Frequency Analysis
by fortenforge, Jul 16, 2009, 3:19 AM
If you have been looking at the shouts you will notice someone mentioned something called "frequency analysis".
Frequency Analysis is an algorithm to easily decrypt substitution ciphers.
To understand how it works, you have to first understand that in the English language, or in any language, some letters will appear more often than others and some will appear less often.
For example, in English the most common letters are E, R, S, T, and A. The least common letters are Q, X, Z, and J.
In the table below, you can see the frequency analysis for the first 2 paragraphs of this Entry, and the average frequency analysis for English. The first row is the alphabet, the second the frequency analysis for the first 2 paragraphs, and the third is the average frequency analysis for english.
A B C D E .F G H I J K L M N O P Q R S T U V W X Y Z
8 1 2 3 11 2 3 4 7 0 1 6 2 9 8 2 1 5 8 9 4 1 2 0 3 0
8 2 3 4 13 2 2 6 7 0 1 4 2 7 8 2 0 6 6 9 3 1 2 0 2 0
The most difference between the actual English text and the average for English is the letters E, N, H, and S in which the difference is 2. The same is true for any piece of English text. The frequency of the letters is always going to be the same. Of course the larger the text you have the more the text will correspond to the averages. It is just like a coin flip. The more times you flip the coin the more the percentage of Heads is closer to 50%.
If I encrypt the first two paragraphs with a Caesar cipher of key 1 I get:
JGZPV IBWFC FFOMP PLJOH BUUIF TIPVU TZPVX
JMMOP UJDFT PNFPO FNFOU JPOFE TPNFU IJOHD
BMMFE GSFRV FODZB OBMZT JTGSF RVFOD ZBOBM
ZTJTJ TBOBM HPSJU INUPF BTJMZ EFDSZ QUTVC
TUJUV UJPOD JQIFS TUPVO EFSTU BOEIP XJUXP
SLTZP VIBWF UPGJS TUVOE FSTUB OEUIB UJOUI
FFOHM JTIMB OHVBH FPSJO BOZMB OHVBH FTPNF
MFUUF STXJM MBQQF BSNPS FPGUF OUIBO PUIFS
TBOET PNFXJ MMBQQ FBSMF TTPGU FO.
Let's do the frequency analysis on this ciphertext:
A B C D E .F .G H I J K L M N O P Q R S T U V W X Y Z
0 8 1 2 3 .11 2 3 4 7 0 1 6 2 9 8 2 1 5 8 9 4 1 2 0 3
8 2 3 3 13 2 .2 6 7 0 1 4 2 7 8 2 0 6 6 9 3 1 2 0 2 0
It is clear from looking at the frequency analysis that the second row has been shifted by 1 column. This tells us that the message has been encrypted using a Caesar cipher of key 1. If we shift it back one column the numbers match up better. Now we can easily decrypt it. If I had used a substitution cipher, it would have been a little more difficult, but still doable. We could immediately notice that whichever letter had 11 occurrences, must be E, because on average E has 13 occurrences. We could move on to the other letters from here thereby decrypting the cipher. In the next post I will show you an actual example.
One note is that if the language of the plaintext had not been English, the average frequency analysis would have been different thereby making it difficult to decrypt it well.
Frequency Analysis is an algorithm to easily decrypt substitution ciphers.
To understand how it works, you have to first understand that in the English language, or in any language, some letters will appear more often than others and some will appear less often.
For example, in English the most common letters are E, R, S, T, and A. The least common letters are Q, X, Z, and J.
In the table below, you can see the frequency analysis for the first 2 paragraphs of this Entry, and the average frequency analysis for English. The first row is the alphabet, the second the frequency analysis for the first 2 paragraphs, and the third is the average frequency analysis for english.
A B C D E .F G H I J K L M N O P Q R S T U V W X Y Z
8 1 2 3 11 2 3 4 7 0 1 6 2 9 8 2 1 5 8 9 4 1 2 0 3 0
8 2 3 4 13 2 2 6 7 0 1 4 2 7 8 2 0 6 6 9 3 1 2 0 2 0
The most difference between the actual English text and the average for English is the letters E, N, H, and S in which the difference is 2. The same is true for any piece of English text. The frequency of the letters is always going to be the same. Of course the larger the text you have the more the text will correspond to the averages. It is just like a coin flip. The more times you flip the coin the more the percentage of Heads is closer to 50%.
If I encrypt the first two paragraphs with a Caesar cipher of key 1 I get:
JGZPV IBWFC FFOMP PLJOH BUUIF TIPVU TZPVX
JMMOP UJDFT PNFPO FNFOU JPOFE TPNFU IJOHD
BMMFE GSFRV FODZB OBMZT JTGSF RVFOD ZBOBM
ZTJTJ TBOBM HPSJU INUPF BTJMZ EFDSZ QUTVC
TUJUV UJPOD JQIFS TUPVO EFSTU BOEIP XJUXP
SLTZP VIBWF UPGJS TUVOE FSTUB OEUIB UJOUI
FFOHM JTIMB OHVBH FPSJO BOZMB OHVBH FTPNF
MFUUF STXJM MBQQF BSNPS FPGUF OUIBO PUIFS
TBOET PNFXJ MMBQQ FBSMF TTPGU FO.
Let's do the frequency analysis on this ciphertext:
A B C D E .F .G H I J K L M N O P Q R S T U V W X Y Z
0 8 1 2 3 .11 2 3 4 7 0 1 6 2 9 8 2 1 5 8 9 4 1 2 0 3
8 2 3 3 13 2 .2 6 7 0 1 4 2 7 8 2 0 6 6 9 3 1 2 0 2 0
It is clear from looking at the frequency analysis that the second row has been shifted by 1 column. This tells us that the message has been encrypted using a Caesar cipher of key 1. If we shift it back one column the numbers match up better. Now we can easily decrypt it. If I had used a substitution cipher, it would have been a little more difficult, but still doable. We could immediately notice that whichever letter had 11 occurrences, must be E, because on average E has 13 occurrences. We could move on to the other letters from here thereby decrypting the cipher. In the next post I will show you an actual example.
One note is that if the language of the plaintext had not been English, the average frequency analysis would have been different thereby making it difficult to decrypt it well.