San Diego State University logo

Decrypt Vigenere

So now we have:

P('a';'a') = .08 * .08 = .0064
P('e';'e')= .122 * .122 0 .0149
P('a';'a' OR 'e';'e') = .0064 +  .0149 = 0213

So now we have the method for solving the original problem: To find the probability of two English letters coinciding we add the probabilities of coincidence for all the indiviual letters:

P(coindicence) = Sumx = a...z  P(x;x)

WE compute this (using a computer) and discover:

P(coindicence) = .065
This is a pretty soplid number that will hold up for most randomly sampled English texts.

Now suppose we perfrom this same computation on a monalphabetic substituion cipher. What do we get?

Let's be precise. Assume DH as follows:

cipher = oja gpddju hbkglno cjq tpdo xlrluh uj xlhbwpo pnukv p xkgltpw bjldu lu lh ltbjvupdu uj pyjlx rlyldr p tlhupikd ltbvkhhljd jn ykvo clrc bvkglhljd qckd oja idjq ojav datkvlg vkhawuh pvk jdwo pggavpuk uj p nkq xlrluh
        
ciph_plain_dict = {}

hypothesis = --- ------ ------- --- ---- ------ -- ------- ----- - ------- ----- -- -- --------- -- ----- ------ - -------- ---------- -- ---- ---- --------- ---- --- ---- ---- ------- ------- --- ---- -------- -- - --- ------

letter_frequencies = [('l', 21), ('j', 16), ('p', 14), ('u', 14), ('d', 13), ('k', 13), ('h', 11), ('v', 10), ('o', 8), ('g', 7), ('a', 6), ('b', 6), ('t', 6), ('r', 5), ('x', 5), ('c', 4), ('n', 4), ('q', 4), ('w', 4), ('y', 3), ('i', 2)]

top_digraph_frequencies = [('lh', 4), ('vk', 4), ('gl', 3), ('lu', 3), ('lr', 3), ('rl', 3), ('jd', 3), ('ja', 3), ('oj', 3), ('xl', 3), ('lt', 3), ('pd', 3), ('uh', 3), ('uj', 3), ('kg', 3), ('kv', 3), ('hb', 2), ('ld', 2), ('lj', 2), ('tb', 2), ('dj', 2), ('du', 2), ('jq', 2), ('bj', 2), ('jl', 2), ('tp', 2), ('hl', 2), ('bv', 2), ('up', 2), ('uk', 2), ('av', 2), ('kd', 2), ('kh', 2)]

One Letter Words = ['p', 'p', 'p']

Two Letter Words = ['uj', 'lu', 'lh', 'uj', 'jn', 'uj']

Three Letter Words = ['oja', 'cjq', 'oja', 'pvk', 'nkq']
Then we compute the Frideman as for the spcifc characters of this cipher as follows:
>>> DH.friedman_test()
Letter Count: 176
Letter  Prob   ProbSq Incidence
----------------------------
  l    0.119   0.014   0.014 
  j    0.091   0.008   0.023 
  p    0.080   0.006   0.029 
  u    0.080   0.006   0.035 
  d    0.074   0.005   0.041 
  k    0.074   0.005   0.046 
  h    0.062   0.004   0.050 
  v    0.057   0.003   0.053 
  o    0.045   0.002   0.055 
  g    0.040   0.002   0.057 
  a    0.034   0.001   0.058 
  b    0.034   0.001   0.059 
  t    0.034   0.001   0.060 
  r    0.028   0.001   0.061 
  x    0.028   0.001   0.062 
  c    0.023   0.001   0.062 
  n    0.023   0.001   0.063 
  q    0.023   0.001   0.064 
  w    0.023   0.001   0.064 
  y    0.017   0.000   0.064 
  i    0.011   0.000   0.064 
Total Index of Coincidence 0.064
Notice the total is very close to 0.065!

Why?

Because the numbers are not that far off of English letter frequencies; of coursde, different {\bf letters} than usual are giving rise to the probabilities, but that doesn't matter to the computation.

The plain text version of this message is:

If we do the friedman calculation for this.
Letter Count: 177
Letter  Prob   ProbSq Incidence
----------------------------
  i    0.119   0.014   0.014 
  o    0.090   0.008   0.022 
  a    0.079   0.006   0.029 
  t    0.079   0.006   0.035 
  e    0.073   0.005   0.040 
  n    0.073   0.005   0.046 
  s    0.068   0.005   0.050 
  r    0.056   0.003   0.053 
  y    0.045   0.002   0.055 
  c    0.040   0.002   0.057 
  m    0.034   0.001   0.058 
  p    0.034   0.001   0.059 
  u    0.034   0.001   0.060 
  d    0.028   0.001   0.061 
  g    0.028   0.001   0.062 
  f    0.023   0.001   0.062 
  h    0.023   0.001   0.063 
  l    0.023   0.001   0.064 
  w    0.023   0.001   0.064 
  v    0.017   0.000   0.064 
  k    0.011   0.000   0.064 
Total Index of Coincidence 0.064

Of course this is the same calculation as before.
The only thing that has changed is the first column.
And the reason we get a number close to .065
is that all told the letter frequencies of this
message do not differ all that much from
those of larger represeantative samples of English.
And the encrypted versiopn preserves that property
as long as we're using a monoalphabetic
substitution cipher.

But let's take the same messgae and encrypt it using a vigenere cipher (the key milk):

'kwf oiyxab cbmnsrg rae wmvj pqrsfa da osexwkk lpfmc m oooqxkx 
ayuvec qe ua syxzbfiyd bz mdzsp rshqyq i wuaekwmy uuabqadsav 
yr godg ruos bzpmuatyz hrqv iac uzwh kwfb vfwqztm zpcgtec ico 
wyvk lmocckfm da l rmh pqrsfa'
'kwf oiyxab cbmnsrg rae wmvj pqrsfa da osexwkk lpfmc m oooqxkx ayuvec qe ua syxzbfiyd bz mdzsp rshqyq i wuaekwmy uuabqadsav yr godg ruos bzpmuatyz hrqv iac uzwh kwfb vfwqztm zpcgtec ico wyvk lmocckfm da l rmh pqrsfa'

Now we run the Friedman test on this: Letter Count: 177 Letter Prob ProbSq Incidence ---------------------------- a 0.079 0.006 0.006 m 0.062 0.004 0.010 c 0.051 0.003 0.013 o 0.051 0.003 0.015 q 0.051 0.003 0.018 s 0.051 0.003 0.020 r 0.051 0.003 0.023 w 0.051 0.003 0.026 y 0.051 0.003 0.028 f 0.045 0.002 0.030 k 0.045 0.002 0.032 u 0.045 0.002 0.034 z 0.045 0.002 0.036 b 0.040 0.002 0.038 e 0.034 0.001 0.039 d 0.034 0.001 0.040 p 0.034 0.001 0.041 v 0.034 0.001 0.043 i 0.028 0.001 0.043 x 0.028 0.001 0.044 g 0.023 0.001 0.045 h 0.023 0.001 0.045 l 0.017 0.000 0.045 t 0.017 0.000 0.046 j 0.006 0.000 0.046 n 0.006 0.000 0.046 Total Index of Coincidence 0.046

In general the index of coincidence will be lowered significantly when a polyalphabetic cipehr like Vigenere is being used.

How much? What's the limit.

Well suppose we have a completely random sample of letters. What would it's index of coincidence be?


There are 26 possible letters for which this is the probability ocf a coincidence so we multiply this by 26:
P(incidence) = 26*(1/26 * 1/26) = 0.038
So this is a lower bound. In general if the index of incidence is close to 0.038, we've got a polyalphabetic cipher.
Kasiski
Test
 

Find repetitions. If the cipher is a Vigenere cipher, the distance between multiple occurrences is a multiple of the length of the key.

Example.

Friedman
Test
 

The index of coincidence: The probability that two letters randomly selected from the text are the same.

  1. n = numbers of letters in text
  2. the numbers of a's
  3. the number of b's
  4. ...
  5. the number of z's

How do we compute this?

We'll begin with the problem of computing the probability of having two randomly chosen English letters turn out to be the same letter.

First some elementary probability theory.

Probability  

Let's assume that two events are independent. FOr example, rolling a 6 on your firsdt toss of a single ide and rolling a 3 on the second.

Each of these events has probability 1/6;

To find the probbqaility of two independent events BOTH happeneing we simply multiply:

P(6) = 1/6
P(5) = 1/6
P(6;5) = 1/6 * 1/6 = 1/36
Thus the chance of rolling a 6 followed by a 5 are 1 in 36.

What are the chance of rolling a 6 OR a 5 on the first roll and a 6 or a 5 on the second?

To find the find the probabilities of either of two mutually exclusive events happening, just add their probabilities.

The probability of rolling a 6 on the first toss is 1/6. The probability of rolling a 5 on the first toss is 1/6. We add those to get t he probbaility of tossing either a 5 or a 6:

P(6) = 1/6
P(5)= 1/6
P(5 OR 6) =  1/6 + 1/6 = 1/3
Now what is the probability of rolling a 6 05 5 followed by a 6 or a 5>
P(6 OR 5) = 1/3
P6 OR 5);6 OR 5) = 1/3 * 1/3 = 1/9
Application
to coincidence
problem
 

The problem is to compute the probability of randomly choosing two English letters and having them turn out to be the same.

First we compute the probability of having two randomly chosenn English letters turn ouyt to be 'a'.

What is the probability of choosing one Englishg letter and having it turn out to be 'a'?

To find out we consult our table of English letter frequencies: Consider the letter frequencies for English:

('e', '0.122'), ('t', '0.091'), ('a', '0.080'),
('o', '0.077'), ('i', '0.075'), ('n', '0.069'),
('s', '0.064'), ('r', '0.061'), ('h', '0.054'),
('l', '0.042'), ('d', '0.039'), ('c', '0.031'),
('u', '0.029'), ('m', '0.024'), ('f', '0.022'),
('g', '0.021'), ('p', '0.020'), ('y', '0.020'),
('w', '0.020'), ('b', '0.016'), ('v', '0.010'),
('k', '0.008'), ('j', '0.002'), ('x', '0.002'),
('q', '0.001'), ('z', '0.001')]
The probability is .08.

Therefore t compute the porobability of drawing two 'a's in a row:

P('a') = .08
P('a';'a') = .08 * .08 = .0064

Nopw what is the probability of drawing two 'a's in a row OR two 'e's in a row.

We first compute the probability of drawiung two 'e's:

P('e')= .122
P('e';'e')= .122 * .122 0 .0149
Summary  

We have two tests:

  1. Use the index of coincidence (Friedman test) to determine if a polyalphabetic cipher is being used.
  2. Use the Kasiski test to determine if it's a vigenere cipher and if so, to at least constrain the possible key lengths.
Vigenere
weakness
 

Once you know the key length of a vigener cipher, it's great weakness becomes apparent:

      Vigenere Cipher text with period 4
      m i l k
      k w f o
      i y x a
      b c b m
      n s r g
      r a e w
      m v j p
      q r s f
      a d a o
      s e x w
      k k l p
      f m c m
      o o o q
      x k x a
      y u v e
      c q e u
      a s y x
      z b f i
      y d b z
      m d z s
      p r s h
      q y q i
      w u a e
      k w m y
      u u a b
      q a d s
      a v y r
      g o d g
      r u o s
      b z p m
      u a t y
      z h r q
      v i a c
      u z w h
      k w f b
      v f w q
      z t m z
      p c g t
      e c i c
      o w y v
      k l m o
      c c k f
      m d a l
      r m h p
      q r s f
      a
Each column becomes a shift cipher, meaning it has one of 26 possible keys.

At this point brute force can enter in, at least with computers around. All we need to do is try all 26 keys for column 1, times all 26 for column2 , times all 26 times column 3, looking for combinations of shifts that produce word matches somewhere in the message...

So determining the length of the key is crucial.

A better way  

The Friedman test actually opens the door to producing a formula that allows us to compute the langteh of the key.

Look here .