Application
to coincidence
problem
|
 
|
The problem is to compute the probability of randomly
choosing two English letters and having them turn
out to be the same.
First we compute the probability of having two randomly
chosenn English letters turn ouyt to be 'a'.
What is the probability of choosing one Englishg
letter and having it turn out to be 'a'?
To find out we consult our table of English letter frequencies:
Consider the letter frequencies for English:
('e', '0.122'), ('t', '0.091'), ('a', '0.080'),
('o', '0.077'), ('i', '0.075'), ('n', '0.069'),
('s', '0.064'), ('r', '0.061'), ('h', '0.054'),
('l', '0.042'), ('d', '0.039'), ('c', '0.031'),
('u', '0.029'), ('m', '0.024'), ('f', '0.022'),
('g', '0.021'), ('p', '0.020'), ('y', '0.020'),
('w', '0.020'), ('b', '0.016'), ('v', '0.010'),
('k', '0.008'), ('j', '0.002'), ('x', '0.002'),
('q', '0.001'), ('z', '0.001')]
The probability is .08.
Therefore t compute the porobability of drawing
two 'a's in a row:
P('a') = .08
P('a';'a') = .08 * .08 = .0064
Nopw what is the probability of drawing two 'a's
in a row OR two 'e's in a row.
We first compute the probability of drawiung two 'e's:
P('e')= .122
P('e';'e')= .122 * .122 0 .0149
|
So now we have:
P('a';'a') = .08 * .08 = .0064
P('e';'e')= .122 * .122 0 .0149
P('a';'a' OR 'e';'e') = .0064 + .0149 = 0213
So now we have the method for solving the original problem:
To find the probability of two English letters coinciding
we add the probabilities of coincidence for all
the indiviual letters:
P(coindicence) = Sumx = a...z P(x;x)
WE compute this (using a computer) and discover:
P(coindicence) = .065
This is a pretty soplid number that will hold up for
most randomly sampled English texts.
Now suppose we perfrom this same computation
on a monalphabetic substituion cipher. What
do we get?
Let's be precise. Assume DH as follows:
cipher = oja gpddju hbkglno cjq tpdo xlrluh uj xlhbwpo pnukv p xkgltpw bjldu lu lh ltbjvupdu uj pyjlx rlyldr p tlhupikd ltbvkhhljd jn ykvo clrc bvkglhljd qckd oja idjq ojav datkvlg vkhawuh pvk jdwo pggavpuk uj p nkq xlrluh
ciph_plain_dict = {}
hypothesis = --- ------ ------- --- ---- ------ -- ------- ----- - ------- ----- -- -- --------- -- ----- ------ - -------- ---------- -- ---- ---- --------- ---- --- ---- ---- ------- ------- --- ---- -------- -- - --- ------
letter_frequencies = [('l', 21), ('j', 16), ('p', 14), ('u', 14), ('d', 13), ('k', 13), ('h', 11), ('v', 10), ('o', 8), ('g', 7), ('a', 6), ('b', 6), ('t', 6), ('r', 5), ('x', 5), ('c', 4), ('n', 4), ('q', 4), ('w', 4), ('y', 3), ('i', 2)]
top_digraph_frequencies = [('lh', 4), ('vk', 4), ('gl', 3), ('lu', 3), ('lr', 3), ('rl', 3), ('jd', 3), ('ja', 3), ('oj', 3), ('xl', 3), ('lt', 3), ('pd', 3), ('uh', 3), ('uj', 3), ('kg', 3), ('kv', 3), ('hb', 2), ('ld', 2), ('lj', 2), ('tb', 2), ('dj', 2), ('du', 2), ('jq', 2), ('bj', 2), ('jl', 2), ('tp', 2), ('hl', 2), ('bv', 2), ('up', 2), ('uk', 2), ('av', 2), ('kd', 2), ('kh', 2)]
One Letter Words = ['p', 'p', 'p']
Two Letter Words = ['uj', 'lu', 'lh', 'uj', 'jn', 'uj']
Three Letter Words = ['oja', 'cjq', 'oja', 'pvk', 'nkq']
Then we compute the Frideman as for the spcifc
characters of this cipher as follows:
>>> DH.friedman_test()
Letter Count: 176
Letter Prob ProbSq Incidence
----------------------------
l 0.119 0.014 0.014
j 0.091 0.008 0.023
p 0.080 0.006 0.029
u 0.080 0.006 0.035
d 0.074 0.005 0.041
k 0.074 0.005 0.046
h 0.062 0.004 0.050
v 0.057 0.003 0.053
o 0.045 0.002 0.055
g 0.040 0.002 0.057
a 0.034 0.001 0.058
b 0.034 0.001 0.059
t 0.034 0.001 0.060
r 0.028 0.001 0.061
x 0.028 0.001 0.062
c 0.023 0.001 0.062
n 0.023 0.001 0.063
q 0.023 0.001 0.064
w 0.023 0.001 0.064
y 0.017 0.000 0.064
i 0.011 0.000 0.064
Total Index of Coincidence 0.064
Notice the total is very close to 0.065!
Why?
Because the numbers are not that far off of English letter
frequencies; of coursde, different {\bf letters}
than usual are giving rise to the probabilities,
but that doesn't matter to the computation.
The plain text version of this message is:
If we do the friedman calculation for this.
Letter Count: 177
Letter Prob ProbSq Incidence
----------------------------
i 0.119 0.014 0.014
o 0.090 0.008 0.022
a 0.079 0.006 0.029
t 0.079 0.006 0.035
e 0.073 0.005 0.040
n 0.073 0.005 0.046
s 0.068 0.005 0.050
r 0.056 0.003 0.053
y 0.045 0.002 0.055
c 0.040 0.002 0.057
m 0.034 0.001 0.058
p 0.034 0.001 0.059
u 0.034 0.001 0.060
d 0.028 0.001 0.061
g 0.028 0.001 0.062
f 0.023 0.001 0.062
h 0.023 0.001 0.063
l 0.023 0.001 0.064
w 0.023 0.001 0.064
v 0.017 0.000 0.064
k 0.011 0.000 0.064
Total Index of Coincidence 0.064
Of course this is the same calculation as before.
The only thing that has changed is the first column.
And the reason we get a number close to .065
is that all told the letter frequencies of this
message do not differ all that much from
those of larger represeantative samples of English.
And the encrypted versiopn preserves that property
as long as we're using a monoalphabetic
substitution cipher.
But let's take the same messgae
and encrypt it using a vigenere cipher
(the key milk):
'kwf oiyxab cbmnsrg rae wmvj pqrsfa da osexwkk lpfmc m oooqxkx
ayuvec qe ua syxzbfiyd bz mdzsp rshqyq i wuaekwmy uuabqadsav
yr godg ruos bzpmuatyz hrqv iac uzwh kwfb vfwqztm zpcgtec ico
wyvk lmocckfm da l rmh pqrsfa'
'kwf oiyxab cbmnsrg rae wmvj pqrsfa da osexwkk lpfmc m oooqxkx ayuvec qe ua syxzbfiyd bz mdzsp rshqyq i wuaekwmy uuabqadsav yr godg ruos bzpmuatyz hrqv iac uzwh kwfb vfwqztm zpcgtec ico wyvk lmocckfm da l rmh pqrsfa'
Now we run the Friedman test on this:
Letter Count: 177
Letter Prob ProbSq Incidence
----------------------------
a 0.079 0.006 0.006
m 0.062 0.004 0.010
c 0.051 0.003 0.013
o 0.051 0.003 0.015
q 0.051 0.003 0.018
s 0.051 0.003 0.020
r 0.051 0.003 0.023
w 0.051 0.003 0.026
y 0.051 0.003 0.028
f 0.045 0.002 0.030
k 0.045 0.002 0.032
u 0.045 0.002 0.034
z 0.045 0.002 0.036
b 0.040 0.002 0.038
e 0.034 0.001 0.039
d 0.034 0.001 0.040
p 0.034 0.001 0.041
v 0.034 0.001 0.043
i 0.028 0.001 0.043
x 0.028 0.001 0.044
g 0.023 0.001 0.045
h 0.023 0.001 0.045
l 0.017 0.000 0.045
t 0.017 0.000 0.046
j 0.006 0.000 0.046
n 0.006 0.000 0.046
Total Index of Coincidence 0.046
In general the index of coincidence will be lowered significantly
when a polyalphabetic cipehr like Vigenere is being used.
How much? What's the limit.
Well suppose we have a completely random sample of
letters. What would it's index of coincidence be?
There are 26 possible letters for which this is the probability
ocf a coincidence so we multiply this by 26:
P(incidence) = 26*(1/26 * 1/26) = 0.038
So this is a lower bound. In general if the index
of incidence is close to 0.038, we've got a polyalphabetic cipher.