|
|
|
Hiragana Writing |
  |
The hiragana writing system is one of 3 writing used in the Japanese language. Two of the three writing systems (hiragana and katakana) are syllabaries (one symbol per syllable); one (kanji) is ideographic (one symbol per word). Since languages have many more words than syllables, syllabaries are a lot easier to learn ideographic writing systems. Depending how you count, hiragana has as many as 107 characters, though many of these are composable from simple rules. Kanji on the other hand has thousands of characters. The term romaji is also used. this denotes a conventional transliteration of Japanese sounds into Roman characters. It is not an official Japanese wriuting system but it shows up fairly often, usually for the convenience of foreigners. |
||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
The basic characters |
  |
|
||||||||||
| The sounds |   |
The romaji representations are pretty good approximations but this website has some nice recordings to help you hear some differences. Pay close attention to the "r" series (syllables whose romaji rep begins with "r"). Is that sound at the beginnning really an "r"? What English vowel is "e" closest to? The one in let or the one in late? Is the t in the T-series really English t? As in the word top? |
||||||||||
| Pitch accent |   |
In some sense the idea a syllable based writing system is very natural to Japanese. Japanese has pitch accent. This means that accent syllables of words are signalled by pitch (frequency) rather than stress (a combination of amplitude and duration). This means the timing of Japanese syllables is constant, which gives it a prosody very unlike that of English. It also means Japanese has the option of multi-syllables with no accent (no high tone); English does not seem to have unaccented multi-syllable words. Accent on first syllable Accent elsewhere, or accentless
In principle this should reduce the constraints on combining Japanese syllables. In English there are syllable types that are very likely to be unstressed (any syllable with a shwa in it). Japanese does not have this. Syllables are more equal, at least as far as timing and accent goes. |
||||||||||
|
General Uesugi's Cipher |
  |
A simple polysyllabic substitution cipher is described here. Actually a message written in General Uesugi's cipher gives one more important clue that a substition cipher might be in use. Normal Japanese texts mix all 3 writing systems. Here is an example. All 3 writing systems are used in the very first line. In fact Japanese quite commonly switches writing systems in the middle of a word. Noamally hiragana is used for function words, such as auxiliary verbs, as well as endings, such as verb suffixes and case markers. In the first line we see right after the kanji characters 社が出資 : している si te i ruwhich is the progressive form of the very common verb do, written in hiragana. So when we see a message written entirely in hiragana, either it is being written by or for a child, or for some stylistic "cuteness" effect, or something funny is going on. |
||||||||||
|
Cryptographic Properties |
  |
Let us investigate the question of whether this cipher should be harder to break or easier to break than an English subtitution cipher. We COULD make the following assumption (consistent with Kerckhoff's Principle): The cipher is disyllabic, that is that we know that 2 hiragana characters of ciphertext are always substituted for 1 hiragana cxharacter of plain text. However we could also just GUESS that and do a simple statistical test to see if the guess is right. We do the index of coincidence test (Friedman's test), but for digraphs. That is we check to see what the probability of choosing a sequence of pairs of letters in a row is. The reasoning is as follows:
Some questions:
Answers:
|
||||||||||
|
Frequencies for Hiragana |
  |
Here is a table of Japanese hiragana frequencies. Of course what we are interested in is not the frequencies of hiragana in Japanese text as it is usually written (which would be the frequencies of grammatical formatives syllables and function word syllables), but the frequencies of the hiragana characters for text that is ALL translated into hiragana (which should roughly spproximate the frequencies of Japanese syllables in speech). The table was constructed using the following steps.
|
||||||||||
|
Interpreting the numbers |
  |
The numbers generally show that hiragana frequencies are "bumpy". RThat is, there is quite avariation in frequency from most frequent to least frequent. But how comparable is this the case of English letters? We need a measure of how "bumpy" a probability distribution is. WE hereby introduce the concept of entropy. |
||||||||||
| Entropy |   |
Entropy is a measure of the average amount of surprise in a probability distribution. If the probability distribution characterizes a set of signals in some channel (such as English letters in English texts), it measures how easy it is to predict the signals. The higher the entropy the harder it is to predict. High entropy means high average surprise, which means low predictablility. Probability is itself a meassure of surprise. The lower the probability the greater the surpirse: (1) Surp(x) = 1/Prob(x)We also want a measure of surprise that has the following property. The measure of the amount of surpirse of two independent events is just the sum of their surprise values: (2) Surp(x;y) = Surp(x) + Surp(y)Equation (1) does not have this neat property:: Prob(x;y) = Prob(x) * Prob(y) Supr(x;y) = 1/(Prob(x) * Prob(y))It turns out this does not in general equal: 1/Prob(x) + 1/Prob(y)For example: 1/4 * 1/3 = 1/12 1/(1/12) = 12 But: 1/(1/4) = 4 1/(1/3) = 3 4 + 3 = 7 What we need to get this to work is some function f such that: f(x * y) = f(x) + f(y)It turns out the log function does this!
Surp(x) = - log Prob(x)The log of a number between 0 and 1 is always negative, so we throw in the minus sign to get surprising events to have bigger surprise values. For example: Suppose: Prob(A) = 1/8 log Prob(A) = - log 8 = -3 And suppose Prob(B) = 1/4 log Prob(A) = - log 4 = -2Now A is the more surprising (less probable event), so we throw in the minus signs to assign A the bigger surprise value: Surp(A) = - log prob(A) = - -3 = 3 Surp(B) = - log prob(B) = - -2 = 2 The last thing you need to know ios that computer scientists and information technologists and generally people who worry about signals, channels, channel capacity, and noise, like to call surprise "information". So rather than saying the measure of the surprise of A is 3, they say th emeasure of the information of A is 3. Finally we're interested in the AVERAGE amount of surpirse for all the events in the probability distribiton (for all possible signals), so we add the information measure for each signal weighted by its probability. This really is the general definition of what an average is. For all the signals 1,n: H(p) = - Sumi=1...n p(i) * log p(i)For some reason H is the letter used for entropy. |
||||||||||
|
Example Entropy of English letters |
  |
Entropy for ../models/let_nr.txt Let Prob Information Avg --------------------------------------------- e 0.123519 3.01719511707 0.372680923665 t 0.091202 3.45479072768 0.315083823946 a 0.080872 3.6282158994 0.293421076216 o 0.075482 3.72772354013 0.281376028256 i 0.073973 3.75685740382 0.277906012733 n 0.070675 3.82265621172 0.270166227763 s 0.064620 3.95187543917 0.255370190879 r 0.063176 3.9844795943 0.251723482849 h 0.051893 4.26831624746 0.221495735029 l 0.042018 4.57284869646 0.192141956528 d 0.037956 4.71952822809 0.179134413425 c 0.032046 4.96371189971 0.159067111538 u 0.027356 5.1919988953 0.14203232178 m 0.024467 5.35301897232 0.130972315196 f 0.022234 5.49108867051 0.1220888655 p 0.021212 5.55897553618 0.117916989074 g 0.020374 5.61712693929 0.114443344261 w 0.018895 5.72585167118 0.108189967327 y 0.018290 5.77280111469 0.105584532388 b 0.016459 5.92497950522 0.0975192376765 v 0.010586 6.56169863069 0.0694621417045 k 0.007437 7.07106351252 0.0525874993426 x 0.001921 9.02392676566 0.0173349633168 j 0.001648 9.24506804214 0.0152358721334 q 0.001034 9.91754809901 0.0102547447344 z 0.000656 10.5740165647 0.00693655486645 Sample Space: 26 Entropy: 4.18706288699 Entropy per signal: 0.890781105188 (4.1870628869941813, 26, 0.89078110518776332)To get the entropy per signal, I just divided by log 26, because there 26 letters. This tells me on average how surprising each letter is. |
||||||||||
|
Example Entropy of Hiragana Characters |
  |
Entropy for ../japanese_models/hir_freq_nr.txt Let Prob Information Avg --------------------------------------------- う 0.073907 3.75814517605 0.277753235527 ん 0.069157 3.85398090291 0.266529757302 い 0.066484 3.91084900552 0.260008885283 し 0.040915 4.61122633785 0.188668325613 き 0.031462 4.99024580579 0.157003113542 に 0.030106 5.05380515104 0.152149857877 か 0.029626 5.07699233802 0.150410975006 ょ 0.029277 5.09408846084 0.149139627868 の 0.027069 5.20721459818 0.140954091958 ち 0.026263 5.25082446631 0.137902402959 く 0.025535 5.29138013072 0.135115391638 と 0.025279 5.30591679602 0.134128270687 は 0.021296 5.55327371341 0.118262517001 た 0.020157 5.63257525367 0.113535819388 ゅ 0.017647 5.82443324441 0.102783773464 こ 0.017515 5.83526520163 0.102204670007 て 0.017004 5.87798202568 0.0999492063647 さ 0.016562 5.91597928893 0.0979804489832 つ 0.015702 5.99290785955 0.0941006392107 な 0.015462 6.01512924674 0.0930059284131 せ 0.014067 6.15154150488 0.0865337343492 じ 0.013827 6.17636801627 0.0854006405609 が 0.013742 6.18526420162 0.0849979006587 る 0.013649 6.19506093445 0.0845563866942 ろ 0.012913 6.27503197749 0.0810294879253 り 0.012262 6.34966188002 0.0778595539727 け 0.011061 6.49837436751 0.071878518879 を 0.011014 6.50451767617 0.0716407576854 ど 0.010402 6.58699524673 0.0685179245565 っ 0.010286 6.60317413097 0.0679202491112 よ 0.010232 6.61076802044 0.0676413783852 ぜ 0.010139 6.62394082207 0.0671601359949 で 0.009697 6.68824580072 0.0648559195296 お 0.008829 6.82353424166 0.0602449838196 ら 0.008434 6.88956726305 0.0581066102965 ご 0.007954 6.97410372252 0.0554720210089 す 0.007768 7.0082410839 0.0544400167398 あ 0.007744 7.01270533205 0.0543063900914 ー 0.007620 7.03599328694 0.0536142688465 も 0.007613 7.03731920646 0.0535751111188 だ 0.007597 7.04035446342 0.0534855728586 め 0.006962 7.1662824706 0.0498916585603 ま 0.006853 7.18904859766 0.0492665500398 れ 0.006768 7.2070547162 0.0487773463193 え 0.006567 7.25054982942 0.0476143607298 ぎ 0.005908 7.40311445856 0.0437376002212 ひ 0.005862 7.41439131663 0.0434631618981 ほ 0.005288 7.56306210782 0.0399934724261 み 0.005273 7.56716028794 0.0399016361983 そ 0.005249 7.57374168711 0.0397545701157 ね 0.004676 7.74050935479 0.036194621743 げ 0.004413 7.82402453735 0.0345274202833 ゃ 0.004389 7.83189201446 0.0343741740515 わ 0.004366 7.83947215433 0.0342271354258 ふ 0.004227 7.88615017229 0.0333347567783 ぶ 0.004219 7.8888831971 0.0332831982086 や 0.003630 8.10581473644 0.0294241074933 ン 0.003607 8.11498486154 0.0292707503956 ば 0.002747 8.50792737425 0.0233712764971 ぱ 0.002708 8.52855654575 0.0230953311259 ル 0.002406 8.69914764215 0.020930149227 ス 0.002274 8.78055203043 0.0199669753172 べ 0.002135 8.87154821482 0.0189407554386 む 0.002057 8.9252424908 0.0183592238036 ざ 0.002057 8.9252424908 0.0183592238036 ト 0.002049 8.93086430031 0.0182993409513 び 0.002003 8.96362186351 0.0179541345926 イ 0.001879 9.0558192179 0.0170158843104 へ 0.001747 9.16090467641 0.0160041004697 ゆ 0.001716 9.18673473182 0.0157644367998 ぽ 0.001709 9.19263188765 0.015710207896 n 0.001701 9.19940114366 0.0156481813454 o 0.001693 9.20620231143 0.0155861005132 v 0.001693 9.20620231143 0.0155861005132 ぞ 0.001654 9.23982505015 0.015282670633 ぼ 0.001468 9.41193231648 0.0138167166406 ラ 0.001383 9.49798312817 0.0131357106663 フ 0.001352 9.53068913304 0.0128854917079 ク 0.001329 9.55544318005 0.0126991839863 ド 0.001220 9.67890313687 0.011808261827 ず 0.001120 9.80228555238 0.0109785598187 リ 0.001019 9.93863023316 0.0101274642076 ぐ 0.000996 9.97156663726 0.00993168037071 レ 0.000941 10.0535176566 0.00946036011486 づ 0.000879 10.1518492142 0.00892347545926 ッ 0.000732 10.415868731 0.00762441591112 ぴ 0.000717 10.4457392606 0.00748959504987 Sample Space: 87 Entropy: 5.49661603106 Entropy per signal: 0.85312187428 (5.4966160310600634, 87, 0.85312187428022668)To get the entropy per signal, I just divided by log 87, because there were 87 letters. This tells me on average how surprising each letter is. Now compare the per signal entropy measure for English and Japanbese hiragana: English letters: 0.89078110518776332 Japanese hiragana: 0.85312187428The English entropy per signal is actually a little higher. This means the average surprise on seeing a new English letter is greater than the average surprise on seeing a new Japanese hiragana character. |
||||||||||
|
Evaluating the Code |
  |
We considereed the hypothesis that the Hiragana substitution code might be a better code because hiragana characters would be more evenly distributed than English characters. This turned out to be wrong. In fact the hiragana distribution was quite bumpy. We tried to get precise about the idea of the bumpiness of a probability distribution by introducing the notion entropy: H(p) = - Sumi=1...n p(i) * log p(i)This measure the average amount of surprise (or informativeness of the n signals. We adjusted the entropy for the size of the signal space to get something called per signal entropy. This measure turned out to be slightly lower for Japanese: English letters: 0.89078110518776332 Japanese hiragana: 0.85312187428How good a measure of "toughness of code" is this? Well we argued that the toughest code is one in which every character has equal probability. The per character entropy for such a code is always 1. For example, if a signal system has 8 characters all equally probable, then the entropy is:
H(p) = 8 * (1/8 * - log (1/8))
= 8 * (1/8 * (- - 3))
= 8 * (1/8 * 3)
= 8 * 3/8
= 3
To get the per signal entropy we divide by log 8 = 3:
Hps(p) = H(p)/log 8 = 3/3 = 1So the per signal entrop is a measure of how close a system is to the hardest case. It will always be a nuimber between 0 and 1. And by this measure Hiragana makes an easier substitution cipher than English characters. |