Literate Programming
The file names of literate haskell programs end with ".lhs".
Everything in literate haskell is comment by default. Codes are in
lines starting with '>' character, e.g.,
> import Data.Char (ord, chr, toUpper, isAlpha)
Histograms
==========
We write haskell code to compute a histogram of any string.
> histogram :: String -> [(Char, Double)]
> histogram str = map countChar ['A'..'Z']
> where countChar c = (c,
> (fromIntegral $ length $ filter (== c) str) /
> (fromIntegral $ length str))
Cracking the Caeser Cipher
=====================
Below is a familiar implementation of a shift cipher (Caeser cipher)
over all-uppercase messages.
> encode :: Char -> Int
> encode c = ord c - 65
> decode :: Int -> Char
> decode i = chr (i + 65)
> encodeMsg :: String -> [Int]
> encodeMsg = map encode
> decodeMsg :: [Int] -> String
> decodeMsg = map decode
> caeserEnc :: Int -> Char -> Char
> caeserEnc k c = decode $ (k + encode c) `mod` 26
> caeserDec :: Int -> Char -> Char
> caeserDec k c = decode $ (encode c - k ) `mod` 26
> caeserEncMsg :: Int -> String -> String
> caeserEncMsg = map . caeserEnc -- a point-free form
> caeserDecMsg :: Int -> String -> String
> caeserDecMsg = (map . caeserDec)
Assume the armed force intercepted the following message:
"AOLPTWLYPHSQHWHUOHZKLJPKLKUVAAVHAAHJRBZZY"
It is easy to try all possible keys to decrypt this message with
Caeser cipher:
> crack ctxt = map (\k -> caeserDecMsg k ctxt) [0..25]
Running crack function on "AOLPTWLYPHSQHWHUOHZKLJPKLKUVAAVHAAHJRBZZY" gives 26 possible decodings:
"AOLPTWLYPHSQHWHUOHZKLJPKLKUVAAVHAAHJRBZZY",
"ZNKOSVKXOGRPGVGTNGYJKIOJKJTUZZUGZZGIQAYYX",
"YMJNRUJWNFQOFUFSMFXIJHNIJISTYYTFYYFHPZXXW",
"XLIMQTIVMEPNETERLEWHIGMHIHRSXXSEXXEGOYWWV",
"WKHLPSHULDOMDSDQKDVGHFLGHGQRWWRDWWDFNXVVU",
"VJGKORGTKCNLCRCPJCUFGEKFGFPQVVQCVVCEMWUUT",
"UIFJNQFSJBMKBQBOIBTEFDJEFEOPUUPBUUBDLVTTS",
"THEIMPERIALJAPANHASDECIDEDNOTTOATTACKUSSR",
"SGDHLODQHZKIZOZMGZRCDBHCDCMNSSNZSSZBJTRRQ",
"RFCGKNCPGYJHYNYLFYQBCAGBCBLMRRMYRRYAISQQP",
"QEBFJMBOFXIGXMXKEXPABZFABAKLQQLXQQXZHRPPO",
"PDAEILANEWHFWLWJDWOZAYEZAZJKPPKWPPWYGQOON",
"OCZDHKZMDVGEVKVICVNYZXDYZYIJOOJVOOVXFPNNM",
"NBYCGJYLCUFDUJUHBUMXYWCXYXHINNIUNNUWEOMML",
"MAXBFIXKBTECTITGATLWXVBWXWGHMMHTMMTVDNLLK",
"LZWAEHWJASDBSHSFZSKVWUAVWVFGLLGSLLSUCMKKJ",
"KYVZDGVIZRCARGREYRJUVTZUVUEFKKFRKKRTBLJJI",
"JXUYCFUHYQBZQFQDXQITUSYTUTDEJJEQJJQSAKIIH",
"IWTXBETGXPAYPEPCWPHSTRXSTSCDIIDPIIPRZJHHG",
"HVSWADSFWOZXODOBVOGRSQWRSRBCHHCOHHOQYIGGF",
"GURVZCREVNYWNCNAUNFQRPVQRQABGGBNGGNPXHFFE",
"FTQUYBQDUMXVMBMZTMEPQOUPQPZAFFAMFFMOWGEED",
"ESPTXAPCTLWULALYSLDOPNTOPOYZEEZLEELNVFDDC",
"DROSWZOBSKVTKZKXRKCNOMSNONXYDDYKDDKMUECCB",
"CQNRVYNARJUSJYJWQJBMNLRMNMWXCCXJCCJLTDBBA",
"BPMQUXMZQITRIXIVPIALMKQLMLVWBBWIBBIKSCAAZ"
Obliviously, it is tedious to manually examine the decrypted messages
to look for the "correct" entry. To automate the process, we can
leverage ideas of *statistic tests*. The statistical tool we want to
show is Chi-square test.
Histogram of Enlish:
> baseline_hist = map (\(a,b)->(a, b/100)) [('A', 8.167), ('B', 1.492), ('C', 2.782), ('D', 4.253), ('E', 12.70), ('F', 2.228), ('G', 2.015), ('H', 6.094), ('I', 6.966), ('J', 0.153), ('K', 0.772), ('L', 4.025), ('M', 2.406), ('N', 6.749), ('O', 7.507), ('P', 1.929), ('Q', 0.095), ('R', 5.987), ('S', 6.327), ('T', 9.056), ('U', 2.758), ('V', 0.978), ('W', 2.360), ('X', 0.150), ('Y', 1.974), ('Z', 0.074)]
Chi-square can be implemented as
> chi_square :: [(Char, Double)] -> [(Char, Double)] -> Double
> chi_square ha hb = sum [ (a-b)^2 / b | ((_, a), (_, b)) <- zip ha hb]
Our improved crack can be realized as
> betterCrack ctxt = map (\k -> chi_square (histogram $ caeserDecMsg k ctxt) baseline_hist) [0..25]
Vigenère Cipher
=====
Observing that the 26-key key space would be too small, the Japanese
decided to use a Vigenère cipher. With Vigenère, a password is a short
phrase, e.g., "attack", which is first converted to a sequence of
numbers (0--25) and will be repreated to encrypt the corresponding
letters in the original message. This idea of repeatedly using the
same key is known as the Vigenère cipher.
> vigEncMsg :: String -> String -> String
> vigEncMsg passphrase msg = [ caeserEnc k m | (k, m) <- zip pcode msg]
> where pcode = concat $ repeat $ map encode $ map toUpper passphrase
> vigDecMsg :: String -> String -> String
> vigDecMsg passphrase msg = [ caeserDec k m | (k, m) <- zip pcode msg]
> where pcode = concat $ repeat $ map encode $ map toUpper passphrase
The Index of Coincidence (IoC) for a given string 'str' can be
computed by the ic function given below. Note that `hist` is an array
of 26 counters, each counting the number of appearances of an English
letter (as opposed to the probabilities used in computing `histogram`
above). This allows us to calculate the IoC value in the
"draw-without-replacement" model.
> ic :: String -> Double
> ic str = sum [fromIntegral (x*(x-1)) / fromIntegral (n*(n-1)) | x <- hist]
> where hist = map countif ['A'..'Z']
> countif c = length $ filter (== c) str
> n = length str
Let the length of the ciphertext be n and the length of the Virgenere
cipher key is k. Assume the keys are chosen uniformly random and n >>
k. Since letters in the original message is partitioned into k
interleaving categories in terms of the shift cipher keys. The
probability that two letters are picked from the same category is 1/k,
in which case the resulting IoC would be the same as that of
English. On the other hand, the two letters will be picked from two
different categories with probability (k-1)/k, in which case the
resulting IoC is 1/26 (exactly the same as that for uniformly random
texts). Thus,
IoC_ctxt = 1/k * IoC_English + (k-1)/k * (1/26)
Solving the equation for k, we get the following key length prediction
function:
> keyLength ctxt = (0.065 - (1/26)) / (ic ctxt - (1/26))
In practice (like the demos we had in class), you may want to remove
some noise by not assuming n >> k. Then a slightly more accurate
analysis can go as follows:
- View the message as being partitioned into k categories, where each
category contains n/k letters.
- There are a total of n(n-1)/2 ways to pick (without replacement) two
letters from the n-letter message (when order does not matter).
- There are n(n/k - 1)/2 possible ways where the two randomly picked
letters belong to the same category.
- There are n(n-n/k)/2 possible ways where the two randomly picked letters belong to two different categories.
So, IoC_ctxt = ((n(n/k - 1)/2) * IoC_English + n(n-n/k)/2 * (1/26)) / (n(n-1)/2)
Let IoC_English be 0.065 and solve k from the above equation, we can
derive the following slightly more accurate predictor:
> keyLength' ctxt = 0.027*n / ((n-1)*(ic ctxt)- 0.038*n + 0.065)
> where n = fromIntegral $ length ctxt