Editorial for Baltic OI '01 P2 - Crack the Code


Remember to use this editorial only when stuck, and not to copy-paste code from it. Please be respectful to the problem author and editorialist.
Submitting an official solution before solving the problem yourself is a bannable offence.

Encryption Program

The encryption program encrypts (one by one) the successive characters from the input file. Characters that are not letters (of the English alphabet) are left unaltered. The key given by the user specifies the way characters are coded. Namely, the i^\text{th} character from the input file (counting from 1) is "rotated" in a "cyclic" English alphabet (consisting of capital letters only) by the value of the [(i-1) \bmod 10+1]^\text{th} element of the key (Elements of the key are also numbered from 1).

Reverse Engineering Algorithm

It can be easily seen, that knowing the key we can decode the message in a similar way it was coded — we can apply the coding algorithm given above for a key consisting of negated elements of the coding key. Hence, the problem reduces to finding the coding key. In case of texts written in the natural language, the relative frequencies of particular letters are different. So, we can use some statistical methods to determine the key. Basing on the uncoded piece of text we can calculate the probability with which particular letters appear in the text.

We will denote by p_i the probability that (the next) character in the text is i, and by P the probability distribution itself. To find the i^\text{th} element of the key, we calculate the probability distribution for letters appearing in the coded text on positions i, (i+10), (i+20), \dots.

We will denote by q_i the probability of appearance of letter i, and by Q the probability distribution itself.

Now, it is enough to find such a cyclic rotation of sequence (q_A, q_B, \dots, q_Z) that is closest to the distribution P.

Intended Solution

Our solution is based on the method described above. We also need a measure of how "close" the distribution P is to R = (r_A, r_B, \dots, r_Z) = (\dots, q_Z, q_A, q_B, \dots) (some cyclic rotation of the sequence (q_A, q_B, \dots, q_Z)). We will use here statistical test \chi^2, which can be expressed as:

\displaystyle n \sum_{i=A}^Z \frac{(p_i - r_i)^2}{p_i}

where n is the number of letters in the sample on which Q is based.

Remark. The test \chi^2 should be used only in the cases when for each i the value np_i is not too small (let's say, not less than 0.1). Therefore, sometimes it may be necessary to group a few events in the distribution P, so that their total probability is big enough, relatively to n.

Other Solutions

This solution differs from the one above only in the measure used to compare probability distributions. We can view the distributions P and R as vectors in a 26-dimensional space (26 is the number of letters in the alphabet). We can calculate their dot product and normalise it, dividing it by the square root of the product of lengths of the two vectors. The closer the result is to 1, the closer are the two distributions to each other.


Comments

There are no comments at the moment.