Editorial for Baltic OI '18 P5 - Genetics
Submitting an official solution before solving the problem yourself is a bannable offence.
This problem had a simple cubic solution, lots of potential for optimization, and a nice probabilistic quadratic solution. We'll describe the latter.
Let's start by solving the problem for the binary alphabet case. In this case, we can make the math slightly nicer: instead of having the letters A
and C
, we use the numbers and . Computing the difference between two DNA strings and then becomes isomorphic to taking a dot product between two rows of a matrix , i.e., a sum of for (up to some constant factor and linear rescaling – instead of wanting sums to equal , we want them to equal ).
Now, what we want to check is that the dot products are for all rows . Rather than doing this individually for all , we will check the sum against every other row at once. There are a bunch of different approaches for this, but one nice way is to pick random values for each row, and then check that the dot product against the combined sum equals (where is the row we're checking).
In less abstract mathematical terms, and generalizing to larger alphabets, we pick random for each row, and then for each column and letter , compute as the sum of for each row which has the letter in the column. Then, we can check a row against every other by computing the sum . If this row is the answer, this equals times the sum of the other rows' 's, since it differs in exactly positions from each other row , and the sum thus includes times.
If the row isn't the answer, it highly likely does not equal that. Changing any of a row that differed in something else than positions would result in a changed sum, and if we say do all the arithmetic modulo we have a probability of accidentally passing the test on the order of . Hence, the solution passes with very close to probability.
Interesting notes:
Test data generation for this task was pretty tricky. For the naive solution not to pass, we want almost all distances between rows to be , but all but one row should also have some other row with distance not . The only matrices that the authors are aware of where distances are all equal are identity matrices (with , ), constant matrices (), and generalized Hadamard matrices (, ), and combinations of the three. Here denotes the alphabet size. For a binary alphabet, Hadamard matrices are defined to be matrices of with such that all rows differ in exactly positions. They are well studied, and a simple construction of them is a recursive one: start with the matrix:
and repeatedly replace by:
where means with all entries of XORed by . This results in a Hadamard matrix of any size which is a power of .
For alphabets of size we can do something more complicated, with instead replacing by:
Proof that this works is left as an exercise for the reader (the construction is derived from the multiplication table for , for the mathematically inclined).
Given matrices with all pairwise distances , we can perturb the matrix in various ways to make the answer unique, e.g. duplicating rows or changing bits of the matrix.
These complex constructions partly explain the constraints section – it is difficult to construct Hadamard matrices for sizes that are not powers of for alphabet size .
- There is also a fun sub-cubic solution: with the formulation that values are in , we can think of the problem simply as asking for a matrix product , from where we can check which values are . Matrix multiplication can in theory be computed in time, although in practice the algorithms that do this are very non-trivial to implement and have too high constant factors.
Comments