Editorial for Baltic OI '18 P5 - Genetics - DMOJ: Modern Online Judge

Editorial for Baltic OI '18 P5 - Genetics

Remember to use this editorial only when stuck, and not to copy-paste code from it. Please be respectful to the problem author and editorialist.
Submitting an official solution before solving the problem yourself is a bannable offence.

This problem had a simple cubic solution, lots of potential for optimization, and a nice probabilistic quadratic solution. We'll describe the latter.

Let's start by solving the problem for the binary alphabet case. In this case, we can make the math slightly nicer: instead of having the letters A and C, we use the numbers $1$ ~1~ and $-1$ ~-1~. Computing the difference between two DNA strings $a$ ~a~ and $b$ ~b~ then becomes isomorphic to taking a dot product between two rows of a matrix $A$ ~A~, i.e., a sum of $A_{i,j} \cdot A_{i',j}$ ~A_{i,j} \cdot A_{i',j}~ for $j = 1 \dots M$ ~j = 1 \dots M~ (up to some constant factor and linear rescaling – instead of wanting sums to equal $K$ ~K~, we want them to equal $K' = M-2K$ ~K' = M-2K~).

Now, what we want to check is that the dot products are $K'$ ~K'~ for all rows $b = A_{i' \ne i}$ ~b = A_{i' \ne i}~. Rather than doing this individually for all $b$ ~b~, we will check the sum against every other row at once. There are a bunch of different approaches for this, but one nice way is to pick random values $w_i$ ~w_i~ for each row, and then check that the dot product against the combined sum $\sum_{i'} w_{i'} A_{i'}$ ~\sum_{i'} w_{i'} A_{i'}~ equals $(\sum_{i'} w_{i'}-w_i) \cdot K'$ ~(\sum_{i'} w_{i'}-w_i) \cdot K'~ (where $i$ ~i~ is the row we're checking).

In less abstract mathematical terms, and generalizing to larger alphabets, we pick random $w_i$ ~w_i~ for each row, and then for each column $j$ ~j~ and letter $c \in \{A, C, G, T\}$ ~c \in \{A, C, G, T\}~, compute $D_c$ ~D_c~ as the sum of $w_i$ ~w_i~ for each row which has the letter $c$ ~c~ in the $j^\text{th}$ ~j^\text{th}~ column. Then, we can check a row $a_i$ ~a_i~ against every other by computing the sum $\sum_j \sum_{d \ne A_{i,j}} D_d$ ~\sum_j \sum_{d \ne A_{i,j}} D_d~. If this row is the answer, this equals $K$ ~K~ times the sum of the other rows' $w$ ~w~'s, since it differs in exactly $K$ ~K~ positions from each other row $A_{i'}$ ~A_{i'}~, and the sum thus includes $w_{i'}$ ~w_{i'}~ $K$ ~K~ times.

If the row isn't the answer, it highly likely does not equal that. Changing any $w$ ~w~ of a row that differed in something else than $K$ ~K~ positions would result in a changed sum, and if we say do all the arithmetic modulo $2^{64}$ ~2^{64}~ we have a probability of accidentally passing the test on the order of $2^{-50}$ ~2^{-50}~. Hence, the solution passes with very close to $100\%$ ~100\%~ probability.

Interesting notes:

Test data generation for this task was pretty tricky. For the naive solution not to pass, we want almost all distances between rows to be $K$ ~K~, but all but one row should also have some other row with distance not $K$ ~K~. The only matrices that the authors are aware of where distances are all equal are identity matrices (with $N = M$ ~N = M~, $K = 2$ ~K = 2~), constant matrices ( $K = 0$ ~K = 0~), and generalized Hadamard matrices ( $N = M$ ~N = M~, $K = N(1-1/A)$ ~K = N(1-1/A)~), and combinations of the three. Here $A$ ~A~ denotes the alphabet size. For a binary alphabet, Hadamard matrices are defined to be matrices of $\{1, 0\}$ ~\{1, 0\}~ with $N = M$ ~N = M~ such that all rows differ in exactly $N/2$ ~N/2~ positions. They are well studied, and a simple construction of them is a recursive one: start with the matrix:

$\displaystyle H = \begin{bmatrix}1\end{bmatrix}$

and repeatedly replace $H$ ~H~ by:

$\displaystyle \begin{bmatrix}H & H \\ H & H \oplus 1\end{bmatrix}$

where $H \oplus 1$ ~H \oplus 1~ means $H$ ~H~ with all entries of $H$ ~H~ XORed by $1$ ~1~. This results in a Hadamard matrix of any size which is a power of $2$ ~2~.

For alphabets of size $4$ ~4~ we can do something more complicated, with instead replacing $H$ ~H~ by:

$\displaystyle \begin{bmatrix}H & H & H & H \\ H & H \oplus 1 & H \oplus 2 & H \oplus 3 \\ H & H \oplus 2 & H \oplus 3 & H \oplus 1 \\ H & H \oplus 3 & H \oplus 1 & H \oplus 2\end{bmatrix}$

Proof that this works is left as an exercise for the reader (the construction is derived from the multiplication table for $GF(4)$ ~GF(4)~, for the mathematically inclined).

Given matrices with all pairwise distances $K$ ~K~, we can perturb the matrix in various ways to make the answer unique, e.g. duplicating rows or changing bits of the matrix.

These complex constructions partly explain the constraints section – it is difficult to construct Hadamard matrices for sizes that are not powers of $A$ ~A~ for alphabet size $A$ ~A~.
There is also a fun sub-cubic solution: with the formulation that values are in $\{-1, 1\}$ ~\{-1, 1\}~, we can think of the problem simply as asking for a matrix product $A \cdot A^T$ ~A \cdot A^T~, from where we can check which values are $K$ ~K~. Matrix multiplication can in theory be computed in $\mathcal O(n^{2.373})$ ~\mathcal O(n^{2.373})~ time, although in practice the algorithms that do this are very non-trivial to implement and have too high constant factors.

Comments

There are no comments at the moment.