Editorial for Baltic OI '18 P2 - Martian DNA

Remember to use this editorial only when stuck, and not to copy-paste code from it. Please be respectful to the problem author and editorialist.
Submitting an official solution before solving the problem yourself is a bannable offence.

Subtask 1

Let $s = s_1 s_2 \dots s_N$ ~s = s_1 s_2 \dots s_N~ denote the DNA string. Given a substring $t = s_i s_{i+1} \dots s_j$ ~t = s_i s_{i+1} \dots s_j~, we can check in time $\mathcal O(NR)$ ~\mathcal O(NR)~ whether it contains sufficiently many of all nucleobases, by simply looping over each required nucleobase and counting how many occurrences there are of that nucleobase in the substring. Suppose we have a function $HasEnough(t)$ ~HasEnough(t)~ which returns $\text{true}$ ~\text{true}~ if the substring $t$ ~t~ contains sufficiently many of all nucleobases. Then we can find the optimal substring by looping over all substrings (there are $\binom{N+1}{2} = \mathcal O(N^2)$ ~\binom{N+1}{2} = \mathcal O(N^2)~ of them) and picking the shortest substring $t$ ~t~ such that $HasEnough(t) = \text{true}$ ~HasEnough(t) = \text{true}~. The time complexity of this algorithm is $\mathcal O(N^3 R)$ ~\mathcal O(N^3 R)~.

Subtask 2

If we do some precomputations we can actually compute the function $HasEnough(t)$ ~HasEnough(t)~ in time $\mathcal O(R)$ ~\mathcal O(R)~. The idea is to use prefix sums, which means that we start by computing, for each required nucleobase, how many times that nucleobase occurs in each prefix of the DNA, which allows us to quickly compute how many times that nucleobase occurs in any substring. If we let $count_c(m)$ ~count_c(m)~ denote the number of times the nucleobase $c$ ~c~ occurs in the prefix $s_1 \dots s_m$ ~s_1 \dots s_m~, then we can compute how many times $c$ ~c~ occurs in the substring $t = s_i s_{i+1} \dots s_j$ ~t = s_i s_{i+1} \dots s_j~ as $count_c(j)-count_c(i-1)$ ~count_c(j)-count_c(i-1)~, which only takes time $\mathcal O(1)$ ~\mathcal O(1)~ per type of nucleobase. We can therefore check if we have enough of all required nucleobases in time $\mathcal O(R)$ ~\mathcal O(R)~.

The precomputation can be done in time $\mathcal O(NR)$ ~\mathcal O(NR)~, and then we can use the $\mathcal O(R)$ ~\mathcal O(R)~ implementation of $HasEnough$ ~HasEnough~ to test all substrings in time $\mathcal O(N^2 R)$ ~\mathcal O(N^2 R)~.

Subtask 3

The key realization needed to solve subtask 3, where $N$ ~N~ may be as large as $100\,000$ ~100\,000~, is that we don't actually have to check all substrings. Instead, it would be sufficient to know, for each $i \in \{1, \dots, N\}$ ~i \in \{1, \dots, N\}~, what is the minimum $j$ ~j~ such that the substring $s_i \dots s_j$ ~s_i \dots s_j~ contains sufficiently many of all nucleobases. We can find $j$ ~j~ using binary search, because whenever $HasEnough(s_i \dots s_{j'})$ ~HasEnough(s_i \dots s_{j'})~ returns $\text{true}$ ~\text{true}~ then we get an upper bound $j \le j'$ ~j \le j'~, and whenever $HasEnough(s_i \dots s_{j'})$ ~HasEnough(s_i \dots s_{j'})~ returns $\text{false}$ ~\text{false}~ we get a lower bound $j \ge j'$ ~j \ge j'~. If we also compute $HasEnough$ ~HasEnough~ using prefix sums then the algorithm runs in time $\mathcal O(NR+NR \log N) = \mathcal O(NR \log N)$ .

Subtask 4

To solve the problem in time $\mathcal O(N)$ ~\mathcal O(N)~, we can use an approach based on two pointers. At all times we keep track of a substring $s_\ell \dots s_r$ ~s_\ell \dots s_r~. If $HasEnough(s_\ell \dots s_r) = \text{false}$ then the substring is too small, so we make it bigger by setting $r \gets r+1$ ~r \gets r+1~. If, on the other hand, $HasEnough(s_\ell \dots s_r) = \text{true}$ then we might be able to make the substring smaller, so we set $\ell \gets \ell+1$ ~\ell \gets \ell+1~. This way we only have to check $\mathcal O(N)$ ~\mathcal O(N)~ different substrings.

The problem is that precomputing all the prefix sums needed for $HasEnough$ ~HasEnough~ takes time $\Theta(NR)$ ~\Theta(NR)~, which is too slow. However, we can exploit the fact that when we increment $\ell$ ~\ell~ or $r$ ~r~, the substring $s_\ell \dots s_r$ ~s_\ell \dots s_r~ doesn't change very much, so we can keep track of:

How many nucleobases there are of each type in the substring $s_\ell \dots s_r$ ~s_\ell \dots s_r~.
How many of the required nucleobases have too few occurrences in the substring $s_\ell \dots s_r$ ~s_\ell \dots s_r~.

Whenever we increment the left pointer $\ell$ ~\ell~, we first decrement the number of nucleobases of type $s_\ell$ ~s_\ell~, and if the number of nucleobases of that type becomes too small we also increment the number of nucleobases with an insufficient number of occurrences.

Similarly, whenever we increment the right pointer $r$ ~r~, we first increment the number of nucleobases of type $s_{r+1}$ ~s_{r+1}~, and if the number of nucleobases of that type becomes exactly the number we need then we also decrement the number of nucleobases with an insufficient number of occurrences.

If we keep track of the number of nucleobases with too few occurrences, then we can implement $HasEnough(s_\ell \dots s_r)$ ~HasEnough(s_\ell \dots s_r)~ by simply checking if there are no nucleobases with too few occurrences, which is a constant time operation. Hence this algorithm runs in $\mathcal O(N)$ ~\mathcal O(N)~.

Comments

There are no comments at the moment.