Editorial for Baltic OI '18 P2 - Martian DNA
Submitting an official solution before solving the problem yourself is a bannable offence.
Subtask 1
Let denote the DNA string. Given a substring , we can check in time whether it contains sufficiently many of all nucleobases, by simply looping over each required nucleobase and counting how many occurrences there are of that nucleobase in the substring. Suppose we have a function which returns if the substring contains sufficiently many of all nucleobases. Then we can find the optimal substring by looping over all substrings (there are of them) and picking the shortest substring such that . The time complexity of this algorithm is .
Subtask 2
If we do some precomputations we can actually compute the function in time . The idea is to use prefix sums, which means that we start by computing, for each required nucleobase, how many times that nucleobase occurs in each prefix of the DNA, which allows us to quickly compute how many times that nucleobase occurs in any substring. If we let denote the number of times the nucleobase occurs in the prefix , then we can compute how many times occurs in the substring as , which only takes time per type of nucleobase. We can therefore check if we have enough of all required nucleobases in time .
The precomputation can be done in time , and then we can use the implementation of to test all substrings in time .
Subtask 3
The key realization needed to solve subtask 3, where may be as large as , is that we don't actually have to check all substrings. Instead, it would be sufficient to know, for each , what is the minimum such that the substring contains sufficiently many of all nucleobases. We can find using binary search, because whenever returns then we get an upper bound , and whenever returns we get a lower bound . If we also compute using prefix sums then the algorithm runs in time .
Subtask 4
To solve the problem in time , we can use an approach based on two pointers. At all times we keep track of a substring . If then the substring is too small, so we make it bigger by setting . If, on the other hand, then we might be able to make the substring smaller, so we set . This way we only have to check different substrings.
The problem is that precomputing all the prefix sums needed for takes time , which is too slow. However, we can exploit the fact that when we increment or , the substring doesn't change very much, so we can keep track of:
- How many nucleobases there are of each type in the substring .
- How many of the required nucleobases have too few occurrences in the substring .
Whenever we increment the left pointer , we first decrement the number of nucleobases of type , and if the number of nucleobases of that type becomes too small we also increment the number of nucleobases with an insufficient number of occurrences.
Similarly, whenever we increment the right pointer , we first increment the number of nucleobases of type , and if the number of nucleobases of that type becomes exactly the number we need then we also decrement the number of nucleobases with an insufficient number of occurrences.
If we keep track of the number of nucleobases with too few occurrences, then we can implement by simply checking if there are no nucleobases with too few occurrences, which is a constant time operation. Hence this algorithm runs in .
Comments