Editorial for Baltic OI '18 P2 - Martian DNA


Remember to use this editorial only when stuck, and not to copy-paste code from it. Please be respectful to the problem author and editorialist.
Submitting an official solution before solving the problem yourself is a bannable offence.

Subtask 1

Let s=s1s2sN denote the DNA string. Given a substring t=sisi+1sj, we can check in time O(NR) whether it contains sufficiently many of all nucleobases, by simply looping over each required nucleobase and counting how many occurrences there are of that nucleobase in the substring. Suppose we have a function HasEnough(t) which returns true if the substring t contains sufficiently many of all nucleobases. Then we can find the optimal substring by looping over all substrings (there are (N+12)=O(N2) of them) and picking the shortest substring t such that HasEnough(t)=true. The time complexity of this algorithm is O(N3R).

Subtask 2

If we do some precomputations we can actually compute the function HasEnough(t) in time O(R). The idea is to use prefix sums, which means that we start by computing, for each required nucleobase, how many times that nucleobase occurs in each prefix of the DNA, which allows us to quickly compute how many times that nucleobase occurs in any substring. If we let countc(m) denote the number of times the nucleobase c occurs in the prefix s1sm, then we can compute how many times c occurs in the substring t=sisi+1sj as countc(j)countc(i1), which only takes time O(1) per type of nucleobase. We can therefore check if we have enough of all required nucleobases in time O(R).

The precomputation can be done in time O(NR), and then we can use the O(R) implementation of HasEnough to test all substrings in time O(N2R).

Subtask 3

The key realization needed to solve subtask 3, where N may be as large as 100000, is that we don't actually have to check all substrings. Instead, it would be sufficient to know, for each i{1,,N}, what is the minimum j such that the substring sisj contains sufficiently many of all nucleobases. We can find j using binary search, because whenever HasEnough(sisj) returns true then we get an upper bound jj, and whenever HasEnough(sisj) returns false we get a lower bound jj. If we also compute HasEnough using prefix sums then the algorithm runs in time O(NR+NRlogN)=O(NRlogN).

Subtask 4

To solve the problem in time O(N), we can use an approach based on two pointers. At all times we keep track of a substring ssr. If HasEnough(ssr)=false then the substring is too small, so we make it bigger by setting rr+1. If, on the other hand, HasEnough(ssr)=true then we might be able to make the substring smaller, so we set +1. This way we only have to check O(N) different substrings.

The problem is that precomputing all the prefix sums needed for HasEnough takes time Θ(NR), which is too slow. However, we can exploit the fact that when we increment or r, the substring ssr doesn't change very much, so we can keep track of:

  1. How many nucleobases there are of each type in the substring ssr.
  2. How many of the required nucleobases have too few occurrences in the substring ssr.

Whenever we increment the left pointer , we first decrement the number of nucleobases of type s, and if the number of nucleobases of that type becomes too small we also increment the number of nucleobases with an insufficient number of occurrences.

Similarly, whenever we increment the right pointer r, we first increment the number of nucleobases of type sr+1, and if the number of nucleobases of that type becomes exactly the number we need then we also decrement the number of nucleobases with an insufficient number of occurrences.

If we keep track of the number of nucleobases with too few occurrences, then we can implement HasEnough(ssr) by simply checking if there are no nucleobases with too few occurrences, which is a constant time operation. Hence this algorithm runs in O(N).


Comments

There are no comments at the moment.