Editorial for Baltic OI '18 P2 - Martian DNA


Remember to use this editorial only when stuck, and not to copy-paste code from it. Please be respectful to the problem author and editorialist.
Submitting an official solution before solving the problem yourself is a bannable offence.

Subtask 1

Let s = s_1 s_2 \dots s_N denote the DNA string. Given a substring t = s_i s_{i+1} \dots s_j, we can check in time \mathcal O(NR) whether it contains sufficiently many of all nucleobases, by simply looping over each required nucleobase and counting how many occurrences there are of that nucleobase in the substring. Suppose we have a function HasEnough(t) which returns \text{true} if the substring t contains sufficiently many of all nucleobases. Then we can find the optimal substring by looping over all substrings (there are \binom{N+1}{2} = \mathcal O(N^2) of them) and picking the shortest substring t such that HasEnough(t) = \text{true}. The time complexity of this algorithm is \mathcal O(N^3 R).

Subtask 2

If we do some precomputations we can actually compute the function HasEnough(t) in time \mathcal O(R). The idea is to use prefix sums, which means that we start by computing, for each required nucleobase, how many times that nucleobase occurs in each prefix of the DNA, which allows us to quickly compute how many times that nucleobase occurs in any substring. If we let count_c(m) denote the number of times the nucleobase c occurs in the prefix s_1 \dots s_m, then we can compute how many times c occurs in the substring t = s_i s_{i+1} \dots s_j as count_c(j)-count_c(i-1), which only takes time \mathcal O(1) per type of nucleobase. We can therefore check if we have enough of all required nucleobases in time \mathcal O(R).

The precomputation can be done in time \mathcal O(NR), and then we can use the \mathcal O(R) implementation of HasEnough to test all substrings in time \mathcal O(N^2 R).

Subtask 3

The key realization needed to solve subtask 3, where N may be as large as 100\,000, is that we don't actually have to check all substrings. Instead, it would be sufficient to know, for each i \in \{1, \dots, N\}, what is the minimum j such that the substring s_i \dots s_j contains sufficiently many of all nucleobases. We can find j using binary search, because whenever HasEnough(s_i \dots s_{j'}) returns \text{true} then we get an upper bound j \le j', and whenever HasEnough(s_i \dots s_{j'}) returns \text{false} we get a lower bound j \ge j'. If we also compute HasEnough using prefix sums then the algorithm runs in time \mathcal O(NR+NR \log N) = \mathcal O(NR \log N).

Subtask 4

To solve the problem in time \mathcal O(N), we can use an approach based on two pointers. At all times we keep track of a substring s_\ell \dots s_r. If HasEnough(s_\ell \dots s_r) = \text{false} then the substring is too small, so we make it bigger by setting r \gets r+1. If, on the other hand, HasEnough(s_\ell \dots s_r) = \text{true} then we might be able to make the substring smaller, so we set \ell \gets \ell+1. This way we only have to check \mathcal O(N) different substrings.

The problem is that precomputing all the prefix sums needed for HasEnough takes time \Theta(NR), which is too slow. However, we can exploit the fact that when we increment \ell or r, the substring s_\ell \dots s_r doesn't change very much, so we can keep track of:

  1. How many nucleobases there are of each type in the substring s_\ell \dots s_r.
  2. How many of the required nucleobases have too few occurrences in the substring s_\ell \dots s_r.

Whenever we increment the left pointer \ell, we first decrement the number of nucleobases of type s_\ell, and if the number of nucleobases of that type becomes too small we also increment the number of nucleobases with an insufficient number of occurrences.

Similarly, whenever we increment the right pointer r, we first increment the number of nucleobases of type s_{r+1}, and if the number of nucleobases of that type becomes exactly the number we need then we also decrement the number of nucleobases with an insufficient number of occurrences.

If we keep track of the number of nucleobases with too few occurrences, then we can implement HasEnough(s_\ell \dots s_r) by simply checking if there are no nucleobases with too few occurrences, which is a constant time operation. Hence this algorithm runs in \mathcal O(N).


Comments

There are no comments at the moment.