Subtask 1
Let
denote the DNA string. Given a substring
, we can check in time
whether it contains sufficiently many of all nucleobases, by simply looping over each required nucleobase and counting how many occurrences there are of that nucleobase in the substring. Suppose we have a function
which returns
if the substring
contains sufficiently many of all nucleobases. Then we can find the optimal substring by looping over all substrings (there are
of them) and picking the shortest substring
such that
. The time complexity of this algorithm is
.
Subtask 2
If we do some precomputations we can actually compute the function
in time
. The idea is to use prefix sums, which means that we start by computing, for each required nucleobase, how many times that nucleobase occurs in each prefix of the DNA, which allows us to quickly compute how many times that nucleobase occurs in any substring. If we let
denote the number of times the nucleobase
occurs in the prefix
, then we can compute how many times
occurs in the substring
as
, which only takes time
per type of nucleobase. We can therefore check if we have enough of all required nucleobases in time
.
The precomputation can be done in time
, and then we can use the
implementation of
to test all substrings in time
.
Subtask 3
The key realization needed to solve subtask 3, where
may be as large as
, is that we don't actually have to check all substrings. Instead, it would be sufficient to know, for each
, what is the minimum
such that the substring
contains sufficiently many of all nucleobases. We can find
using binary search, because whenever
returns
then we get an upper bound
, and whenever
returns
we get a lower bound
. If we also compute
using prefix sums then the algorithm runs in time
.
Subtask 4
To solve the problem in time
, we can use an approach based on two pointers. At all times we keep track of a substring
. If
then the substring is too small, so we make it bigger by setting
. If, on the other hand,
then we might be able to make the substring smaller, so we set
. This way we only have to check
different substrings.
The problem is that precomputing all the prefix sums needed for
takes time
, which is too slow. However, we can exploit the fact that when we increment
or
, the substring
doesn't change very much, so we can keep track of:
- How many nucleobases there are of each type in the substring
.
- How many of the required nucleobases have too few occurrences in the substring
.
Whenever we increment the left pointer
, we first decrement the number of nucleobases of type
, and if the number of nucleobases of that type becomes too small we also increment the number of nucleobases with an insufficient number of occurrences.
Similarly, whenever we increment the right pointer
, we first increment the number of nucleobases of type
, and if the number of nucleobases of that type becomes exactly the number we need then we also decrement the number of nucleobases with an insufficient number of occurrences.
If we keep track of the number of nucleobases with too few occurrences, then we can implement
by simply checking if there are no nucleobases with too few occurrences, which is a constant time operation. Hence this algorithm runs in
.
Comments