Editorial for Baltic OI '18 P5 - Genetics
Submitting an official solution before solving the problem yourself is a bannable offence.
This problem had a simple cubic solution, lots of potential for optimization, and a nice probabilistic quadratic solution. We'll describe the latter.
Let's start by solving the problem for the binary alphabet case. In this case, we can make the math slightly nicer: instead of having the letters A
and C
, we use the numbers and
. Computing the difference between two DNA strings
and
then becomes isomorphic to taking a dot product between two rows of a matrix
, i.e., a sum of
for
(up to some constant factor and linear rescaling – instead of wanting sums to equal
, we want them to equal
).
Now, what we want to check is that the dot products are for all rows
. Rather than doing this individually for all
, we will check the sum against every other row at once. There are a bunch of different approaches for this, but one nice way is to pick random values
for each row, and then check that the dot product against the combined sum
equals
(where
is the row we're checking).
In less abstract mathematical terms, and generalizing to larger alphabets, we pick random for each row, and then for each column
and letter
, compute
as the sum of
for each row which has the letter
in the
column. Then, we can check a row
against every other by computing the sum
. If this row is the answer, this equals
times the sum of the other rows'
's, since it differs in exactly
positions from each other row
, and the sum thus includes
times.
If the row isn't the answer, it highly likely does not equal that. Changing any of a row that differed in something else than
positions would result in a changed sum, and if we say do all the arithmetic modulo
we have a probability of accidentally passing the test on the order of
. Hence, the solution passes with very close to
probability.
Interesting notes:
Test data generation for this task was pretty tricky. For the naive solution not to pass, we want almost all distances between rows to be
, but all but one row should also have some other row with distance not
. The only matrices that the authors are aware of where distances are all equal are identity matrices (with
,
), constant matrices (
), and generalized Hadamard matrices (
,
), and combinations of the three. Here
denotes the alphabet size. For a binary alphabet, Hadamard matrices are defined to be matrices of
with
such that all rows differ in exactly
positions. They are well studied, and a simple construction of them is a recursive one: start with the matrix:
and repeatedly replace
by:
where
means
with all entries of
XORed by
. This results in a Hadamard matrix of any size which is a power of
.
For alphabets of size
we can do something more complicated, with instead replacing
by:
Proof that this works is left as an exercise for the reader (the construction is derived from the multiplication table for
, for the mathematically inclined).
Given matrices with all pairwise distances
, we can perturb the matrix in various ways to make the answer unique, e.g. duplicating rows or changing bits of the matrix.
These complex constructions partly explain the constraints section – it is difficult to construct Hadamard matrices for sizes that are not powers of
for alphabet size
.
- There is also a fun sub-cubic solution: with the formulation that values are in
, we can think of the problem simply as asking for a matrix product
, from where we can check which values are
. Matrix multiplication can in theory be computed in
time, although in practice the algorithms that do this are very non-trivial to implement and have too high constant factors.
Comments