Editorial for Baltic OI '18 P5 - Genetics


Remember to use this editorial only when stuck, and not to copy-paste code from it. Please be respectful to the problem author and editorialist.
Submitting an official solution before solving the problem yourself is a bannable offence.

This problem had a simple cubic solution, lots of potential for optimization, and a nice probabilistic quadratic solution. We'll describe the latter.

Let's start by solving the problem for the binary alphabet case. In this case, we can make the math slightly nicer: instead of having the letters A and C, we use the numbers 1 and -1. Computing the difference between two DNA strings a and b then becomes isomorphic to taking a dot product between two rows of a matrix A, i.e., a sum of A_{i,j} \cdot A_{i',j} for j = 1 \dots M (up to some constant factor and linear rescaling – instead of wanting sums to equal K, we want them to equal K' = M-2K).

Now, what we want to check is that the dot products are K' for all rows b = A_{i' \ne i}. Rather than doing this individually for all b, we will check the sum against every other row at once. There are a bunch of different approaches for this, but one nice way is to pick random values w_i for each row, and then check that the dot product against the combined sum \sum_{i'} w_{i'} A_{i'} equals (\sum_{i'} w_{i'}-w_i) \cdot K' (where i is the row we're checking).

In less abstract mathematical terms, and generalizing to larger alphabets, we pick random w_i for each row, and then for each column j and letter c \in \{A, C, G, T\}, compute D_c as the sum of w_i for each row which has the letter c in the j^\text{th} column. Then, we can check a row a_i against every other by computing the sum \sum_j \sum_{d \ne A_{i,j}} D_d. If this row is the answer, this equals K times the sum of the other rows' w's, since it differs in exactly K positions from each other row A_{i'}, and the sum thus includes w_{i'} K times.

If the row isn't the answer, it highly likely does not equal that. Changing any w of a row that differed in something else than K positions would result in a changed sum, and if we say do all the arithmetic modulo 2^{64} we have a probability of accidentally passing the test on the order of 2^{-50}. Hence, the solution passes with very close to 100\% probability.

Interesting notes:

  • Test data generation for this task was pretty tricky. For the naive solution not to pass, we want almost all distances between rows to be K, but all but one row should also have some other row with distance not K. The only matrices that the authors are aware of where distances are all equal are identity matrices (with N = M, K = 2), constant matrices (K = 0), and generalized Hadamard matrices (N = M, K = N(1-1/A)), and combinations of the three. Here A denotes the alphabet size. For a binary alphabet, Hadamard matrices are defined to be matrices of \{1, 0\} with N = M such that all rows differ in exactly N/2 positions. They are well studied, and a simple construction of them is a recursive one: start with the matrix:

    \displaystyle H = \begin{bmatrix}1\end{bmatrix}

    and repeatedly replace H by:

    \displaystyle \begin{bmatrix}H & H \\ H & H \oplus 1\end{bmatrix}

    where H \oplus 1 means H with all entries of H XORed by 1. This results in a Hadamard matrix of any size which is a power of 2.

    For alphabets of size 4 we can do something more complicated, with instead replacing H by:

    \displaystyle \begin{bmatrix}H & H & H & H \\ H & H \oplus 1 & H \oplus 2 & H \oplus 3 \\ H & H \oplus 2 & H \oplus 3 & H \oplus 1 \\ H & H \oplus 3 & H \oplus 1 & H \oplus 2\end{bmatrix}

    Proof that this works is left as an exercise for the reader (the construction is derived from the multiplication table for GF(4), for the mathematically inclined).

    Given matrices with all pairwise distances K, we can perturb the matrix in various ways to make the answer unique, e.g. duplicating rows or changing bits of the matrix.

    These complex constructions partly explain the constraints section – it is difficult to construct Hadamard matrices for sizes that are not powers of A for alphabet size A.

  • There is also a fun sub-cubic solution: with the formulation that values are in \{-1, 1\}, we can think of the problem simply as asking for a matrix product A \cdot A^T, from where we can check which values are K. Matrix multiplication can in theory be computed in \mathcal O(n^{2.373}) time, although in practice the algorithms that do this are very non-trivial to implement and have too high constant factors.

Comments

There are no comments at the moment.