Editorial for Baltic OI '18 P5 - Genetics - Olympiads Online Judge

Editorial for Baltic OI '18 P5 - Genetics

Remember to use this editorial only when stuck, and not to copy-paste code from it. Please be respectful to the problem author and editorialist.
Submitting an official solution before solving the problem yourself is a bannable offence.

This problem had a simple cubic solution, lots of potential for optimization, and a nice probabilistic quadratic solution. We'll describe the latter.

Let's start by solving the problem for the binary alphabet case. In this case, we can make the math slightly nicer: instead of having the letters A and C, we use the numbers $1$ $1$ and $-1$ $- 1$ . Computing the difference between two DNA strings $a$ $a$ and $b$ $b$ then becomes isomorphic to taking a dot product between two rows of a matrix $A$ $A$ , i.e., a sum of $A_{i,j} \cdot A_{i',j}$ $A_{i, j} \cdot A_{i^{'}, j}$ for $j = 1 \dots M$ $j = 1 \dots M$ (up to some constant factor and linear rescaling – instead of wanting sums to equal $K$ $K$ , we want them to equal $K' = M-2K$ $K^{'} = M - 2 K$ ).

Now, what we want to check is that the dot products are $K'$ $K^{'}$ for all rows $b = A_{i' \ne i}$ $b = A_{i^{'} \neq i}$ . Rather than doing this individually for all $b$ $b$ , we will check the sum against every other row at once. There are a bunch of different approaches for this, but one nice way is to pick random values $w_i$ $w_{i}$ for each row, and then check that the dot product against the combined sum $\sum_{i'} w_{i'} A_{i'}$ $\sum_{i^{'}} w_{i^{'}} A_{i^{'}}$ equals $(\sum_{i'} w_{i'}-w_i) \cdot K'$ $(\sum_{i^{'}} w_{i^{'}} - w_{i}) \cdot K^{'}$ (where $i$ $i$ is the row we're checking).

In less abstract mathematical terms, and generalizing to larger alphabets, we pick random $w_i$ $w_{i}$ for each row, and then for each column $j$ $j$ and letter $c \in \{A, C, G, T\}$ $c \in {A, C, G, T}$ , compute $D_c$ $D_{c}$ as the sum of $w_i$ $w_{i}$ for each row which has the letter $c$ $c$ in the $j^\text{th}$ $j^{th}$ column. Then, we can check a row $a_i$ $a_{i}$ against every other by computing the sum $\sum_j \sum_{d \ne A_{i,j}} D_d$ $\sum_{j} \sum_{d \neq A_{i, j}} D_{d}$ . If this row is the answer, this equals $K$ $K$ times the sum of the other rows' $w$ $w$ 's, since it differs in exactly $K$ $K$ positions from each other row $A_{i'}$ $A_{i^{'}}$ , and the sum thus includes $w_{i'}$ $w_{i^{'}}$ $K$ $K$ times.

If the row isn't the answer, it highly likely does not equal that. Changing any $w$ $w$ of a row that differed in something else than $K$ $K$ positions would result in a changed sum, and if we say do all the arithmetic modulo $2^{64}$ $2^{64}$ we have a probability of accidentally passing the test on the order of $2^{-50}$ $2^{- 50}$ . Hence, the solution passes with very close to $100\%$ $100 %$ probability.

Interesting notes:

Test data generation for this task was pretty tricky. For the naive solution not to pass, we want almost all distances between rows to be $K$ $K$ , but all but one row should also have some other row with distance not $K$ $K$ . The only matrices that the authors are aware of where distances are all equal are identity matrices (with $N = M$ $N = M$ , $K = 2$ $K = 2$ ), constant matrices ( $K = 0$ $K = 0$ ), and generalized Hadamard matrices ( $N = M$ $N = M$ , $K = N(1-1/A)$ $K = N (1 - 1 / A)$ ), and combinations of the three. Here $A$ $A$ denotes the alphabet size. For a binary alphabet, Hadamard matrices are defined to be matrices of $\{1, 0\}$ ${1, 0}$ with $N = M$ $N = M$ such that all rows differ in exactly $N/2$ $N / 2$ positions. They are well studied, and a simple construction of them is a recursive one: start with the matrix:

$\displaystyle H = \begin{bmatrix}1\end{bmatrix}$ $H = [\begin{matrix} 1 \end{matrix}]$

and repeatedly replace $H$ $H$ by:

$\displaystyle \begin{bmatrix}H & H \\ H & H \oplus 1\end{bmatrix}$ $[\begin{matrix} H & H \\ H & H \oplus 1 \end{matrix}]$

where $H \oplus 1$ $H \oplus 1$ means $H$ $H$ with all entries of $H$ $H$ XORed by $1$ $1$ . This results in a Hadamard matrix of any size which is a power of $2$ $2$ .

For alphabets of size $4$ $4$ we can do something more complicated, with instead replacing $H$ $H$ by:

$\displaystyle \begin{bmatrix}H & H & H & H \\ H & H \oplus 1 & H \oplus 2 & H \oplus 3 \\ H & H \oplus 2 & H \oplus 3 & H \oplus 1 \\ H & H \oplus 3 & H \oplus 1 & H \oplus 2\end{bmatrix}$ $[\begin{matrix} H & H & H & H \\ H & H \oplus 1 & H \oplus 2 & H \oplus 3 \\ H & H \oplus 2 & H \oplus 3 & H \oplus 1 \\ H & H \oplus 3 & H \oplus 1 & H \oplus 2 \end{matrix}]$

Proof that this works is left as an exercise for the reader (the construction is derived from the multiplication table for $GF(4)$ $G F (4)$ , for the mathematically inclined).

Given matrices with all pairwise distances $K$ $K$ , we can perturb the matrix in various ways to make the answer unique, e.g. duplicating rows or changing bits of the matrix.

These complex constructions partly explain the constraints section – it is difficult to construct Hadamard matrices for sizes that are not powers of $A$ $A$ for alphabet size $A$ $A$ .
There is also a fun sub-cubic solution: with the formulation that values are in $\{-1, 1\}$ ${- 1, 1}$ , we can think of the problem simply as asking for a matrix product $A \cdot A^T$ $A \cdot A^{T}$ , from where we can check which values are $K$ $K$ . Matrix multiplication can in theory be computed in $\mathcal O(n^{2.373})$ $O (n^{2.373})$ time, although in practice the algorithms that do this are very non-trivial to implement and have too high constant factors.

Comments

There are no comments at the moment.