Editorial for IOI '11 P6 - Parrots
Submitting an official solution before solving the problem yourself is a bannable offence.
For this task, contestants are given a message of length , which consists of a sequence of integers between and , inclusive, and they are asked to find an encoding and decoding scheme such that the encoded message are invariant up to permutations. The encoded message should be as concise as possible. The scores will be rewarded based on the ratio of the length of the encoded message to the length of the original message .
There are many ways to solve this task. It will be instructive to look at the first few subtasks.
Subtask 1
For this subtask, recall that the original message is just a sequence of two possible numbers, or . Assuming that the original message is indexed from to , one can instead choose to send the positions in the original message where all number 's are located. This technique uses at most integers to send the message.
Subtask 2
For subtask 2, since the numbers in the original message could be from to , they can be encoded using only bits. However, this subtask allows one to use up to bits per number in the encoded message . Hence, you can encode the position of each number using the higher bits (similar to the solution to subtask 1), and use the lower bits to encode the actual number in the original message , i.e.
where denotes the number in the original message.
Note that the encoded numbers are also ordered, i.e. , for . Therefore, when we would like to decode the message from the shuffled array , we can sort it to get . Then, to obtain the original message from the sorted should be pretty obvious.
This method also uses at most parrots, which is satisfactory for this subtask.
Subtask 3
Thinking about the first two subtasks should get us familiar with the approach. There are many roads to take from here. However, to formalize the settings, we shall think of a general representation first.
Because we shall keep integers in the encoded array, a natural representation would be to use a sorted sequence. Note that this representation is very robust against permutation; provided that the encoded array is sorted, by sorting a shuffled array , we can easily recover the encoded array , as in subtask 2.
We shall try to extend the approach in subtask 2. However, for subtasks 3–5, we do not have "extra" bits for keeping the position data. In these cases, we look at the original message as a sequence of binary digits (or bits), and , by representing each number using only bits. Thus, the original message of length will have bits.
This leads to a way to solve subtask 3. Note that the original message contains at most bits. For now, let be the bit of the original message, indexing from to . In this case, it is sufficient to take bits to represent all positions of every single bit , and use one last bit to represent the actual data. Hence, the encoding scheme looks like the following:
therefore, it takes bits to represent possible positions and another bit to keep the data. That is, for each bit , we shall encode it as . The length of the encoded message for each test case of this subtask is bytes.
Subtask 4
For this subtask, there are bit positions; and it requires bits just to represent the positions. Hence, there is no way to include the information of each bit of the original message in the encoding. If we recall the technique we use to solve subtask 1, we can improve the technique we used in subtask 3 by choosing to encode the positions of all bits. This approach again sends at most bytes.
Subtask 5
For this subtask, contestants are allowed to obtain some partial scores, depending on the ratio of the encoding message to the original message. In our point of view, to obtain nearly perfect scores like is relatively much easier than to obtain the full points. We consider the last points to be the reward to those who choose to implement more sophisticated methods of encoding that achieve better results than our expectations.
We discuss many potential solutions that might achieve some partial scores.
A approach
We are going to generalize the solution to the previous subtask case since the number of bits is more than . In this case, there are at most bit positions.
To deal with this issue, we partition the message into pairs of bits. We define the bit pair to be . Again, the bit pair should be indexed from to .
Note that there are possible values for each bit pair, which can be represented with integers from to . One possible method is to send the positions of the bit pair (which uses bits to encode) in the encoding data, and then the actual bit pair data will be encoded as the number of occurrences of such positions.
Using this approach, for each pair of bits, we may have to send at most bytes. Therefore, we send at most bytes. Contestants will achieve points on this subtask for implementing this method.
A approach
With a simple observation, we can reduce the size of the encoded messages by roughly a factor of two. Note that the best case (in terms of encoding length) for the previous encoding method is when the original message only contains zero bits, and the worst case is when the message contains only one bits. We shall try to get the average of these two cases.
Consider two encoding schemes, called ONE and ZERO. For a pair of bits , the ONE scheme sends copies of , but the ZERO scheme sends copies. Note that both schemes work, provided that the decoder knows which of the schemes the encoder actually uses.
For each pair of bits, the sum of the number of encoded bytes used by both schemes is always bytes. Considering all pairs, the sum is . Now, if we choose the scheme with the lower number of encoding data, we will need at most half of this number, i.e., at most bytes.
There are many ways to signal the decoder which encoding scheme the encoder uses without too much penalty on the ratio. For example, we can add copies of iff the ZERO scheme is used. This will make the ratio go up by . However, since , the worst ratio is at most . Contestants will achieve points on this subtask for implementing this method.
Better approaches
There are many other approaches that give a better compression ratio.
One byte to 4 numbers
Consider the number of encoded messages of length no greater than , only consisting of numbers , , , and ( is one such message). To analyze this, it might be easier to consider an equivalent scheme: encoded messages of length exactly , only consisting of numbers , , , , and BLANK. There are of them, which is enough to encode one byte of the original message.
Since the original message is at most bytes long, it is possible to allocate numbers per byte (, , , and for the first byte, for example), and therefore numbers from to are sufficient. This scheme yields a compression ratio of at most .
Optimal ratio
Theoretically, one can encode bytes of data to the minimum of bytes of encoded message without loss of information. There are different encoded messages using no more than bytes. This is just more than , the number of different -byte original messages. The compression ratio achieved is about . For smaller data, the ratio is even smaller.
This method yields a perfect score for this task. However, implementation of this scheme is comparatively difficult.
Comments