Editorial for COCI '06 Contest 5 #6 Dvaput
Submitting an official solution before solving the problem yourself is a bannable offence.
If there is a string of length
We'll use an algorithm based on the Rabin-Karp string searching algorithm. Each substring will be assigned an integer (finite, relatively small compared to the size of the string), the so called hash value. The hash value of a string contains information about the string it came from.
The function that maps strings to their hash values is called the hash function. The function must always assign the same hash value to the same string, but different strings may map to the same hash value (the hash function is not injective). However, if two hash values are different, then they couldn't have come from the same string.
The algorithm uses a special type of hash function, a rolling hash function, with the additional property that the hash value of the next substring can be efficiently calculated from the hash value of the previous substring. One such function is similar to a positional numeral system, with an arbitrary base (say
The hash value of the string
Similarly,
Notice the relation
Here's the algorithm for the subproblem:
calculate the hash value of the first substring of length K and insert the pair (hash value, substring) into a set
for every other substring:
calculate the new hash value
if the hash value is already in the set
check if there is a substring in the set equal to the current string (if so, this is the second occurrence)
else:
insert the pair (hash value, substring) into the set
The additional check in step 5 is needed since different strings can map to the same hash value (this is called a hash collision). The speedup over the naive algorithm is reducing the string comparison (slow) to an integer comparison (fast, step 4). Rarely do we have to do an actual string comparison (step 5).
All arithmetic operations are done modulo some integer
It is important that the number
The above algorithm does set
class in C++, the overall complexity is
We can do better. If we make
Reducing
Inserting into a hash table is
The complexity of a find operation depends on the number of elements in the relevant bucket. But if the hash function disperses the elements into buckets mostly uniformly, the complexity will be
The overall complexity of the solution is
Comments
" (with M as large as L, the collision ratio is about 0.5%)."
Can someone explain why the collision ratio is so small? Thanks!
This editorial's first sentence is blatantly wrong, it is actually 94.3467% of the text that focuses on the sub problem and not 94.5%, misleading information is very dangerous for readers.
Hi! Everything makes sense here except for the first paragraph.
"If there is a string of length K that occurs twice, then there is such a string of length K−1 and less. We can use binary search to find the largest such K."
How does the first sentence relate to the second one? And how do we find the largest K with just a binary search? An explaination would be helpful. Thanks!
When you are binary searching for the answer in a problem, you are typically searching for the largest or smallest value that meets some given condition. Assume without loss of generality that you are searching for the largest value that meets some condition. If
is the largest value for which the condition is true and for all integers less than 
, the condition is true, then you are allowed to binary search for 
. Note that this is actually a biconditional statement - if it is not true that for all integers less than 
, the condition is true, then it is not promised that binary search will yield the proper value for 
. By way of example demonstrating the latter, if the condition is true for 
and 
, and your binary search checks 
and decides the condition is not true, it will incorrectly conclude that the answer is strictly less than 
when the answer is actually 
.
Therefore, it is critical when binary searching for the answer that the condition switches from being always true to being always false exactly once as you sweep across all values in some direction.
To find the largest
, compute some value where it is guaranteed the condition is true, and some value where it is guaranteed the condition is false. Binary search by picking some value between these two values where you don't know whether the condition is true or not (typically done by taking the arithmetic mean of the largest value that you know fits the condition and the smallest value that you know doesn't fit the condition, and casting to an integer), and then check whether the condition is true for this value. After doing so, update the pertinent endpoint appropriately.