Editorial for COCI '06 Contest 5 #6 Dvaput


Remember to use this editorial only when stuck, and not to copy-paste code from it. Please be respectful to the problem author and editorialist.
Submitting an official solution before solving the problem yourself is a bannable offence.

If there is a string of length K that occurs twice, then there is such a string of length K-1 and less. We can use binary search to find the largest such K. The remaining 94.5\% of this text focuses on solving this subproblem.

We'll use an algorithm based on the Rabin-Karp string searching algorithm. Each substring will be assigned an integer (finite, relatively small compared to the size of the string), the so called hash value. The hash value of a string contains information about the string it came from.

The function that maps strings to their hash values is called the hash function. The function must always assign the same hash value to the same string, but different strings may map to the same hash value (the hash function is not injective). However, if two hash values are different, then they couldn't have come from the same string.

The algorithm uses a special type of hash function, a rolling hash function, with the additional property that the hash value of the next substring can be efficiently calculated from the hash value of the previous substring. One such function is similar to a positional numeral system, with an arbitrary base (say 26, the size of our alphabet).

The hash value of the string abc is hash(abc) = a \cdot 26^2 + b \cdot 26 + c.

Similarly, hash(bcd) = b \cdot 26^2 + c \cdot 26 + d.

Notice the relation hash(bcd) = 26 \cdot hash(abc) - a \cdot 26^3 + d. When generalized for strings of any length, the relation allows us to calculate the hash value of the next substring in \mathcal O(1) complexity, instead of calculating it anew in \mathcal O(K).

Here's the algorithm for the subproblem:

calculate the hash value of the first substring of length K and insert the pair (hash value, substring) into a set
for every other substring:
  calculate the new hash value
  if the hash value is already in the set
    check if there is a substring in the set equal to the current string (if so, this is the second occurrence)
  else:
    insert the pair (hash value, substring) into the set

The additional check in step 5 is needed since different strings can map to the same hash value (this is called a hash collision). The speedup over the naive algorithm is reducing the string comparison (slow) to an integer comparison (fast, step 4). Rarely do we have to do an actual string comparison (step 5).

All arithmetic operations are done modulo some integer M, so that all hash values are between 0 and M-1. The larger the M, the fewer collisions occur. In fact, if M is large enough (say 10^{15}), collisions become rare enough that we can omit the safety string comparison and have the algorithm work correctly with high probability (albeit not 100\%).

It is important that the number M be relatively prime with the base of our hash function (26), so that the hash values disperse uniformly between 0 and M-1. If M is prime, then it is relatively prime with every base so a prime number is always a good choice.

The above algorithm does \mathcal O(N) arithmetic operations and \mathcal O(N-K) set operations. We still need to implement a set structure. If we use the set class in C++, the overall complexity is \mathcal O(N \log^2 N).

We can do better. If we make M smaller (say 200\,003), we can build what is called a hash table. The table consists of M buckets, with bucket i being a linked list containing all substrings that map to the hash value i.

Reducing M implies having the additional check in step 5, because the number of collisions is considerable (with M as large as L, the collision ratio is about 0.5\%). Increasing the number of buckets reduces the number of collisions, but the ratio is already small enough that decreasing it doesn't affect the overall efficiency of the solution.

Inserting into a hash table is \mathcal O(1) since we can always put the new element at the front of the linked list.

The complexity of a find operation depends on the number of elements in the relevant bucket. But if the hash function disperses the elements into buckets mostly uniformly, the complexity will be \mathcal O(1) on average.

The overall complexity of the solution is \mathcal O(N \log N).


Comments


  • 0
    jhinezeal123  commented on Aug. 31, 2024, 12:32 p.m.

    " (with M as large as L, the collision ratio is about 0.5%)."

    Can someone explain why the collision ratio is so small? Thanks!


  • -1
    LeonLiur  commented on Jan. 22, 2022, 9:23 p.m.

    This editorial's first sentence is blatantly wrong, it is actually 94.3467% of the text that focuses on the sub problem and not 94.5%, misleading information is very dangerous for readers.


  • 0
    Badmode  commented on Aug. 27, 2021, 3:16 a.m.

    Hi! Everything makes sense here except for the first paragraph.

    "If there is a string of length K that occurs twice, then there is such a string of length K−1 and less. We can use binary search to find the largest such K."

    How does the first sentence relate to the second one? And how do we find the largest K with just a binary search? An explaination would be helpful. Thanks!


    • 1
      chika  commented on Aug. 27, 2021, 4:33 a.m.

      How does the first sentence relate to the second one?

      When you are binary searching for the answer in a problem, you are typically searching for the largest or smallest value that meets some given condition. Assume without loss of generality that you are searching for the largest value that meets some condition. If X is the largest value for which the condition is true and for all integers less than X, the condition is true, then you are allowed to binary search for X. Note that this is actually a biconditional statement - if it is not true that for all integers less than X, the condition is true, then it is not promised that binary search will yield the proper value for X. By way of example demonstrating the latter, if the condition is true for 1 and 3, and your binary search checks 2 and decides the condition is not true, it will incorrectly conclude that the answer is strictly less than 2 when the answer is actually 3.

      Therefore, it is critical when binary searching for the answer that the condition switches from being always true to being always false exactly once as you sweep across all values in some direction.

      And how do we find the largest K with just a binary search?

      To find the largest K, compute some value where it is guaranteed the condition is true, and some value where it is guaranteed the condition is false. Binary search by picking some value between these two values where you don't know whether the condition is true or not (typically done by taking the arithmetic mean of the largest value that you know fits the condition and the smallest value that you know doesn't fit the condition, and casting to an integer), and then check whether the condition is true for this value. After doing so, update the pertinent endpoint appropriately.