Editorial for CCC '20 S3 - Searching for Strings


Remember to use this editorial only when stuck, and not to copy-paste code from it. Please be respectful to the problem author and editorialist.
Submitting an official solution before solving the problem yourself is a bannable offence.

Author: wleung_bvg

For 3 of the 15 marks, we try can all |N|! permutations of the needle string (one way to do this is with recursion), and use any standard string searching algorithm to see if it is a substring of the haystack string.

Time Complexity: \mathcal O(|N|! \cdot |N| \cdot |H|)

For an additional 2 of the 15 marks, we can observe that if a substring of the haystack is a permutation of the needle, then it must contain the same frequency of each letter. For each of the \frac{N \cdot (N + 1)}{2} substrings, we can compute a frequency array of each character in the substring, and compare it to the frequency array of the needle. If they match, we will add the string to a set, in order to ignore duplicate substrings.

Time Complexity: \mathcal O(|N| + |H|^3)

For another 2 marks out of 15, we can make the observation that the frequency arrays of a substring of the haystack, and the needle only match if the length of the substring is equal to the length of the needle. There are only |H| - |N| + 1 such substrings. We can then use the same linear time method to compute the frequency array of each substring.

Time Complexity: \mathcal O(|N| + |H|^2)

For full marks, we first need to improve the computation of the frequency arrays. We can think of each of the substrings as a window of size |N| sliding from left to right over the haystack. When the window slides one unit to the right, the leftmost character is removed, while the character to the right of the rightmost character is added. We can decrement the frequency of the character that was removed, and increment the frequency of the character that was added in order to quickly compute the frequency array based on the previous window. We then need to determine if there is a substring in a set that is equal to another substring. We will use a similar technique to the sliding window in order to compute a rolling hash of the string. We can store this hash, instead of the full substring in order to quickly compare equality. One caveat is that a 32-bit hash may not be sufficient enough to avoid hash collisions. Tuples of multiple hashes (with prime moduli) should be used instead in order to minimize the chances of a hash collision.

Time Complexity: \mathcal O(|N| + |H|)

An alternate method to determining whether two substrings are equal, that does not have any chance of failure, is using a suffix array. Using a suffix array, we can quickly sort the suffixes of a string in lexicographical order in \mathcal O(|H| \cdot \log |H|) time. Afterwards, we can use Kasai's algorithm to compute the longest common prefix array, where LCP[i] stores the longest common prefix between the i^\text{th} and the (i+1)^\text{th} suffixes when sorted lexicographically. To compare whether two substrings are equal, we first identify their ranks in the sorted suffix array. We want to find the rank of the suffix in the haystack that starts from the first index in the substring (and extends to the last character in the haystack). Suppose the ranks of the two suffixes are i and j, where i \le j. If i = j, then the two substrings are trivially equal. Otherwise, the two substrings are equal if and only if the minimum element of LCP[k] where i \le k < j must be greater than or equal to the length of substring. We can quickly find the minimum element using either a segment tree, or a sparse table.

Time Complexity: \mathcal O(|N| + |H| \cdot \log |H|)


Comments

There are no comments at the moment.