Science  People  Locations  Timeline
Index: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Home > Rabin-Karp string search algorithm


 Contents
The Rabin-Karp algorithm is a string searching algorithm that seeks a pattern, i.e. a substring, within a text by using hashing. It is not widely used for single pattern matching, but is of considerable theoretical importance and is very effective for multiple pattern matching. Historically, it predates the very popular Knuth-Morris-Pratt algorithm. For text of length n and pattern of length m, its average and best case running time is O(n), but the (highly unlikely) worst case performance is O(nm), which is one of the reasons why it is not widely used. However, it has the unique advantage of being able to find any one of k strings or less in O(n) time on average, regardless of the size of k.

One of the simplest practical applications of Rabin-Karp is in detection of plagiarism. Say, for example, that a student is writing an English paper on Moby Dick. A cunning professor might locate a variety of source material on Moby Dick and automatically extract a list of all sentences in those materials. Then, Rabin-Karp can rapidly search through a particular paper for any instance of any of the sentences from the source materials. To avoid easily thwarting the system with minor changes, it can be made to ignore details such as case and punctuation by removing these first. Because the number of strings we're searching for, k, is very large, other string searching algorithms are impractical.

1 Shifting substrings search and competing algorithms

The basic problem the algorithm addresses is finding a fixed substring of length m, called the pattern, within a text of length n; for example, finding the string "sun" in the sentence "Hello sunshine in this vale of tears." One very simple algorithm for this task just looks for the substring at all possible positions:

The following is wikicode , a proposed pseudocode for use in many articles:

1 function NaiveSearch(string s[1..n], string sub[1..m]) 2 for i from 1 to n 3 for j from 1 to m 4 if s[i+j-1] ≠ sub[j] 5 jump to next iteration of outer loop 6 return i 7 return not found

This algorithm works well in many practical cases, but fails spectacularly on certain examples, such as searching for a string of 10,000 "a"s followed by a "b" in a string of 10 million "a"s, in which case it exhibits its worst-case Ω(mn) time.

Knuth-Morris-Pratt algorithm is an elegant elaboration on this naive algorithm that uses precomputed data to skip forward not by 1 character, but by as many as possible for the search to succeed, effectively decreasing the number of times we iterate through the outer loop. However, Rabin-Karp focuses instead on speeding up lines 3-6, as we'll discuss in the next section.

2 Use of hashing for shifting substring search

Rather than pursuing more sophisticated skipping, the Rabin-Karp algorithm seeks to speed up the testing of equality of the pattern to the substrings in the text by using a hash function. A hash function is a function which converts every string into a numeric value, called its hash value; for example, we might have hash("hello")=5. Rabin-Karp exploits the fact that if two strings are equal, their hash values are also equal. Thus, it would seem all we have to do is compute the hash value of the substring we're searching for, and then look for a substring with the same hash value.

However, there are two problems with this. First, because there are so many different strings, to keep the hash values small we have to assign some strings the same number. This means that if the hash values match, the strings might not match; we have to verify that they do, which can take a long time for long substrings. Luckily, a good hash function promises us that on most reasonable inputs, this won't happen too often, which keeps the average search time good.

The algorithm is as shown:

1 function RabinKarp(string s[1..n], string sub[1..m]) 2 hsub := hash(sub[1..m]) 3 hs := hash(s[1..m]) 4 for i from 1 to n 5 if hs = hsub 6 if s[i..i+m-1] = sub 7 return i 8 hs := hash(s[i+1..i+m]) 9 return not found

Lines 2, 3, and 6 each require Ω(m) time. However, lines 2 and 3 are only executed once, and line 6 is only executed if the hash values match, which is unlikely to happen more than a few times. Line 5 is executed n times, but only requires constant time. We now come to the second problem: line 8.

If we naively recompute the hash value for the substring s[i+1..i+m], this would require Ω(m) time, and since this is done on each loop, the algorithm would require Ω(mn) time, the same as the most naive algorithms. The trick to solving this is to note that the variable hs already contains the hash value of s[i..i+m-1]. If we can use this to compute the next hash value in constant time, then our problem will be solved.

We do this using what is called a rolling hash . A rolling hash is a hash function specially designed to enable this operation. One simple example is adding up the values of each character in the substring. Then, we can use this formula to compute the next hash value in constant time:

s[i+1..i+m] = s[i..i+m-1] - s[i] + s[i+m]

This simple function works, but will result in statement 6 being executed more often than other more sophisticated rolling hash functions such as those discussed in the next section.

Notice that if we're very unlucky, or have a very bad hash function such as a constant function, line 6 might very well be executed n times, on every iteration of the loop. Because it requires Ω(m) time, the whole algorithm then takes a worst-case Ω(mn) time.



Read more »

Non User