Bitap algorithm

The bitap algorithm is an approximate or exact string matching algorithm that is one of the underlying algorithms of the Unix utility agrep (approximate grep). What differentiates it from other string matching algorithms is its use of bitwise operations, that make it run super fast.

The algorithm for exact string matching works by encoding the pattern to be matched in a bit matrix whose dimensions are the length of a long [31] by the length of the alphabet. So, let’s work with a small alphabet of size 3. A pattern such as “abcc” would be encoded as

1111111111111111111111111111110 # pattern/a
1111111111111111111111111111101 # pattern/b
1111111111111111111111111110011 # pattern/c

see more ยป

In the “a” row, there’s a 0 for every instance of “a” in the pattern, as indexed into the corresponding bit array. The same applies to “b” and “c.” The letter “c” occurs in the 3rd and 4th place of the pattern, so the 3rd and 4th lower order bits are set to 0. Everything else is a 1.

Let’s match this pattern to the text “aaabcca,” which requires a separate bit array for keeping track of how far we’ve gotten, if we need to start over, etc. More precisely, this bit array keeps track of all matches of the current iteration. We initialize this bit array to ~1 (not 1, long format)

1111111111111111111111111111110 # state

Now, we iterate through the characters of our text “aaabcca.” Here’s where the bitwise operations come in. We 1) select the first letter of the text and bitwise-or the corresponding pattern bit matrix with our state and 2) left shift the state over 1:

Text aaabcca
1111111111111111111111111111110 # pattern/a
1111111111111111111111111111110 # old state
1111111111111111111111111111100 # new state

Awesome, a 0 indicates a match and a 1 indicates a mis-match at the corresponding position. So, so far, we’ve got a match of “a” and are ready to see if the next letter matches too. Next:

Text aaabcca
1111111111111111111111111111110 # pattern/a
1111111111111111111111111111100 # old state
1111111111111111111111111111100 # new state

We’ve still just got a match at “a”…

Text aaabcca
1111111111111111111111111111110 # pattern/a
1111111111111111111111111111100 # old state
1111111111111111111111111111100 # new state

OK, I’m getting bored…

Text aaabcca
1111111111111111111111111111101 # pattern/b
1111111111111111111111111111100 # old state
1111111111111111111111111111010 # new state

Text aaabcca
1111111111111111111111111110011 # pattern/c
1111111111111111111111111111010 # old state
1111111111111111111111111110110 # new state

Text aaabcca
1111111111111111111111111110011 # pattern/c
1111111111111111111111111110110 # old state
1111111111111111111111111101110 # new state

Suppose the pattern is of length m. Each iteration checks the new state for whether the m+1th lowest bit is a 0. If it is, that means the pattern was matched to completion, like in this case! Awesome, we win.

Due to the limitation of this algorithm to strings of 31 characters or less, the runtime is O(1). The number of bitwise operations required is 2n, where n is the length of the text.