UP | HOME

Grepping File Streams

This is a little something I wrote for a job interview. This code is not production ready (I wrote it in a few hours, I'm certain I missed edge cases and valuable optimizations), but I thought it was a nice exercise and I found many of the explanations of Boyer-Moore lacking.

Search Algorithm

The Boyer-Moore alogrithm works by skipping forward in haystack at most length(needle) characters, then stepping backwards, comparing each character. This way you only have to process every length(haystack) / length(needle) bytes in the best case (when there are no matches or partial matches). Once you find a complete match, you skip forward a single character from the end of the match. In the absolute simplest case, it looks like this (in zero-indexed "pseudocode"):

needle = testing
haystack = testing
offset = length(testing) - 1 = 6
haystack[offset] = g
g = g
n = n
i = i
...

However, you cannot always skip length(needle) characters, you'll miss any matches that don't align, for example:

needle = testing
haystack = atesting
offset = legnth(needle) - 1 = 6
haystack[offset] = n
g = n

The algorithm uses two rules to determine how many characters to skip: the bad character rule, and the good suffix rule. You can take the max of the two, or just implement the bad character rule, as I did, for simplicity's sake. You might be able to infer how the bad character rule from the above example. You need to skip enough characters to get to the end of the needle before stepping backwards. In this case, that would be 1.

needle = testing
haystack = atesting
offset = legnth(needle) - 1 = 6
haystack[offset] = n
offset = offset + badChar(n)
haystack[offset] = g
g = g
n = n
i = i
...

What does "badChar" look like? It's the offset from the end of the needle, for any given character. For "testing", it looks like this:

3 5 4 3 2 1 0
t e s t i n g

If the character doesn't appear in the search term, you can skip the max, length(needle). Consider a slightly more complicated example:

needle = testing
haystack = atestingatesting
offset = length(needle) - 1 = 6
haystack[offset] = n
offset = offset + badChar(n)
haystack[offset] = g
g = g
n = n
i = i
...
t = t
offset = offset + 1
haystack[offset] = a
g = a
offset = offset + length(needle) - 1
haystack[offset] = n
g = n
badChar(n) = 1
offset = offset + 1
haystack[offset] = g
g = g
n = n
i = i
...
t = t

If this doesn't make sense, try squinting at the pseudocode on wikipedia 1.

File Buffer

This is simple enough if you're searching a fixed buffer, but what if you want to search arbitrarily large files that don't fit in memory? I implemented a circular buffer to "roll over" the file. Condering the above, the buffer class should support the following operations, loading new data and discarding old data as necessary:

  • Skip forward N characters ("offset = offset + N")
  • Reading characters back M characters from the offset ("haystack[offset - M]")

The buffer should be at least twice the size of the needle for this to work correctly.

import java.io.IOException;
import java.io.Reader;

public class CircularBuffer {

  private int bufferSize;
  private char[] buffer;
  private int head;
  private int tail;
  private int count;
  private Reader input;

  public CircularBuffer(int bufferSize, Reader input) {
    this.bufferSize = bufferSize;
    this.buffer = new char[bufferSize];
    this.head = 0;
    this.tail = 0;
    this.count = 0;
    this.input = input;
  }

  private int append() throws IOException {
    // Here we assume that the "working space" behind tail is half bufferSize
    // this should be made configurable
    int copySize = Math.min(bufferSize - count, bufferSize / 2);
    if (copySize == 0) {
      return 0;
    }

    if (head + copySize > bufferSize) {
      int copyLen1 = bufferSize - head;
      int copyLen2 = copySize - copyLen1;
      int read = 0;
      if (copyLen1 > 0) {
        read = input.read(buffer, head, copyLen1);
        if (read == -1) {
          return -1;
        }
      }
      int read2 = input.read(buffer, 0, copyLen2);
      if (read2 == -1) {
        if (read > 0) {
          return read;
        } else {
          return -1;
        }
      }
      head = read2;
      count += read + read2;
      return read + read2;
    } else {
      int read = input.read(buffer, head, copySize);
      head += read;
      count += read;
      return read;
    }
  }

  public char readBack(int offset) {
    if (tail - offset - 1 < 0) {
      return buffer[bufferSize - Math.abs(tail - offset) - 1];
    } else {
      return buffer[tail - offset - 1];
    }
  }

  public int skip(int offset) throws IOException {
    if (count < offset) {
      if (append() == -1) {
        return -1;
      } else {
        return skip(offset);
      }
    }

    if (tail + offset >= bufferSize) {
      tail = offset - (bufferSize - tail);
    } else {
      tail += offset;
    }

    count -= offset;

    return 0;
  }

}

This is fairly specific to the string search algorithm, but does succeed in abstracting away the fact that the file we're searching isn't entirely in memory, so we can implement the search algorithm as normal.

public static int findMatches(char[] needle, Reader haystack, int bufferSize) throws IOException {
  Preconditions.checkArgument(bufferSize >= needle.length * 2);
  if (needle.length == 0) {
    return 0;
  }

  int matches = 0;

  // Roll over input file with a circular buffer, so we don't have to load
  // large files into memory
  CircularBuffer buffer = new CircularBuffer(bufferSize, haystack);
  int[] badCharTable = buildBadCharTable(needle);

  // Step over the buffer needle.length bytes at a time, if we encounter a character
  // from the search string, look up its offset from the end of the search string and
  // skip that many bytes forward so we can try to backtrack from the end of the search
  // string.
  int n = needle.length;
  while (buffer.skip(n) != -1) {
    int i;
    for (i = 0; needle[needle.length - i - 1] == buffer.readBack(i); i++) {
      if (i == needle.length - 1) {
        matches++;
        i = 1;
        break;
      }
    }

    n = badCharTable[buffer.readBack(i)];
  }

  return matches;
}

// Build a table mapping characters to their offset from the end of the string
// Default = needle.length
private static int[] buildBadCharTable(char[] needle) {
  // This will be accessed frequently, use an array instead of a map for cache locality
  int[] table = new int[Character.MAX_VALUE + 1];

  // Missing characters will return needle.length
  for (int i = 0; i < table.length; i++) {
    table[i] = needle.length;
  }

  for (int i = 0; i < needle.length - 1; i++) {
    table[needle[i]] = needle.length - 1 - i;
  }

  return table;
}

Footnotes: