Bug 545252 - ConfigurableLineTracker: use the Aho-Corasick algorithm

The ConfigurableLineTracker had quadratic performance characteristics,
making line-end detection even only with the standard line terminators
extremely slow already on moderately sized sources.

Add an efficient MultiStringMatcher using the Aho-Corasick algorithm
to efficiently find any of a fixed set of strings in a text. The
algorithm is a well-described efficient general-purpose multi-string
matching algorithm based on a trie. Its main search loop is very
simple; the more tricky bits are in the construction of the auxiliary
links in the trie. The algorithm as described in 1975 returns all
matches including overlaps within a text, but it can be adapted fairly
easily to support the "leftmost longest match" semantics needed by
ConfigurableLineTracker.

The Aho-Corasick algorithm has a complexity of O(n + m + z),
where n is the text length, m the sum of the length of patterns,
and z the number of matches. For line end matching, this is roughly
O(n + L), where L is the number of lines.

Note that it can produce a quadratic number of matches on degenerate
input (patterns "x", "xx", "xxx", and so on, and a text consisting
only of "x"s).

Adapt ConfigurableLineTracker to use the new MultiStringMatcher. The
matcher provides a find() operation to find all (possibly overlapping)
occurrences of the pattern strings; this is the normal way the
Aho-Corasick algorithm works. An indexOf() method provides a way to
efficiently find only the leftmost longest match.

Since this new matcher is generally useful and fast, provide a builder
interface through which pattern strings can be added iteratively
before matching.

Change-Id: I3e66482cfca271248caaad79157f843bc2c819d7
Signed-off-by: Thomas Wolf <thomas.wolf@paranor.ch>
Also-by: Paul Pazderski <paul-eclipse@ppazderski.de>
8 files changed