Post-Hoc Index Design: From Regex to PEG

In September of last year, the team faced one of its first challenges
in responding to the growing user base of Hosted Chef. The on-call
engineer at the time received an alert that the queue used to store
Chef objects waiting to be indexed for search was backed up. Soon we
received alerts that calls to the search API were failing and we
posted an outage notice. Restoring search functionality required
us to compromise on search freshness and we began investigating the
incident’s root cause.

Chef’s search feature is powered by Lucene via Solr. Every
time an object is created or updated in Chef (e.g. a node, a
role, or a data bag) the object is converted to JSON and put on a
queue (we use RabbitMQ) where it will be picked up for
indexing. The indexer expands the JSON into a set of key/value
pairs (deep structure is flattened by concatenating keys with
underscore) and sends the data to Solr for indexing.

The search outage was the result of Lucene doing a merge of its index
that took minutes. Analysis of our Lucene index revealed that we had
well over 900,000 fields in our index. If you look hard, you can find
benchmarks on Lucene using 30 fields. Our system was using Lucene
wrong.

In addition to mapping each flattened key into a field, the
indexer was “expanding” all compound keys by inserting an ‘X’ to
allow a kind of wild card searching. For example, the JSON:

{"a": {"b": {"c": 1}}}

would have a flattened key of a_b_c which would get expanded
to:

X_b_c
a_X_c
a_b_X

This was a non-optimal approach to say the least — yet Lucene
actually made it work for quite some time.

We needed to change how Chef objects were indexed to avoid creating an
unbounded number of fields in the index. The solution we came up with
was to concatenate the flattened key value pairs with a delimiter (we
chose __=__) and put all of the concatenated pairs into a single
content field in the search index separated by whitespace. Besides
reducing the number of fields in the index, this approach eliminated
the need for the “X” expansion since Lucene wild card queries using
“*” could be used instead; this saved considerable time and memory
during the indexing preprocessing phase.

Experiments with simulated node data were run to validate the new
approach. Two 10,000 node save test scenarios were run: the first was
optimistic, in which two random keys were added to a node being saved,
and the second pessimistic, in which 50 random keys were added for
each save. The results showed a decrease in average commit time of 4x
and 23x, respectively.

This left us with two problems. We needed to reindex our data and all
queries would fail once we did; querying the new index required a new
query syntax.

The reindexing was solved with a well practiced deployment plan and a
small amount of scheduled maintenance. Dealing with the change in
the query syntax was somewhat more challenging.

Chef’s query syntax directly exposes the Lucene syntax and, as a
result, the initial index layout. Chef object keys map to fields in
the query. For example, to search for all nodes with the “web” role,
the user query would be role:web which will be passed along to
Lucene unchanged. Lucene will interpret this as a search for documents
with “web” in the “role” field. The equivalent query for the new
“no-fields” index looks like this: content:role__=__web. In order to
deploy the new indexing scheme, we needed to be able to convert user
queries into the new format.

We were under operational duress at the time and needed a
fast-as-possible solution. By examining system logs, we determined that
we could cover the types of queries actually in use with a few regular
expressions. If you’re ever looking for a definition of “quick and
dirty”, this is it. To help verify the solution, we wrote a test
harness
that populated an organization with a known set of objects
and randomly generated queries with Boolean operators in such a way
that the result set was computable. This helped increase our
confidence in our duct tape solution.

Having introduced a series of regular expressions into our query
handling code, we now had an additional problem. The regular
expression based solution was a stop-gap measure and didn’t support
many aspects of the Lucene query syntax. For example, we failed to
handle the ! and - operators (see Lucene syntax). The query
-role:blah was transformed into content:-role__=__blah when it
should have been -content:role__=__blah. We considered adding
more regular expression based replacement rules to the already
somewhat convoluted logic, but it became clear that we’d need a
fair bit more logic to be able to identify the leading dash and ignore
the embedded dash in a query like tags:b||(-role:web-server tags:a)
(which should be transformed to content:tags__=__b ||
(-content:role__=__web-server content:tags__=__a
).

In order to restore support for more of Lucene’s flexible syntax and
to end up with a maintainable query handling system that isn’t a
laundry list of regular expressions, we began work on code to parse
Lucene’s query syntax. The initial implementation used a Parsing
Expression Grammar (PEG) created with Treetop. Here’s the grammar
file
we used. The result was a query handling system that we have
full control over and that provides a layer of abstraction over
Lucene’s query syntax. A side-benefit of the full-parse approach is
that we can now validate user queries much earlier.

Since deploying the reorganized index, Lucene’s commit times have
remained short and consistent even in the face of significant growth
of the index. We’ve ported the Hosted Chef search API endpoint to
Erlang and ported the Lucene PEG to Erlang using neotoma. We have
been able to address a few further edge cases of the Lucene query
syntax in a structured and robust fashion (no regular expressions
required).

Author Seth Falcon