Having Fun: Python and Elasticsearch, Part 3

December 03, 2014 9:25 am

Programming

Welcome back to having fun with Elasticsearch and Python. In the first part of this series, we learned the basics of setting up and running with Elasticsearch, and wrote the very basics we needed to cover basic indexing and searching of Gmail metadata. In the second part, we extended the search and querying to cover the full text of the emails as well.

That theoretically got us most of what we wanted, but there’s still work to be done. Even for a toy, this isn’t doing quite what I want yet, so let’s see what we can do with another thirty minutes of work.

Improving the Query Tool

There are two big sticking points. First, right now, we’re just passing the raw search queries to Elasticsearch and relying on Lucene’s search syntax to take care of things. That’s kind of okay, but it means we can’t easily do something I really care about, which is saying that labels must match, while everything else can be best-effort. Second, we’re not printing out all the data I want; while we did a basic extension of the query tool last time, that data is still kind of disgusting and annoying to read through.

Fixing the first part isn’t too bad. Remember earlier how I said that Elasticsearch provides a structured query language you can use? Let’s use it to solve our problem.

The structured query language is really just a JSON document describing what you’re looking for. The JSON document has a single top level key, query, which then has sub-elements describing exactly what we’re trying to query and how. For example, if you wanted to look at all the documents in a given index, that’s just

{
  "query": {
    "match_all": {}
  }
}

This, of course, is a bit silly; you usually want to look for something. For example, to explicitly match all subjects that contain go, we could do something like

{
  "query": {
    "match": {
      "subject": "go"
    }
  }
}

match is a simple query operator that does fuzzy matching, so “go” would also match “going”, “goes”, and the like. Using the query DSL from Python is really simple. Give it a shot in your Python prompt by passing it as the body parameter to es.search(). E.g.,

es.search('mail', 'message', body={
    'query': {
        'match': {
            'subject': 'go',
        }
    }
}

Of course, what we want to do here is to leverage the real power of the DSL search syntax. We want to be able to combine queries in specific ways. For example, I mentioned earlier I wanted to require labels, but have everything else be requested, but optional.

Thankfully, Elasticsearch provides the bool query operator to allow just this:

{
  "query": {
    "bool": {
      "must": [{"match": {"labels": "camlistore"}}],
      "should": [{"match": {"subject": "go"}}]
    }
  }
}

bool takes a dictionary containing at least one of must, should, and must_not, each of which takes a list of matches or other further search operators. In this case, we only care about must versus should: while our labels must match, the text in general should match if it can, but it’s okay if we don’t have an exact correspondence.

So let’s put everything together here.

First, we unfortunately have to replicate the Gmail-style query axes that Lucene gave us for free. Doing this properly would require writing a legitimate (if tiny) parser. While that can be done pretty easily, it’s a bit out of scope for this series, so we’ll cheat: since we know our keys can’t contain colons, we’ll say that everything must be in the rigid format header:value. If you want to specify multiple things can match, we’ll allow you to specify a given header multiple times. If a given token has no : in it, then we’ll assume it’s part of the full-body search. That would leave us code that looks something like this:

from collections import defaultdict
fulltext = io.StringIO()
keywords = defaultdict(str)
for token in query.split():
    idx = token.find(':')
    if 0 <= idx < len(token):
        key, value = token.split(':', 1)
        keywords[key] += ' ' + value
    else:
        fulltext.write(' ' + token)

That will allow us to search to, from, labels, and so on using exact matching, while still keeping track of anything that isn’t a key/value field so we can use it for a fuzzy body search.

We also introduced a new class from Python’s standard library, defaultdict. defaultdict is one of those tools that, once you learn about, you can’t put down. defaultdict takes a function that returns a default value to be used when you attempt to access a key that doesn’t exist. Since str() returns an empty string (''), we can avoid having to do a bunch of checks for existing keys, and instead simply directly concatenate to the default value.

Next, we just need to take the keywords dict we built above and put that into the must field, and then take the fulltext string we built up and use that for the body match. This is really straightforward to do in Python:

q = {
    'query': {
        'bool': {
            'must': [{'match': {k: v}} for k, v in keywords.viewitems()]
        }
    }
}

fulltext = fulltext.getvalue()
if fulltext:
    q['query']['bool']['should'] = [{'match': {'contents': fulltext}}]

That’s it; we’ve got our must and our should, all combined into an Elasticsearch DSL search query.

There’s only one piece missing at this point: enhancing the query output itself. While we took a passing stab at this last time, we really want to do a bit better. In particular, since we’re indexing the entire message, it’d sure be nice if we showed part of the entire message.

Doing this properly gets complicated, but we can make use of two easy tricks in Python to get something that works “well-enough” for most of our use cases. First, we can use the re module to write a regex that will replace all instances of annoying whitespace characters (tabs, newlines, carriage returns, and so on) with single spaces. Such a regex is simply [\r\n\t], so that’s easy enough. Second, Python allows us to trivially truncate a string using list slicing syntax, so s[:80] will return up to the first 80 characters in s.

Finally, we want to pull in one more special function from the Python standard library, textwrap.dedent. textwrap is a module that contains all sorts of useful utility functions for wrapping text. A shocker, I know. dedent is an incredibly handy function that simply removes leading white space from a string, based on the first line. This is incredibly useful when writing strings inline in a Python file, because you can keep the string itself properly indented with the rest of the code, but have it output to the screen flush at the left margin. We can use this to make writing our template string a lot cleaner than last time.

Putting this all together, our display code would look like this:

es = elasticsearch.Elasticsearch()
matches = es.search('mail', 'message', body=q)
hits = matches['hits']['hits']
if not hits:
    click.echo('No matches found')
else:
    if raw_result:
        click.echo(json.dumps(matches, indent=4))
    for hit in hits:
        click.echo(textwrap.dedent('''\
            Subject: {}
            From: {}
            To: {}
            Content: {}...
            Path: {}
            '''.format(
            hit['_source']['subject'],
            hit['_source']['from'],
            hit['_source']['to'],
            re.sub(r'[\r\n\t]', ' ', hit['_source']['contents'])[:80],
            hit['_source']['path']
        )))

The only tricky part is that we’ve combined the regex substitution and the truncation together. Otherwise, this is a very straightforward modification of what we already had.

That’s it. Here’s the full version, including all the imports and the initial #! line, in case you don’t want to perform all of the edits by hand:

#!/usr/bin/env python
    
import io
import json
import re
import textwrap
from collections import defaultdict
    
import click
import elasticsearch
    
    
@click.command()
@click.argument('query', required=True)
@click.option('--raw-result/--no-raw-result', default=False)
def search(query, raw_result):
    fulltext = io.StringIO()
    keywords = defaultdict(str)
    for token in query.split():
        idx = token.find(':')
        if 0 <= idx < len(token):
            key, value = token.split(':', 1)
            keywords[key] += ' ' + value
        else:
            fulltext.write(' ' + token)
    
    q = {
        'query': {
            'bool': {
                'must': [{'match': {k: v}} for k, v in keywords.viewitems()]
            }
        }
    }
    
    fulltext = fulltext.getvalue()
    if fulltext:
        q['query']['bool']['should'] = [{'match': {'contents': fulltext}}]
    
    es = elasticsearch.Elasticsearch()
    matches = es.search('mail', 'message', body=q)
    hits = matches['hits']['hits']
    if not hits:
        click.echo('No matches found')
    else:
        if raw_result:
            click.echo(json.dumps(matches, indent=4))
        for hit in hits:
            click.echo(textwrap.dedent('''\
                Subject: {}
                From: {}
                To: {}
                Content: {}...
                Path: {}
                '''.format(
                hit['_source']['subject'],
                hit['_source']['from'],
                hit['_source']['to'],
                re.sub(r'[\r\n\t]', ' ', hit['_source']['contents'])[:80],
                hit['_source']['path']
            )))

if __name__ == '__main__':
    search()

That’s it. We now have a full command-line search utility that can look through all of our Gmail messages in mere moments, thanks to the power of Python and Elasticsearch. Easy as pie.

Of course, command-line tools are cool, but it’d be really nice if we had a more friendly, graphical interface to use our tool. Thankfully, as we’ll see next time, it’s incredibly easy to extend our tool and start making it into something a bit more friendly for the casual end-user. For now, we’ve demonstrated how we can really trivially get a lot done in very little time with Python and Elasticsearch.