Having Fun: Python and Elasticsearch, Part 3

Welcome back to having fun with Elasticsearch and Python. In the first part of this series, we learned the basics of setting up and running with Elasticsearch, and wrote the very basics we needed to cover basic indexing and searching of Gmail metadata. In the second part, we extended the search and querying to cover the full text of the emails as well.

That theoretically got us most of what we wanted, but there’s still work to be done. Even for a toy, this isn’t doing quite what I want yet, so let’s see what we can do with another thirty minutes of work.

Improving the Query Tool

There are two big sticking points. First, right now, we’re just passing the raw search queries to Elasticsearch and relying on Lucene’s search syntax to take care of things. That’s kind of okay, but it means we can’t easily do something I really care about, which is saying that labels must match, while everything else can be best-effort. Second, we’re not printing out all the data I want; while we did a basic extension of the query tool last time, that data is still kind of disgusting and annoying to read through.

Fixing the first part isn’t too bad. Remember earlier how I said that Elasticsearch provides a structured query language you can use? Let’s use it to solve our problem.

The structured query language is really just a JSON document describing what you’re looking for. The JSON document has a single top level key, query, which then has sub-elements describing exactly what we’re trying to query and how. For example, if you wanted to look at all the documents in a given index, that’s just

{
  "query": {
    "match_all": {}
  }
}

This, of course, is a bit silly; you usually want to look for something. For example, to explicitly match all subjects that contain go, we could do something like

{
  "query": {
    "match": {
      "subject": "go"
    }
  }
}

match is a simple query operator that does fuzzy matching, so “go” would also match “going”, “goes”, and the like. Using the query DSL from Python is really simple. Give it a shot in your Python prompt by passing it as the body parameter to es.search(). E.g.,

es.search('mail', 'message', body={
    'query': {
        'match': {
            'subject': 'go',
        }
    }
}

Of course, what we want to do here is to leverage the real power of the DSL search syntax. We want to be able to combine queries in specific ways. For example, I mentioned earlier I wanted to require labels, but have everything else be requested, but optional.

Thankfully, Elasticsearch provides the bool query operator to allow just this:

{
  "query": {
    "bool": {
      "must": [{"match": {"labels": "camlistore"}}],
      "should": [{"match": {"subject": "go"}}]
    }
  }
}

bool takes a dictionary containing at least one of must, should, and must_not, each of which takes a list of matches or other further search operators. In this case, we only care about must versus should: while our labels must match, the text in general should match if it can, but it’s okay if we don’t have an exact correspondence.

So let’s put everything together here.

First, we unfortunately have to replicate the Gmail-style query axes that Lucene gave us for free. Doing this properly would require writing a legitimate (if tiny) parser. While that can be done pretty easily, it’s a bit out of scope for this series, so we’ll cheat: since we know our keys can’t contain colons, we’ll say that everything must be in the rigid format header:value. If you want to specify multiple things can match, we’ll allow you to specify a given header multiple times. If a given token has no : in it, then we’ll assume it’s part of the full-body search. That would leave us code that looks something like this:

from collections import defaultdict
ft = io.StringIO()
kw = defaultdict(str)
for token in query.split():
    idx = token.find(':')
    if 0 <= idx < len(token):
        key, value = token.split(':', 1)
        kw[key] += ' ' + value
    else:
        ft.write(' ' + token)

That will allow us to search to, from, labels, and so on using exact matching, while still keeping track of anything that isn’t a key/value field so we can use it for a fuzzy body search.

We also introduced a new class from Python’s standard library, defaultdict. defaultdict is one of those tools that, once you learn about, you can’t put down. defaultdict takes a function that returns a default value to be used when you attempt to access a key that doesn’t exist. Since str() returns an empty string (''), we can avoid having to do a bunch of checks for existing keys, and instead simply directly concatenate to the default value.

Next, we just need to take the kw (keyword) dict we built above and put that into the must field, and then take the ft (full-text) string we built up and use that for the body match. This is really straightforward to do in Python:

q = {
    'query': {
        'bool': {
            'must': [{'match': {k: v}} for k, v in kw.viewitems()]
        }
    }
}

ft = ft.getvalue()
if ft:
    q['query']['bool']['should'] = [{'match': {'contents': ft}}]

That’s it; we’ve got our must and our should, all combined into an Elasticsearch DSL search query.

There’s only one piece missing at this point: enhancing the query output itself. While we took a passing stab at this last time, we really want to do a bit better. In particular, since we’re indexing the entire message, it’d sure be nice if we showed part of the entire message.

Doing this properly gets complicated, but we can make use of two easy tricks in Python to get something that works “well-enough” for most of our use cases. First, we can use the re module to write a regex that will replace all instances of annoying whitespace characters (tabs, newlines, carriage returns, and so on) with single spaces. Such a regex is simply [\r\n\t], so that’s easy enough. Second, Python allows us to trivially truncate a string using list slicing syntax, so s[:80] will return up to the first 80 characters in s.

Finally, we want to pull in one more special function from the Python standard library, textwrap.dedent. textwrap is a module that contains all sorts of useful utility functions for wrapping text. A shocker, I know. dedent is an incredibly handy function that simply removes leading white space from a string, based on the first line. This is incredibly useful when writing strings inline in a Python file, because you can keep the string itself properly indented with the rest of the code, but have it output to the screen flush at the left margin. We can use this to make writing our template string a lot cleaner than last time.

Putting this all together, our display code would look like this:

es = elasticsearch.Elasticsearch()
matches = es.search('mail', 'message', body=q)
hits = matches['hits']['hits']
if not hits:
    click.echo('No matches found')
else:
    if raw_result:
        click.echo(json.dumps(matches, indent=4))
    for hit in hits:
        click.echo(textwrap.dedent('''\
            Subject: {}
            From: {}
            To: {}
            Content: {}...
            Path: {}
            '''.format(
            hit['_source']['subject'],
            hit['_source']['from'],
            hit['_source']['to'],
            re.sub(r'[\r\n\t]', ' ', hit['_source']['contents'])[:80],
            hit['_source']['path']
        )))

The only tricky part is that we’ve combined the regex substitution and the truncation together. Otherwise, this is a very straightforward modification of what we already had.

That’s it. Here’s the full version, including all the imports and the initial #! line, in case you don’t want to perform all of the edits by hand:

#!/usr/bin/env python

import io
import json
import re
import textwrap
from collections import defaultdict

import click
import elasticsearch


@click.command()
@click.argument('query', required=True)
@click.option('--raw-result/--no-raw-result', default=False)
def search(query, raw_result):
    ft = io.StringIO()
    kw = defaultdict(str)
    for token in query.split():
        idx = token.find(':')
        if 0 <= idx < len(token):
            key, value = token.split(':', 1)
            kw[key] += ' ' + value
        else:
            ft.write(' ' + token)

    q = {
        'query': {
            'bool': {
                'must': [{'match': {k: v}} for k, v in kw.viewitems()]
            }
        }
    }

    ft = ft.getvalue()
    if ft:
        q['query']['bool']['should'] = [{'match': {'contents': ft}}]

    es = elasticsearch.Elasticsearch()
    matches = es.search('mail', 'message', body=q)
    hits = matches['hits']['hits']
    if not hits:
        click.echo('No matches found')
    else:
        if raw_result:
            click.echo(json.dumps(matches, indent=4))
        for hit in hits:
            click.echo(textwrap.dedent('''\
                Subject: {}
                From: {}
                To: {}
                Content: {}...
                Path: {}
                '''.format(
                hit['_source']['subject'],
                hit['_source']['from'],
                hit['_source']['to'],
                re.sub(r'[\r\n\t]', ' ', hit['_source']['contents'])[:80],
                hit['_source']['path']
            )))

if __name__ == '__main__':
    search()

That’s it. We now have a full command-line search utility that can look through all of our Gmail messages in mere moments, thanks to the power of Python and Elasticsearch. Easy as pie.

Of course, command-line tools are cool, but it’d be really nice if we had a more friendly, graphical interface to use our tool. Thankfully, as we’ll see next time, it’s incredibly easy to extend our tool and start making it into something a bit more friendly for the casual end-user. For now, we’ve demonstrated how we can really trivially get a lot done in very little time with Python and Elasticsearch.

C++ Programming and Brain RAM

I have a tricky relationship with C++. There is a narrow subset of the language that, when properly used, I find to be a strict improvement over C. Specifically, careful use of namespaces, RAII, some pieces of the STL (such as std::string and std::unique_ptr), and very small bit of light templating can actually simplify a lot of common C patterns, while making it a lot harder to shoot yourself in the foot via macros and memory leaks.

That said, C++ faces a choking combination of wanting to simultaneously maintain backwards compatibility and also extend the language to be more powerful and flexible. I’ve recently been reading the final draft of Effective Modern C++ by Scott Meyers. It is an excellently written book, and it does a superb job covering what new features have been introduced in the last couple of C++ versions, and how to make your code base properly take advantage of them.

And a lot of the new stuff in C++ is awesome. I had a chance to start taking advantage of the new features when I was working at Fog Creek on [MESSAGE REDACTED], and I was actually really pleasantly surprised by how much of an improvement the additions made in my day-to-day coding. In fact, I was so pleasantly surprised that I actually gave a whole presentation on how C++ didn’t have to be awful.

But reading through Scott’s book the last few days has also reminded me why I was somewhat relieved to effectively abandon C++ when I joined Knewton.

Take move semantics, one of the hallmark features of C++11/14. Previously, in C++, you always had to either pass around pointers to objects, or copy them around.2 This is a problem, because pointers aren’t easily amenable to C++-style RAII-based garbage collection, yet copies are very expensive. In order to get your memory safety and your speed, you end up having a lot of code where you semantically want something like

std::vector<something> generate_somethings();

...

std::vector<something> foo = generate_somethings();

but, for performance reasons, you have to actually write something closer to

void generate_somethings(std::vector<something> &empty_vector);

...

std::vector<something> foo;
generate_somethings(foo);

As much as C++ developers rapidly acclimate to this pattern, I think we can safely agree that it’s much less clear than the far-less-efficient first variant. You can’t even tell, simply by looking at the call site, that foo is mutated. You can infer it, certainly, but you have to actually find the prototype to be sure.

In theory, move semantics (also known as rvalue references) allow C++ to explicitly acknowledge when a value is “dead” in a specific context, which allows for much greater efficiency and clarity. The reason it’s called “move semantics” comes from the idea that you can move the contents of the old object to the new one, rather than copying them, since you know that the old object can no longer be referenced. For example, if you’re moving a std::string from one variable to another, you could simply assign the underlying char * and length, rather than making a full-blown copy of the underlying buffer, even if neither string is a const. The original can’t be accessed anymore, so it’s fine if you mutate memory that the original owned to your heart’s content.

In practice, though, things aren’t that simple. Scott Meyers helpfully notes that

std::move doesn’t move anything, for example […]. Move operations aren’t always cheaper than copying; when they are, they’re not always as cheap as you’d expect; and they’re not always called in a context where moving is valid. The construct type&& doesn’t always represent an rvalue reference.1

Got that?

In fact, Scott’s point is obvious if you understand how C++ gets realized under the hood. For example, when returning from a function, anything on the stack that’s returned via rvalue reference is going to have to be copied, so you’re only going to win if the object has enough data on the heap that moving actually saves copies. But understanding that requires you already bring a lot of C++ knowledge to the table.

This is a fractal issue with modern C++. Congratulations, you get type inference via auto! auto type inference works via template type inference, so make sure you understand that first.3 This comes up in especially fun situations, like Foo &&bar = quux() being an rvalue reference, auto&& bar = quux() not being one.

Or to quit picking on rvalue references, how about special member generation—those freebies like default constructors and copy constructors that the compiler will write on your behalf if you don’t write them? There are two new ones in C++, the move constructor and the move assignment operator, and the compiler will write them for you!…unless you wrote one of the two, in which case, unlike all the other special members, you have to write both. But at least that’ll be a compile-time issue, whereas, if you have an explicit copy constructor, you actually won’t get either move-related special members autogenerated; you’ll have to write both yourself if you want them. This isn’t purely academic: if you add a copy constructor and forget this fact, you may get a chance to enjoy an “unexplained” slowdown in your code when your silently generated move constructor vanishes.

To be clear again, these rules are emphatically not arbitrary. They make complete sense if you take a step back and think about why the standard would have mandated things work this way. But it’s not immediately transparent; you have to think.

And this is why I find it so amazingly hard to write code productively in C++. My brain has a limited amount of working memory. When I’m writing in a language with a simple runtime and syntax, such as C, Go, Python, Smalltalk, or (to an arguably slightly lesser extent) OCaml, then I need dedicate relatively little space in my brain to nuances of the language. I can spend nearly all of my working space on solving the actual problem at hand.

When I write in C++, by contrast, I find that I’m constantly having to dedicate large amount of thought to what the underlying C++ is actually going to do. Is this a template, a macro, or an inline function? Was that the right choice? How many copies of this templated class am I actually generating in the compiled code? If I switch this container to have const members, is that going to speed things up, or slow them down? Is this class used in a DLL for some silly reason? If so, how can I make this change without altering the vtable? Is this function supposed to be called from C? Do I even need to care in this instance?

It’s not that I can’t do this. I did it for years, and, as I noted, I was voluntarily, intentionally working in C++ for the last couple of months I was at Fog Creek. Sometimes, at least for now, C++ is unquestionably the right tool, and that project was one of those times. But as happy as I am that C++ is getting a lot of love, and that working with it is increasingly less painful, I can’t help but feel that the amount of baggage it’s dragging around at this point means that I have to spend far too much of my brain on the language, not the problem at hand. My brain RAM ends up being all about C++; most of the problem gets swapped to disk.

C++ still has a place in my toolbox, but I’m very, very glad that improvements elsewhere in the ecosystem are ever shrinking the tasks that require it. I’m optimistic that languages like Rust may shrink its uses even further, and that I may live to see when the answer to “when is C++ the best tool for the job?” can finally genuinely be “never.” In the meantime, if you have to write C++, go buy Effective Modern C++.


  1. Effective Modern C++, pp. 355. 

  2. I’m oversimplifying slightly, mostly by omitting things like std::auto_ptr and boost::scoped_ptr, but they don’t really change my point. 

  3. Did you know templates had type inference? No? Me neither. I somehow was able to work in C++ for several years without learning this fact, and am now scared to look back at my old code and figure out how dumb some of it is. 

Having Fun: Python and Elasticsearch, Part 2

In my earlier post on Elasticsearch and Python, we did a huge pile of work: we learned a bit about how to use Elasticsearch, we learned how to use Gmvault to back up all of our Gmail messages with full metadata, we learned how to index the metadata, and we learned how to query the data naïvely. While that’s all well and good, what we really want to do is to index the whole text of each email. That’s what we’re going to do today.

It turns out that nearly all of the steps involved in doing this don’t involve Elasticsearch; they involve parsing emails. So let’s take a quick time-out to talk a little bit about emails.

A Little Bit About Emails

It’s easy to think of emails as simple text documents. And they kind of are, to a point. But there’s a lot of nuance to the exact format, and while Python has libraries that will help us deal with them, we’re going to need to be aware of what’s going on to get useful data out.

To start, let’s take a look again at the raw email source we looked at yesterday a bit more completely:

$ gzcat /Users/bp/src/vaults/personal/db/2005-09/11814848380322622.eml.gz
X-Gmail-Received: 887d27e7d009160966b15a5d86b579679
Delivered-To: benjamin.pollack@gmail.com
Received: by 10.36.96.7 with SMTP id t7cs86717nzb;
        Wed, 14 Sep 2005 19:35:45 -0700 (PDT)
Received: by 10.70.13.4 with SMTP id 4mr150611wxm;
        Wed, 14 Sep 2005 19:35:45 -0700 (PDT)
Return-Path: <probablyadefunctaddressbutjustincase@duke.edu>
[...more of the same...]
Message-ID: <4328DDFA.4050903@duke.edu>
Date: Wed, 14 Sep 2005 22:35:38 -0400
From: No Longer a Student <probablyadefunctaddressbutjustincase@duke.edu>
Reply-To: probablyadefunctaddressbutjustincase@duke.edu
User-Agent: Mozilla Thunderbird 1.0.6 (Macintosh/20050716)
MIME-Version: 1.0
To: benjamin.pollack@gmail.com
Subject: Celebrating 21 years
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

    It's my birthday, Blue Devils!  At least it will be in a few days, so I
am opening my apartment, porch, and environs this Friday, Sept. 16th
to all of you for some celebration.  Come dressed up, come drunk, or
whatever...just come.  There will be plenty to drink, and for those of
you that are a wine connessiouers, Cheerwine is the closest you'll get.
Kickoff is at 10:30pm.  Pass-out is at <early Saturday morning>.  If you
have some drink preferences, let me know and we'll see what we can
snag.  In addition to that, let me know if you think you can make it,
even if only for a while, so we can judge the amount of booze that we'll
be stocking.

All you really need to know:
Friday, Sept. 16th
10:30pm-late
alcohol

This is about the simplest form of an email you can have. At the top, we have a bunch of metadata about the email itself. Notably, while these look kind of like key/value pairs, we can see that at least some values are allowed to repeat. That said, we’d like to try to merge this with the existing metadata we’ve got if we can.

There’s also the, you know, actual content of the email. In this particular case, that’s clearly just a blob of plain text, but let’s be honest: we know from experience that some emails have a lot of other things—attachments, HTML, and so on.1 Emails that have formatting or attachments are called multipart messages: each chunk corresponds to a different piece of the email, like an attachment, or a formatted version, or an encryption signature. For a toy tool, we don’t really need to do something special with all of the attachments and whatnot; we just want to grab as much as we can from the email itself. Since, in real life, even multipart emails have a plain text part, it’ll be good enough if we can just grab that.

Let’s make that the goal: we do care about the header values, and we’ll extract any plain text in the email, but the rest can wait for another day.

Parsing Emails in Python

So we know what we want to do. How do we do it in Python?

Well, we’ll need two things: we’ll need to decompress the .eml.gz files, and we’ll need to parse the emails. Thankfully, both pieces are pretty easy.

Python has a gzip module that trivially handles reading compressed data. Basically, wherever you’d otherwise write open(path_name, mode), you instead write gzip.open(path_name, mode). That’s really all there is to that part.

For parsing the emails, Python provides a built-in library, email, which does this tolerably well. For one thing, it allows us to easily grab out all of those header values without figuring out how to parse them. (We’ll see shortly that it also provides us a good way to get the raw text part of an email, but let’s hold that thought for a moment.)

There’s unfortunately one extra little thing: emails are not in a reliable encoding. Sure, they might claim they’re in something consistent, like UTF-7, but you know they aren’t. This is a bit of a problem, because Elasticsearch is going to want to be handed nothing but pure, clean Unicode text.

For the purposes of a toy, it’ll be enough if we just make a good-faith effort to grab useful information out, even if it’s lossy. Since most emails are sent in Latin-1 or ASCII encoding, we can be really, really lazy about this by introducing a utility function that tries to decode strings as Latin-1, and just replaces anything it doesn’t recognize with the Unicode unknown character symbol, �.

def unicodish(s):
    return s.decode('latin-1', errors='replace')

With that in mind, we can start playing with these modules immediately. In your Python REPL, try something like this:

import email
import gzip

with gzip.open('/path/to/an/email.eml.gz', 'r') as fp:
    message = email.message_from_file(fp)
print '%r' % (message.items(),)

This looks awesome. The call to email.message_from_file() gives us back a Message object, and all we have to do to get all the header values is to call message.items().

All that’s left for this part is to merge the email headers with the Gmail metadata, so let’s do that first. While values can repeat, we don’t actually care; key fields, like From and To, don’t, and if we accidentally only end up with one Received field when we should have fifteen, we don’t care. This is, after all, something we’re hacking together for fun, and I’ve never in my life cared to query the Received field. This gives us an idea for a way to quickly handle things: we can just combine the existing headers with our current metadata.

So, ultimately, we’re really just changing our original metadata loading code from

with open(path.join(base, name), 'r') as fp:
    meta = json.load(fp)

to

with gzip.open(path.join(base, name.rsplit('.', 1)[0] + '.eml.gz', 'r') as fp:
    message = email.message_from_file(fp)
meta = {unicodish(k).lower(): unicodish(v) for k, v in message.items()}
with open(path.join(base, name), 'r') as fp:
    meta.update(json.load(fp))

Not bad for a huge feature upgrade.

Note that this prioritizes Gmail metadata over email headers, which we want: if some email has an extra, non-standard Label header, we don’t want it to trample our Gmail labels. We’re also normalizing the header keys, making them all lowercase, so we don’t have to deal with email clients that secretly write from and to instead of From and To.

That’s it for headers. Give it a shot: try running your modified loader script, and then querying it using the query tool we wrote last time with the --raw-result flag we added to our query tool last time. We’re not printing something useful and user-friendly with the new data, but it’s already searchable and useful.

In fact, you know what? Sure, this is a toy, but it’s just not honestly hard to make this print out at least a little more useful data. Just having From and To would be helpful, so let’s quickly tweak the tool to do that by altering the final click.echo() call:

#!/usr/bin/env python

import json

import click
import elasticsearch


@click.command()
@click.argument('query', required=True)
@click.option('--raw-result/--no-raw-result', default=False)
def search(query, raw_result):
    es = elasticsearch.Elasticsearch()
    matches = es.search('mail', q=query)
    hits = matches['hits']['hits']
    if not hits:
        click.echo('No matches found')
    else:
        if raw_result:
            click.echo(json.dumps(matches, indent=4))
        for hit in hits:
            # This next line and the two after it are the only changes
            click.echo('To:{}\nFrom:{}\nSubject:{}\nPath: {}\n\n'.format(
                hit['_source']['to'],
                hit['_source']['from'],
                hit['_source']['subject'],
                hit['_source']['path']
            ))

if __name__ == '__main__':
    search()

Bingo, done. Not bad for a three-line edit.

For the body itself, we need to do something a little bit more complicated. As we discussed earlier, emails can be simple or multipart, and Python’s email module unfortunately exposes that difference to the user. For simple emails, we’ll just grab the body, which will likely be plain text. For multipart, we’ll grab any parts that are plain text, smash them all together, and use that for the body of the email.

So let’s give it a shot. I’m going to pull out the io module so we can access StringIO for efficient string building, but you could also just do straight-up string concatenation here and get something that would perform just fine. Our body reader then is going to look something like this:

content = io.StringIO()
if message.is_multipart():
    for part in message.get_payload():
        if part.get_content_type() == 'text/plain':
            content.write(unicodish(part.get_payload()))
else:
    content.write(unicodish(message.get_payload()))

This code simply looks for anything labeled plain text and builds a giant blob of it, handling the plain case and the multipart case differently.2

Well, if you think about it, we’ve done all the actual parsing we need to do. That just leaves Elasticsearch integration. We want to combine this with the metadata parsing we already had, so our final code for indexing will look like:

def parse_and_store(es, root, email_path):
    gm_id = path.split(email_path)[-1]
    with gzip.open(email_path + '.eml.gz', 'r') as fp:
        message = email.message_from_file(fp)
    meta = {unicodish(k).lower(): unicodish(v) for k, v in message.items()}
    with open(email_path + '.meta', 'r') as fp:
        meta.update(json.load(fp))

    content = io.StringIO()
    if message.is_multipart():
        for part in message.get_payload():
            if part.get_content_type() == 'text/plain':
                content.write(unicodish(part.get_payload()))
    else:
        content.write(unicodish(message.get_payload()))

    meta['account'] = path.split(root)[-1]
    meta['path'] = email_path

    body = meta.copy()
    body['contents'] = content.getvalue()
    es.index(index='mail', doc_type='message', id=gm_id, body=body)

That’s it. On my system, this can index every last one of the tens of thousands of emails I’ve got in only a minute or so, and the old query tool we wrote can easily search through all of them in tens of milliseconds.

Making a Real Script

Last time, we used click to make our little one-off query tool have a nice UI. Let’s do that for the data loader, too. All we really need to do is make that ad-hoc parse_and_store function be a real main function. The result will look like this:

#!/usr/bin/env python

import email
import json
import gzip
import io
import os
from os import path

import click
import elasticsearch


def unicodish(s):
    return s.decode('latin-1', errors='replace')


def parse_and_store(es, root, email_path):
    gm_id = path.split(email_path)[-1]

    with gzip.open(email_path + '.eml.gz', 'r') as fp:
        message = email.message_from_file(fp)
    meta = {unicodish(k).lower(): unicodish(v) for k, v in message.items()}
    with open(email_path + '.meta', 'r') as fp:
        meta.update(json.load(fp))

    content = io.StringIO()
    if message.is_multipart():
        for part in message.get_payload():
            if part.get_content_type() == 'text/plain':
                content.write(unicodish(part.get_payload()))
    else:
        content.write(unicodish(message.get_payload()))

    meta['account'] = path.split(root)[-1]
    meta['path'] = email_path

    body = meta.copy()
    body['contents'] = content.getvalue()

    es.index(index='mail', doc_type='meta', id=gm_id, body=meta)
    es.index(index='mail', doc_type='message', id=gm_id, body=body)


@click.command()
@click.argument('root', required=True, type=click.Path(exists=True))
def index(root):
    """imports all gmvault emails at ROOT into INDEX"""
    es = elasticsearch.Elasticsearch()
    root = path.abspath(root)
    for base, subdirs, files in os.walk(root):
        for name in files:
            if name.endswith('.meta'):
                parse_and_store(es, root, path.join(base, name.split('.')[0]))

if __name__ == '__main__':
    index()

Until Next Time

For now, you can see that what we’ve got works by using the old query tool with the --raw-result flag, and you can use it to do queries across all of your stored email. But the query tool is lacking, and in multiple ways: it doesn’t output everything we care about (specifically, a useful bit of the message bodies), and it doesn’t really work the way we want by treating some fields (like labels) as exact matches. We’ll fix these next time, but for now, we can rest knowing that we’re successfully storing everything we care about. Everything else is going to be UI.


  1. After all, if you can’t attach a Word document containing your cover letter to a blank email saying “Job Application”, what’s the point of email? 

  2. I actually think the Python library messes this up: simple emails and multipart emails really ought to look the same to the developer, but unfortunately, that’s the way the cookie crumbled. 

Having Fun: Python and Elasticsearch, Part 1

I find it all too easy to forget how fun programming used to be when I was first starting out. It’s not that a lot of my day-to-day isn’t fun and rewarding; if it weren’t, I’d do something else. But it’s a different kind of rewarding: the rewarding feeling you get when you patch a leaky roof or silence a squeaky axle. It’s all too easy to get into a groove where you’re dealing with yet another encoding bug that you can fix with that same library you used the last ten times. Yet another pile of multithreading issues that you can fix by rewriting the code into shared-nothing CSP-style. Yet another performance issue you can fix by ripping out code that was too clever by half with Guava.

As an experienced developer, it’s great to have such a rich toolbox available to deal with issues, and I certainly feel like I’ve had a great day when I’ve fixed a pile of issues and shipped them to production. But it just doesn’t feel the same as when I was just starting out. I don’t get the same kind of daily brain-hurt as I did when everything was new,1 and, sometimes, when I just want to do something “for fun”, all those best practices designed to keep you from shooting yourself (or anyone else) in the foot just get in the way.

Over the past several months, Julia Evans has been publishing a series of blog posts about just having fun with programming. Sometimes these are “easy” topics, but sometimes they’re quite deep (e.g., You Can Be a Kernel Hacker). Using her work as inspiration, I’m going to do a series of blog posts over the next couple of months that just have fun with programming. They won’t demonstrate best practices, except incidentally. They won’t always use the best tools for the job. They won’t always be pretty. But they’ll be fun, and show how much you can get done with quick hacks when you really want to.

So, what’ll we do as our first project? Well, for awhile, I’ve wanted super-fast offline search through my Gmail messages for when I’m traveling. The quickest solution I know for getting incredibly fast full-text search is to whip out Elasticsearch, a really excellent full-text search engine that I used to great effect on Kiln at Fog Creek.2

We’ll also need a programming language. For this part of the series, I’ll choose Python, because it strikes a decent balance between being flexible and being sane.3

I figure we can probably put together most of this in a couple of hours spread over the course of a week. So for our first day, let’s gently put best practices on the curb, and see if we can’t at least get storage and maybe some querying done.

Enough Elasticsearch to Make Bad Decisions

I don’t want to spend this post on Elasticsearch; that’s really well handled elsewhere. What you should do is read the first chapter or two of Elasticsearch: the Definitive Guide. And if you actually do that, skip ahead to the Python bit. But if you’re not going to do that, here’s all you need to know about Elasticsearch to follow along.

Elasticsearch is a full-text search database, powered by Lucene. You feed it JSON documents, and then you can ask Elasticsearch to find those documents based on the full-text data within them. A given Elasticsearch instance can have lots of indexes, which is what every other database on earth calls a database, and each index can have different document types, which every other database on earth calls a table. And that’s about it.

“Indexing” (storing) a document is really simple. In fact, it’s so simple, let’s just do it.

First, if you haven’t already, install the Python library for Elasticsearch using pip via a simple pip install elasticsearch, and then launch an instance of a Python. I like bpython for this purpose, since it’s very lightweight and provides great tab completion and as-you-type help, but you could also use IPython or something else. Next, if you haven’t already, grab a copy of Elasticsearch and fire it up. This involves the very complicated steps of

  1. Downloading Elasticsearch;
  2. Extracting it; and
  3. Launching it by firing up the bin/elasticsearch script in a terminal.

That’s it. You can make sure it’s running by hitting http://localhost:9200/ in a web browser. If things are looking good, you should get back something like

{
  "status" : 200,
  "name" : "Gigantus",
  "version" : {
    "number" : "1.3.4",
    "build_hash" : "a70f3ccb52200f8f2c87e9c370c6597448eb3e45",
    "build_timestamp" : "2014-09-30T09:07:17Z",
    "build_snapshot" : false,
    "lucene_version" : "4.9"
  },
  "tagline" : "You Know, for Search"
}

Then, assuming you are just running a vanilla Elasticsearch instance, give this a try in your Python shell:

import elasticsearch
es = elasticsearch.Elasticsearch()  # use default of localhost, port 9200
es.index(index='posts', doc_type='blog', id=1, body={
    'author': 'Santa Clause',
    'blog': 'Slave Based Shippers of the North',
    'title': 'Using Celery for distributing gift dispatch',
    'topics': ['slave labor', 'elves', 'python',
               'celery', 'antigravity reindeer'],
    'awesomeness': 0.2
})

That’s it. You didn’t have to create a posts index; Elasticsearch made it when you tried storing the first document there. You likewise didn’t have to specify what the document schema was; Elasticsearch just infered it, based on the first document you provided.4

Want to store more documents? Just repeat the process:

es.index(index='posts', doc_type='blog', id=2, body={
    'author': 'Benjamin Pollack',
    'blog': 'bitquabit',
    'title': 'Having Fun: Python and Elasticsearch',
    'topics': ['elasticsearch', 'python', 'parseltongue'],
    'awesomeness': 0.7
})
es.index(index='posts', doc_type='blog', id=3, body={
    'author': 'Benjamin Pollack',
    'blog': 'bitquabit',
    'title': 'How to Write Clickbait Titles About Git Being Awful Compared to Mercurial',
    'topics': ['mercurial', 'git', 'flamewars', 'hidden messages'],
    'awesomeness': 0.95
})

Getting documents is just as easy. E.g., how can we see the post we just indexed?

es.get(index='posts', doc_type='blog', id=2)

Of course, this is boring; we’re using Elasticsearch as a really bizarre key/value store, but the whole point of Elasticsearch is to allow you to, well, search. So let’s do that.

Elasticsearch provides two different ways to search documents. There’s a structured query language, which allows you to very carefully and unambiguously specify complex queries; and there’s a simple, Lucene-based syntax that is great for hacking things together. For the moment, let’s just play with the Lucene-based one. What’s that look like? Well, if you wanted to find all posts where I was the author, you could simply do

es.search(index='posts', q='author:"Benjamin Pollack"')

So all we have to do is write field:value and we get search. You could also just do something like

es.search(index='posts', q='Santa')

to search across all fields, or mix and match:

es.search(index='posts', q='author:"Benjamin Pollack" python')

It’s just that simple.5

And, hey, this seems really close to Gmail’s search syntax. Maybe that’ll come in handy later.

Getting the Gmail Data

So with Elasticsearch under our belt, let’s look at actually coding up how to get the Gmail data! We’ll need to write an IMAP client, and then de-duplicate messages due to labels, and figure out what the labels are, and…

…or, better yet, let’s not, because someone else already figured that part out for us. Gmvault already allows mirroring your Gmail account, including all the special stuff, like which labels are on which emails and so on. So let’s just install and use that. You can install it with a simple pip install gmvault==1.8.1-beta --allow-external IMAPClient6, and then you can sync your email with a simple

gmvault sync you@gmail.com -d path/to/where/you/want/the/email/archived

Not only will this sync things for you; it’ll do it with proper OAuth semantics and everything. So that takes care of getting the emails for offline access. Next up, let’s figure out how to start getting data into Elasticsearch.

Loading the Metadata

Once Gmvault finishes downloading your emails, if you go poke around, you’ll see there’s a really simple structure going on in the downloaded data. Assuming you synced to ~/gmails, then you’ll see something like:

~/gmails/db/2005-09/118148483803226229.meta
~/gmails/db/2005-09/118148483803226229.eml.gz
~/gmails/db/2007-03/123168411054578126.meta
~/gmails/db/2007-03/123168411054578126.eml.gz
...

This looks really promising. I wonder what format those .metas are?

$ cat 2007-03/123168411054578.meta | python -mjson.tool
{
    "flags": [
        "\\Seen"
    ],
    "gm_id": 123168411054578,
    "internal_date": 1174611101,
    "labels": [
         "Registrations"
    ],
    "msg_id": "19b71702474dd770796e8aa45d@www.rememberthemilk.com",
    "subject": "Welcome to Remember The Milk!",
    "thread_ids": 123168411054578,
    "x_gmail_received": null
}

Perfect! Elasticsearch takes JSON, and these are already JSON, so all we have to do is to submit these to Elasticsearch and we’re good. Further, these have a built-in ID, gm_id, that matches the file name of the actual email on disk, so we’ve got a really simple mapping to make this all work.

And what are the .eml.gz files?

$ gzcat 2005-09/11814848380322.eml.gz
X-Gmail-Received: 887d27e7d009160966b15a5d86b579679
Delivered-To: benjamin.pollack@gmail.com
Received: by 10.36.96.7 with SMTP id t7cs86717nzb;
        Wed, 14 Sep 2005 19:35:45 -0700 (PDT)
Received: by 10.70.13.4 with SMTP id 4mr150611wxm;
        Wed, 14 Sep 2005 19:35:45 -0700 (PDT)
Return-Path: <probablyadefunctaddressbutjustincase@duke.edu>

Okay, so good news, that’s the email that goes with the metadata, bad news, parsing emails in Python sucks. For today, let’s start by indexing just the metadata, and then spin back around to handle the full email text later.

To make this work, we really only need three tools:

  • os.walk, which lets us walk a directory hierarchy;
  • json, the Python module for working with JSON; and
  • elasticsearch, the Python interface for Elasticsearch we already discussed earlier.

We’ll walk all the files in the root of the Gmvault database using os.walk, find all files that end in .meta, load the JSON in those files, tweak the JSON just a bit (more on that in a second), and then shove the JSON into Elasticsearch.

for base, subdirs, files in os.walk('/home/you/gmails'):
    for name in files:
        if name.endswith('.meta'):
            with open(path.join(base, name), 'r') as fp:
                meta = json.load(fp)
            meta['account'] = path.split(root)[-1]
            meta['path'] = path.join(base, name)
            es.index(index='mail', doc_type='message', id=meta['gm_id'], body=meta)

And that’s seriously it. Elasticsearch will automatically create the index and the document type based on the first document we provide it, and will store everything else. On my machine, this chews through tens of thousands of meta files in just a couple of seconds.

I did throw in two extra little bits to help us later: I explicitly track the path to the metadata (that’s the meta['path'] = path.join(base, name) bit), and I explicitly set the account name based on the path so we can load up multiple email accounts if we want (that’s the meta['account'] = path.split(root)[-1] part). Otherwise, I’m just doing a vanilla es.index() call on the raw JSON we loaded.

So far, so good. But did it work?

Searching the Metadata

We can start by being really lazy. As noted earlier, Elasticsearch provides two search mechanisms: a structured query language and a string-based API that just passes your query onto Lucene. For production apps, you pretty much always want to use the first one; it’s more explicit, more performant, and less ambiguous.

But this isn’t a production app, so let’s be lazy. As you know if you read the Elasticsearch section first, doing a Lucene-based query is this monstrosity:

es.search('mail', q=query)

Yep. That’s all it takes to do a full-text search using the built-in Lucene-backed query syntax. Try it right now. You’ll see you get back a JSON blob (already decoded into native Python objects) with all of the results that match your query. So all we’d really have to do to exceed our goal for today, to have both storing data and querying, would be to whip up a little command-line wrapper around this simple command call.

When I think of writing command-line tools in Python, I think click, a super-simple library for whipping up great command-line parsing. Let’s bring in click via pip install click, and use it to write our tool.

All we have to do for this tool is allow passing a Lucene-like string to Elasticsearch. I’ll also add an extra command-line parameter for printing the raw results from Elasticsearch, since that’ll be useful for debugging.

Here’s all it takes for the first draft of our tool:

#!/usr/bin/env python

import json

import click
import elasticsearch


@click.command()
@click.argument('query', required=True)
@click.option('--raw-result/--no-raw-result', default=False)
def search(query, raw_result):
    es = elasticsearch.Elasticsearch()
    matches = es.search('mail', q=query)
    hits = matches['hits']['hits']
    if not hits:
        click.echo('No matches found')
    else:
        if raw_result:
            click.echo(json.dumps(matches, indent=4))
        for hit in hits:
            click.echo('Subject:{}\nPath: {}\n\n'.format(
                hit['_source']['subject'],
                hit['_source']['path']
            ))

if __name__ == '__main__':
    search()

Let’s walk through this slowly. The initial @click.command() tells Click that the following function is going to be exposed as a command-line root. @click.argument and @click.option allow us to specify mandatory arguments and optional parameters, respectively, just by naming them. Then we can declare our function, make sure that our function arguments match the command-line arguments, and Click takes care of the rest.

The actual body of the function has no real surprises, but just to go over things: the es.search() call we’ve already discussed. We don’t care about any of the result metadata that Elasticsearch provides, so we simply explicitly grab matches['hits']['hits'] early on;7 and, by default, Elasticsearch stores the entire document when you index it, so instead of loading the original document explicitly, we can be really lazy and just look at the _source dictionary key in the hit.8

Aaaand we’ve already got full-text and axis-based searching across message metadata, including a nice command-line client. In less than an hour.

In the next post, we’ll explore parsing emails in Python and doing full-text search through the whole body. For today, we’ve already got lightening-fast search across message metadata.


  1. While it’s true that picking up my fifteenth package management tool certainly makes my brain hurt, that’s not quite what I mean here. 

  2. Sure, something like Xapian might make more sense, but it’s harder and I don’t know it very well, and this is for fun, so who cares. 

  3. Besides which, this particular example is heavily adapted from a talk I gave at NYC Python a few weeks back, so Python already had one foot in the door. You should come to NYC Python if you’re in the New York area; it’s a great group. 

  4. You can, and in production usually need to, specify an explicit schema. Elasticsearch calls these mappings. However, the defaults are usually totally fine for messing around, so I’m not going to explore mappings here at all. 

  5. For some queries, you may notice that Elasticsearch returns more documents than you think ought to match. If you look, you’ll see that, while it’s returning them, it thinks they’re really lousy results: their _score property will be very low, like 0.004. We’ll explore ways to mitigate this later, but note that this is similar to search engines like DuckDuckGo, Google, and Bing trying very hard to find you websites when your terms just don’t honestly match very much. 

  6. IMAPClient unfortunately is not hosted on PyPI, so you’ll need to explicitly enable fetching it from non-approved sources to proceed. 

  7. The upper parts of the dictionary contain really useful metadata about the results, but we don’t care about them for our work here, so we’ll just throw them out. 

  8. See the previous footnote about mappings? This is part of why production apps almost always need them: mappings allow you to disable storing the full document, since you probably have the original stored elsewhere anyway. But, again, it’s actually really handy for debugging and toy projects. 

Factor 0.97 Released

I’m really happy to see that Factor 0.97 is now available. Factor is a modern, concatenative programming language, similar to FORTH or Joy, but actively maintained. It’s got great performance, solid documentation, and rich libraries and tooling, including a robust web framework that powers the Factor website itself. Along with Pharo Smalltalk, Factor is one of two languages and environments I go to when I just want to have fun for a bit, which is how I ended up making my own little contribution to the release in the form of a rewritten and more robust Redis package. If you want a really mind-bending change from your day-to-day coding, I’d strongly suggest you check it out.

Code reviews and bad habits

Sometimes, I feel that my career as a coder is demarcated by the tech stacks I used to write software. Partly that’s about the programming language—Smalltalk in college, C# and Python at Fog Creek—but it’s also about all the other tools you use to get the job done. I spent eight years working for Fog Creek, and in that capacity, I had a pretty consistent stack: FogBugz for bugs, customer support, and documentation; Trello for general feature development; Kiln for code review; Mercurial for source control; Vim and Visual Studio for actual coding; and our in-house tool, Mortar, for continuous integration.2 While those tools of course changed a bit over my time at Fog Creek, they only ever changed very gradually, component-by-component, so my effective workflow remained largely unchanged.

About a month ago, when I joined Knewton, my entire stack changed all at once.3 Out with Visual Studio; in with IntelliJ. Out with Mortar; in with Jenkins. Out with Mercurial; in with Git. Out with FogBugz; in with JIRA.

While you might think that’d be a headache-inducing amount of churn, it really hasn’t been, because most of these minimally impacted my workflow. Git and Mercurial end up being surprisingly similar, JIRA is just a kind of half-finished FogBugz, and IntelliJ is at worst Visual Studio’s equal. Maybe I had to relearn some keystrokes and button placements, but the actual pattern of how I wrote code didn’t really change.

With one exception: I hate using Gerrit for code review. I don’t hate it because it’s poorly written; I hate it because its workflow encourages bad habits.

Knewton is really, really big on code review. That’s awesome, because so am I, to the point that I created a whole tool around it. So it’s certainly not the idea of code review that I object to.

Further, Gerrit’s design is actually almost exactly identical to the original Kiln prototype. There are fundamentally two ways to do code review: pre-merge, meaning that you review code before it’s in the main repository, and post-commit, which means you review it afterwards. Modern Kiln permits both, but back in 2008, when Tyler and I won the Django Dash with what became Kiln, we were all about the pre-merge workflow. Pushing to the main repository was forbidden; instead, you’d create a review workspace, push your changes there, discuss them, and, on approval, we’d automatically merge them. That’s still my preferred workflow, which is why Kiln still supports it (via the “Read and Branch” permission), and since that happens to be the only workflow supported by Gerrit, I ought to love it.

Kiln's Early UI

And I almost do, except for one, fatal flaw: the granularity of a review is wrong. In all versions of Kiln, reviews are on a series of related changes. In Gerrit, reviews are on a single patch. Thus, in Kiln, many commits may be involved in a single review, and the review approves or rejects the collection of them, whereas Gerrit’s reviews are single, isolated commits.

Each of these two methodologies patterns has lots of tools in its camp. GitHub and Bitbucket both join Kiln in the review-a-collection-of-patches camp, while Review Board, Barkeep, and Phabricator all land in the review-a-single-patch camp. So it’s not as if I’m saying one particular tool got it right, and all of the others got it wrong. But I sure am going to say that one collection of tools got it right, and the others got it wrong, because single-patch systems encourage bad habits.

There are two fundamental problems with single-patch review systems:

  1. They encourage lumping at-best-weakly-related changes together. Frequently, when I start implementing a new feature, there are at least three steps: first, refactor the existing code to make it clean to add the new feature; next, add the new feature; and finally, add unit tests. The bigger the feature, the more likely each of these steps is likely to itself consist of several logical steps. If you can store several distinct commits in a single review, then you can simply keep these commits grouped together. But if I’m in a single-patch system, I’m going to be strongly encouraged to do everything in one massive commit. That’s especially frustrating because refactoring existing code, and adding new code, get lumped together, demanding much more mental energy on my part to figure out what’s actually going on in a given review.

    You might argue that you can keep the patches split, creating one review per commit, but that’s actually worse. At best, you’re now separating the tests from the feature, and the refactoring from the motivation for the refactoring. But the real issue is that many single-patch systems make it very easy to approve any one of the commits in isolation, which is exactly the opposite of what you want. Thus, one-review-per-commit switches the balance from “annoying” to “dangerous.” Not really an improvement.

  2. They encourage you to hide your history. The whole point of a source control system is to tell you the history of how code got to be the way it was. I want to be able to see what it looked like yesterday, and last February at 2 PM, and anything in between. Sometimes that’s because I know that the code worked then and doesn’t now, and I want to know why, but lots of the time, it’s because I want to know why a given change was done. What was the context? What was the motivation? When you just keep updating one single patch the whole time it’s under review, I’m losing tons of history: all I’ll get is a single finished product as a single patch, without any of the understanding of how it got that way.1

And that’s why I’m finding myself extremely frustrated with Gerrit. It’s not that Gerrit’s a bad piece of software; it’s that it’s encouraging me to develop bad habits in how I use source control. And that’s why, for all the parts of my stack that I’ve switched out, the only one I’m truly frustrated to give up is Kiln.


  1. I’m aware that many people do this in Git anyway via aggressive rebasing, but at least GitHub, Bitbucket, and Kiln all allow the agglutinative workflow. It’s not even possible in a single-patch system. 

  2. Given the amount of love Mortar’s gotten since I wrote it as part of a hackathon, I think it should be renamed “Moribund”, but that’s neither here nor there. 

  3. Well, almost. We still use Trello heavily, and if I have anything to say about it, it’ll stay that way. 

Walled Gardens, Walled Ghettos

I’ve seen a lot of posts recently about how Windows 8, and Windows Phone 8, are failures. These posts inevitably talk about how the new user interface is a complete mess, or how, no matter how great Windows Phone 8 may be, the app situation is so bad that Microsoft should simply give up on the platform.

I actually disagree with these arguments as such. While OS X and iOS are my daily operating systems, Metro is, in my opinion, a great touch interface, and will make a wonderful tablet experience…once they remove the desktop. Almost all of the improvements delivered in Windows 8 on the desktop side of things were welcome, and I think that people will actually grow to love them…once Microsoft quits forcing its users to check into Metro every so often. On the Windows Phone front, Microsoft has made it clear that they can buy themselves out of the app problem—which, if not a solution I particularly care for, nevertheless seems to be working well for them right now. In other words, as much as these are genuine problems, I think they have solutions, and those appear to be the solutions Microsoft is actively pursuing right now.

Yet there is a subtler and more nefarious problem that Microsoft seems uninterested in fixing, and that problem keeps me from using Windows.

I was at Microsoft Build this past year. My focus was on learning about Azure and doing some labs with the .NET engineers, but, given that Microsoft gave every single attendee two free Windows 8 tablets, and that I happened to win a Windows Phone, it was kind of hard to ignore the consumer side of things. And, honestly, when I got those free devices, I was in need of both a new phone, and a new laptop. I used Windows at work; why not give it a shot at home?

So I tried. I really did. And even now, I sometimes break out my Surface, or transfer my SIM to my Windows Phone, just to see if things have changed. But the honest answer is that I can’t really use these devices as my main desktop and phone.

And it’s not because of apps. And it’s not because of user interfaces. And it’s not because of battery life, or whatever else you’ve read.

It’s because of mail, calendars, and contacts.

Last March, Microsoft rolled out an anticipatory update that removed support in their Windows 8 apps for talking to Google via Exchange ActiveSync, or EAS. This meant you could no longer access your Google calendars natively in Metro. Nearly a year later, Windows 8 still has no ability to natively work with Google calendars. Or, really, any calendaring system that doesn’t run EAS, because they don’t support the CalDAV standard that virtually every calender server except theirs uses. Their recommended solution? Oh, you know, just start using Outlook.com instead.

Meanwhile, Windows Phone, unlike its desktop cousin, works just fine with Google Calendar. On first blush, it works great with Gmail, too, so you’d think you’re golden. But try deleting a message from a Gmail account on your phone, and you’ll discover that your definition of deleting a message is rather different from Microsoft’s: despite the phone having a special workflow for registering Gmail accounts, and showing their unique status clearly in the accounts section, deleting a Gmail message results in your phone creating a new label called “Deleted Items”, tagging your message with that label, and then archiving it. And the situation with other email providers isn’t much better: Windows Phone allows no folder customization, so you get to enjoy all kinds of lovely dummy folders on your other providers. Don’t want your sent folder called “Sent Items”? Tough luck, because your phone sure does.

What about contacts? Windows Phone actually gets this right: you can have lots of sources for contacts, and it’s easy to link duplicates from across multiple services into a kind of übercontact. That makes Windows Phone better than at least iOS, and better than the last version of Android I heavily used, too. Yet when I jump back to Windows proper, again, it’s Microsoft services or bust. In practice, if I want to have ubiquitous access to my contacts in a Windows world, it’s again Outlook.com or bust.

These all have workarounds—using a web browser to access your calendars directly on Google, for example, or giving up on the bundled Metro apps and using your old desktop apps instead—but these workarounds remove any real impetus to use Windows in the first place. If I’m just accessing everything through the web, I’d be better off with a Chromebook than a Surface, or a Geeksphone than a Lumia. For all the accusations that Apple has a walled garden on their app store, their phone does a great job integrating tons of third-party services, from Outlook.com to Google to LinkedIn to Facebook, and you better believe that their mail applications know how to handle Gmail’s unique mail structure. It’s deplorable that Microsoft can’t get its act together years after Windows 8’s release and make it actually work with the rest of the world.

I don’t mind living in a walled garden of apps, but I’m not willing to live in a walled ghetto of services. If Microsoft wants me to use Windows, they’re going to have to admit that the world has changed and tear down that wall.

But that’s impossible!

For the past week, I have felt a wave of relief that we shipped Kiln Harmony, the first DVCS-agnostic source control system. Kiln Harmony’s translation engine ruled my life for the better part of a year, and, as the technical blog series is revealing, probably took some of my sanity with it. But we’ve received nearly universally positive feedback, and built a product that I myself love to use, so I can’t help but feel the project was an incredible success.

A success that started with me half-screaming “that’s impossible” across a table.

That’s impossible!

I like to think of myself as fairly open-minded, but I found those words flying out of my mouth almost before I’d had a chance to think about them.

Joel and I were sitting in his office in January 2012, discussing adding Git support to Kiln. As much as I loved Mercurial, and as happy as we both were that Kiln had been built on top of it, we both also agreed that the time had come to add Git support as well. The support I wanted to build was simple and straightforward: just like every other service that offers both Git and Mercurial, we’d make you pick one of the two at repository creation. To make it easy to change your mind later, I also proposed a one-click method to switch a repository to the alternative DVCS, but any given repository would ultimately be either Git or Mercurial—not both.

Joel’s idea was fundamentally different: he wanted anyone to be able to use either system on any repository at any time. And that’s why I found myself screaming “that’s impossible.”

Developers expect that pushing data to Kiln and cloning it back out again gives them the same repository. To do that, we’d have to round-trip a steady state between Mercurial and Git. No conversion tool at the time could do that, and for a very good reason: while Mercurial and Git are nearly isomorphic, they have some fundamental differences under the hood that were intractable. There are concepts that exist only in one of the two systems, and therefore cannot possibly be translated in a meaningful fashion into the other. Unless you’re the magic data fairy, you’re going to lose that information while round-tripping. In addition, since some of these data are tightly tied to their tool’s preferred workflow, you’re also going to severely hamper day-to-day DVCS operations in the process.

In other words, if we tried to build what Joel wanted, we’d end up with the world’s first lossy version control system, supporting only the narrow band of functionality common between Git and Mercurial. Talk about products that are dead on arrival. I wanted no part.

I ended up going back to my office after that meeting incredibly annoyed. Joel just didn’t understand the internals of the two tools well enough. If he did, he’d agree with me. I’d therefore draft up a paper explaining exactly why you could not possibly build a tool like what he was proposing.

That’s improbable!

Except I ended up doing quite the opposite.

Over and over, I’d pick a topic that I knew was fundamentally intractable—say, handling Git- and Mercurial-style tags at the same time—and bang out a few paragraphs going into the technical minutiae of why it was impossible. Then I’d read back over my paragraph, and realize there were gaps in my logic. So I’d fill gap after gap, until, finally, I’d written a detailed how-to manual for handling exactly the thing that I originally said you couldn’t.

The impossible things fell down one by one. I designed a scheme for Mercurial named branches that involved non-fast-forward merges on the Git side for lossless preservation.1 I proposed extending Mercurial’s pushkey system in a way to handle annotated tags. I developed a workflow that could handle the competing branching strategies with minimal user confusion. I ended up going home at nearly 8 PM, only to keep writing.

The next morning, I came into the office not with a document explaining why we couldn’t make Kiln repositories simultaneously Git and Mercurial, but rather, a document explaining exactly how to do so.

That’s very unlikely!

At the time, I was still serving as Kiln’s shit-umbrella more than a developer, so I asked Kevin, one of the Kiln team’s best developers, to write me a prototype based on the white paper, thus freeing me up for the vastly more sexy task of merging 15,000 SQL Server databases into one so our backup strategy would finally be viable. (Hey, I was a good shit umbrella.) He put together a surprisingly robust prototype using ideas from the white-paper, extending them nicely in the process, thus indicating pretty strongly that my ideas weren’t all that bad.

Based on Kevin’s prototype, I proposed a new strategy: we would, indeed, make a version of Kiln that had Git and Mercurial “either/or” support, and we’d still aim to have that project shippable by summer on roughly the original schedule. That would be our safety option. Meanwhile, a parallel effort, dubbed “Doublespeak,” would launch to make repositories DVCS-agnostic. If Doublespeak ended in failure, we’d still have a new Kiln version to show in summer with Git support. If, on the other hand, it were promising, we’d delay the launch long enough to ship the more ambitious vision of Kiln with Doublespeak.

By a stroke of luck, things worked out such that I could swap my team-lead duties for developer duties just as Doublespeak development kicked off in earnest, and that’s how I ended up as the lead engineer on the translation engine.

The first thing I did was throw out the prototype. Dogma or no, I realized in my first day of hacking that the original design, by definition, would not perform as well as we needed it to. Its architecture would also have required me to spend a considerable amount of time modifying Git’s internals, which was not appealing to me.2

I opted for a new design that would directly leverage Mercurial’s code base and a Python Git implementation called Dulwich. I reimplemented the prototype in a day or two. Then I began the long sledge through gnarly parts of the white-paper: managing ref/bookmark translation and movement. Handling tags. Handling octopus merges. Handling Mercurial’s concepts of filenode parents and linkrevs. As the Kiln “either/or” project wrapped up, more developers started joining me on the translation layer to fill in the increasingly esoteric gaps.

But it’ll still never actually fly!

It wasn’t long before we had a nearly complete solution that was ready for dogfooding. Unfortunately, almost as soon as we began using the new magic, we hit two major problems that threatened to scuttle the project.

The first was that the entire engine was just too slow, limited almost entirely by how much disk I/O Doublespeak had to do to get its job done. This was already brutal on Fog Creek’s private Kiln instance; our ops team was having nightmares about what the disk damage would look like on Kiln On Demand. Thus began months of work to try to get our disk access as minimal as possible. The general mantra was to read any given revision at most once when converting—and, whenever possible, not at all. We introduced caches, memoization, and more. At the beginning, I was landing order-of-magnitude performance improvements daily. By the end, we’d optimized the code base so much that dinky little 2% performance improvements were frequently the result of days of work. But we had the system performing at a level we could actually use.

The second problem we hit was that, while we had lots of experience with how Mercurial repositories were supposed to look, and how Git data was supposed to look, we grossly underestimated how much real-world variation there’d be in Mercurial and Git repositories. The Kiln team spent weeks adding more and more “corrupt” data preservation logic to Doublespeak before it could handle real-world repositories like Mercurial itself. But we ultimately got to a place where nearly every repository we threw at the thing losslessly round-tripped.3

But we tackled both of these challenges. And soon, dogfooding became alpha, alpha became beta, the Doublespeak code-name became the much less Orwellian “Kiln Harmony,” and, last Tuesday, Kiln Harmony shipped.

That was supposed to be impossible!

I don’t think Joel was prescient and knew Kiln Harmony was doable, and I certainly don’t think he knew how to do everything the white-paper explained before I wrote it. But I definitely believe he knew that pushing me as hard as he did would force me to find a way if there were one to be found.

In case it’s not clear, I’m very glad he did.


  1. This strategy ended up being too clever by half, so we dropped it for a less-perfect but less-surprising solution, but I was very excited to realize it was possible at all. 

  2. I honestly couldn’t care less whether you like Mercurial or Git at the end of the day, but I think it’s objectively obvious that Mercurial is far easier to hack and extend than Git. Its code base is DRY, orthogonal, written in Python, and (in the sense of executable count) monolithic. Therefore, solutions where the hacking could be more in Mercurial, and less in Git, were highly attractive. 

  3. Annoyingly, we somehow spaced ever checking the Linux kernel, so of course that was one of the first repositories someone fed Kiln Harmony on launch day, and it crashed the translation engine. Thankfully, while there’s lots of user features headlining Kiln Harmony, one of the features the developers are most excited about is that we are finally in a place where Kiln can be continuously deployed. Open-source may make all bugs shallow, but continuous deployment makes all bugs short-lived. 

Stepping back and being quiet

I always travel ready to get stuck and be forced to work remotely. My tool of choice for that varies, but has recently been a third-generation iPad armed with my Nokia 800’s old folding keyboard, PocketCloud, and Prompt. With these four simple tools, plus Azure and AWS in a pinch, I can pretty easily get a good day’s work done anywhere. So when I got stuck in Los Angeles this past Saturday, I wasn’t worried: I knew I’d still be able to help Fog Creek get stuff done.

You know what an iPad, PocketCloud, and Prompt do not help with?

Shuttling diesel fuel, five gallons at a time, up seventeen flights of stairs.

I’ve been at Fog Creek seven years. I’ve worked through emergencies before: I’ve been there moving systems around, rebuilding databases, getting emergency code fixes out to work around infrastructure problems. I’ve even written code on an airplane to fix a bug in the account registration system on Thanksgiving, pushing it out right after we landed. When stuff breaks, I’m there.

But this time, there was nothing I could do to help. With the exception of providing some very minor assistance with the shutdown and power-up when we thought the datacenter death was immanent, all I could do was to sit and watch.

I tried to think of something to do, anything, to make up for my absence. I spooled up machines on AWS and Azure so I could…write code if I had to, I guess? And nabbed copies of the deployment system for…some reason. I wanted to do something to help out, and was, briefly, being a noisy jerk in the chat room trying desperately to find something to do.

One thing I’ve come to realize is that, sometimes, the best thing to do is to shut up and stay out.

I love that at Fog Creek, everyone I work with is bright, focused, and eager to build amazing stuff. But that means that my individual absence, while not ideal, is not going to make or break anything, and right now, the help Fog Creek needs is 100% people on the ground. The best way I can help from LA is to quietly monitor the chat room, pipe up if I know a specific answer that no one else currently available knows, and otherwise, keep to myself.

The annoying thing to me is that I’ve been in the reverse situation before, many times: trying to get a system back online and answering every five minutes if it’s back yet, or whether someone can help, is ludicrously frustrating. But it was incredibly hard to recognize that I was doing the same thing when my role was reversed. I was so used to being able to help that it took awhile for me to genuinely understand that I couldn’t.

The next time this happens, I’m going to follow a strict check-list before interjecting:

  1. Make sure I try to understand the full problem first. That includes not asking, “what’s the situation with ABC, and have you tried XYZ?” until I’ve read the full chat logs (if applicable) or talked to someone on a private channel or face-to-face who is clearly currently idle, and therefore interruptible.
  2. Evaluate whether I have any relevant skills or expertise to help. Is the team trying to figure out how to quickly ship 10 TB of data to S3? Unless I actually have real experience trying to accomplish that, I’ll be quiet. Anyone can google the random comment I saw on Serverfault if that actually becomes relevant.
  3. Even when I do have relevant experience, if someone is already providing accurate, relevant information, I should be quiet. The random guy on the other team describing the migration process for PostgreSQL may not have as much experience with it as I do, but if he’s right, no one needs to hear me validate it. If they’re actually unsure if the information they’re getting is accurate, they will ask for confirmation from someone else, and I’ll provide it.
  4. Evaluate whether what I’m about to say is actually productive. Even when I have experience at hand, if the comment I want to make isn’t going to move things forward, it’s useless. If something will take ten days to complete, I know a cool hack that’ll cut it to eight, and the actual problem is any solution has to be done in five hours, then I can keep it to myself. And finally,
  5. Do not 4chan the conversation. People stress out and need to blow off steam. When I need to do that, I’ll do it off-channel, to keep the main forum clean. When other people do it, I’ll ignore it, and not add fuel to the fire.

Instead of trying to help with the data center, I’m responding to public forums, working on marketing campaigns for a new product we’re releasing when this is all over, and doing some performance work on that same new product—basically, the stuff I’d be doing if there weren’t an emergency. And I’m also making sure I’m well-rested so that, whenever I actually manage to get back to Manhattan, I, too, can help out with the bucket brigade.

I’m annoyed it took a hurricane to teach me to stand back and not throw myself in an adrenaline-fueled craze of help when something goes wrong, but I’m happy it took.

To all my fellow Creekers, and the awesome folks from Stack Exchange and Squarespace who are helping out: you are all mad awesome people, and you deserve serious accolades. I’m sorry I can’t help, but I can’t think of any group of people I’d trust more than all of you to get this done. Good luck!

Using Trello to organize your summer job search

I remember when I was back in college looking for summer internships. It stank. No one gave me any meaningful guidance (very much including the college employment office). I had no idea how to organize the process. I had no central location to store contact information, or to easily make sure I’d done the next step. I just kind of had a stack of brochures and business cards on my desk that I tried to follow up on, and an Entourage calendar of any upcoming interviews I had. Basically, I was a mess.

Thankfully, Fog Creek’s resident organizational ninja and jack-of-all-trades, Liz, took time out of her day to help all of you college students manage your summer internship applications. So when you start applying to companies this fall, you should have a really easy time keeping everything organized in a single, easy-to-use spot.

Good luck!