Every once and awhile, you have an idea that you totally think is the bee’s knees but which ends up being the bee’s ass.
When Kiln and FogBugz went to a weekly release cycle a few weeks ago, we had what we thought, if not a brilliant idea, was a pretty good one: we’d upgrade Fog Creek On Demand at midnight on Saturday, when all the normal people were asleep. It’s pretty common practice, it minimizes customer impact, overall a good thing. Right?
By 12:30 AM on Monday, I had developed a very different opinion of the whole matter. In particular, I think the idea of doing a midnight deploy on the weekend has got to be the absolute dumbest thing I’ve ever heard of, right up there with Capris and the Nintendo Power Glove. Because, you see, you know how no one else works on the weekend? You know how that’s why it seems so alluring? Well, that applies to you, too. So, for the same reason that your customers are less likely to be hitting your website, you have a drastically decreased ability to handle things if something actually does go wrong.
This got drilled into my head very painfully this past weekend. Our build manager Rock did the deploy to production at 10 PM on Saturday, checked out that everything looked good, and then enjoyed his weekend. At about the same time, I checked my test account, saw the update had gone out, and verified that all seemed in order. So we both decided the deployment had been a great success, and had wonderful evenings, during which I may possibly have made a total ass of myself in front of a Kinect playing Dance Central, but that’s for another post.
But while we enjoying ourselves, Fog Creek On Demand had a deep, dark secret:
The deployment had failed.
You see, Kiln actually has three components: Kiln itself (what you hit when you go to e.g. https://mirrors.kilnhg.com/); FogBugz (which is used for a variety of purposes even if you don’t actually use FogBugz in your On Demand account); and a plugin for FogBugz which allows the two to communicate. Unbeknownst to us, but beknownst to the servers, the version of the Kiln Plugin we deployed did not actually work, in a very subtle way: all the APIs were valid except one, and that one call would fail. The real kicker, though, was that we anticipated that FogBugz might drop out for a moment when we originally designed the system, so we programmed Kiln to ignore being unable to communicate with FogBugz as long as possible. The failed API call turns out to be one that’s trivially cached for a very long time, and so is one that Kiln would allow to fail without actually dying. So even though everything looked just fine, and would work just fine for awhile, and you could even use Kiln and FogBugz and not notice a problem, Kiln accounts were doomed to start flicking off as they finally had to invalidate their cache and would be forced to disable themselves until FogBugz came back online.
The first people to discover this were exactly the people you’d expect, given our logo: Kiwis, since, due to the all-powerful Theory of Relativity, or perhaps merely Magic, they have managed to have their Monday mornings at about 7 PM on Sunday evening. So one of our Kiwi customers went to go hit their Kiln install, and were greeted by an infinite redirect loop, as a very lonely Kiln installation desperately attempted to find any FogBugz installation out in the world who would talk to it, and, finding none, simply tried again indefinitely. Tim called Tyler and me, who in turn called Rock, and just for good measure another FogBugz developer, and we all got the problem taken care of, so everything was perfectly fine.
No, wait! That’s not what happened at all!
- Because it was Sunday, it took Tim about half an hour or so to actually get in touch with Tyler and me, during which our customers were stuck on a busy weekday, and Tim could make no forward progress on resolving the issue.
- Once he did finally get in touch with us, it took us a long time to spin up on solving the problem, since we were forced to get to our apartments, fire up wimpy laptops, and get into our office computers via slow RDC-via-VPN connections.
- Once we finally figured out the problem, we realized that, even though we knew exactly what was wrong, we had absolutely no idea how to fix it, because the issue was in the bowels of how FogBugz On Demand deploys its plugins. I therefore called Rock, who, in addition to being our build manager, also does an amazing job running FogBugz in his spare time. Rock unfortunately told me he couldn’t be available for half an hour, but if I wanted to take a stab at fixing things, I just needed to 齰蝌齰蜡. I said thank-you and then cried a little inside after he hung up.
- I then tried to find a FogBugz developer to explain to me how exactly one did…whatever it was I was told to do. So I started calling FogBugz developers at random. The first one I tried to call was the plugin guru, but he was busy. So then I called our Canadian guru, Dane, who understood how to do the thingie, which ended up roughly translating to “load random assemblies into a running AppDomain.” This idea ends up not working, so let’s skip ahead.
- About an hour later, Rock was able to join us, and we are finally able to try to cut new builds. Because the QA team does not generally work Sunday evenings, we were forced to do a minimal very minimal amount of QA before just opting to pull the trigger and deploy to all Fog Creek On Demand customers. By this point, it’s already about 11:30.
So, for those keeping score: it took half an hour to reach anybody who could fix the problem; it took more than an hour before we were actually in a position to properly debug it; and it took another couple of hours to actually get in touch with everyone required to fix the problem. Further, throughout the whole evening, everything happened in slow motion: people would step away from their computer and miss a HipChat notification, they’d need to put kids to bed, they’d need to take dogs for walks, they’d need to outrun the cops, and so on. Even when everyone was fully on-deck, I think conservatively that things happened at half the speed as normal at best.
Conversely, if we had simply deployed first thing on a workday, and things had gone wrong, we’d have had everyone right there ready to take care of it immediately. We’d have had a real QA team to make sure the build was high-quality before we unleashed it. We’d have been able to pile into a single office to discuss the problem in a very high-bandwidth maneuver, and I think we literally might have been able to fix the problem in 30 minutes, rather than five and a half hours. And despite our midnight Saturday deploy, customers were still adversely affected, and so ended up negatively impacting customers, and then being extremely slow to respond, compared to what we’d have been capable of with a midweek deploy.
Well, screw that.
New policy: Kiln and FogBugz will deploy midweek. We’ll still do it late at night, but in the middle of the week, we’ll have everyone on-deck first thing in the morning in case anything blows up, and our chance of being able to reach people on a Wednesday is dramatically higher than on a Sunday night even if things do have an immediate negative impact.
Weekend deployments are for chumps.
If you want the full post mortem, plus some more details on what went wrong, please visit my write-up on the Fog Creek status blog. I’ve also spelled out how we’ll be avoiding getting into this situation in the future, Wednesday deploys or no.