Explanation for Site Blip on March 19
Last night (3/19/2013) between about 6:50pm and 7:30pm Pacific, we were effectively “down.” This was an unplanned outage and I wanted to take this opportunity to explain what happened.
Foundational Thoughts
Before I start I think it’s appropriate to explain how we think about our site. We rarely schedule maintenance windows. In fact, I think it’s fair to say we never schedule down time. In the two years that I’ve been here we’ve only scheduled one, and that was to switch out our edge routers. In this case, we could probably have done it without any downtime, but the time needed to execute that would have far out weighed the benefits, and so we opted for a short and sharp outage.
The reason we never want to schedule these windows is that we believe, as a live site, that we should always, always, be live. That thinking permeates us all the way down and we have built our deployment tooling such that we are able to deploy code at will. Contrasted with many companies our size, this is a pretty radical notion. In any given day we release code to our production users tens of times. Much of the work we’re invested in now is set to improve that such that we can deploy even more code. We’re working hard to build great product and deploying code is how we’ll get there.
So what happened?
We pushed a code change that changed where we fetch data about follower relationships on our site. Our second most visited page is http://www.twitch.tv/directory/following. This page displays channels that you follow that are live. The “which channels are live” data is stored separately from the “which channels do you follow” data. Since the live channel set is often smaller than the channels that an individual follows, we choose to enumerate all of the live channels in the query, in pseudo sql:
select from your followers all channels in ( a list of live channels ).
Individually each query is fairly simple and finishes very quickly, however, we failed to account for just how many of these queries actually take place. That is, this page is so highly visited that this change dumped a 10x load on our database. Not being prepared for this jump is what effectively rendered our site incapable of functioning.
More about our storage
Previous to this change, our follower relationships had been stored in mongo. We’re not big fans of mongo and we consider this portion of our stack as technical debt. It doesn’t burn us, and so we’ve generally put fixing it on the back burner. This very simple change moves the queries to our postgresql cluster. We love postgresql and as such were pretty happy that we’d finally taken the time to make this change.
In this case, however, we failed to really assess the theoretical impact of this change. We didn’t look at the load that these queries put on mongo, we didn’t consider how such an important infrastructure change needed to be rolled out. We simply dropped the ball on this — we often dark launch changes like this but we were lulled into a false sense of security here because we ran a sample of the queries and they were fast. We didn’t anticipate the sheer volume at all.
Conclusion
First and foremost we’ll ensure to be more careful. We still need to make this change and given the graphs of the load spike we now have a reasonable baseline to predict how much more capacity we’ll need if we want to make this change. Even then, we’ll introduce the change gradually and ensure that the results are correct and that it doesn’t impact the performance of the site. This change will let us add a bunch of cool features to permit you to better personalize your twitch experience, so its ability to work well is paramount to us.
Mike Director Site Engineering, Twitch