Regarding Recent Chat Issues
(Chat is complicated. There, we said it. And recently, Twitch has experienced its first significant chat outages in quite a long while. In the past year, our team has rolled out a new chat infrastructure that we lovingly refer to as TMI, and so far it’s been able to meet the challenge of scaling alongside our rapidly expanding community. Here to discuss last night’s chat outage, along with the complexity that exists in making something like this work, is Mike Ossareh.)
Chat is a key part of our product, we love it — more importantly you guys love it! After our site outage last Friday we decided to check the “remove password hash from our pages”-task off our TODO list. On Wednesday we migrated you guys to using OAuth2 for your chat connections. This brings our site in line with our mobile clients and broadcasting SDK which have used OAuth for many months. This change altered the balance of our systems which caused our authentication to break. In this post we aim to explain what happened.
Our chat stack looks somewhat like this:
Your client (either an IRC client or our web client) connects to an edge server. These servers communicate with the clue server using HTTP and JSON. The clue server is responsible for answering questions which the edge server asks (e.g. the edge asks “Can user foo123 send a message to room bar?”, and the clue server answers “yes” if appropriate). As such a majority of the logic exists in the clue server. The clue server is also totally stateless. This permits us to roll out changes to the chat logic without needing to bounce the edge servers and thus cause your clients to reconnect.
Moving over to using OAuth tokens for authentication instead of the password hash was as simple as changing our web client code to supply an OAuth token, which is fetched using AJAX, instead of the password hash. This change took place on the frontend and required no code changes to our chat stack.
Over the course of the day we had been getting reports of people not being able to log into chat. It was fair to assume that the problem was likely related to the OAuth change, but we were not sure why code that had been used in production for a long time would suddenly cause these problems. As it turned out, the issue was that all the extra requests caused our clue server CPUs to get bogged down in SSL-negotiation.
The edge server makes a request to the clue server with the OAuth token and waits for a response. There is a 20 second timeout, after which the edge server cancels the request and informs the client that auth failed. We have a queue in the clue server from which workers pull requests to handle, and many of these requests require a response from the Rails API. Historically the volume of these requests had been low enough that the speed penalty for SSL-negotiation was not an issue. However, as we are now making many times more requests to this API, we ran out the CPU available to the clue server processes. The requests from the edge servers were timing out, but given the clue server isn’t notified that the edge server has abandoned the request, the clue server was left chugging through the queue while the edge server kept adding new requests to the end of it.
We needn’t use SSL for requests within our data center so the fix was to simply not use SSL. This change reduced our CPU utilization by 20% on our chat tier:
As well as ~50% on our web tier:
“SSL-negotiation is CPU intensive”- is a phrase we’re all very well acquainted with however it was certainly a surprise that it was the root cause in this incident. Chat is stable now and we hope you learned a little about the infrastructure.