Technically Speaking — Group Chat and General Chat Engineering
Twitch Engineering is very excited about our new Group Chat feature — not only because it’s friggin’ awesome, but also because we have been using it to facilitate broader improvements to our chat architecture. These changes are aimed at improving our QoS — we hate dropped messages as much as you. For this post, I’d like to give an overview of some of the things we’ve been doing to improve overall chat reliability over the past few months.
One of Twitch Engineering’s long-term goals is to move away from a monolithic Ruby on Rails application toward a Service Oriented Architecture, where each feature of the site is a separate component. Our Group Chat design facilitates this drive by creating a new service that will warehouse all chat data (mod lists, Chat colors, etc), data that has historically existed in databases owned by our Rails system. This data migration is currently a work-in-progress.
This new data separation means that chat remains unaffected when Rails is hit hard. To illustrate why this is important, you can see the correlation between Rails and chat performance impulses in the following graphs (chat => Rails requests on the left, internal chat timings on the right — all in milliseconds):
Analysis of our data shows that load spikes in non-chat related parts of the site have a direct impact on the chat system. Thus separating our chat data from our Rails data will improve chat reliability and help us focus on solving other performance issues.
We’ve also been integrating new build systems and instrumentation tools that have been developed by other Twitch engineers. Our new build system uses virtual machines, allowing us to easily build and test in environments identical to our production boxes. This lets us debug faster and more reliably. This new system is also starting to be used by other engineering teams at Twitch.
We’ve also integrated a tool which allows us to trace individual messages through our entire chat pipeline, helping us to correlate system bottlenecks with usage patterns. Integrating this tool into our chat system allows us to work more efficiently in the long-run and lets us have more insight into how the system behaves under various conditions.
Of course, there is also the question that’s on everyone’s mind: how do we prevent another “Twitch Plays Pokémon” scenario, where chat breaks down completely due to huge spikes in load? We’re using a separate cluster of machines to handle all Group Chat messages, similar to how our Event Chat cluster is separate from our primary chat cluster (however, there’s an even greater degree of separation here).
While multiple clusters work well for Group/Event Chat, our more general strategy for dealing with these kinds of massive load spikes going forward is to artificially duplicate our chat traffic. The idea is that if we artificially double traffic to our backend, then we must design our system to handle twice the load that we need to handle. If we get a sudden surge of load, we can automatically reduce this artificial traffic to accommodate the load spike while we work towards expanding capacity even further — all without affecting QoS.
We’ve also made many other improvements: rewriting Python code in Go for improved performance, tweaking server and database configurations, smarter caching strategies, and of course various bug fixes. We’re excited to elevate our Chat QoS even further in the coming months, and we hope you enjoy Group Chat!