The idea for this project began when a coworker and I were talking about NVIDIA’s photo-realistic generated human faces using StyleGAN and they mentioned “I wish someone made one of those for Twitch emotes.” I had always wanted to take some time to learn more about convolution neural networks and was in the middle of a machine learning project for work, so it seemed like trying to build this would be a quick and relevant personal project.
The original plan (with my time estimates):
Scrape all of the Twitch emote image assets (1 day)
Write a progressive growing GAN implementation using Keras as a proof of concept (1 week)
Adapt the emote dataset for use with a real research-caliber GAN implementation (1–2 days)
Train the GAN on the emote dataset (1 day)
1 . Scrape all of the emote image assets
This part of the project was actually quite straightforward. Twitch stores all of the emote images on a CDN using a monotonically increasing numeric emote_id. Each emote is available in three different sizes (1.0, 2.0, 3.0) corresponding to the resolutions of 28x28, 56x56, and 112x112.
I wrote a quick python scraper which would go through ~2 million emote_ids of three different sizes and download the images locally. Even at a modest 200 requests per second, this was able to finish overnight. In fact, I think the limiting factor for the speed of this step was actually writing the image files to my HDD, not the network requests themselves. So far, I was right on track with my time estimate.
2. Write a progressive growing GAN implementation using Keras as a proof of concept
As I had no experience with CNNs or deep learning before this project, I wanted to at least attempt to write a progressive growing GAN myself as a way to learn before switching to an existing implementation. As Twitch emotes are not even power of two sizes, it was not possible to exactly replicate the 4x4 pixel to 128x128 pixel growing architecture used by Karras. Instead, I chose to start off with a 7x7 base convolution layer, and then gradually double the image size up to 112x112 pixels using repeated Upsample + 3x3 Conv + 3x3 Conv block layers. The higher resolution layers were slowly faded in using a weight factor as the GAN increased output resolutions. The mirror of this architecture was used for the discriminator. This GAN used a random 200k sample of the 2 million emotes as the training data.
The low resolution images produced by this GAN showed some promise. They had a rough visually consistent color scheme within each generated emote and displayed early signs of creating faces. You can see this in the bottom-right corner generated emote in the figure below. When the generated emotes were compared side-by-side with down-scaled real emotes, it was not trivial to determine if an emote was generated or came from the training data.
With promising results on the lowest resolution, I began to allow my GAN to grow to larger sized emotes. At this resolution, the generated emotes were definitely more discernible from the training data. However, they were also developing more complex shapes and the beginnings of real faces. I attempted to continue to grow the output resolution of the GAN.
However, my architecture was not able to reliably grow the generated image resolution above 14x14 pixels. All attempts to grow the generator to produce 28x28 pixel images led to generator collapse. At this time, I decided to move to an existing GAN implementation to see how the results would compare. Looking back, my original implementation was fading in the weights for the new higher resolution layers much too rapidly. However, later attempts to correct this after the fact by increasing the fade in time were not able to prevent the collapse. Instead, this just prolonged how long it took for the collapse to occur.
Although this step was unsuccessful at being able to produce full resolution emotes, I felt as though it gave me a necessary introduction to CNNs and was good preparation for the rest of the project.
In total, this section took ~3 weeks from end-to-end. A majority of this was spent developing the initial custom network architecture and trying to correct GAN collapse problems.
3. Adapt the emote dataset for use with a real research-caliber GAN implementation
I decided to use the Tensorflow implementation of a progressive growing GAN provided here. There were three main reasons for picking this specific GAN, given the huge diversity of different GAN architectures available.
The first reason was that this implementation had a proven history of being able to generate very high resolution images on diverse datasets, far above the resolution of what I would need for my emote dataset. Given the trouble I had consistently growing the output resolution of my own GAN, I wanted to see if this more complex implementation would be able to resolve these issues.
Secondly, this GAN’s ability to progressively grow the generator’s image output resolution would dramatically reduce the training time, as the first few low resolution stages would complete very quickly. For reference, it took an equal amount of time to grow the GAN from an output resolution of 4x4 to 64x64 as it took to grow from 64x64 to 128x128. This meant that I could very quickly inspect low resolution results before committing to continue training the GAN up to much higher resolutions.
Finally, this GAN had significant tooling already in place for dataset generation and output visualization that I would be able to leverage. This made it easier to compose new datasets and validate results. This allowed me to create the first version of the of the training dataset in one night.
As this GAN required square image sizes, I choose to pad the Twitch emotes from 112x112 to 128x128 with white background. For the first test, I choose another random 200k sample of the full dataset and started the training.
4. Train the GAN on the emote dataset
After 72 hours of training locally on my GTX 1070, the GAN had grown to the full output resolution of 128x128. I opened up the results directory, awaiting my emotes. The actual results were…
… a bit of a mess. These were certainly far better than what my GAN was able to achieve, but most of the generated emotes were composed of disjointed text, blurred blob faces, and a melting mix of colors. Not exactly the photo-realism that I had been promised!
I realized very quickly that the problem wasn’t the GAN itself, but the training set.
After pulling together a sample of the training emotes actually being fed into the neural network, it was clear that this jumble of unrelated images did not provide a clear set of features which could be learned by the GAN. The mix of cartoon-style faces, “realistic” faces, text, anthropomorphic random objects and hearts had so much variety that the GAN was forced to over-generalize to all styles, leaving it unable to produce high quality images within a specific style. A GAN is not magic, it still requires a clear visual pattern which can be learned in order to produce high quality images.
One interesting idea for future GAN research is developing an algorithm to train Progressive GANs that learns to automatically cluster implicit styles in the training data. This would allow GANs to be trained on mixed-style datasets directly, instead of requiring additional filtering steps to separate these datasets into their component styles before training.
I believe this issue was further exacerbated due to the GAN including a minibatch standard deviation layer at the end of the discriminator. This layer was meant to increase the variation in the images generated by the GAN by allowing the discriminator to easily discriminate against generated minibatches of images which have low variation. This was included in the original implementation to prevent the GAN from converging to a single local optimum face when trained on sets of human faces (Celeba-HQ) which were all visually similar. Given the huge variation of the emote training dataset, this layer effectively forced the generator to over-generalize, sacrificing image quality and visual consistency for image diversity.
With this in mind, I decided to create a more consistent dataset for a subset of the emotes available.
5.1 Moving towards a more consistent dataset
**On Twitch, there are two main parts of metadata about an emote. The first, emote_id is the actual unique id that is used to reference an emote and was used for the data scraping in step 1. This id is never exposed to an actual Twitch user, as it is only used on the backend. Instead, there is a second identifier; emote_code__, which is what users type in chat to get an emote to show up. This emote_code is actually composed of two parts, the __emote_prefix and the_part_that_isnt_the_prefix (emote_suffix? emote_remainder?). All emote_codes for a given Twitch broadcaster will share the same emote_prefix. However, the rest of the emote_code is customizable by the broadcaster when they upload an new emote.
What you will quickly realize using Twitch is that many different broadcasters have emotes with the same _emote_suffix._ Although the exact emotes can be very different, the general visual styles appeared to be roughly consistent. My first thought was using these shared emote_suffixes to build a better dataset upon which to train the GAN. The only concern was that there would not be enough emotes with the exact same emote_suffix to allow the GAN to learn a generalized emote style instead of learning to reproduce the exact training images from the limited dataset as closely as possible. After a few SUBSTRINGs and GROUP BYs, I had the following table:
The top few _emote_suffix_es would definitely have enough data, so I chose “Hype” as the first emote_suffix to try and rebuilt another dataset with just those emotes. A sample of that training dataset follows below:
Although there was still a significant amount of variation between different emotes, this was looking a lot better and I felt confident to try it as a proof of concept. I fed these into the GAN and came back after a day to look at the partially trained 64x64 resolution network.
These were developing faces and the “hype” text was clearly recognizable, so this was definitely better than the first training attempt. However, many of generated emotes had the “hype” text mirrored from right-to-left instead of left-to-right. This didn’t occur in the training dataset at all, and it was very clearly the whole word reversed, not just a single letter. This told me that there was something wrong within the GAN itself. After digging through the code, I found a config parameter called train_mirror_augment which should have defaulted to false but was accidentally overridden to true when I copied configuration data from another dataset’s configuration. This brings me to the first major lesson learned from this project:
This was the first of multiple times I had to spend a bunch of time on issues that were caused by improper configuration. Many of these caused performance problems that were not immediately noticed, wasting both my debugging time and potential GAN training time. Although it may seem excessive, breaking out the debugger and stepping right up to the first training epoch to inspect any configuration data for the first few training cycles will help catch a lot of small problems that would otherwise stick around for a while. In the words of Andrej:
Neural net training fails silently
When you break or misconfigure code you will often get some kind of an exception. You plugged in an integer where something expected a string. The function only expected 3 arguments. This import failed. That key does not exist. The number of elements in the two lists isn’t equal. In addition, it’s often possible to create unit tests for a certain functionality.
This is just a start when it comes to training neural nets. Everything could be correct syntactically, but the whole thing isn’t arranged properly, and it’s really hard to tell. The “possible error surface” is large, logical (as opposed to syntactic), and very tricky to unit test … Therefore, your misconfigured neural net will throw exceptions only if you’re lucky; Most of the time it will train but silently work a bit worse.
With the mirror augment fixed, I fired up another training iteration.
The text direction was fixed and it was actually pretty readable, but the remainder of the image still had that amorphous blob look. In general, the training emotes fell into three different categories:
“Hype” on the top of the image with something below it
“Hype” or “Sub Hype” all alone on a white background
“Hype” on the bottom of the image with something above it
As as last ditch effort, I manually curated a dataset of only emotes from category 3, the retrained the GAN on this. This greatly stabilized the general structure of the generated emotes, even if a faces were still blurry messes.
5.2 Just the faces, please
Seeing how a more consistent training dataset had been able to stabilize the generated hype emotes, my next goal was removing all of non-faces from the original dataset. I pulled down the face_recognition python library, with uses a pre-trained CNN to detect the position of faces in an image. After running this for a few hours, I was left with a filtered dataset of ~400k images each containing exactly one face. This collection of images will subsequently be referred to as the “face dataset” and formed the base for the remainder of this project. The strength of this library is that it not only provides a bounding box for the face, but also the locations of 68 specific landmark points. These can be used for further filtering on the details of the image.
For most emotes, the locations of these landmark points were highly accurate.
I chose specific locations for the left eye, right eye, and bottom of the chin (points 36, 45, and 8 in the diagram above), then filtered the face dataset to only emotes which had similar locations for their corresponding facial landmarks. My intuition was that these emotes would all share similar poses, even if the exact content of the emote was different. I kicked off the GAN training, and in three days came back to a bunch of people looking left.
People may have been a bit of an exaggeration, but these were looking better than anything I had been able to produce before. The removal of emotes with no faces as well as the consistent pose had really improved the quality. In fact, some of them were looking too good.
Somehow the GAN had been able to create nearly pixel-perfect replications of this emote. I had remembered seeing the yellow haired emote in the training dataset, but there was no clear reason at first why this one looked so much better. However, the fact that I could even remember the emote from the dataset was weird, since I definitely didn’t remember nearly any of the other thousands of emotes.
Digging in more, the problem revealed itself — duplication. Since Twitch has no validation to ensure that emotes were distinct, there were multiples copies of many emotes. This duplication would most frequently occur if someone uploaded an new emote multiple times, slightly adjusting the cropping or color when testing what it would look like. This particular yellow-haired emote was massively duplicated, being fed through the GAN during training far more than the typical emote. This over-weighting was strong enough converge the GAN to a local optimum, producing a near perfect replication of this specific emote.
Luckily, face_recognition provides a way to detect similar faces. Along with the position of face_landmarks, a face_encoding is output which should be nearly identical for all faces of the same person. You can read more about the ResNet CNN which is behind this encoding here. Since duplicate faces are likely to have similar emote_ids (due to being uploaded at nearly the same time), I ran a window function over the images sorted by emote_id. This compared the face_encoding values for all faces within a 1000 emote window to detect those which were near identical. The window size of 1000 was chosen empirically based on the max difference in observed emote_ids for duplicated emotes.
Out the ~400k face emotes, this identified ~20k which were duplicated. After confirming these were actually duplicates via brief visual inspection, these emotes were purged from the dataset.
5.3 Filters, Filters, Filters
After seeing how filtering the dataset based on the position of specific __face_landmarks was able to stabilize the training, I removed the original face_landmarks filter and created four new rules:
Emotes should include between 20% and 50% white space. Emotes with more than 50% white space were usually text miscategorized as faces or were just the outlines of faces. Emotes with less than 20% whitespace were too zoomed in or had large colored backgrounds.
The chin position of the face (point 8) should be roughly vertically centered and near the bottom of the emote. Although this would filter out a few good training images which had slightly strange face positions, many of the faces which failed this rule had small faces with vertical text next to them.
The size of the bounding box for each face should be between 50x50 and 90x90 pixels. Faces smaller than 50x50 were usually not the main focus of the emote. Faces larger than 90x90 were too zoomed in to be usable.
None of the face_landmark points should fall outside of the bounds of the emote (112x112 pixels). This was also adapted to combat very zoomed in faces or emotes which had highly cropped faces.
The intersection of these rules formed a new dataset. These were composed into a new training set and I soon began to see much more promising results.
I was particularly happy with these six, as they were either highly realistic or showed the ability of the GAN to generalize non-face features (headphones and sunglasses)
This brings me to my second major lesson learned from this project:
Training dataset inspection is incredibly helpful at identifying sources of problems and can drastically improve the results generated by a GAN
When you do this, it is very easy to determine if you (as a human) can see a consistent pattern across most of the training data. If there are training images which you think are not good representative examples or are unlikely to contribute to the GAN’s generalization learning, build an algorithm to filter them out.
Of course, this approach works best when you have significantly more data than you actually need for training and can be picky about which subset of the dataset to train on. Since the de-duplicated face dataset had ~390k images to choose from and only ~30k or less were needed for training, it was worthwhile to trade recall for precision and create loose rules which would potentially filter out “good” emotes, so long as the majority of emotes removed by a rule were “bad”. In doing this, the relative proportion of “good” emotes in the training dataset would increase with the addition of each additional rule. Adding new rules followed this general pattern:
Look through the training data and find a few instances of emotes which are not high quality faces or have other noise that would likely degrade the quality of the generated GAN emotes
Create a rule which would filter these out
Run the rule against the entire remaining dataset and keep track of all emotes which would be filtered out by the new rule
Quickly visually inspect the filtered out emotes to make sure a majority of them were “bad”
Run the rule against the entire remaining dataset and keep only the emotes which would not be filtered out by the new rule
Repeat steps 1 -5 until the quality and consistency of the training data has improved enough that you feel confident rerunning the GAN training
As a next step, two new rules were added. These were aimed at stabilizing the color scheme of the dataset by requiring that 90% of the pixels in every image be made of white, black, or skin-tone colors. In addition, no large portion of any emote could be gray. These rules were remarkable effective at improving the stability and realism of the generated emotes:
One interesting realization from this is that the generated emotes with their entire faces blurred are actually intentionally produced by the GAN, as these blurry emotes occur with some frequency in the training dataset. The reason for this is that not all Twitch broadcasters provide full 112x112 resolution images for their emotes. Twitch requires uploading all three image resolutions when creating an emote, but some people choose to just upscale low resolution images instead of providing native 112x112 resolution assets. This degraded image quality is then replicated by the GAN.
In addition, some generated emotes still had strange block-like artifacts. I believed that these artifacts were caused by the presence of text within emotes in the training dataset. Although emotes which were only text had almost entirely been filtered out by the addition of the prior rules, there were still some emotes that had small portions of text within the emote, usually above or below the face.
To remove these text emotes, I ran this implementation of EAST over the training dataset, then removed any emotes which were detected to contain text. EAST is an Efficient and Accurate Scene Text detector which uses a combination of both a CNN and NMS merging stage to detect text at any orientation. This filtered out ~1k face emotes which also contained some text. The GAN was then retrained, but the results did not show much of a significant change after the removal of the text. The majority of the artifacts were gone, but it appeared that question marks alone were not detected by EAST, as one of the generated emotes showed this feature.
5.4 Moving training to the cloud with AWS Deep learning AMIs
Prior to this point, I had been training on my home computer using a GTX 1070. This took ~1 day to grow the resolution of the GAN output to 64x64, and a further 2 days to grow to 128x128. This meant that I was unable to rapidly iterate on new training datasets or try multiple examples at once.
More importantly, I had to turn my graphics settings to low when playing video games. The graphics card was unable to handle 70% CUDA utilization and high resolution gaming at the same time, as it would frequently run into GPU memory allocation issues which required me to restart GAN training from the last checkpoint.
Because of this, I moved to training using AWS deep learning AMIs on GPU-accelerated EC2 instances. I choose to use the
p3.2xlarge instance type, which includes one NVIDIA® V100 GPU. This reduced my training time from the aforementioned 3 days to almost exactly 24 hours.
However, this also increased my training cost from $0/hour to $3.06/hour. Although this is inexpensive compared to buying a V100 for $6718, at ~75$ for each end-to-end training I certainly began to put more focus on pre-training dataset validation and testing. After moving to AWS, my new training workflow became:
Create dataset locally, push to S3
Start the EC2 deep learning instance and ssh into it
Pull down the dataset from S3 onto the instance, git clone the GAN repo, start training using screen
Come back in 24 hours, copy the trained NN checkpoint files to my local machine using
Kill the EC2 instance
A significant amount of this manual work could have been automated, but I never expected to run so many different training iterations that the cost of automating this would have paid off.
Outside of the improved training performance, the strong advantage I saw from using EC2 with deep learning AMIs is that all of the configuration work was already complete. The only thing left to do when connecting to any of the hosts was run
source activate tensorflow_p36 then
python train.py. This was a stark contrast to the many early hours I spent on this project properly installing the CUDA toolkit, getting the correct drivers, and managing python packages and their own dependencies on specific Tensorflow and CUDA versions. I considered using a Tensorflow Docker image to simplify this configuration work, but at that point I already had everything working locally and the thought of making any new changes which could potentially tip over the house of cards that was my local development stack seemed unattractive.
5.5 Using the discriminator to improve data quality
At this point, I started struggling with new programmatic ways to improve the dataset quality. New rule-based approaches had low precision, filtering out too many good emotes along with the bad. However, there was still a non-trivial proportion of low quality emotes, far too many to remove by hand. The goal was to build a more complex system to be able to detect these and remove them from the training dataset. While coming up with ideas, I realized that I already had something that could work for this task.
Although the GAN trained two neural networks, a discriminator and generator, only the generator was actually used. The discriminator was discarded after training, as it was no longer needed to synthesize emotes.
While training a GAN, the purpose of the discriminator is to detect which emotes come from the training set and which are generated by the generator. As such, the discriminator become very adapt at detecting “strange” images which do not match the patterns found in the majority of the training data. This is used during training to update the generator’s weight to produce better quality images.
I took the training dataset and fed every emote through the trained discriminator. The output of this was a score for each emote ranging from -100 to 100. This score indicated the discriminator’s belief that the emote came from the training dataset (100) or was generated (-100). The scores for the training emotes were roughly normally distributed, having a large center cluster and tail of outliers with both very positive and very negative scores. Interestingly, the mean score for the training emotes was only weakly positive.
I then visually inspected the emotes which were scored at the extreme ends of this range. Training emotes which scored very low (below 0) were generally overly simplified, heavily rotated, did not have a very clear face in the image, or had text in the image that EAST had missed. These were removed from the dataset.
On the opposite side, the very high-scored emotes were also useful to remove. Although these were generally high quality faces, they were plagued by another problem — duplication.
Each of these emotes were duplicated dozens of times in the dataset. As discussed prior, this duplication caused training to drastically over-weight the features from these emotes. This could be observed in the generated emotes from the last training run.
The reason that these were not able to be filtered out during the previous image-deduplication was that they were uploaded many days apart. As such, their drastically different emote_ids would fall outside of the 1000 emote width window function when the duplication detection code was run. These were also removed from the dataset and training was re-run.
I wonder if this approach of leveraging the trained discriminator to provide useful work outside of the training backpropagation step could be applied more broadly for other aspects of GAN training. One potential use of the trained discriminator is automatically adjusting the composition of the data during training in order to speed GAN convergence. Being able to identify which specific training examples are providing most of the “learning” would allow you to dis-proportionally feed these though the GAN, potentially improving the training rate.
At this point, I felt as though the improvements in generated emote quality had slowed enough that small iterative changes was unlikely to lead to further realism. As such, I chose to move on and create new GAN training datasets using the lessons learned from trying to synthesize realistic faces.
5.6 Cartoon Emotes and Wide Emotes
Up next was attempting to generate cartoon-style emotes. The approach for these was generally similar to the one described for realistic face emotes, with one important exception — emotes were only included in the training dataset if they (1) had a low number of distinct colors and (2) had a large fraction of the total image space composed of only a few of distinct colors. These two rules were remarkably effective at filtering cartoon-style emotes. The results of this are shown below:
Furthermore, the GAN for these emotes was able to linearly interpolate across different input noise vectors with relative smoothness.
Up next was wide emotes. The only main change to the filtering rules for this was removing the prior face size and whitespace restrictions and prioritizing faces whose bounding box filled almost all of the image.
These emotes actually had the most smooth linear interpolations, as the placement of facial features was very consistent across all emotes.
6. Next steps and final reflections
In total, what I originally estimated as a short 2 week project ended up taking about 2 months by the time the last GAN was finished training. This brings me to the last major lesson learned from this project:
Many small improvements can combine together to create a great result, focus on making one good change at a time
Given the initial lack of high quality results with both my custom GAN implementation and the progressive growing GAN, it was tempting to shelve the project. I knew that there was a lot of work that needed to be done, and I didn’t have clear steps forward on what would be the best approach. Instead of trying complicated ideas to fix all the problems at once, focusing on making single iterative improvements (1) unblocked progress and (2) stopped any analysis paralysis.
Overall I am quite happy with how this project has turned out and the diversity of emotes that were able to be synthesized. I don’t think I’ll be able to compete with human emote artists anytime soon, but maybe with some dedicated research dollars we can change this. In the meantime, I’ll say bye in the most succinct way I can.