Product instrumentation best practices
This post covers a dozen best practices we’ve developed at Twitch on the design and engineering of product instrumentation via events. Better instrumentation leads to better analytics and better decisions for the whole company. While there are resources covering this topic, they tend to be scarce and introductory. Our data staff has accrued a lot of experience over the years, so we thought it’d be worth sharing our own design patterns and best practices.
General best practices
Send events from the backend. In most modern apps, front-end clients facing the end user, like web or mobile apps, send API requests to backend servers. Sending events from the backend is more reliable because the backend runs trusted code in a trusted environment. The frontend, on the other hand, can be tampered with, simulated by robots, and lose connection. Sending events from the backend also saves time: events need to be implemented only once for all clients hitting that API. Sometimes, however, sending events from the frontend is unavoidable. The table below lists some examples when it’s preferable to use the frontend or the backend.
Backend | Front-end |
Recommendations served | Modals and screens displayed |
Microservice response time | Experienced latency |
Text of comments | Clicks and hovering |
On the front-end, forward backend values verbatim. If firing from the front-end, avoid translating or converting values passed by the backend. This drastically reduces the amount of coordination required between client teams. For example, user IDs and Twitch channel names are great to use verbatim in all events, and don’t need any translation table or conversion scheme like lower casing or removing special characters. When a creator changes their display name, all clients will seamlessly pass the new name.
Do not reinvent the wheel. Look at the existing data documentation, and ask fellow data analysts and engineers if an existing event fits your tracking needs. For example, if an event already exists for page loads, see if you can use it as-is, or at most add properties to it, but avoid adding a new one. This also highlights the importance of data governance and having a holistic data dictionary.
Send standard fields in all frontend client events. On our web platform for example, every front-end event passes the current page location and the user ID. This makes it easy to split by location on the site, or to join with a user dimension table and filter by country.
Future-proof and look outside your silo. If you foresee potential use cases for your events in the near future, or other products being able to leverage your events, design with those in mind. Renaming and retrofitting events and fields is painful and time-consuming. For example, if we launch a feature allowing viewers to search for any channel, an event like search_for_channel
could be re-used in the future to search for games. The event could be simply called search
, with a field search_content_type
taking values “channel” or “game” or even “any”.
Descriptive and unambiguous names. Descriptive and concise event names are really worth spending time thinking about. For example, at Twitch, content
is a vague and ambiguous field name. It could relate to the game being watched, the video bitrate, or the email subject of a marketing campaign. This again highlights the importance of data governance.
Use snake_case, not CamelCase, and avoid dashes. SQL ignores caps and requires escaping dashes in table names via double quotes.
Prefix event names with the product domain. Twitch frontends and backends fire hundreds of unique events for dozens of teams. Using the same prefix for events concerning the same product makes it easy to find related events when a data catalog is sorted alphabetically.
Bad event name | Better event name |
video-play | playback_start |
message | chat_send |
tryCreatingClip | clip_create_attempt |
Click Through Rate
CTRs are probably the most common type of metrics. At Twitch for example, we compute CTRs for carousel recommendations, signups, and notifications. Although these CTRs rely on different events and cover different product areas, their formulas all consist of a numerator and a denominator.
Best practices
- Compute the denominator from a
display
event, and numerator from aclick
event. - Or use the same events as above, with a shared UUID to join them together, in case the denominator fires multiple times. This happens for example when scrolling down a feed, then back up, which makes a given post appear, disappear, and appear again.
- Or compute the numerator and denominator from the same event, with a field
action_type
taking values “display” or “click”. - Or compute the denominator from a
display
event, and extract the numerator from URL parameters (see Netflix example below).
At Twitch, such events often include fields like carousel_rank
, position_in_carousel
, and recommender_id
. Linkedin also passes the page location.
Example of URL parameters: Netflix on web.
After clicking on the 3rd item of the 5th carousel, the URL to the movie page is this:
https://www.netflix.com/watch/80223779?trackId=12345678&tctx=5,3,d681a17d-c5bc-4830-84d6-f0c1e78a6d1e-166054377
Parameter tctx
contains the carousel number=5, position in carousel=3, and a UUID of the previous page load. trackID
might be the user_id.
Funnels, flows, and lifecycles
In a way, funnel tracking is a generalization of CTR tracking. Conceptually, these are a series of steps that need to be tied together. For example, an advertising funnel could rely on events for opportunity, request, impression, and click, all tied with the same UUID. The flowchart below details how these events could fit together.
Best practices
Each step should have a unique entity ID or a contextual UUID to join all steps together.
Like for CTR, each step fires the same event with a string field
funnel_step
storing the step name, or each step fires its own event.Do not use an integer to index the steps in the funnel: it makes it impossible to track intermediary steps later on.
For each step, document its name, purpose, start, and end.
Explicitly track errors as an end state in the flow.
Fire all events in the front-end, or all in the backend, but avoid mixing front-end and backend events. They tend to require a lot of debugging and long reconciliation sessions.
Intent vs completion
This is a special case of CTR tracking, where the front-end does not know which ID the backend will assign after the action has completed. For example, when uploading a video, the video ID is generated by the backend after the video has started uploading.
Best practices
- Have the front-end generate a UUID, and fire a front-end intent event.
- Have the front-end pass that UUID to the backend, and have the backend fire a completion event with the front-end’s UUID.
- Regularly audit that front-end and backend volumes are comparable, to detect bots.
In-app navigation and third-party referrals
Twitch has clients on multiple platforms, like web, mobile, and console. Navigation events and fields tend to vary slightly.
Best practices
On web, the page load event should track the URL of the previous page, via the HTTP header. This enables Sankey diagrams of navigation paths in the app.
On web, leverage URL parameters, especially for email campaigns and SEO.
On mobile and console, track the previous app location instead of the previous URL.
On mobile, there are vendors who specialize in deep linking and third-party referrals.
Object lifecycle
This is about the regular lifecycle of complex objects like video collections or user accounts. Using events to track object lifecycle may seem redundant with production databases, but this redundancy can be useful. Moreover, database snapshots only happen at discrete points in time, whereas events enable reconstructing the database at any point in time.
Best practices
- Fire one backend event for each operation.
- Prefix all events with the object name, for namespacing.
- Fields for the update events depend on the use case: they can carry only the new value(s), or both new and old values, or both new value(s) and delta(s).
- Regularly audit that the volume of update events matches the volume of rows updated between 2 snapshots of a database.
Example:
account_create
has fieldsuser_id
(immutable) anddisplay_name
(mutable)account_update
hasuser_id
(unchanged) andnew_display_name
.account_delete
hasuser_id
andreason
.
Long activity
By “long activity” we mean activity that takes place over minutes, hours, or even days.
Best practices
- Fire an event at the beginning and one at the end, if possible.
- Or use a heartbeat event, fired at regular intervals throughout the activity, to know when users abandon.
- Use a UUID to join all heartbeats from the same session together.
- Optionally, offset the first heartbeat by a random amount, so it’s possible to estimate average durations for activities shorter than the heartbeat interval. However, it’s often better to use a shorter heartbeat in this case.
- Heartbeats generate a high volume of events. Monitor their volume, and aggregate or sessionize them whenever possible.
Short activity (~seconds)
Best practices
- Fire an event at the beginning and one at the end.
- Join them on a common UUID.
- Do not fire just one event at the end, with a field storing the activity duration: it would completely miss abandons!
- It’s possible to capture short and long activity together by firing heartbeats e.g. every 10 seconds for the first 180 seconds, then every 60 seconds. This works best with a
seconds_elapsed
field and sessionization via max(seconds_elapsed). However, it can be confusing to people not familiar with the data trying naively to count(*).
Parent relationship
This is useful when tracking N-to-1 relationships such as a comment tree.
Best practices
- Pass a
parent_id
field in the events tracking child creation or update. - Or fire an update event for the parent, with a JSON array or comma-separated list of the new or latest children.
Collections
Collections can consist of sets, ordered lists, hash maps, and so on. Production databases often track creation, deletion, and other metadata about a collection, via fields like created_by
and last_updated_at
for example. If it’s possible to use snapshots of production databases to capture the information of interest, then it’s always better to use those. However, databases don’t always record all we need, for example when an item is added to or removed from a collection, and by whom. In these cases, we must use events.
To create and delete collections: see object lifecycle.
To add an item to a collection: fire event mycollection_add_myitem
, with fields myitem_id
, new_position
, myset_id
, and mycollection_result_list
, a JSON array or comma-separated list.
For example: collection a7852cb2 has items 1a7fbcde and 2bc9d6ab. Adding item 3bc7db8c to it, in first position, triggers this event: 45678,1,’a7852cb2’,’3bc7db8c,1a7fbcde,2bc9d6ab’
To remove an item from a collection: fire mycollection_remove_myitem
with fields myitem_id
, old_pos
, mycollection_id
, mycollection_result_list
.
Final thoughts
Instrumenting events in a consistent and reliable way can be challenging. We hope the best practices we shared in this article will be as useful to you as they were to us! And if this kind of work sounds interesting to you, have a look at our data engineer and data analyst open positions.
Thanks to Brian Eng and Nicholas Ngorok for reviewing this article.