In November 2020, a Rainbow Six Siege player posted a video to Twitter that showed that hit registration seemed to have failed during their game.
C'est un bon jeu, rien à dire 🙂 pic.twitter.com/dO36ym4gVw— Kramiche (@KramaFPS) November 23, 2020
When the dev team came across the post, it set out to debug the issue—even though all it had to go on was the single tweet, the bug not having been officially logged.
Yep: the R6S dev team aims to investigate or evaluate all bugs, whether they’re logged officially via R6fix, a platform where community members can submit bugs directly to the devs’ Jira board for examination, or plastered across a random subsection of the internet.
We have one hand on the temperature of the community at all times,” says Jason Egginton, the Network Programming Team Lead on Rainbow Six Siege. “So we react to what members post.
That being said, the dev team’s vision of responding to community feedback has to intersect with the ops team’s reality of efficiently and usefully storing petabytes — literally — of data.
It’s another dev vs. ops dilemma, and one that R6S’s network team solved rather elegantly over the past year.
The hidden face of observability
Rainbow Six Siege servers located on all the continents host several millions matches daily. Consider that each match generates a log of 10 to 50 megabytes; this brings us to a total of five to seven terabytes of data generated daily, which amounts to 200 terabytes of data monthly.
With a season of Rainbow Six Siege lasting three months, should the team want to compare regression and improvements from one season to another, it would have to keep data for at least six months, which means storing 1.2 petabytes of data.
Data storage is costly, and storing petabytes of data over extended periods is nearly unfathomable. As such, until recently, R6S servers were simply not storing all this data. Instead, servers collected actionable logs (logs that can lead to a direct action, such as recorded errors and warnings), and then randomly saved a sample of the rest of the raw logs “in case”. This was a sufficient amount of data to allow the teams to detect DDOS attacks, cheaters, performance losses, bugs, and other events that give a snapshot of the state of servers and game quality.
This was not, however, enough data to solve any random issue that might be posted to the internet. And so, a dev investigating an issue based on a screen cap or a video not only had to spend a considerable amount of time identifying the server where the game had been hosted, but they also had to contend with rather low odds that the server had saved the logs necessary to debug the issue.
What’s more, data logs were generally only stored for a few days, so the investigating dev had to act fast; no small feat when one has to Sherlock Holmes their way from an anonymous, undated screencap to—hopefully—a fix.
The developer dream of facility
“I can tell you from experience that having a one in anything chance of actually finding the game server logs that you’re looking for is very frustrating,” Egginton says. “It’s much better to guarantee that the logs will exist when someone goes looking for them.
So basically, I wondered: Why can’t we have everything all the time?” he says, matter-of-factly. “To start, you need to store the data in an efficient way, because it’s costly.
Over the past year, through R&D, the team has developed a two-tiered system for storing petabytes of game server data, based on the importance or the pertinence of data.
The bulk of the data represented by raw logs—every single line that every single game server encounters during a run—is now sent to a decentralized, inexpensive, low-performance storage solution in the cloud (Blob storage). These logs are in a rapid-to-produce JSON format that is designed to be highly compressible, yet organized in such a way to be searchable in the future.
Then, two specific subsets of logs are sent to an Elasticsearch database, a high-performance storage solution.
“We’re sending the very high-level information, errors and warnings, and this kind of thing, to Elasticsearch, so we can keep an eye on whether the game servers are running properly,” Egginton says. “And at the same time, we’re producing metadata in another index in Elasticsearch. We’re using Elasticsearch more optimally now; we no longer store data just in case we need it, which really reduces costs.”
The metadata logs Egginton describes include information such as the data centre, session ID, and player IDs, as well as the locations of raw logs in Blob storage. The team spent some time refining the raw log data down to the best categories that made sense to keep for optimal searches.
Finding the needle in the haystack, in one click
In summary, the strategy is ultimately quite simple: reference and critical data are sent to Elasticsearch for real-time search and visualization, while raw data is sent to a cheap storage (cold storage), and brought back to Elasticsearch (hot storage) on demand through a simple click.
Concretely this means a dev uses Kibana to investigate an issue by searching in the metadata index of Elasticsearch using a parameter such as player IDs. Results should return the location of the appropriate game server log—which is clickable and activates a log recovery service.
Next thing you know, the raw log has been fetched from Blog storage, and injected into Elasticsearch for a more thorough investigation of the issue at hand. It takes minutes.
In the past, a dev could spend days on an issue: first, searching via Kibana, then hunting for the proper logs in Azure—with fingers crossed that the logs had been saved within a random sample—, then downloading the log file, then using a text editor to read the log…
“Now, basically, it’s one tool, just Kibana,” Egginton concludes. “We have a very efficient way to investigate an issue and fetch the raw data if we need it.”
An eye on the bottom line: the player experience
And so, after one year of research and experimentation, the R6S network team turned a Kafkaesque problem into a one-click solution. “Our vision was to give our devs easy access to every log and metric generated by every game server,” Egginton says.
The new system, with its optimized game server log settings and one-click solution, was successfully activated and verified on live servers in early March 2021, and went live with the launch of the new season on March 16, 2021.
Imagine—before, you were sending some logs to expensive data storage media in case you needed them, but then you never looked at them,” Egginton says. “Now, the general vision is that it should be as easy to investigate something that happened three months ago in the game or just now in development.
This new way of judiciously selecting which data to save to an expensive data store (Elasticsearch) versus a cheaper one (Blob storage) could mean a cost reduction of tens of thousands of dollars per month for R6S, which is remarkable—but Egginton is already thinking about the bugs devs can now investigate, identify and solve.
“We’re just making sure that we have all the information about the game so we can investigate any issue,” he says. “We hope that the production will be able to leverage this solution to increase our ability to investigate and fix live issues so we can make sure players keep having a good time.”
Flashback to real-world issues
So what happened with the hit-registration issue mentioned at the top?
Using a few key pieces of information—the player names that are viewable in the video, as well as the map being played—the production team was able to locate the specific game server that had hosted the game. From there, they accessed the logs and were able to investigate the issue.
In this particular instance, it turns out the issue wasn’t hit registration. Rather, the player was suffering from massive lag, but was not receiving the signs and feedback to advise them as such. A different problem altogether.
“Community members tell us what we need to look at,” Egginton says. And R6S devs have even more time and resources to solve their issues thanks to a single-click tool.
Have a Rainbow Six Siege bug to flag? Sure, you can post it to YouTube if you want, but do also use R6fix, the dedicated R6S platform where community members can submit bugs directly to the devs’ Jira board for examination. Players can gain visibility on the status of their submissions, as well as on the latest updates and fixes.