Fortnite is a force to be reckoned with if the latest numbers are a sign. This past weekend, the free to play game broke it’s concurrent player record, topping at 3.4 million players. This absurd number players caused hardware issues for Epic Games, to which they spoke about the fiasco over on the official blog, going into depth and discussing how they plan to alleviate the situation.
With an “extreme load” that occurred on February 3rd and 4th, there was a series of issues that lead to the meltdown: MCP Database Latency, MCP Thread Configuration, account service outage, XMPP Outage, cloud capacity throttling, and available IP exhaustion. There’s an entire root cause analysis over on the official blog, breaking down everything that happened from the beginning.
Epic is aware of how popular their game has gotten, so they’ve come up with a plan to make sure this doesn’t happen in the future:
- Identify and resolve the root cause of our DB performance issues. We’ve flown Mongo experts on-site to analyze our DB and usage, as well as provide real-time support during heavy load on weekends.
- Optimize, reduce, and eliminate all unnecessary calls to the backend from the client or servers. Some examples are periodically verifying user entitlements when this is already happening implicitly with each game service call. Registering and unregistering individual players on a game play session when these calls can be done more efficiently in bulk, Deferring XMPP connections to avoid thrashing during login/logout scenarios. Social features recovering quickly from ELB or other connectivity issues. When 3.4 million clients are connected at the same time these inefficiencies add up quickly.
- Optimize how we store the matchmaking session data in our DB. Even without a root cause for the current write queue issue we can improve performance by changing how we store this ephemeral data. We’re prototyping in-memory database solutions that may be more suited to this use case, and looking at how we can restructure our current data in order to make it properly shardable.
- Improve our internal operation excellence focus in our production and development process. This includes building new tools to compare API call patterns between builds, setting up focused weekly reviews of performance, expanding our monitoring and alerting systems, and continually improving our post-mortem processes.
- Improve our alerting and monitoring of known cloud provider limits, and subnet IP utilization.
- Reducing blast radius during incidents. A number of our core services are globally impacting to all players. While we operate game servers all over the world, expanding to additional cloud providers and supporting core services in multiple geographical locations will help reduce player impact when services fail. Expanding our footprint also increases our operational overhead and complexity. If you have experience in running large worldwide multi cloud services and/or infrastructure we would love to hear from you.
- Rearchitecting our core messaging stack. Our stack wasn’t architected to handle this scale and we need to look at larger changes in our architecture to support our growth.
- Digging deeper into our data and DB storage. We hit new and interesting limits as our services grow and our data sets and usage patterns grow larger and larger every day. We’re looking for experienced DBAs to join our team and help us solve some of the scaling bottlenecks we run into as our games grow.
- Scaling our internal infrastructure. When our game services grow in size so do our internal monitoring, metrics, and logging along with other internal needs. As our footprint expands our needs for more advanced deployment, configuration tooling and infrastructure also increases. If you have experience scaling and improving internal systems and are interested in what is going on here at Epic, let’s have a chat.
- Performance at scale. Along with a number of things mentioned, even small performance changes over N nodes collectively make large impacts for our services and player experience. If you have experience with large scale performance tuning and want to come make improvements that directly impact players please reach out to us.
- MCP Re-architecture
- Move specific functionality out of MCP to microservices
- Event sourcing data models for user data
- Actor based modeling of user sessions
“Problems that affect service availability are our primary focus above all else right now. We want you all to know we take these outages very seriously, conducting in-depth post-postmortems on each incident to identify the root cause and decide on the best plan of action,” Epic said about the problems. “The online team has been working diligently over the past month to keep up with the demand created by the rapid week-over-week growth of our user base.”
It’s been a wonderful surprise to see how well gamers have responded to Fortnite, a game that was so much smaller not that long ago. Today, I see more than half of my friends list playing Fortnite every day, hours at a time.