What SREs Can Learn From the Atlassian Nightmare Outage of 2022

Table of Contents

What caused the Atlassian Cloud outage?

The crash began on April 4 (resolved on April 18) and affected about 400 Atlassian Cloud customer accounts. Atlassian Cloud is a collection of popular Atlassian products such as Jira and OpsGenie. Customers affected by the crash can no longer access these tools or the data they manage.

According to Atlassian, the problem was triggered by the misuse of the migration process. Engineers wrote the script to disable an obsolete version of an application. However, due to the so-called “communication gap” between Atlassian teams, the script was written in such a way that it would not only make obsolete use but also disable all Atlassian cloud products.

To make matters worse, the script was configured to delete data permanently, rather than intended to delete it. As a result, the data in the affected accounts were permanently removed from the production environment.

The good, the bad, and the ugly from the Atlassian incident

The Atlassian Cloud crash may not be the worst event imaginable – failures like Facebook’s 2021 crash were arguably worse because they affected more people and service restoration issues were complicated by physical access issues – but it was even worse. Production data was permanently deleted, and hundreds of enterprise customers experienced total service interruptions that lasted for several days.

Considering the seriousness of the incident, Atlassian urges to point fingers at the engineers for allowing such an incident to happen first. They have written a script with some serious issues and then used it without testing it first – which is the opposite of what you call SRE best practice.

Atlassian, on the other hand, deserves a lot of points for responding efficiently and openly to the incident. The company was quiet at first, but eventually shared details of what happened and why even though those details were a bit embarrassing to its engineers.

Importantly, Atlasian also had backups and failed scenarios used to speed up the recovery process. The company said that the main reason for the longevity of this crash was the need to integrate backup data for individual customers with storage that is shared by multiple customers, to restore production data from the backup. A difficult process that cannot be performed automatically by Atlas (or does not want to because it can be very dangerous to do it automatically).

Unfortunately for affected clients, while they wait for Atlassian functions to be restored, no fallback tools or services appear to be available.

We imagine that this is more than a minor issue for groups that rely on tools like Gira and OpsGenie to handle events to manage projects. Maybe those teams have set up alternative tools in the meantime – or maybe they’ve been finger-pointing for the past several days, hoping their project and reliability management tools will come online soon.

Takeaways for SREs from the Atlassian outage

For SREs, the key features taken from this incident are:

Always do dry flows of emigration processes before placing them in production under test conditions. Presumably, if Atlasian engineers had tested their application migration script first, they would have noticed its flaws before taking on direct client environments.
Ensure that you have a backup, backup, backup – and failed services that can be rebuilt based on backups. Although this crash is bad, if the service cannot be restored based on Atlassian backups it will be 100 times worse and the data will be lost permanently.
Ideally, the data of each customer should be stored separately. As we mentioned above, the fact that Atlassian used shared storage seems to be a factor in delaying recovery. It is difficult to overstate the Atlassian at this point; Isolating data for each user is not always practical due to cost and administrative complexity.
It is good for SRE teams to think about how they will respond if their credibility management software goes offline. For example, extracting and backing up data from your trusted management tools may be worthwhile, so you can access it if your tool provider experiences an incident like this.
Communicate with your customers frequently and in advance. In this case, the radio was silent for a while, leaving customers wondering what to do. Most of these conversations eventually went to public forums.

Conclusion

The Atlassian Cloud crash, for its length and somewhat paradoxical, took the software to help groups prevent such issues in their businesses.

The good news is that Atlassian had the resources needed to restore service quickly. The shared data storage structure led to a slow recovery, which is unfortunate, but again, it is difficult to overstate Atlassian’s failure to set up dedicated storage for each customer.