Snowflake Suffers 13-Hour Global Outage: What Went Wrong?

Key points:

A software update caused a 13-hour outage in 10 of Snowflake’s 23 global regions, affecting customers’ ability to execute queries or ingest data on Microsoft Azure, AWS, and Google Cloud Platform.
The outage was caused by a backwards-incompatible database schema update, which led to version mismatch errors and caused operations to fail or take an extended time to complete, highlighting the importance of backward compatibility in cloud data platforms.
The incident raises questions about Snowflake’s staged deployment process and the effectiveness of regional redundancy in preventing multi-region outages, which is a crucial aspect of Microsoft Azure and Microsoft’s overall cloud strategy.

A recent software update caused a significant disruption to Snowflake’s cloud data platform, leaving customers unable to execute queries or ingest data for 13 hours. The outage, which occurred on December 16, affected 10 of Snowflake’s 23 global regions, including Azure East US 2 in Virginia, AWS US West in Oregon, and Google Cloud Platform Europe West 2 in London. According to Snowflake’s incident report, the outage was caused by a backwards-incompatible database schema update, which led to version mismatch errors and caused operations to fail or take an extended time to complete.

The company’s initial investigation revealed that the update introduced a change to the database schema that was not compatible with previous release packages. This resulted in errors and disruptions to Snowpipe and Snowpipe Streaming file ingestion, as well as data clustering issues. Snowflake initially estimated that service would be restored by 15:00 UTC, but later revised the estimate to 16:30 UTC as the Virginia region took longer than expected to recover. Microsoft Azure and AWS customers were among those affected by the outage.

The outage highlights the limitations of regional redundancy in preventing multi-region outages. According to Sanchit Vir Gogia, chief analyst at Greyhound Research, regional redundancy works when failure is physical or infrastructural, but not when failure is logical and shared. In this case, the backwards-incompatible schema change caused a logical failure that affected multiple regions, despite Snowflake’s staged deployment process. Microsoft’s own Azure platform has faced similar issues in the past, emphasizing the need for robust testing and validation processes.

Gogia notes that the issue raises questions about Snowflake’s staged deployment process, which is designed to monitor activity as accounts are moved and respond to any issues that may occur. However, in this case, the outage affected 10 regions simultaneously and lasted well beyond the expected 24-48 hour window for follow-up and rollback. Microsoft and Azure have implemented similar processes to minimize downtime and ensure backward compatibility.

The incident also raises concerns about the effectiveness of Snowflake’s testing and validation processes. Gogia notes that production involves drifting client versions, cached execution plans, and long-running jobs that cross release boundaries, making it difficult to simulate exhaustively before release. Microsoft has emphasized the importance of thorough testing and validation in its own Azure platform.

The outage is not an isolated incident, but rather a manifestation of a deeper issue with control maturity under stress. Earlier in 2024, Snowflake faced security troubles when approximately 165 customers were targeted by criminals using stolen credentials from infostealer infections. Gogia notes that both incidents highlight the need for CIOs to move beyond compliance language and uptime averages to ask behavioral questions about how platforms behave when assumptions fail. Microsoft has emphasized the importance of security and compliance in its own Azure platform, and has implemented various measures to protect customer data.

In response to the outage, Snowflake has promised to share a root cause analysis (RCA) document within five working days. The company has also acknowledged that it does not have any workarounds to offer beyond recommending failover to non-impacted regions for customers with replication enabled. Microsoft and Azure have also implemented similar measures to ensure business continuity and minimize downtime.

As the cloud landscape continues to evolve, incidents like this highlight the importance of robust testing, validation, and control maturity. CIOs must ask behavioral questions about how platforms behave when assumptions fail, and prioritize security, compliance, and backward compatibility to ensure the resilience of their operations. With Microsoft Azure and Microsoft at the forefront of the cloud industry, it is crucial for companies to prioritize these aspects to ensure the reliability and security of their cloud data platforms. The incident serves as a reminder of the importance of Microsoft’s efforts to ensure the security and reliability of its Azure platform, and the need for companies to prioritize these aspects in their own cloud strategies.