Google Cloud burst by 12-hour power outage in German region

Loose juice led to cooling issue in one zone, but the pain was widespread

25 Oct 2024, 05:34 by Laura Dobberstein · The Register

Google Cloud apologized on Thursday after its europe-west3 region – located in Frankfurt, Germany – experienced an outage lasting half a day.

The incident began at 02:30 local time on Thursday, October 24 and ended at 15:09 – a total of 12 hours and 39 minutes.

"We apologize for the inconvenience this service disruption/outage may have caused," wrote the cloudy giant.

Google identified the root cause as a power failure and cooling issue that led to parts of one of the region's three zones, europe-west3-c, to power down. Degraded services inevitably followed.

"Google engineers implemented a fix to return the datacenter to full operation and this mitigated the issue," stated the advisory.

Services and features affected included: Cloud Build, Cloud Developer Tools, Cloud Machine Learning, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud Pub/Sub, Google Compute Engine, Google Kubernetes Engine, Persistent Disk, and Vertex AI Batch Prediction.

Users experienced a range of issues across several Google Cloud services.

On Google Compute Engine, some users faced VM creation failures, delays in processing deletions, and certain instances in the affected zone were unavailable for operations.

In Google Kubernetes Engine, nodes in the impacted location were inaccessible, and some attempts to create new nodes failed. Persistent Disk instances were unreachable, preventing operations on them.

Users of Google Cloud Dataflow saw delays in scaling workers for batch jobs, and some streaming jobs failed to progress or scale properly. Existing Google Cloud Dataproc clusters remained functional, but some attempts to create new clusters failed. Cloud Build users may have experienced delays in starting custom worker pools.

While most problems were experience at the zonal level, there was some regional-level impact.

"For the other two zones in the same region, less than one percent of the operations that touch instance and disk resources experienced internal errors," insisted Google.

Multi-zonal problems occurred with Vertex AI Batch Prediction, which failed for some with the error message "Unable to prepare an infrastructure for serving within time."

The chocolate factory first notified users of the failure 26 minutes after it began, but did not offer any workaround until almost three hours into the outage. It eventually told impacted users to migrate workloads to other regions or zones and advised those with a degraded regional persistent disk to take regular snapshots.

Although europe-west-3 is not particularly known for outages, it did experience an incident that affected cloud workstations late last year.

This past May, Google Cloud had a very bad day when failed maintenance ops brought pain to users of 33 services – including the Compute Engine and Kubernetes Engine – for approximately two hours and 48 minutes.

That incident occurred around a week after the cloud provider deleted Australian pension fund UniSuper's entire account. That error was attributed to a tragic alignment of a bug and a misconfiguration. ®