Lousy Outages

Incident

• July 18, 2025 8:47 am

## \# Incident Report ## \#\# Summary On Friday, 18 July 2025 07:50 US/Pacific, several Google Cloud Platform (GCP) and Google Workspace (GWS) products experienced elevated latencies and error rates in the us-east1 region for a duration of up to 1 hour and 57 minutes. **GCP Impact Duration:** 18 July 2025 07:50 \- 09:47 US/Pacific : 1 hour 57 minutes **GWS Impact Duration:** 18 July 2025 07:50 \- 08:40 US/Pacific : 50 minutes We sincerely apologize for this incident, which does not reflect the level of quality and reliability we strive to offer. We are taking immediate steps to improve the platform’s performance and availability. ## ## \#\# Root Cause The service interruption was triggered by a procedural error during a planned hardware replacement in our datacenter. An incorrect physical disconnection was made to the active network switch serving our control plane, rather than the redundant unit scheduled for removal. The redundant unit had been properly de-configured as part of the procedure, and the combination of these two events led to partitioning of the network control plane. Our network is designed to withstand this type of control plane failure by failing open, continuing operation. However, an operational topology change while the network control plane was in a failed open state caused our network fabric's topology information to become stale. This led to packet loss and service disruption until services were moved away from the fabric and control plane connectivity was restored. ## \#\# Remediation and Prevention Google engineers were alerted to the outage by our monitoring system on 18 July 2025 07:06 US/Pacific and immediately started an investigation. The following timeline details the remediation and restoration efforts: * **07:39 US/Pacific**: The underlying root cause (device disconnect) was identified and onsite technicians were engaged to reconnect the control plane device and restore control plane connectivity. At that moment, network failure open mechanisms worked as expected and no impact was observed. * **07:50 US/Pacific**: A topology change led to traffic being routed suboptimally, due to the network being in a fail open state. This caused congestion on the subset of links, packet loss, and latency to customer traffic. Engineers made a decision to move traffic away from the affected fabric, which mitigated the impact for the majority of the services. * **08:40 US/Pacific**: Engineers mitigated Workspace impact by shifting traffic away from the affected region. * **09:47 US/Pacific**: Onsite technicians reconnected the device, control plane connectivity was fully restored and all services were back to stable state. Google is committed to preventing a repeat of the issue in the future, and is completing the following actions: * Pause non-critical workflows until safety controls are implemented (complete). * Strengthen safety controls for hardware upgrade workflows by end of Q3 2025\. * Design and implement a mechanism to prevent control plane partitioning in case of dual failure of upstream routers by end of Q4 2025\. ## \#\# Detailed Description of Impact \#\#\# GCP Impact: Multiple products in us-east1 were affected by the loss of network connectivity, with the most significant impacts seen in us-east1-b. Other regions were not affected. The outage caused a range of issues for customers with zonal resources in the region, including packet loss across VPC networks, increased error rates and latency, service unavailable (503) errors, and slow or stuck operations up to loss of networking connectivity. While regional products were briefly impacted, they recovered quickly by failing over to unaffected zones. A small number (0.1%) of Persistent Disks in us-east1-b were unavailable for the duration of the outage: these disks became available once the outage was mitigated, with no customer data loss. \#\#\# GWS Impact: A small subset of Workspace users, primarily around the Southeast US, experienced varying degrees of unavailability and increased delays across multiple products, including Gmail, Google Meet, Google Drive, Google Chat, Google Calendar, Google Groups, Google Doc/Editors, and Google Voice.

View incident

Incident

• June 12, 2025 5:18 pm

# Incident Report ## **Summary** *Google Cloud, Google Workspace and Google Security Operations products experienced increased 503 errors in external API requests, impacting customers.* ***We deeply apologize for the impact this outage has had. Google Cloud customers and their users trust their businesses to Google, and we will do better. We apologize for the impact this has had not only on our customers’ businesses and their users but also on the trust of our systems. We are committed to making improvements to help avoid outages like this moving forward.*** ### **What happened?** Google and Google Cloud APIs are served through our Google API management and control planes. Distributed regionally, these management and control planes are responsible for ensuring each API request that comes in is authorized, has the policy and appropriate checks (like quota) to meet their endpoints. The core binary that is part of this policy check system is known as Service Control. Service Control is a regional service that has a regional datastore that it reads quota and policy information from. This datastore metadata gets replicated almost instantly globally to manage quota policies for Google Cloud and our customers. On May 29, 2025, a new feature was added to Service Control for additional quota policy checks. This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code. As a safety precaution, this code change came with a red-button to turn off that particular policy serving path. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash. Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag protected, the issue would have been caught in staging. On June 12, 2025 at \~10:45am PDT, a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds. This policy data contained unintended blank fields. Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop. This occurred globally given each regional deployment. Within 2 minutes, our Site Reliability Engineering team was triaging the incident. Within 10 minutes, the root cause was identified and the red-button (to disable the serving path) was being put in place. The red-button was ready to roll out \~25 minutes from the start of the incident. Within 40 minutes of the incident, the red-button rollout was completed, and we started seeing recovery across regions, starting with the smaller ones first. Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure. Service Control did not have the appropriate randomized exponential backoff implemented to avoid this. It took up to \~2h 40 mins to fully resolve in us-central-1 as we throttled task creation to minimize the impact on the underlying infrastructure and routed traffic to multi-regional databases to reduce the load. At that point, Service Control and API serving was fully recovered across all regions. Corresponding Google and Google Cloud products started recovering with some taking longer depending upon their architecture. ### **What is our immediate path forward?** Immediately upon recovery, we froze all changes to the Service Control stack and manual policy pushes until we can completely remediate the system. ### **How did we communicate?** We posted our first incident report to Cloud Service Health about \~1h after the start of the crashes, due to the Cloud Service Health infrastructure being down due to this outage. For some customers, the monitoring infrastructure they had running on Google Cloud was also failing, leaving them without a signal of the incident or an understanding of the impact to their business and/or infrastructure. We will address this going forward. ### **What’s our approach moving forward?** Beyond freezing the system as mentioned above, we will prioritize and safely complete the following: * We will modularize Service Control’s architecture, so the functionality is isolated and fails open. Thus, if a corresponding check fails, Service Control can still serve API requests. * We will audit all systems that consume globally replicated data. Regardless of the business need for near instantaneous consistency of the data globally (i.e. quota management settings are global), data replication needs to be propagated incrementally with sufficient time to validate and detect issues. * We will enforce all changes to critical binaries to be feature flag protected and disabled by default. * We will improve our static analysis and testing practices to correctly handle errors and if need be fail open. * We will audit and ensure our systems employ randomized exponential backoff. * We will improve our external communications, both automated and human, so our customers get the information they need asap to react to issues, manage their systems and help their customers. * We'll ensure our monitoring and communication infrastructure remains operational to serve customers even when Google Cloud and our primary monitoring products are down, ensuring business continuity. -------

View incident

Incident

• May 20, 2025 4:05 am

# Incident Report ## Summary On 19 May 2025, Google Compute Engine (GCE) encountered problems affecting Spot VM termination globally, and performance degradation and timeouts of reservation consumption / VM creation in us-central1 and us-east4 for a duration of 8 hours, 42 minutes. Consequently, multiple other Google Cloud Platform (GCP) products relying on GCE also experienced increased latencies and timeouts. To our customers who were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. ## Root Cause A recently deployed configuration change to a Google Compute Engine (GCE) component mistakenly disabled a feature flag that controlled how VM instance states are reported to other components. Safety checks intended to ensure gradual rollout of this type of change failed to be triggered, resulting in an unplanned rapid rollout of the change. This caused Spot VMs to be stuck in an unexpected state. Consequently, Spot VMs that had initiated their standard termination process due to preemption began to accumulate as they failed to complete termination, creating a backlog that degraded performance for all VM types in some regions. ## Remediation and Prevention Google engineers were alerted to the outage via internal monitoring on 19 May 2025, at 21:08 US/Pacific, and immediately started an investigation. Once the nature and scope of the issue became clear, Google engineers initiated a rollback of the change on 20 May 2025 at 03:29 US/Pacific. The rollback completed at 03:55 US/Pacific, mitigating the impact. Google is committed to preventing a repeat of this issue in the future and is completing the following actions: * Google Cloud employs a robust and well-defined methodology for production updates, including a phased rollout approach as standard practice to avoid rapid global changes. This phased approach is meant to ensure that changes are introduced into production gradually and as safely as possible, however, in this case, the safety checks were not enforced. We have paused further feature flag rollouts for the affected system, while we undertake a comprehensive audit of safety checks and fix any exposed gaps that led to the unplanned rapid rollout of this change. * We will review and address scalability issues encountered by GCE during the incident. * We will improve monitoring coverage of Spot VM deletion workflows. Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization. We thank you for your business. ## Detailed Description of Impact Customers experienced increased latency for VM control plane operations in us-central1 and us-east4. VM control plane operations include creating, modifying, or deleting VMs. For some customers, Spot VM instances became stuck while terminating. Customers were not billed for Spot VM instances in this state. Furthermore, running virtual machines and the data plane were not impacted. VM control plane latency in the us-central1 and us-east4 regions began increasing at the start of the incident (19 May 2025 20:23 US/Pacific), and peaked around 20 May 2025 03:40 US/Pacific. At peak, median latency went from seconds to minutes, and tail latency went from minutes to hours. Several other regions experienced increased tail latency during the outage, but most operations in these regions completed as normal. Once mitigations took effect, median and tail latencies started falling and returned to normal by 05:15 US/Pacific. Customers may have experienced similar latency increases in products which create, modify, failover or delete VM instances: GCE, GKE, Dataflow, Cloud SQL, Google Cloud Dataproc, Google App Engine, Cloud Deploy, Memorystore, Redis, Cloud Filestore, among others.

View incident

Incident

• March 29, 2025 6:15 pm

# Incident Report ## Summary: On Saturday, 29 March 2025, multiple Google Cloud Services in the us-east5-c zone experienced degraded service or unavailability for a duration of 6 hours and 10 minutes. To our Google Cloud customers whose services were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. ## Root Cause: The root cause of the service disruption was a loss of utility power in the affected zone. This power outage triggered a cascading failure within the uninterruptible power supply (UPS) system responsible for maintaining power to the zone during such events. The UPS system, which relies on batteries to bridge the gap between utility power loss and generator power activation, experienced a critical battery failure. This failure rendered the UPS unable to perform its core function of ensuring continuous power to the system. As a direct consequence of the UPS failure, virtual machine instances within the affected zone lost power and went offline, resulting in service downtime for customers. The power outage and subsequent UPS failure also triggered a series of secondary issues, including packet loss within the us-east5-c zone, which impacted network communication and performance. Additionally, a limited number of storage disks within the zone became unavailable during the outage. ## Remediation and Prevention: Google engineers were alerted to the incident from our internal monitoring alerts at 12:54 US/Pacific on Saturday, 29 March and immediately started an investigation. Google engineers diverted traffic away from the impacted location to partially mitigate impact for some services that did not have zonal resource dependencies. Engineers bypassed the failed UPS and restored power via generator by 14:49 US/Pacific on Saturday, 29 March. The majority of Google Cloud services recovered shortly thereafter. A few services experienced longer restoration times as manual actions were required in some cases to complete full recovery. Google is committed to preventing a repeat of this issue in the future and is completing the following actions: * Harden cluster power failure and recovery path to achieve a predictable and faster time-to-serving after power is restored. * Audit systems that did not automatically failover and close any gaps that prevented this function. * Work with our uninterruptible power supply (UPS) vendor to understand and remediate issues in the battery backup system. Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization. We thank you for your business. ## Detailed Description of Impact: Customers experienced degraded service or unavailability for multiple Google Cloud products in the us-east5-c zone of varying impact and severity as noted below: **AlloyDB for PostgreSQL:** A few clusters experienced transient unavailability during the failover. Two impacted clusters did not failover automatically and required manual intervention from Google engineers to do the failover. **BigQuery:** A few customers in the impacted region experienced brief unavailability of the product between 12:57 US/Pacific until 13:19 US/Pacific. **Cloud Bigtable:** The outage resulted in increased errors and latency for a few customers between 12:47 US/Pacific to 19:37 US/Pacific. **Cloud Composer:** External streaming jobs for a few customers experienced increased latency for a period of 16 minutes. **Cloud Dataflow:** Streaming and batch jobs saw brief periods of performance degradation. 17% of streaming jobs experienced degradation from 12:52 US/Pacific to 13:08 US/Pacific, while 14% of batch jobs experienced degradation from 15:42 US/Pacific to 16:00 US/Pacific. **Cloud Filestore:** All basic, high scale and zonal instances in us-east5-c were unavailable and all enterprise and regional instances in us-east5 were operating in degraded mode from 12:54 to 18:47 US/Pacific on Saturday, 29 March 2025\. **Cloud Firestore:** Limited impact of approximately 2 minutes where customers experienced elevated unavailability and latency, as jobs were being rerouted automatically. **Cloud Identity and Access Management:** A few customers experienced slight latency or errors while retrying for a short period of time. **Cloud Interconnect:** All us-east5 attachments connected to zone1 were unavailable for a duration of 2 hours, 7 minutes. **Cloud Key Management Service:** Customers experienced 5XX errors for a brief period of time (less than 4 mins). Google engineers rerouted the traffic to healthy cells shortly after the power loss to mitigate the impact. **Cloud Kubernetes Engine:** Customers experienced terminations of their nodes in us-east5-c. Some zonal clusters in us-east5-c experienced loss of connectivity to their control plane. No impact was observed for nodes or control planes outside of us-east5-c. **Cloud NAT:** Transient control plane outage affecting new VM creation processes and/or dynamic port allocation. **Cloud Router:** Cloud Router was unavailable for up to 30 seconds while leadership shifted to other clusters. This downtime was within the thresholds of most customer's graceful restart configuration (60 seconds). **Cloud SQL:** Based on monitoring data, 318 zonal instances experienced 3h of downtime in the us-east5-c zone. All external high-availability instances successfully failed out of the impacted zone. **Cloud Spanner:** Customers in the us-east5 region may have seen a few minutes of errors or latency increase during the few minutes after 12:52 US/Pacific when the cluster first failed. **Cloud VPN:** A few legacy customers experienced loss of connectivity of their sessions up to 5 mins. **Compute Engine:** Customers experienced instance unavailability and inability to manage instances in us-east5-c from 12:54 to 18:30 US/Pacific on Saturday, 29 March 2025\. **Managed Service for Apache Kafka:** CreateCluster and some UpdateCluster commands (those that increased capacity config) had a 100% error rate in the region, with the symptom being INTERNAL errors or timeouts. Based on our monitoring, the impact was limited to one customer who attempted to use these methods during the incident. **Memorystore for Redis:** High availability instances failed over to healthy zones during the incident. 12 instances required manual intervention to bring back provisioned capacity. All instances were recovered by 19:28 US/Pacific. **Persistent Disk:** Customers experienced very high I/O latency, including stalled I/O operations or errors in some disks in us-east5-c from 12:54 US/Pacific to 20:45 US/Pacific on Saturday, 29 March 2025\. Other products using PD or communicating with impacted PD devices experienced service issues with varied symptoms. **Secret Manager:** Customers experienced 5XX errors for a brief period of time (less than 4 mins). Google engineers rerouted the traffic to healthy cells shortly after the power loss to mitigate the impact. **Virtual Private Cloud:** Virtual machine instances running in the us-east5-c zone were unable to reach the network. Services were partially unavailable from the impacted zone. Customers wherever applicable were able to fail over workloads to different Cloud zones.

View incident

Incident

• March 4, 2025 1:40 pm

The issue with Apigee Hybrid, Apigee Edge Public Cloud, Apigee has been resolved for all affected users as of Tuesday, 2025-03-04 13:30 US/Pacific. From preliminary analysis root cause appears to be due certificate expiration. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• February 19, 2025 11:56 pm

The issue with Cloud Asset Inventory has been resolved for all affected users as of Wednesday, 2025-02-19 23:20 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• February 18, 2025 6:41 pm

# Mini Incident Report We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support. (All Times US/Pacific) Incident Start: 18 February 2025, 14:42 Incident End: 18 February 2025, 18:40 Duration: 3 hours 58 minutes Affected Services and Features: * Chronicle Security * Chronicle SOAR Regions/Zones: Global Description: Starting on 18 February 2025 14:42 US/Pacific, Chronicle Security and Chronicle SOAR experienced an issue where customers encountered the error message "An error occurred while loading dashboards" when attempting to access dashboards in Google Cloud Security for a duration of 3 hours 58 minutes. From preliminary analysis, the root cause of the issue is related to a revocation of associated internal looker licenses - all revoked licenses were restored to mitigate an issue. Customer Impact: Customers encountered an error message "An error occurred while loading dashboards" when attempting to access certain security dashboards in Google Cloud.

View incident

Incident

• February 16, 2025 7:49 am

The issue with Vertex AI Search, Recommendation AI has been resolved for all affected users as of Sunday, 2025-02-16 07:21 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• February 13, 2025 9:39 pm

The issue with Chronicle Security has been resolved for all affected users as of Thursday, 2025-02-13 21:38 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• February 3, 2025 1:54 am

The issue with Chronicle Security has been resolved for all affected users as of Monday, 2025-02-03 01:49 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• January 31, 2025 6:13 am

## \# Mini Incident Report We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using [***https://cloud.google.com/support***](https://cloud.google.com/support). (All Times US/Pacific) **Incident Start:** 30 January 2025, 16:53 **Incident End:** 31 January 2025 06:13 **Duration:** 13 hours, 20 minutes **Affected Services and Features:** Chronicle Security **Regions/Zones:** asia-south1, asia-southeast1 **Description:** Chronicle Security WORKSPACE\_ACTIVITY data ingestion observed delays in asia-southeast1 and asia-south1 regions for 13 hours, 20 minutes due to network connectivity issues between the trans-Pacific regions. While the network connectivity issues are being fixed, the data ingestion delays were mitigated by increasing the Network Quality of Service (QoS) for this traffic to ensure timely processing. **Customer Impact:** * Chronicle Security customers observed WORKSPACE\_ACTIVITY data was delayed in the Secops instance. The impacted regions processed only around one third of the actual traffic. * Rule detections which depend on WORKSPACE\_ACTIVITY data may have been delayed. * Search on WORKSPACE\_ACTIVITY data may have shown fewer events due to ingestion delays.

View incident

Incident

• January 28, 2025 8:54 pm

The issue with Cloud Translation has been resolved for all affected users as of Tuesday, 2025-01-28 20:53 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• January 28, 2025 2:16 pm

The issue with Cloud Translation has been resolved for all affected users as of Tuesday, 2025-01-28 14:00 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• January 24, 2025 11:20 am

# Incident Report ## Summary On Friday, 24 January 2025, AppSheet customers were unable to load AppSheet apps with the app editor or the app load page due to ‘500’ errors and timeouts. Around 60% of the requests were impacted in us-east4 and europe-west4 for a duration of 1 hour and 50 minutes. We sincerely apologize to our Google Cloud customers for the disruption you experienced. ## Root Cause A database schema migration in production triggered a cascading incident. The migration caused failures and timeouts on the primary database, disrupting most AppSheet operations and preventing apps loading for users in us-east4 and europe-west4. The sustained outage occurred due to a surge of retries, overloading the secondary authentication database and rendering it completely unresponsive for requests to the affected regions. The authentication database is responsible for storing user authentication tokens. Traffic was migrated to the us-central1 and us-west1 regions, after which issues pertaining to user auth tokens were resolved. However, this triggered an increase in load on our service for validating users’ Workspace license entitlements, due to that information no longer being available in cache. The request rate went up significantly, triggering aggressive load shedding, resulting in elevated latency for 95% of the traffic. This further aggravated latency after traffic migration to us-central1 and us-west1 was performed. ## Remediation and Prevention Google engineers were alerted to the outage via an automated alert on 24 January 2025 09:42 US/Pacific and immediately started an investigation. To mitigate the impact, engineers redirected the traffic from us-east4 and europe-west4, to us-central1 and us-west1. The resultant load shedding that occurred on the licensing server recovered by 11:20 US/Pacific, once we restored our authentication database and gradually reverted traffic to us-east4 and europe-west4. Google is committed to preventing a repeat of this issue in the future and is completing the following actions: - Improve alerting and monitoring of on license server traffic to reduce impact on latencies when traffic migration happens. - Gradually reduce dependency on licensing servers to avoid failures arising from either increased traffic, or unavailability of licensing servers. - We are reviewing measures to increase the stability of our authentication database, to ensure optimal handling of any surge in requests. ## Detailed Description of Impact On Friday, 24 January 2025, from 09:30 to 11:20 US/Pacific, approximately 60% of the AppSheet requests in us-east4 and europe-west4 may have failed. - Affected customers were unable to load AppSheet apps with the app editor or the app load page. - Affected customers experienced elevated ‘500’ errors and timeouts. - Some customers may also have observed intermittent latency.

View incident

Incident

• January 15, 2025 3:43 pm

The issue with Chronicle Security has been resolved for all affected users as of Wednesday, 2025-01-15 15:30 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• January 13, 2025 6:16 am

The issue with Vertex Gemini API has been resolved for all affected projects as of Monday, 2025-01-13 06:16 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• January 10, 2025 12:46 am

The issue with Chronicle Security has been resolved for all affected users as of Friday, 2025-01-10 00:30 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• January 8, 2025 11:43 am

# Mini Incident Report We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support. (All Times US/Pacific) **Incident Start:** 3 January 2025 12:14 **Incident End:** 8 January 2025 11:08 **Duration:** 5 days, 10 hours, 55 minutes **Affected Services and Features:** Google SecOps (Chronicle Security) - SOAR Permissions **Regions/Zones:** europe, europe-west12, europe-west2, europe-west3, europe-west6, europe-west9, asia-northeast1, asia-south1, asia-southeast1, australia-southeast1, me-central1, me-central2, me-west1, northamerica-northeast2, southamerica-east1 **Description:** Google SecOps (Chronicle Security) experienced an increase in permission errors for non-admin users accessing SOAR cases. From preliminary analysis, the issue was due to a software defect introduced by a recent service update that had been rolled out to non-US regions. The issue was fully mitigated once the affected service update was rolled back, restoring service for all affected users. **Customer Impact:** * When a non-admin user attempted to access the SOAR cases view, they received a 403 error. ---

View incident

Incident

• January 8, 2025 8:07 am

# Incident Report ## Summary On Wednesday, 8 January 2025 06:54 to 08:07 US/Pacific, Google Cloud Pub/Sub experienced a service outage in multiple regions resulting in customers unable to publish or subscribe to the messages for a duration of 1 hour and 13 minutes. This outage also resulted in an increased backlog which was identified at 8 January 2025 09:07 US/Pacific for a small subset of customer subscriptions using message ordering[1], which extended beyond the unavailability time window. These subscriptions were repaired and mitigated by 8 January 2025 23:09 US/Pacific. We deeply regret the disruption this outage caused for our Google Cloud customers. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s availability. ## Root Cause Cloud Pub/Sub uses a regional database for the metadata state of its storage system, including information about published messages and the order in which those messages were published for ordered delivery. The regional metadata database is on the critical path of most of the Cloud Pub/Sub data plane operations. From 8 January 2025 06:54 to 07:30 US/Pacific, a bad service configuration change, which unintentionally over-restricted the permission to access this database, was rolled out to multiple regions. The issue did not surface in our pre-production environment due to a mismatch in the configuration between the two environments. In addition, the change was mistakenly rolled out to multiple regions within a short time period and did not follow the standard rollout process. This change prevented Cloud Pub/Sub from accessing the regional metadata store, leading to publish, subscribe, and backlog metrics failures and unavailability impact, which was mitigated on 8 January 2025 08:07 US/Pacific. Though the configuration change was rolled back and mitigated on 8 January 2025 08:07 US/Pacific, the database unavailability during the issue exposed a latent bug in the way Cloud Pub/Sub enforces ordered delivery for subscriptions with ordering enabled. In particular, when the database was unavailable for an extended period of time, the metadata pertaining to ordering became inconsistent with the metadata about published messages. This inconsistency prevented the delivery of a subset of messages until the subscriptions were repaired, and they received all backlogged messages in the proper order. Mitigation was completed by 8 January 2025 23:09 US/Pacific. Note that this did not impact ordering or guaranteed delivery. ## Remediation and Prevention Google engineers were alerted to the outage via internal telemetry on 8 January 2025 07:03 US/Pacific, 9 minutes after impact started. The config change that caused the issue was identified and rollback completed by 8 January 2025 08:07 US/Pacific. At 8 January 2025 09:07 US/Pacific, Google engineers were alerted via internal telemetry to the fact that a small subset of ordered subscriptions were unable to consume their backlog and root caused the metadata inconsistency at 8 January 2025 12:20 US/Pacific. Google engineers worked on identifying and repairing all impacted ordered subscriptions, which was completed by 8 January 2025 23:09 US/Pacific. Google is committed to preventing a repeat of this issue in the future and is completing the following actions: * Our engineering team is working on implementing stronger enforcement of parity between pre-production and production environments in order to ensure the impact of configuration changes can be caught before changes move to production. ETA: 31 January 2025. * We are reviewing our change management process to ensure that future configuration changes roll out in a progressive fashion aligned with the priority of the change. ETA: 31 January 2025. * We are working on implementing additional monitoring that proactively detects ordering metadata inconsistency. ETA: 31 March 2025. * We are implementing a fix to the Cloud Pub/Sub ordering metadata management bug, which led to undelivered, ordered messages. ETA: 30 June 2025. ## Detailed Description of Impact On Wednesday 8 January 2025 from 06:54 to 08:07 US/Pacific Google Cloud Pub/Sub, Cloud Logging, and BigQuery Data Transfer Service experienced a service outage in europe-west10, asia-south1, europe-west1, us-central1, asia-southeast2, us-east1, us-east5, asia-south2, us-south1, me-central1 regions. Customers publishing from other regions may have also experienced the issue if the message storage policies [2] are set to store and process the messages in the above-mentioned regions. #### Google Cloud Pub/Sub : Customers were unable to publish or subscribe to the messages in the impacted regions. Publishing the messages from other regions may also have been impacted, if they have any of the impacted regions in their message storage policies. Backlog metrics might have been stale or missing. #### Google BigQuery Data Transfer Service : Customers experienced failures with data transfers runs failing to publish to Pub/Sub for a duration of 20 minutes. #### Cloud Logging : All Cloud Logs customers exporting logs to Cloud Pub/Sub experienced a delay in the log export for a duration of 26 minutes. **Appendix:** * [1] https://cloud.google.com/pubsub/docs/ordering * [2] https://cloud.google.com/pubsub/docs/resource-location-restriction#message_storage_policy_overview

View incident

Incident

• January 7, 2025 9:18 pm

The issue with Vertex Gemini API has been resolved for all affected users as of Tuesday, 2025-01-07 21:18 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• January 7, 2025 3:30 pm

The issue with Apigee has been resolved for all affected users as of Tuesday, 2025-01-07 15:22 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• January 6, 2025 12:24 pm

The issue with Google Cloud Console, Google Cloud Support has been resolved for all affected users as of Monday, 2025-01-06 11:45 US/Pacific. Only a few chat instances were impacted during the incident. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• December 19, 2024 5:35 pm

The issue with Vertex Gemini API has been resolved for all affected users as of Thursday, 2024-12-19 17:30 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• December 19, 2024 11:05 am

# Mini Incident Report We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support . (All Times US/Pacific) **Incident Start:** 19 December, 2024 08:15 **Incident End:** 19 December, 2024 10:55 **Duration:** 2 hours, 40 minutes **Affected Services and Features:** Vertex Gemini API **Regions/Zones:** multi-region:asia **Description:** Google Vertex Gemini API customers sending traffic to gemini models in multi-region:asia experienced an increase in 5xx errors up to 100% for a duration of 2 hours, 40 minutes. From preliminary analysis, the root cause of the incident is stemmed from a Vertex service dependency ending up in a bad state due to a configuration process. The situation was mitigated by increasing resources, enabling successful autopilot scaling and reducing customer traffic errors. Additionally, the Vertex service dependency received an update to fix the configuration, preventing recurrence of the Incident. **Customer Impact:** Customers sending traffic to gemini models in multi-region:asia experienced an increase in 5xx errors up to 100%.

View incident

Incident

• December 14, 2024 1:57 pm

The issue with Mandiant Attack Surface Mangement has been resolved for all affected projects as of Saturday, 2024-12-14 13:56 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• December 17, 2024 4:16 pm

The issue with Mandiant Attack Surface Mangement has been resolved for all affected users as of Tuesday, 2024-12-17 16:15 US/Pacific. We thank you for your patience while we worked on resolving the issue. Thank you for choosing us.

View incident

Incident

• December 13, 2024 10:28 am

The issue with Google Cloud Support has been resolved for all affected projects as of Friday, 2024-12-13 10:14 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• December 13, 2024 11:01 am

The issue with Vertex Gemini API has been resolved for all affected users as of Friday, 2024-12-13 09:00 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• December 13, 2024 6:03 am

# Mini Incident Report We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support. (All Times US/Pacific) **Incident Start:** 11 December 2024 17:15 **Incident End:** 13 December 2024 06:03 **Duration:** 1 day, 12 hours, 48 minutes **Affected Services and Features:** Vertex Gemini API **Regions/Zones:** multiregion-us **Description:** The Vertex Gemini API experienced elevated latency and errors for the gemini-1.5-flash-002 model in multiregion-us for a duration of 1 day, 12 hours, and 48 minutes. From preliminary analysis, the root cause of the issue was a significant increase in decode processing load stemming from a large spike in response content heavy traffic. Google engineers mitigated the issue by increasing processing capacity and updating a resource allocation configuration to alleviate stress on processing. **Customer Impact:** Customers would have experienced increased latency or errors for the gemini-1.5-flash-002 model in multiregion-us.

View incident

Incident

• December 9, 2024 5:01 pm

The issue with Apigee Edge Public Cloud, Apigee Hybrid & Apigee have been resolved for all affected projects as of Monday, 2024-12-09 16:45 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• December 9, 2024 11:50 am

# Incident Report ## Summary Starting on Monday, 9 December 2024 09:24 US/Pacific, some Google BigQuery customers in the US multi-region encountered failures on 80-90% of requests to the insertAll API, with a ‘5xx’ error code along with increased latency. BigQuery Write API customers also saw an increase in latency for some requests during this time. Also, Dataflow customers may have experienced an increase in latency by 5-15%. The impact lasted for a duration of 2 hours and 16 minutes. To our BigQuery and Dataflow customers who were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. ## Root Cause Dataflow depends on BigQuery to write streaming data to a table. BigQuery uses a streaming backend service to persist data. This streaming backend service in one of the clusters in the US multi-region experienced a high number of concurrent connections caused by a slightly higher than usual customer traffic fluctuation. This backend service enforces a limit on the number of such concurrent connections for flow control and a bug in its mechanism prevented backends from accepting new requests once this limit was reached. Due to reduced bandwidth in the backend servers in the cluster, the regional streaming frontends experienced higher latency and started accumulating more inflight requests. This led to their overload and regional impact to BigQuery and Dataflow streaming customers outside the affected cluster. ## Remediation and Prevention Google engineers were alerted to the outage by our internal monitoring system on Monday, 9 December 2024 at 09:33 US/Pacific and immediately started an investigation. After a thorough investigation, the impacted backend cluster was identified. Initial mitigation attempt focused on reducing server load through traffic throttling. To achieve complete mitigation, our engineers then drained the affected cluster, resulting in immediate and complete recovery. Google is committed to preventing this issue from repeating in the future and is completing the following actions: * Fix the root cause in the backend service to handle surges in concurrent connections and avoid zonal impact. * Improve testing coverage of the backend service to prevent similar issues. * Enhance the ability to detect and automatically mitigate similar cases of zonal impact. * Improve isolation to prevent issues in a particular cluster or availability zone from impacting all users in the region. ## Detailed Description of Impact On Monday 9 December 2024 from 09:24 to 11:40 US/Pacific BigQuery and Dataflow customers experienced increased latency and elevated error rates in US multi-region. ### Google BigQuery * 80-90% of all requests for the insertAll API failed with a ‘5xx’ status code in US multi-region. Tail latency also increased substantially from <100ms to ~30 seconds during this time. * Additionally, AppendRows requests for the Write API saw increased tail latency (99.99%) from <3 seconds to ~30 seconds during this time. ### Cloud Dataflow * 5-15% of Dataflow streaming jobs may have experienced increased latency in us-east1, us-east4, us-west1, us-west2 and us-central1 regions for the duration of the incident.

View incident

Incident

• December 9, 2024 9:48 pm

The issue with Vertex AI Search, Recommendation AI has been resolved for all affected users as of Monday, 2024-12-09 21:48 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• December 7, 2024 2:58 pm

The issue with Vertex AI Search, Recommendation AI has been resolved for all affected users as of Saturday, 2024-12-07 14:58 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• December 4, 2024 6:28 pm

# Incident Report ## Summary Starting on 4 December 2024 at 14:30 US/Pacific, Google BigQuery experienced elevated invalid value and internal system errors globally for traffic related to BigQuery and Google Drive integration for 3 hours and 25 minutes. The incident affected users and tasks attempting to export data to Google Drive, resulting in failed export jobs. The impacted users would have encountered “API key not valid” and “Failed to read the spreadsheet” errors for export jobs when accessing Google Drive. This resulted in service unavailability or failing jobs for the duration of this disruption for Google BigQuery. To our BigQuery customers whose business analytics were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. ## Root Cause This disruption of data export functionality was triggered by the deletion of an internal project containing essential API keys. This deletion was an unintended consequence of several contributing factors: - An internally used API key was flagged for a Google policy non-compliance and deemed no longer in use which led to the deleting of the API key. - Unclear Internal Google Project Ownership: The project ownership was not clearly recorded, and thus incorrectly associated with a deprecated service. - Outdated Information: The combination of the perceived lack of recent activity and the incorrect ownership led to the project being mistakenly classified as abandoned and deprecated. ## Remediation and Prevention Google engineers were alerted to the service degrading via a support case on 4 December 2024, at approximately 14:21 US/Pacific when users began experiencing failures in data export operations. Google Engineers were alerted to the service disruption through internal monitoring systems and user reports. Upon investigation, the deletion of the project was identified as the root cause. To mitigate the impact, the project was restored at approximately 15:45 US/Pacific. This action successfully recovered the API keys and over time restored the data export functionality for all affected users. The final error related to this incident was observed at approximately 17:55 US/Pacific, indicating full service recovery. Google is committed to continually improving our technology and operations to prevent service disruptions. We apologize for any inconvenience this incident may have caused and appreciate your understanding. Google is committed to preventing a repeat of this issue in the future and is completing the following actions. - **Remove dependency on API keys for BigQuery integrations with other Google services:** This will eliminate the entire failure mode. - **Implement accidental deletion protection for critical internal resources:** Use mechanisms like project liens to ensure that a critical resource cannot be deleted accidentally. - **Enhance Project Metadata:** We are implementing a process for regular review and validation of project ownership and metadata. This will ensure that critical information about project usage and status is accurate and up-to-date, reducing the risk of incorrect assumptions about project status. - **Strengthen Internal Processes and Access Controls:** We are strengthening our processes for deprecating and deleting projects, including mandatory reviews, impact assessments, and stakeholder approvals. This will prevent accidental deletion of critical projects and ensure that all potential impacts are thoroughly evaluated before any action is taken. We are also strengthening access controls for project deletion, ensuring that only authorized personnel with appropriate approvals can perform this action. This will add an additional layer of protection against unintended project deletion. ## Detailed Description of Impact Starting on 4 December 2024, Google BigQuery experienced elevated error rates for data export operations to Google Drive globally. Between approximately 14:21 and 18:04 US/Pacific, users attempting to export data from BigQuery to Google Drive encountered failures, resulting in service disruption for this specific functionality. The incident affected all regions and impacted users encountered errors such as "API key not valid," "Failed to read the spreadsheet," or "[Error: 80324028]". Internal error messages further specified the issue as "Dremel returned third-party error from GDRIVE: FAILED_PRECONDITION: Encountered an error while creating temporary directory" with an underlying status of "Http(400) Bad Request, API key not valid. Please pass a valid API key." ### Google BigQuery This disruption specifically impacted users and automated tasks relying on the BigQuery to Google Drive export functionality. Export jobs initiated during this period failed to complete, preventing data transfer and potentially impacting downstream processes and workflows dependent on this data.

View incident

Incident

• November 28, 2024 4:52 am

The issue with Vertex AI AutoML Image has been resolved for all affected users as of Thursday, 2024-11-28 04:48 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• November 25, 2024 7:49 pm

The issue with Looker Studio has been resolved for all affected users as of Monday, 2024-11-25 19:48 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• November 21, 2024 2:03 pm

The issue with AppSheet has been resolved for all affected users as of Thursday, 2024-11-21 13:16 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• November 21, 2024 10:18 pm

The issue with Mandiant Attack Surface Mangement has been resolved for all affected users as of Thursday, 2024-11-21 20:15 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• November 16, 2024 4:14 am

# Incident Report ## Summary On 16 November 2024 at 00:47 US/Pacific, a combination of fiber failures and a network equipment fault led to reduced network capacity between the asia-southeast2 region and other GCP regions. The failures were corrected and minimum required capacity recovered by 02:13 US/Pacific. To our GCP customers whose businesses were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. ## Root Cause and Impact Google’s global network is designed and built to ensure that any occasional capacity loss events are not noticeable and/or have minimal disruption to customers. We provision several diverse network paths to each region and maintain sufficient capacity buffers based on the measured reliability of capacity in each region. Between 12 November and 16 November, two separate fiber failures occurred near the asia-southeast2 region. These failures temporarily reduced the available network capacity between the asia-southeast2 region and other GCP regions, but did not impact the availability of GCP services in the region. Google engineers were alerted of these failures as soon as they occurred and were working with urgency on remediating these fiber failures but had not yet completed full recovery. On 16 November 2024 at 00:47 US/Pacific, a latent software defect impacted a backbone networking router in the asia-southeast2 region resulting in further reduction of available inter-region capacity and exhausted our reserve network capacity buffers causing multiple Google Cloud services in the region to experience high latency and/or elevated error rates for operations requiring inter-region connectivity. During this time, customers in asia-southeast2, would have experienced issues with managing and monitoring existing resources, creating new resources, and data replication to other regions. To mitigate the impact, Google engineers re-routed Internet traffic away from the asia-southeast2 region to be served from other GCP regions, primarily asia-southeast1 while working in parallel to recover the lost capacity. The faulty backbone networking router was recovered on 16 November 2024 02:13 US/Pacific. This ended the elevated network latency and error rates for most of the impacted GCP services’ operations. Recovery of the first failed fiber was completed on 18 November 08:45 US/Pacific and the second failed fiber was restored at 09:00 US/Pacific on the same day. ## Remediation and Prevention We’re taking the following actions to reduce the likelihood of recurrence and time to mitigate impact of this type of incident in the future: - During the incident, our actions to reroute traffic away from the asia-southeast2 region and recover the faulty backbone networking router took longer than expected as the loss of capacity hindered our visibility of required networking telemetry and functionality of emergency tooling. We’re reviewing these gaps to implement the required improvements to our network observability, emergency tools and incident response playbooks. - Work with our fiber partners in the asia-southeast2 region to ensure our fiber paths between facilities in the region and to submarine cable landing stations are on the most reliable routes available, and have adequate preventative maintenance and repair processes.

View incident

Incident

• November 15, 2024 4:52 pm

The issue with Google Cloud Support has been resolved for all affected customers as of Friday, 2024-11-15 16:35 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• November 15, 2024 4:09 am

# Incident Report ## Summary On 14 November 2024 at 21:57 US/Pacific, our support ticketing system that handles Google Cloud, Billing, and Workspace customer support, experienced an unexpected issue during a vendor-planned maintenance event, causing the system to become unavailable. Throughout the incident duration of 6 hours and 12 minutes, customers were unable to update existing chat, portal or email cases. Customers who attempted to create a support case were presented with our backup contact method and were able to receive support through this method which remained available throughout the outage. ## Root Cause The outage was triggered by a vendor-initiated change that impacted the performance of our support ticket persistence layer. This update inadvertently caused unavailability, specifically to the query subsystem of our support case management tool. After this configuration change was applied, the subsystem became unresponsive, preventing the processing of any read or write commands. As a result, customers and Google Support were unable to access or update support ticket data, leading to service disruption. ## Remediation and Prevention Our monitoring systems detected elevated error rates and, at 22:09 US/Pacific, alerted our engineering team, who immediately started an investigation with the vendor. The vendor's incident team concluded that the query subsystem state would not be resolved by a configuration rollback. The vendor’s engineering team prepared a new update, validated it in a test environment, and applied the update to production, returning the system to service on 15 November 2024 at 04:09 US/Pacific. We are taking immediate steps with the vendor to prevent a recurrence and improve reliability in the future: - A production change freeze for the vendor's query subsystem is in place until rollout safeguards are sufficient to prevent further impact. - We are working with the vendor to improve their change management process to ensure safer rollouts that avoid unexpected issues while also providing earlier detection of rollout change. - We will perform a review of rollback safety for configuration changes with the vendor to ensure rollback is always possible, reducing recovery time. ## Detailed Description of Impact Starting on 14 November 2024 at 21:57 US/Pacific, - Customers observed increased latency and required multiple attempts when opening support cases. Customers were able to use the backup case creation process to receive support. - Customers were able to send and receive updates to existing support cases by email, but were not able to update cases using the support portal. Support agents were able to send update emails for cases, create pro-active bugs and fill Contact-Us-Forms (CUFs) on behalf of their customers. However, responding via the support portal was unavailable. - Customers with active chat support cases were unable to continue their conversation. Error messages received by customers included options for continuing support via the Contact-Us-Forms (CUFs). - All contractual obligations for support requests submitted through the Contact-Us-Forms (CUFs) were fulfilled.

View incident

Incident

• November 18, 2024 2:29 am

The issue with Vertex Gemini API has been resolved for all affected users as of Monday, 2024-11-18 02:29 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• November 11, 2024 10:50 am

The issue with Cloud Security Command Center is believed to be affecting a very small number of projects and our Engineering Team is working on it. If you have questions or are impacted, please open a case with the Support Team and we will work with you until this issue is resolved. No further updates will be provided here. We thank you for your patience while we're working on resolving the issue.

View incident

Incident

• November 8, 2024 10:33 pm

The issue with Vertex Gemini API has been resolved for all affected users as of Friday, 2024-11-08 22:32 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• November 8, 2024 9:52 am

The issue with Google Cloud Support has been resolved for all affected users as of Friday, 2024-11-08 09:08 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• November 8, 2024 12:07 am

The issue with Mandiant Threat Intelligence has been resolved for all affected users as of Thursday, 2024-11-07 22:07 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• November 7, 2024 8:06 am

The issue with Chronicle Security has been resolved for all affected users as of Thursday, 2024-11-07 08:00 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• November 6, 2024 6:21 am

The issue with Looker Studio has been resolved for all affected users as of Wednesday, 2024-11-06 05:30 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• October 29, 2024 3:43 pm

# Incident Report ## Summary On Tuesday, 29 October 2024 at 16:21 US/Pacific, a power voltage swell event impacted the Network Point of Presence (POP) infrastructure in one of the campuses supporting the australia-southeast2 region, causing networking devices to unexpectedly reboot. The impacted network paths were recovered post reboot and connectivity was restored at 16:43 US/Pacific. To our Google Cloud customers whose businesses were impacted during this outage, we sincerely apologize. This is not the level of quality and reliability we strive to offer, and we are taking immediate steps to improve the platforms’ performance and resilience. ## Impact GCP operates multiple PoPs in the australia-southeast2 region to connect Cloud zones to each other and to Google’s Global Network. Google maintains sufficient capacity to ensure that occasional failures of network capacity are not noticeable and/or have minimal disruption to customers. On Tuesday, 29 October 2024, two fiber failures had occurred near this region. These failures reduced the available inter-region network capacity, but did not impact the availability of GCP services in the region. Then later that same day, the PoP where 50% of the network capacity for the region is hosted, experienced a power voltage swell event causing the networking devices in the PoP to reboot. When the networking devices in this datacenter rebooted, this networking capacity was temporarily unavailable. This rare triple failure resulted in reduced connectivity across zones in the region and caused multiple Google Cloud services to lose connectivity for 21 minutes. Additionally, customers using Zones A and Zones C, experienced up to 15% increased latency and error rates for 16 minutes due to degraded inter-zone network connectivity while the networking devices recovered from the reboot . Google engineers were already working on remediating the two fiber failures when they were alerted to the network disruption caused by the voltage swell via an internal monitoring alert on Tuesday, 29 October 2024 16:29 US/Pacific and immediately started an investigation. After the devices had completed rebooting, the impacted network paths were recovered and all affected Cloud Zones regained full connectivity to the network at 16:43 US/Pacific. Majority of GCP services impacted by the issue recovered shortly thereafter. A few Cloud services experienced longer restoration times as manual actions were required in some cases to complete full recovery. **(Updated on 04 December 2024\)** ## Root cause of device reboots Upon review of the power management systems, Google engineers have determined there is a mismatch in voltage tolerances in the affected network equipment on the site. The affected network racks are currently designed to tolerate up to \~110% of designed voltage, but the UPS which supplies the power to the network equipment is designed to tolerate up to \~120% of designed voltage. The voltage swell event caused a deviation between 110% and 120% which was detected by the networking equipment's rectifiers as outside their allowable range and they powered down in order to protect the equipment. We have determined that a voltage regulator for the datacenter-level UPS was enabled, which is a standard setting. This caused an additional boost during power fluctuations, pushing the voltage into the problematic 110%-120% range. The voltage regulator is necessary to ensure sufficient voltage when loads are high, but because the equipment is well below its load limit, and caused a deviation above expected levels. ## Remediation and Prevention We are taking the following actions to prevent a recurrence and improve reliability in the future: * Review our datacenter power distribution design in the region and implement any recommended additional protections for critical networking devices against voltage swells and sags. * As an initial risk reduction measure, the datacenter-level UPS voltage regulator will be reconfigured, and we have instituted monthly reviews to ensure it will be configured to correctly match future site load. * We are also deploying a double conversion UPS in the affected datacenter's power distribution design for the equipment that failed during this event. * Implement changes to network device configuration that reduce time to recover full network connectivity after a failure. * Review and verify that this risk does not exist in other locations, and if so proactively perform the above remediations. * Root cause and determine corrective actions for GCP Services that did not recover quickly from the incident after the network connectivity was restored. ---- To summarize, the root cause investigation for the network device reboots impacting the australia-southeast2 region has been concluded. Google teams are implementing additional preventative measures to minimize the risk of recurrence. This is the final version of the Incident Report. ----

View incident

Incident

• October 28, 2024 3:09 pm

The issue with Colab Enterprise has been resolved for all affected users as of Monday, 2024-10-28 15:43 US/Pacific. We thank you for your patience while we worked on resolving the issue.

View incident

Incident

• November 4, 2024 8:30 pm

# Mini Incident Report We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support (All Times US/Pacific) **Incident Start:** 22 October 2024, 14:08 **Incident End:** 4 November 2024, 20:30 **Duration:** 13 days, 6 hours, 22 minutes **Affected Services and Features:** Vertex AI Online Prediction (Vertex Model Garden Deployments) **Regions/Zones:** All regions except asia-southeast1, europe-west4, us-central1, us-east1, us-east4 **Description:** Deployment of large models (those that require more than 100GB of disk size) in Vertex AI Online Prediction (Vertex Model Garden Deployments) failed in most of the regions for a duration of up to 13 days, 6 hours, 22 minutes starting on Tuesday, 22 October 2024 at 14:08 US/Pacific. From preliminary analysis, the root cause of the issue is an internal storage provisioning configuration error that was implemented as part of a recent change. Google engineers mitigated the impact by rolling back the configuration change that caused the issue. **Customer Impact:** - Customers would have received errors stating “Model server never became ready”, while performing deployments during the period of impact. **Additional details:** - As a workaround, customers were able to deploy in one of the non-impacted regions noted above.

View incident

Cloudflare

GitLab

Okta

PagerDuty

Zoom

Zscaler

Stripe

Fastly

Datadog

Notion

Amazon Web Services

Microsoft Azure

Google Cloud