Incidents added from:
On June 2, 2026, users experienced degraded sandbox creation, start, and resume performance in the US region. Some sandboxes might have also been unexpectedly terminated due to memory pressure on affected infrastructure nodes. The US cluster was restored to normal operation at 11:39 AM PDT (18:39 UTC), and the incident was fully resolved at 1:11 PM PDT (20:11 UTC).
Users in the US region may have experienced:
Elevated error rates for sandbox operations
Increased latency when creating, starting, or resuming sandboxes
Sandbox terminations caused by memory pressure on affected infrastructure nodes
The incident was caused by exhausted compute capacity in our US cluster. As demand exceeded the available node pool, sandbox creation, start, and resume operations began to degrade.
Recovery was prolonged because additional compute capacity was not immediately available in the affected region. During the incident, we introduced temporary rate limits to reduce pressure on the cluster and added capacity as it became available, allowing service to progressively recover until normal operation was restored.
9:23 AM PDT / 16:23 UTC – Incident declared after degraded sandbox start and resume performance was observed. Root cause identified as exhausted compute capacity in the US cluster.
Shortly after – Temporary rate limits were introduced to reduce pressure on the cluster and improve stability.
Throughout the incident – Additional capacity gradually became available and was added to the cluster, allowing sandbox operations to progressively return to normal.
11:39 AM PDT / 18:39 UTC – The US cluster returned to normal operation.
1:11 PM PDT / 20:11 UTC – Incident fully resolved.
Identified exhausted compute capacity as the source of the degradation
Introduced temporary rate limits to reduce pressure on the cluster
Added capacity to the cluster as it became available
Continued recovery efforts until sandbox operations returned to normal
Improved capacity monitoring – Fixing gaps in memory and utilization metrics so we receive earlier warning before capacity is exhausted.
Multi-region support – Expanding sandbox scheduling across multiple GCP regions to reduce reliance on a single cluster and improve resilience during regional capacity constraints.
Additional instance types – Expanding support for additional node types to reduce dependence on any single capacity pool and improve our ability to acquire capacity during regional shortages.
We apologize for the disruption this caused. We are improving our capacity planning, monitoring, and scaling systems to reduce the likelihood of similar incidents in the future.
Please reach out to our support team if you have any questions.