Degraded sandbox creation

Write-up

Post-Mortem: Degraded Sandbox Performance – June 2, 2026

Summary

On June 2, 2026, users experienced degraded sandbox creation, start, and resume performance in the US region. Some sandboxes might have also been unexpectedly terminated due to memory pressure on affected infrastructure nodes. The US cluster was restored to normal operation at 11:39 AM PDT (18:39 UTC), and the incident was fully resolved at 1:11 PM PDT (20:11 UTC).

Impact

Users in the US region may have experienced:

Elevated error rates for sandbox operations
Increased latency when creating, starting, or resuming sandboxes
Sandbox terminations caused by memory pressure on affected infrastructure nodes

Root Cause

The incident was caused by exhausted compute capacity in our US cluster. As demand exceeded the available node pool, sandbox creation, start, and resume operations began to degrade.
Recovery was prolonged because additional compute capacity was not immediately available in the affected region. During the incident, we introduced temporary rate limits to reduce pressure on the cluster and added capacity as it became available, allowing service to progressively recover until normal operation was restored.

Timeline

9:23 AM PDT / 16:23 UTC – Incident declared after degraded sandbox start and resume performance was observed. Root cause identified as exhausted compute capacity in the US cluster.
Shortly after – Temporary rate limits were introduced to reduce pressure on the cluster and improve stability.
Throughout the incident – Additional capacity gradually became available and was added to the cluster, allowing sandbox operations to progressively return to normal.
11:39 AM PDT / 18:39 UTC – The US cluster returned to normal operation.
1:11 PM PDT / 20:11 UTC – Incident fully resolved.

Response

Identified exhausted compute capacity as the source of the degradation
Introduced temporary rate limits to reduce pressure on the cluster
Added capacity to the cluster as it became available
Continued recovery efforts until sandbox operations returned to normal

What We're Doing to Prevent This

Improved capacity monitoring – Fixing gaps in memory and utilization metrics so we receive earlier warning before capacity is exhausted.
Multi-region support – Expanding sandbox scheduling across multiple GCP regions to reduce reliance on a single cluster and improve resilience during regional capacity constraints.
Additional instance types – Expanding support for additional node types to reduce dependence on any single capacity pool and improve our ability to acquire capacity during regional shortages.

Conclusion

We apologize for the disruption this caused. We are improving our capacity planning, monitoring, and scaling systems to reduce the likelihood of similar incidents in the future.

Please reach out to our support team if you have any questions.

Incidents added from:

Write-up

Degraded sandbox creation

Degraded performance

View the incident

Post-Mortem: Degraded Sandbox Performance – June 2, 2026

Summary

Impact

Users in the US region may have experienced:

Elevated error rates for sandbox operations
Increased latency when creating, starting, or resuming sandboxes
Sandbox terminations caused by memory pressure on affected infrastructure nodes

Root Cause

Timeline

9:23 AM PDT / 16:23 UTC – Incident declared after degraded sandbox start and resume performance was observed. Root cause identified as exhausted compute capacity in the US cluster.
Shortly after – Temporary rate limits were introduced to reduce pressure on the cluster and improve stability.
Throughout the incident – Additional capacity gradually became available and was added to the cluster, allowing sandbox operations to progressively return to normal.
11:39 AM PDT / 18:39 UTC – The US cluster returned to normal operation.
1:11 PM PDT / 20:11 UTC – Incident fully resolved.

Response

Identified exhausted compute capacity as the source of the degradation
Introduced temporary rate limits to reduce pressure on the cluster
Added capacity to the cluster as it became available
Continued recovery efforts until sandbox operations returned to normal

What We're Doing to Prevent This

Improved capacity monitoring – Fixing gaps in memory and utilization metrics so we receive earlier warning before capacity is exhausted.
Multi-region support – Expanding sandbox scheduling across multiple GCP regions to reduce reliance on a single cluster and improve resilience during regional capacity constraints.
Additional instance types – Expanding support for additional node types to reduce dependence on any single capacity pool and improve our ability to acquire capacity during regional shortages.

Conclusion

We apologize for the disruption this caused. We are improving our capacity planning, monitoring, and scaling systems to reduce the likelihood of similar incidents in the future.

Please reach out to our support team if you have any questions.