Skip to content

bug: flaky E2E test_sandbox_api_crud_and_exec — DEADLINE_EXCEEDED on CreateSandbox #465

@drew

Description

@drew

Agent Diagnostic

Investigated the failed job 67691440936 from workflow run #87 (Release Dev) on commit 4878b9b (main branch).

  • Failure: e2e / E2E job, step "Run E2E tests", attempt 1.
  • Test: e2e/python/test_sandbox_api.py::test_sandbox_api_crud_and_exec
  • Error: grpc._channel._InactiveRpcError with StatusCode.DEADLINE_EXCEEDED during CreateSandbox RPC call.
  • Result: 1 failed, 67 passed out of 68 tests. The run was automatically retried (attempt 2).
  • Prior runs: The three preceding Release Dev workflow runs all had E2E jobs pass successfully, confirming this is an intermittent failure, not a regression introduced by the commit.
  • Root cause hypothesis: The CreateSandbox gRPC call times out before the sandbox is provisioned. Possible causes include resource contention on the build-arm64 runner, slow container startup after cluster bootstrap, or an insufficient default timeout in the Python SDK's create method.

Description

Actual behavior: test_sandbox_api_crud_and_exec fails intermittently with a DEADLINE_EXCEEDED gRPC error when calling sandbox(delete_on_exit=True), which invokes client.create_session(spec=self._spec)self._stub.CreateSandbox(...).

Expected behavior: The test should reliably create a sandbox within the configured timeout, or have retry/backoff logic to handle transient delays in sandbox provisioning.

Reproduction Steps

  1. Observe the failed job logs from Release Dev run refactor: switch sandbox-internal RPCs from sandbox_id to name-based lookup #87, attempt 1.
  2. The failure occurs at e2e/python/test_sandbox_api.py:29 inside the with sandbox(delete_on_exit=True) context manager.

Environment

Logs

e2e/python/test_sandbox_api.py:29: in test_sandbox_api_crud_and_exec
    with sandbox(delete_on_exit=True) as sb:
python/openshell/sandbox.py:471: in __enter__
    self._session = client.create_session(spec=self._spec)
python/openshell/sandbox.py:206: in create_session
    return SandboxSession(self, self.create(spec=spec))
python/openshell/sandbox.py:193: in create
    response = self._stub.CreateSandbox(
...
E   grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
E       status = StatusCode.DEADLINE_EXCEEDED
E       details = "Deadline Exceeded"
E       debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Deadline Exceeded", grpc_status:4}"
E   >
FAILED e2e/python/test_sandbox_api.py::test_sandbox_api_crud_and_exec
=================== 1 failed, 67 passed in 67.49s (0:01:07) ====================

Possible Mitigations

  • Increase the gRPC deadline for CreateSandbox in the E2E test fixtures or the Python SDK client.
  • Add a warm-up/readiness check after cluster bootstrap before running tests.
  • Add retry logic to the sandbox fixture for transient gRPC errors.
  • Investigate if ARM64 runner resource contention contributes to the timeout.

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions