Skip to content

Implement AWS quota-aware queued shard execution (issue #32)#33

Closed
aviggiano wants to merge 15 commits intomainfrom
issue-32-aws-queue-shards
Closed

Implement AWS quota-aware queued shard execution (issue #32)#33
aviggiano wants to merge 15 commits intomainfrom
issue-32-aws-queue-shards

Conversation

@aviggiano
Copy link
Collaborator

@aviggiano aviggiano commented Feb 12, 2026

Closes #32

Consolidated Status (single source of truth)

This PR now contains the complete implementation and follow-up hardening for AWS quota-aware queued shard execution.

Issue #32 requested queue-based shard draining under quota/capacity pressure, durable shard/run state, retry/backoff behavior, and a global mutex.

What Was Implemented

  1. Queue-backed execution model
  • Per-run SQS queue + DLQ.
  • DynamoDB run/shard state model with explicit shard states.
  • Fixed worker pool (max_parallel_instances) instead of one-EC2-per-shard fan-out.
  1. Orchestration and lock model
  • Global lock with acquire/backoff semantics.
  • Centralized lock orchestration in scripts/benchmark_lock.py.
  • Lock lease renewal during bootstrap and worker execution.
  • Fail-closed behavior when lock renewal fails.
  1. Queue/bootstrap reliability
  • Idempotent queue initialization with explicit launching flow.
  • Fatal bootstrap recovery to terminalize non-terminal shards/run metadata.
  • Conservative lock-release guard on bootstrap failure (keeps lock when safety is uncertain).
  1. Worker/state consistency hardening
  • Atomic/conditional shard claim and transition handling.
  • Retry transition return-code handling fixes.
  • Attempt accounting moved to DynamoDB-backed attempts (not SQS receive count).
  • Completion ordering hardened (status.json publication before terminal completion transition).
  • Run terminalization path on lock-heartbeat failure.
  1. Completion/docs alignment and metadata
  • Shared completion predicate helper (scripts/run_completion.py) reused by release/docs paths.
  • Queue/lock/retry/final count metadata surfaced in manifest/docs.
  • Docs updated for queue mode and mutex behavior.

Thread Findings -> Resolution

  • Global lock expiry/overlap risk: addressed with lease renewal + fail-closed behavior.
  • Bootstrap idempotency and stranded shards: addressed via launch-state flow + recovery logic.
  • Capacity/transient provisioning retries: addressed with retry/backoff/degrade orchestration.
  • Duplicate-delivery/counter inflation risks: addressed with conditional/transactional state handling.
  • Lock fail-fast semantics: default lock acquire timeout is now policy-unbounded (0), still bounded by GitHub Actions job runtime.
  • Completion predicate drift: addressed by shared helper used in release/docs.

Remaining Open Items / Product Decisions

  1. Strict indefinite pending semantics are still effectively bounded by workflow runtime limits, even with policy timeout disabled; true indefinite pending would require a durable external request scheduler/queue.
  2. Conservative lock-keep paths on uncertain bootstrap state require operational recovery when automation cannot safely release.

Validation Reported In This PR

  • make terraform-fmt
  • make terraform-validate
  • bash -n fuzzers/_shared/queue_worker.sh fuzzers/_shared/common.sh
  • python3 -m py_compile scripts/benchmark_lock.py scripts/queue_init_run.py scripts/run_completion.py scripts/generate_docs_site.py
  • actionlint .github/workflows/benchmark-run.yml .github/workflows/benchmark-release.yml .github/workflows/benchmark-request.yml

Discussion Cleanup

aviggiano

This comment was marked as outdated.

@aviggiano
Copy link
Collaborator Author

Closing as superseded by #37. We are replacing this implementation with a simpler S3-only orchestration approach to reduce dependencies/complexity while preserving required behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement AWS quota-aware queued shard execution for benchmark runs

1 participant