Implement AWS quota-aware queued shard execution (issue #32) by aviggiano · Pull Request #33 · Recon-Fuzz/scfuzzbench

aviggiano · 2026-02-12T15:28:10Z

Closes #32

Consolidated Status (single source of truth)

This PR now contains the complete implementation and follow-up hardening for AWS quota-aware queued shard execution.

Issue #32 requested queue-based shard draining under quota/capacity pressure, durable shard/run state, retry/backoff behavior, and a global mutex.

What Was Implemented

Queue-backed execution model

Per-run SQS queue + DLQ.
DynamoDB run/shard state model with explicit shard states.
Fixed worker pool (max_parallel_instances) instead of one-EC2-per-shard fan-out.

Orchestration and lock model

Global lock with acquire/backoff semantics.
Centralized lock orchestration in scripts/benchmark_lock.py.
Lock lease renewal during bootstrap and worker execution.
Fail-closed behavior when lock renewal fails.

Queue/bootstrap reliability

Idempotent queue initialization with explicit launching flow.
Fatal bootstrap recovery to terminalize non-terminal shards/run metadata.
Conservative lock-release guard on bootstrap failure (keeps lock when safety is uncertain).

Worker/state consistency hardening

Atomic/conditional shard claim and transition handling.
Retry transition return-code handling fixes.
Attempt accounting moved to DynamoDB-backed attempts (not SQS receive count).
Completion ordering hardened (status.json publication before terminal completion transition).
Run terminalization path on lock-heartbeat failure.

Completion/docs alignment and metadata

Shared completion predicate helper (scripts/run_completion.py) reused by release/docs paths.
Queue/lock/retry/final count metadata surfaced in manifest/docs.
Docs updated for queue mode and mutex behavior.

Thread Findings -> Resolution

Global lock expiry/overlap risk: addressed with lease renewal + fail-closed behavior.
Bootstrap idempotency and stranded shards: addressed via launch-state flow + recovery logic.
Capacity/transient provisioning retries: addressed with retry/backoff/degrade orchestration.
Duplicate-delivery/counter inflation risks: addressed with conditional/transactional state handling.
Lock fail-fast semantics: default lock acquire timeout is now policy-unbounded (0), still bounded by GitHub Actions job runtime.
Completion predicate drift: addressed by shared helper used in release/docs.

Remaining Open Items / Product Decisions

Strict indefinite pending semantics are still effectively bounded by workflow runtime limits, even with policy timeout disabled; true indefinite pending would require a durable external request scheduler/queue.
Conservative lock-keep paths on uncertain bootstrap state require operational recovery when automation cannot safely release.

Validation Reported In This PR

make terraform-fmt
make terraform-validate
bash -n fuzzers/_shared/queue_worker.sh fuzzers/_shared/common.sh
python3 -m py_compile scripts/benchmark_lock.py scripts/queue_init_run.py scripts/run_completion.py scripts/generate_docs_site.py
actionlint .github/workflows/benchmark-run.yml .github/workflows/benchmark-release.yml .github/workflows/benchmark-request.yml

Discussion Cleanup

PR issue comments were removed.
The submitted review was minimized as outdated.
Issue Implement AWS quota-aware queued shard execution for benchmark runs #32 has no comment thread; requirements are captured in the issue body and reflected above.

Address PR review blocking/high items by implementing: - lease-based global lock with acquire backoff/pending semantics - Terraform apply retry/degrade for capacity/transient failures - idempotent queue bootstrap with launching->queued transitions - duplicate-safe shard claim/finalize transitions with DynamoDB transactions - lock heartbeat from workers and final shard counters in manifest - docs/workflow messaging updates for queue-mode completion semantics

aviggiano · 2026-02-17T16:25:04Z

Closing as superseded by #37. We are replacing this implementation with a simpler S3-only orchestration approach to reduce dependencies/complexity while preserving required behavior.

aviggiano added 4 commits February 12, 2026 15:16

phase1: add queue-capable shard worker runtime

c4d9474

phase2: provision queue-backed worker pool in terraform

e6a9737

phase3: add lock, quota discovery, and queue bootstrap workflow

47ab6f3

phase4: make completion/docs queue-status aware

e0c0c3a

This comment was marked as outdated.

Sign in to view

aviggiano added 11 commits February 12, 2026 13:19

queue: harden bootstrap lock handoff and completion safety

8f84eb2

queue: harden shard recovery and unify completion predicate

da81911

queue: fix transition return-code handling in worker

f5198f9

queue: decouple SQS redrive receive budget from shard attempts

db94397

queue: fix retry_shard_transition rc capture

aa14730

queue: centralize lock orchestration and default to unbounded lock wait

086d649

lock: harden release reliability and bootstrap release safety

1064b08

lock: renew lease during bootstrap apply and fail closed release guard

b19f6a2

lock: fail closed on worker lease renew failures

9928989

worker: fail closed by terminalizing run on lock-heartbeat failure

153084e

aviggiano mentioned this pull request Feb 17, 2026

Simplify benchmark queued execution to S3-only orchestration (supersede #32/#33) #37

Open

aviggiano closed this Feb 17, 2026

This was referenced Feb 17, 2026

Implement AWS quota-aware queued shard execution for benchmark runs #32

Closed

feat: simplify queued benchmark orchestration to S3-only #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement AWS quota-aware queued shard execution (issue #32)#33

Implement AWS quota-aware queued shard execution (issue #32)#33
aviggiano wants to merge 15 commits intomainfrom
issue-32-aws-queue-shards

aviggiano commented Feb 12, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

aviggiano commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aviggiano commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Consolidated Status (single source of truth)

What Was Implemented

Thread Findings -> Resolution

Remaining Open Items / Product Decisions

Validation Reported In This PR

Discussion Cleanup

Uh oh!

This comment was marked as outdated.

Uh oh!

aviggiano commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aviggiano commented Feb 12, 2026 •

edited

Loading