fix: separate infrastructure vs user code error handling#1339
fix: separate infrastructure vs user code error handling#1339TooTallNate merged 11 commits intomainfrom
Conversation
…nd step handler Transient network errors (ECONNRESET, etc.) during infrastructure calls (event listing, event creation) were caught by a shared try/catch that also handles user code errors, incorrectly marking runs as run_failed or steps as step_failed instead of letting the queue redeliver. - runtime.ts: Move infrastructure calls outside the user-code try/catch so errors propagate to the queue handler for automatic retry - step-handler.ts: Same structural separation — only stepFn.apply() is wrapped in the try/catch that produces step_failed/step_retrying - helpers.ts: Add isTransientNetworkError() and update withServerErrorRetry to retry network errors in addition to 5xx responses - helpers.test.ts: Add tests for network error detection and retry
🦋 Changeset detectedLatest commit: fa30093 The changes in this PR will be included in the next version bump. This PR includes changesets to release 16 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express | Nitro Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
|
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests🌍 Community Worlds (55 failed)mongodb (3 failed):
redis (2 failed):
turso (50 failed):
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
|
There was a problem hiding this comment.
Pull request overview
This PR refactors the workflow and step runtimes to more explicitly separate “infrastructure” operations (hydration/serialization and event writes) from user code execution, and extends server-retry logic to include transient network failures so backend/network flakiness can be retried instead of incorrectly failing runs/steps.
Changes:
- Refactors step execution flow to split input hydration, user step execution, and step completion into distinct phases.
- Refactors workflow entrypoint flow to split run preparation, workflow replay execution, and run completion into distinct phases.
- Adds transient network error detection and retries to
withServerErrorRetry, plus corresponding unit tests.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/core/src/runtime/step-handler.ts | Restructures step execution phases and error handling boundaries. |
| packages/core/src/runtime/helpers.ts | Adds transient network error detection + expands withServerErrorRetry retry conditions. |
| packages/core/src/runtime/helpers.test.ts | Adds tests covering network error retry behavior and isTransientNetworkError. |
| packages/core/src/runtime.ts | Restructures workflow execution phases and moves run completion outside user-code try/catch. |
| .changeset/fix-infra-error-handling.md | Records patch release note for infra vs user error handling changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Redundant with undici RetryAgent which already handles 5xx retries and network error retries at the HTTP dispatcher level.
VaguelySerious
left a comment
There was a problem hiding this comment.
LGTM overall, there's a few cases I think we should do different though (if what Claude says applies)
… as fatal, deduplicate check - Remove withThrottleRetry (undici RetryAgent handles 429s) - WorkflowRuntimeError in infrastructure setup now produces run_failed/step_failed instead of retrying forever via queue - Deduplicate redundant WorkflowAPIError.is check in step-handler.ts
VaguelySerious
left a comment
There was a problem hiding this comment.
LGTM assuming e2e tests pass
Summary
Transient network errors (ECONNRESET, etc.) during infrastructure calls (event listing, event creation) were caught by a shared try/catch that also handles user code errors, incorrectly marking runs as
run_failedor steps asstep_failedinstead of letting the queue redeliver the message for retry.This PR structurally separates infrastructure calls from user code execution and removes redundant in-process retry wrappers, relying on undici's
RetryAgentand queue-level redelivery for transient error recovery.Changes
Structural error separation (
runtime.ts,step-handler.ts)world.runs.get(),getAllWorkflowRunEvents(),world.events.create(),getEncryptionKeyForRun(), input hydration, result dehydration) are outside the user-code try/catch. Errors propagate to the queue handler for automatic retry.runWorkflow()/stepFn.apply()(user code) is wrapped in the try/catch that producesrun_failed/step_failed.WorkflowRuntimeErrorin infrastructure setup (data integrity issues like missingstartedAt) now producesrun_failed/step_failedinstead of retrying forever via the queue.WorkflowAPIError5xx/410 from user code catch, preventing step attempt consumption on infra errors.Removed in-process retry wrappers (
helpers.ts)withServerErrorRetryremoved — redundant with undici'sRetryAgentwhich retries 5xx at the HTTP dispatcher level.withThrottleRetryremoved — redundant with undici'sRetryAgentconfigured withretryAfter: truefor 429 handling.isTransientNetworkErrorremoved — no longer needed withoutwithServerErrorRetry.Test updates
serverError5xxRetryWorkflowe2e test and its fault injection helpers fromworkbench/example/workflows/99_e2e.ts— this test specifically validatedwithServerErrorRetrybehavior.withServerErrorRetry,isTransientNetworkError, andwithThrottleRetry.