-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Problem
The stream.on("error", ...) handler in TaskHubGrpcWorker.internalRunWorker() (packages/durabletask-js/src/worker/task-hub-grpc-worker.ts, line 384) only logs the error but does not clean up the stream or attempt to reconnect:
stream.on("error", (err: Error) => {
if (this._stopWorker) {
return;
}
WorkerLogs.streamErrorInfo(this._logger, err);
// ← No cleanup, no retry
});
In contrast, the adjacent stream.on("end", ...) handler (line 370) correctly calls removeAllListeners(), destroy(), and _createNewClientAndRetry().
Root Cause
In Node.js with @grpc/grpc-js, gRPC stream errors — especially transport-level failures like UNAVAILABLE or abrupt network disconnections — may emit an "error" event without a subsequent "end" event. When this happens, the worker logs the error and then silently stops processing work items forever, because no recovery path is triggered.
The stop() method (line 424) already accounts for this asymmetry by listening for "end", "close", or "error" — confirming the developers know that "error" can fire alone.
Proposed Fix
Add stream cleanup and retry logic to the error handler, mirroring the "end" handler pattern:
- Call
stream.removeAllListeners()to prevent double-recovery if both events fire - Add a no-op
stream.on("error", () => {})guard to prevent unhandled error crashes from stale events - Call
stream.destroy()to clean up the stream - Call
_createNewClientAndRetry()to reconnect
Impact
Severity: High — Affected workers silently stop processing orchestration/activity/entity work items after a transport-level gRPC error. This can happen in production when network connectivity is temporarily lost, load balancers reset connections, or the sidecar restarts. The worker appears healthy (no crash, no error logged at error level) but is effectively dead.