-
Notifications
You must be signed in to change notification settings - Fork 656
REST fallback server-streaming reads can become non-cancellable and ignore the effective transport timeout, leaving Firestore reads stuck on batchGetDocuments #7959
Description
Summary
I'm filing this in googleapis/google-cloud-node because the npm metadata for google-gax still points to archived repos (googleapis/gax-nodejs and googleapis/google-cloud-node-core). If there is now a better active home for the REST fallback transport layer, please route this there.
We reproduced a defect chain in Firestore REST fallback reads where a single silent server-streaming stall can become sticky:
- Firestore
DocumentReference.get()in REST mode uses server-streamingbatchGetDocuments - the fallback transport deadline/timeout is not effectively enforced at the underlying fetch layer
- the returned REST streaming object does not abort the underlying fetch when cancelled
- this leaves the Firestore-side client checked out indefinitely if the stream never receives a first byte or terminal error
This reproduces in minimal form on:
google-gax@4.6.1+@google-cloud/firestore@7.11.6google-gax@5.0.6+@google-cloud/firestore@8.3.0- Node and Bun
What we confirmed
Minimal repro without real Firestore credentials proves:
- fallback unary timeout/deadline can land in the ignored third arg
- fallback stream timeout/deadline can land in the ignored third arg
- returned REST stream
cancel()does not abort the underlying fetch signal - a silent stall can leave the Firestore-side client checked out indefinitely
In traced soak evidence from our production incident, the live stall point was:
- request sent
- fallback fetch dispatched
- no first response byte
- no stream end
- no client release
This is a runtime-agnostic REST fallback problem. Under a silent network stall, Node and Bun both showed the pending REST fallback fetch staying alive for at least 180s with no resolve/reject.
Why this matters
A transient network blackhole can become sticky far beyond the original event:
- the read is not effectively transport-timed out
- the returned stream is not effectively transport-cancellable
- the Firestore layer keeps the client checked out
Local mitigation we carry
We currently patch google-gax@4.6.1 locally to:
- extract the effective timeout/deadline correctly
- enforce the transport timeout with an abort controller
- wire REST stream cancel to abort the underlying fetch
That mitigates the issue for our pinned stack, but the upstream gap appears to remain.
Requested maintainer feedback
- Is
googleapis/google-cloud-nodethe correct active home for this report, or should it move elsewhere? - Is the fallback timeout/deadline argument ordering intended?
- Is the returned REST stream cancel path expected to abort the underlying fetch?
- If this belongs in a different repo/package now, could you route it?