Skip to content

REST fallback server-streaming reads can become non-cancellable and ignore the effective transport timeout, leaving Firestore reads stuck on batchGetDocuments #7959

@akshatbaranwal

Description

@akshatbaranwal

Summary

I'm filing this in googleapis/google-cloud-node because the npm metadata for google-gax still points to archived repos (googleapis/gax-nodejs and googleapis/google-cloud-node-core). If there is now a better active home for the REST fallback transport layer, please route this there.

We reproduced a defect chain in Firestore REST fallback reads where a single silent server-streaming stall can become sticky:

  1. Firestore DocumentReference.get() in REST mode uses server-streaming batchGetDocuments
  2. the fallback transport deadline/timeout is not effectively enforced at the underlying fetch layer
  3. the returned REST streaming object does not abort the underlying fetch when cancelled
  4. this leaves the Firestore-side client checked out indefinitely if the stream never receives a first byte or terminal error

This reproduces in minimal form on:

  • google-gax@4.6.1 + @google-cloud/firestore@7.11.6
  • google-gax@5.0.6 + @google-cloud/firestore@8.3.0
  • Node and Bun

What we confirmed

Minimal repro without real Firestore credentials proves:

  • fallback unary timeout/deadline can land in the ignored third arg
  • fallback stream timeout/deadline can land in the ignored third arg
  • returned REST stream cancel() does not abort the underlying fetch signal
  • a silent stall can leave the Firestore-side client checked out indefinitely

In traced soak evidence from our production incident, the live stall point was:

  • request sent
  • fallback fetch dispatched
  • no first response byte
  • no stream end
  • no client release

This is a runtime-agnostic REST fallback problem. Under a silent network stall, Node and Bun both showed the pending REST fallback fetch staying alive for at least 180s with no resolve/reject.

Why this matters

A transient network blackhole can become sticky far beyond the original event:

  • the read is not effectively transport-timed out
  • the returned stream is not effectively transport-cancellable
  • the Firestore layer keeps the client checked out

Local mitigation we carry

We currently patch google-gax@4.6.1 locally to:

  • extract the effective timeout/deadline correctly
  • enforce the transport timeout with an abort controller
  • wire REST stream cancel to abort the underlying fetch

That mitigates the issue for our pinned stack, but the upstream gap appears to remain.

Requested maintainer feedback

  1. Is googleapis/google-cloud-node the correct active home for this report, or should it move elsewhere?
  2. Is the fallback timeout/deadline argument ordering intended?
  3. Is the returned REST stream cancel path expected to abort the underlying fetch?
  4. If this belongs in a different repo/package now, could you route it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions