Skip to content

Reloader thread crashes on transient errors, causing silent data staleness #7086

@bzantium

Description

@bzantium

Summary

When TensorBoard's Reloader thread encounters an unhandled exception during a reload cycle (e.g., a transient network error while reading from a remote filesystem like GCS), the thread terminates permanently. TensorBoard's web server continues running, but no new data is ever loaded — the dashboard silently serves stale data with no indication to the user.

Steps to reproduce

  1. Start TensorBoard pointing to a GCS logdir:
    tensorboard --logdir gs://bucket/path --bind_all --load_fast=false
    
  2. Temporarily lose network connectivity (e.g., Wi-Fi disconnect, VPN timeout).
  3. Restore connectivity.

Expected behavior

TensorBoard logs the error and retries on the next reload cycle. Data loading resumes once connectivity is restored.

Actual behavior

The Reloader thread dies with an unhandled exception:

Exception in thread Reloader:
...
google.auth.exceptions.TransportError: ... Failed to resolve 'oauth2.googleapis.com' ...

After this, TensorBoard never reloads data again, even after network is restored. The only recovery is to restart TensorBoard.

Root cause

The _reload function in data_ingester.py has no exception handling around the reload loop body:

def _reload():
    while True:
        # ... reload logic with no try/except ...
        time.sleep(self._reload_interval)

Any exception propagates out of the loop, killing the thread/process.

Environment

  • TensorBoard 2.20.0
  • macOS (Apple Silicon)
  • Python 3.12
  • Using gcsfs for GCS filesystem support (no TensorFlow installed)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions