Skip to content

2026-04-10 Incident affecting fastpath #396

@hellais

Description

@hellais

Rough timeline (times in CEST+2):

  • Fri 10th Apr 2026 @ 18:28 - there is an increase in the aggregations queries, which is the heaviest I think
  • Fri 10th Apr 2026 @ 18:43 - It looks like the thing consuming most CPU atm is the inserts from the fastpath:
  • Fri 10th Apr 2026 @ 18:51 - we apply a more aggressive rate limit on API endpoints
  • Fri 10th Apr 2026 @ 20:08 - problem seems to be the fastpath. This exception is in the logs:
373715932 10 ERROR fastpath.db Failed Clickhouse insert
Traceback (most recent call last):
  File "/home/fastpath/app/fastpath/db.py", line 180, in _write_rows_to_fastpath
    click_client.execute(sql_insert, rows, settings=settings)
  File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/client.py", line 376, in execute
    rv = self.process_insert_query(
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/client.py", line 604, in process_insert_query
    sample_block = self.receive_sample_block()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/clickhouse_driver/client.py", line 620, in receive_sample_block
    raise packet.exception
clickhouse_driver.errors.ServerException: Code: 202.
DB::Exception: Too many simultaneous queries. Maximum: 100. Stack trace:
  • Fri 10th Apr 2026 @ 20:27 - modified the max_concurrent_queries in clickhouse
  • Fri 10th Apr 2026 @ 20:42 - fastpath is rebooted and temporarily resumes functionality
  • Fri 10th Apri 2026 @ 23:42 - fastpath gets stuck again
  • Sat 11th April 2026 @ 00:01 - data is being stored in s3 as a failover for the fastpath, so we should be able to recover from there
  • Sat 11th April 2026 @ 03:01 - The format of storage in s3 is fixed as the previous format would truncate measurements to the last entry, see: ooni/backend@470b939. Better exception logging is added to ooniprobe
  • Sat 11th April @ 12:00 - A new fastpath istance is setup to take over from the previous one
  • Sat 11th April @ 14:00 - database load is still very high

TODO: fill in missing timeline

  • Sat 11th April @ 18:22 - due to increased load on our infrastructure and inability to properly debug the outage, we decide to take the measurement instance offline until monday to at least preserve measurement collection

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions