Stress testing and performance bottleneck investigation

Problem

We've observed intermittent performance issues in entity-api including:

- 502 Bad Gateway (on PUT calls against entity-api)
- 504 Gateway Timeout (when ingest-ui triggers calls to entity-api)
- "Max retries exceeded" (from search-api during reindex when making HTTP calls to entity-api)

These errors indicate requests fail to complete within the allowed time window under load.

Architecture

[Nginx -> uWSGI -> Python Flask App -> Neo4j Driver] (within the same container) -> Neo4j Server (separate container)

Test cases

We are creating [Locust](https://locust.io/) load tests in Python to cover the following scenarios:

- Baseline tests: 10 users making only GET requests (use the top 10 endpoints reported in [Usage Dashboard](https://ingest.board.hubmapconsortium.org/usage))
- Reindex datasets: reindex 30 datasets via search-api
- Entity creations/updates: 10 new entity creations (POST) and 10 updates (PUT), which trigger reindex via the search-api queue.
- Bulk registration: bulk donor/sample registrations of ~40 entities (simulate TSV rows)

The goal is to stress the system enough to reproduce 502/504 errors and connection pool exhaustion, revealing the real bottlenecks.

Execution plan
- Run the above stress tests and collect metrics
- Fine-tune deployment parameters
- Re-run tests iteratively until stable performance is achieved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stress testing and performance bottleneck investigation #989

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stress testing and performance bottleneck investigation #989

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions