Skip to content

Stress testing and performance bottleneck investigation #989

@yuanzhou

Description

@yuanzhou

Problem

We've observed intermittent performance issues in entity-api including:

  • 502 Bad Gateway (on PUT calls against entity-api)
  • 504 Gateway Timeout (when ingest-ui triggers calls to entity-api)
  • "Max retries exceeded" (from search-api during reindex when making HTTP calls to entity-api)

These errors indicate requests fail to complete within the allowed time window under load.

Architecture

[Nginx -> uWSGI -> Python Flask App -> Neo4j Driver] (within the same container) -> Neo4j Server (separate container)

Test cases

We are creating Locust load tests in Python to cover the following scenarios:

  • Baseline tests: 10 users making only GET requests (use the top 10 endpoints reported in Usage Dashboard)
  • Reindex datasets: reindex 30 datasets via search-api
  • Entity creations/updates: 10 new entity creations (POST) and 10 updates (PUT), which trigger reindex via the search-api queue.
  • Bulk registration: bulk donor/sample registrations of ~40 entities (simulate TSV rows)

The goal is to stress the system enough to reproduce 502/504 errors and connection pool exhaustion, revealing the real bottlenecks.

Execution plan

  • Run the above stress tests and collect metrics
  • Fine-tune deployment parameters
  • Re-run tests iteratively until stable performance is achieved

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions