Skip to content

refactor(tests): split suites by execution level and speed up CI#1609

Open
hassiebp wants to merge 29 commits intomainfrom
codex/split-test-suites-by-directory
Open

refactor(tests): split suites by execution level and speed up CI#1609
hassiebp wants to merge 29 commits intomainfrom
codex/split-test-suites-by-directory

Conversation

@hassiebp
Copy link
Copy Markdown
Contributor

@hassiebp hassiebp commented Apr 4, 2026

Summary

This PR restructures the test suite around execution level and follows through on the CI and reliability work needed to make that split useful in practice.

  • move deterministic local tests into tests/unit
  • move Langfuse-server-backed tests into tests/e2e
  • keep live provider coverage in one dedicated always-on suite under tests/live_provider
  • keep shared helpers in tests/support
  • split mixed prompt and media coverage so unit, e2e, and live-provider cases live in the right buckets
  • keep OpenAI and LangChain unit coverage local by asserting against the in-memory exporter instead of the Langfuse server roundtrip
  • add a repo-level AGENTS.md and remove CLAUDE.md

CI and test infrastructure

This also updates CI and test plumbing so the new layout stays fast and reliable.

  • run tests/unit on the Python 3.10-3.14 matrix
  • run tests/e2e in 2 mechanical shards selected by scripts/select_e2e_shard.py
  • keep tests/live_provider as one dedicated suite that always runs
  • keep only serial_e2e as the scheduling-specific pytest marker
  • replace hidden e2e_core / e2e_data marker routing with mechanical file sharding
  • keep cancel-in-progress: true
  • bootstrap Langfuse in CI via LANGFUSE_INIT_*
  • remove the Docker image cache path after it proved slower than fresh pulls
  • replace fixed e2e sleeps with bounded retry helpers where read-after-write consistency mattered

Unit test speedups

The unit-suite speed work keeps coverage intact rather than weakening assertions.

  • replace prompt-cache polling with deterministic queue draining
  • restore the stricter OTEL timing and prompt atexit stress assertions
  • wake score-ingestion and media-upload workers immediately on shutdown via sentinels instead of waiting on queue timeouts
  • add regression coverage for the worker shutdown behavior
  • use --dist worksteal for the unit lane

Why

The repo already had a mix of local-only, real-server, and live-provider tests, but the old flat tests/ layout made the boundary hard to see and easy to erode. This change makes the split explicit in the filesystem, keeps unit coverage fast, and keeps the expensive e2e and provider surface intentional.

Results

Local measurements after the unit refactor:

  • tests/unit/test_otel.py + tests/unit/test_propagate_attributes.py: 215.08s -> 2.38s
  • full tests/unit run: 360 passed, 2 skipped in 13.42s
  • CI-style unit run with -n auto --dist worksteal: 360 passed, 2 skipped in 5.42s

Validation

  • uv run --frozen pytest -q tests/unit --maxfail=1
  • uv run --frozen pytest -q -n auto --dist worksteal tests/unit --maxfail=1
  • uv run --frozen pytest --collect-only -q tests/unit tests/e2e tests/live_provider
  • uv run --frozen pytest -q tests/unit/test_e2e_sharding.py
  • uv run --frozen pytest --collect-only -q tests/live_provider -m 'live_provider'
  • uv run --frozen ruff check scripts/select_e2e_shard.py tests/conftest.py tests/unit/test_e2e_sharding.py tests/e2e/test_core_sdk.py

CI is the authoritative validation for the real Langfuse-server and live-provider suites.

Disclaimer: Experimental PR review

Greptile Summary

This PR is a large, well-engineered refactoring that splits the flat tests/ directory into tests/unit, tests/e2e, tests/live_provider, and tests/support, and rewires CI to match. It also meaningfully speeds up the unit suite (360 tests in ~5 s) by replacing polling-based waits with sentinel-driven shutdown and deterministic queue draining, and adds proper thread-safety locks to PromptCache. The structural changes are clean and the new retry/sharding helpers are solid.

Key changes:

  • langfuse/_utils/prompt_cache.py: PromptCache and PromptCacheTaskManager are now fully protected by RLock; consumer switches from a 1-second polling loop to a blocking queue.get() with sentinel-based shutdown; add_refresh_prompt_task_if_current() avoids redundant background refreshes.
  • langfuse/_task_manager/media_upload_consumer.py / score_ingestion_consumer.py / media_manager.py: sentinel-driven pause() / signal_shutdown() for immediate worker wake-up on shutdown.
  • scripts/select_e2e_shard.py: greedy weight-based shard assignment that falls back to local AST-based test counting for unknown files.
  • tests/support/retry.py: bounded retry helper for e2e eventual-consistency reads.
  • .github/workflows/ci.yml: unit-tests job runs across Python 3.10–3.14; e2e-tests job runs two mechanical shards + live-provider suite; Langfuse server bootstrapped via LANGFUSE_INIT_* env vars instead of a manual seeder script.

Confidence Score: 5/5

Safe to merge — all findings are P2 style/improvement items with no production correctness risk under default configuration

The core source changes (sentinel-based shutdown, RLock thread-safety, prompt cache freshness guard) are sound and well-tested. The multi-consumer sentinel issue only affects threads > 1 which is not the default. The retry edge case requires a specific timing race that is unlikely in practice. All remaining findings are style, robustness improvements, or minor logic nuances — none block correctness on the happy path.

tests/support/retry.py (stale-error edge case), langfuse/_utils/prompt_cache.py (multi-consumer shutdown), scripts/select_e2e_shard.py (class-based test counting)

Important Files Changed

Filename Overview
langfuse/_utils/prompt_cache.py Adds RLock thread-safety to PromptCache and PromptCacheTaskManager; switches consumer to blocking queue.get() with sentinel-based shutdown; adds wait_for_idle() and add_refresh_prompt_task_if_current() — multi-consumer sentinel deadlock risk when threads > 1
langfuse/_task_manager/media_manager.py Adds sentinel-based signal_shutdown() to wake MediaManager consumer immediately instead of waiting on queue timeout
langfuse/_task_manager/score_ingestion_consumer.py Adds sentinel-based shutdown to ScoreIngestionConsumer.pause() so the consumer breaks its loop immediately
langfuse/_task_manager/media_upload_consumer.py Calls signal_shutdown() on pause so the media manager wakes up immediately
langfuse/_client/client.py Switches to add_refresh_prompt_task_if_current() to skip redundant background refreshes when cache is already fresh
tests/support/retry.py New bounded retry helper for e2e eventual-consistency reads; stale error can be raised after a not-ready result when the deadline expires
tests/support/utils.py Transparent _RetryingApiProxy retries list/get reads until data is available for e2e tests
tests/support/api_wrapper.py HTTP-level Langfuse API wrapper with built-in retry on 404/NotFoundError payloads for e2e helpers
scripts/select_e2e_shard.py Greedy weight-based shard assignment for e2e files; count_test_functions only counts top-level functions and misses class-based test methods
tests/conftest.py Central fixture file providing InMemorySpanExporter, langfuse_memory_client, and marker routing; contains imports inside mock_init closure
.github/workflows/ci.yml Splits CI into unit-tests (Python 3.10–3.14 matrix) and e2e-tests (sharded + live_provider); unit-tests job uses unpinned actions/checkout@v3 and setup-uv@v7
tests/unit/test_e2e_sharding.py Unit tests verifying shard assignment covers all e2e files exactly once and weight estimation works
tests/unit/test_e2e_support.py Unit tests covering retry helper behavior: NotFound retries, filtered-list retries, disabling retry, and predicate-based waiting
tests/unit/test_prompt.py Unit tests for prompt cache using mocked server calls; contains import inside fixture function
tests/unit/test_prompt_atexit.py Subprocess-based stress tests verifying prompt cache shuts down cleanly at process exit with 10 workers
tests/unit/test_otel.py Comprehensive OTel span tests using InMemorySpanExporter; many imports placed inside test methods rather than at module level
AGENTS.md New repo-level agent instructions replacing CLAUDE.md; no code changes
CLAUDE.md Removed in favour of AGENTS.md; no code impact

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    PR[Pull Request / Push] --> C{Concurrency group}
    C -->|cancel-in-progress| LINT[linting job]
    C --> TC[type-checking job]
    C --> UNIT[unit-tests\nPython 3.10–3.14 matrix]
    C --> E2E[e2e-tests matrix]

    E2E --> S1[Shard 1\nselect_e2e_shard --shard-index 0]
    E2E --> S2[Shard 2\nselect_e2e_shard --shard-index 1]
    E2E --> LP[live_provider suite]

    S1 --> PAR1[pytest -n 4 --dist worksteal\nnot serial_e2e]
    S1 --> SER1[pytest serial_e2e]
    S2 --> PAR2[pytest -n 4 --dist worksteal\nnot serial_e2e]
    S2 --> SER2[pytest serial_e2e]
    LP --> LPT[pytest -n 4 --dist worksteal\ntests/live_provider -m live_provider]

    UNIT --> |pytest -n auto --dist worksteal| UT[tests/unit 360 tests]

    LINT --> GATE[all-tests-passed]
    TC --> GATE
    UNIT --> GATE
    S1 --> GATE
    S2 --> GATE
    LP --> GATE
    GATE --> MERGE[Branch protection check]
Loading

Reviews (1): Last reviewed commit: "fix(prompt-cache): avoid redundant refre..." | Re-trigger Greptile

Greptile also left 3 inline comments on this PR.

Context used:

  • Rule used - Move imports to the top of the module instead of p... (source)

Learnt From
langfuse/langfuse-python#1387

@hassiebp hassiebp changed the title [codex] split test suites into execution-level directories refactor(tests): split suites by execution level and speed up CI Apr 4, 2026
@hassiebp hassiebp marked this pull request as ready for review April 4, 2026 21:18
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

@claude review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb98e3ad52

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6fed925e43

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 170beb8201

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: daaf571e78

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@hassiebp
Copy link
Copy Markdown
Contributor Author

hassiebp commented Apr 7, 2026

@claude review

@hassiebp
Copy link
Copy Markdown
Contributor Author

hassiebp commented Apr 7, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c38f2a197

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +65 to +77
item.add_marker(pytest.mark.e2e)
item.add_marker(pytest.mark.live_provider)


@pytest.fixture(autouse=True)
def reset_langfuse_state(request: pytest.FixtureRequest) -> Iterable[None]:
if request.node.get_closest_marker("live_provider") is not None:
yield
return

LangfuseResourceManager.reset()
yield
LangfuseResourceManager.reset()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new autouse reset_langfuse_state fixture (tests/conftest.py:65-77) calls LangfuseResourceManager.reset() before and after every non-live_provider test, which shuts down ALL registered resource managers globally — including the one backing the module-level langfuse = Langfuse() instance created at import time in tests/live_provider/test_openai.py:11. When a mixed pytest session runs tests/unit before tests/live_provider, every unit test invalidates that module-level instance; subsequent live_provider calls to langfuse.flush() operate on a shut-down manager, silently dropping all spans and causing live_provider assertions to fail. Fix by limiting the reset fixture's scope to tests/unit only (e.g. skip when the test is not marked unit), or by replacing the module-level client in test_openai.py with a per-test fixture.

Extended reasoning...

What the bug is and how it manifests

The reset_langfuse_state fixture introduced by this PR is declared autouse=True and applies to every test in the session. It correctly short-circuits for tests carrying the live_provider marker (lines 66-68), but it still fires — both before and after — for every unit and e2e test. Each invocation calls LangfuseResourceManager.reset(), which iterates over every key in _instances, calls shutdown() on each resource manager, then clears _instances entirely.

The specific code path that triggers it

  1. pytest collection imports tests/live_provider/test_openai.py at module scope, executing line 11: langfuse = Langfuse(). This creates a LangfuseResourceManager keyed to the environment's public key and registers it in _instances.
  2. The first unit test runs. reset_langfuse_state fires its setup phase, calling LangfuseResourceManager.reset(). This calls shutdown() on the module-level manager (which calls flush(), then stops worker threads), then removes it from _instances.
  3. The langfuse module-level variable in test_openai.py still holds a reference to the now-shut-down Langfuse client. The client's _resources attribute points to the dead LangfuseResourceManager — there is no lazy re-initialization path (confirmed by langfuse/_client/client.py line 325).
  4. When any live_provider test calls langfuse.flush() (which happens dozens of times throughout test_openai.py), it calls force_flush() on the old, already-shut-down tracer provider. Spans recorded by OpenAI instrumentation after the reset go to a newly-created resource manager (created implicitly on the next Langfuse() call from OpenAI instrumentation), but that manager is never explicitly flushed by langfuse.flush().

Why existing safeguards do not prevent it

The fixture correctly guards live_provider tests from resetting each other. However, the guard fires on the test being run, not on all tests in the session. A unit test running before a live_provider test has already fired the global reset, and no mechanism can undo a shutdown after the fact. The Langfuse client does not attempt to reconnect or re-acquire a new resource manager when its stored _resources instance has been shut down.

Impact

Running pytest tests/unit tests/live_provider locally (a natural workflow for a developer iterating on both suites) silently breaks every live_provider test that uses the module-level langfuse object. Assertions like assert len(generation.data) != 0 will fail because spans are never exported. In CI the suites run in separate jobs and separate processes, so the regression is invisible there — making it an insidious local-only failure that can block developer productivity and erode confidence in the test suite.

Step-by-step proof

  1. pytest tests/unit tests/live_provider — collection imports test_openai.py; module-level Langfuse() registers manager M0 in _instances.
  2. First unit test starts; reset_langfuse_state (setup) calls LangfuseResourceManager.reset(): M0 is shut down and _instances is cleared.
  3. Unit test body runs; unit test teardown calls LangfuseResourceManager.reset() again (no-op, _instances already empty).
  4. live_provider test test_openai_chat_completion runs; it calls openai.chat.completions.create(...) which records spans via a new resource manager M1 (created implicitly by the OpenAI instrumentation's internal get_client()).
  5. Test calls langfuse.flush() (the module-level instance) → M0.flush()M0.tracer_provider.force_flush() on the dead provider → no spans from M1 are exported.
  6. get_api().legacy.observations_v1.get_many() returns empty; assert len(generation.data) != 0 fails.

How to fix it

Restrict reset_langfuse_state to unit tests only — the only suite that genuinely needs a clean slate between tests:

@pytest.fixture(autouse=True)
def reset_langfuse_state(request: pytest.FixtureRequest) -> Iterable[None]:
    if request.node.get_closest_marker('unit') is None:
        yield
        return
    LangfuseResourceManager.reset()
    yield
    LangfuseResourceManager.reset()

Alternatively, move the module-level langfuse = Langfuse() in test_openai.py into a module-scoped or function-scoped fixture so it is always initialized after any reset.

Comment on lines +151 to +152
with pytest.raises(NotFoundError):
langfuse.api.dataset_items.get(item.id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The final NotFoundError assertion in test_upsert_and_get_dataset_item (lines 151-152) calls langfuse.api.dataset_items.get(item.id) directly with no retry, immediately after a wait_for_result that only guarantees the item has disappeared from the /datasets/{name}/items list endpoint — a different server code path than /dataset-items/{id}. If the server propagates the ARCHIVED status to these two endpoints at slightly different times, the direct get() may not yet raise NotFoundError, causing a spurious test failure. Wrapping the assertion in a retry predicate that loops until NotFoundError is raised would make this consistent with every other assertion in the same test.

Extended reasoning...

What the bug is and how it manifests

The test test_upsert_and_get_dataset_item (tests/e2e/test_datasets.py) archives a dataset item and then uses two sequential checks to confirm the archival has taken effect. First, it calls wait_for_result (lines 142-147) polling langfuse.get_dataset(name) until the item no longer appears in the dataset's item list. Then, immediately afterward (lines 151-152), it calls langfuse.api.dataset_items.get(item.id) with a bare pytest.raises(NotFoundError) and no retry.

The specific code path that triggers it

langfuse.get_dataset(name) ultimately queries the server's GET /datasets/{name}/items list endpoint, which filters results by status = ACTIVE. langfuse.api.dataset_items.get(item.id) queries the server's GET /dataset-items/{id} direct-get endpoint. Although both endpoints read from the same backing database, these are distinct server code paths with their own query logic. In a deployment using async queue-based ingestion (LANGFUSE_INGESTION_QUEUE_DELAY_MS=10 is set in CI), it is possible — if uncommon — for the ARCHIVED status to propagate to the list-filter query before it propagates to the direct-get 404 guard, causing the direct get to return the item rather than raising NotFoundError.

Why existing code does not prevent it

The wait_for_result on lines 142-147 is explicitly scoped to the list endpoint. Once it returns, all we know is that get_dataset() no longer returns the item. There is no guarantee that the direct-get endpoint has also begun returning 404 for the same item ID. The pytest.raises block on lines 151-152 does not retry; it makes a single synchronous call, so any window of inconsistency between the two endpoints results in a test failure.

Addressing the refutation

The refuting verifier argues that a single database write marks the item ARCHIVED and both endpoints share the same backing data, so the scenario requires separate caches — which is speculative. This is a valid point for a simple synchronous system. However, the system under test uses a queue-based ingestion pipeline (LANGFUSE_INGESTION_QUEUE_DELAY_MS=10) that can introduce processing order dependencies. More importantly, the PR's own author demonstrably understands there is an eventual-consistency window here: every single other read assertion in this test was wrapped in wait_for_result, including earlier calls to the same langfuse.api.dataset_items.get() endpoint (lines 102-105 and 119-127). The omission on lines 151-152 is structurally inconsistent with the pattern the PR establishes, regardless of whether the probability of failure is low.

Impact

The consequence is intermittent CI flakiness — the test passes most of the time but can fail spuriously when the direct-get endpoint has not yet started returning 404 for the archived item. This is test-only code, so there is no production correctness risk.

Step-by-step proof

  1. create_dataset_item(..., status=DatasetStatus.ARCHIVED) is called at line 134.
  2. The ingestion queue processes the archival event asynchronously.
  3. wait_for_result (lines 142-147) polls GET /datasets/{name}/items and returns as soon as that endpoint filters out the item.
  4. At this point the direct-get endpoint GET /dataset-items/{id} may have not yet received or processed the same status update through its code path.
  5. langfuse.api.dataset_items.get(item.id) at line 152 returns the item successfully instead of raising NotFoundError.
  6. pytest.raises(NotFoundError) fails with "did not raise".

Recommended fix

Wrap the final assertion in a retry predicate that loops until NotFoundError is raised, analogous to the inverted pattern needed when the terminal condition is an exception rather than a successful result. This makes the assertion consistent with the rest of the test and robust to endpoint propagation timing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant