Optimize CI: consolidate workflows, fix caching, speed up e2e tests (47min → 15min) by vikrantpuppala · Pull Request #772 · databricks/databricks-sql-python

vikrantpuppala · 2026-04-13T13:00:50Z

Summary

Consolidate 3 workflows into 1: Delete integration.yml and daily-telemetry-e2e.yml — coverage workflow already runs all e2e tests. Add push: main trigger. Run all tests (including telemetry) in a single pytest invocation with --dist=loadgroup for xdist_group isolation.
Fix pyarrow cache: Remove cache-path: .venv-pyarrow — poetry always creates .venv, so the cache was never saved ("Path does not exist" error). 3.14 PyArrow jobs dropped from 18min → 3min once cache populated.
Fix 3.14 post-test DNS hang: Add enable_telemetry=False to unit test dummy connection args. Unit tests using server_hostname="foo" triggered real HTTP calls — on protected runners this caused an 8-min process hang. 3.14 unit tests dropped from 17min → 2min.
Better xdist distribution: Split TestPySQLLargeQueriesSuite into 3 separate classes and split lz4 on/off into separate parametrized cases so xdist distributes slow tests across 4 workers.
Use 4 workers: -n 4 instead of -n auto (2 CPUs). E2e tests are network-bound (waiting on warehouse), not CPU-bound.
Reduce test sizes: Large result set tests 300MB → 100MB. test_long_running_query threshold 3min → 1min, starting scale_factor 1 → 50.

Results (measured)

Metric	Before	After
E2E workflows per PR	3	1
Coverage wall-clock	47 min	15 min
Integration workflow	40 min	deleted
3.14 unit tests	17m38s	2m46s
3.14 PyArrow tests	18m26s	3m21s
3.14 linting	15m46s	1m27s
Total warehouse compute per PR	~85 min	~15 min

Test plan

All 34 CI checks pass
Coverage workflow runs all tests including telemetry (870 passed, 25 skipped)
3.14 pyarrow cache saves and hits on subsequent runs
3.14 jobs no longer have post-test DNS hang
LargeQueriesSuite tests distributed across multiple xdist workers

SKIP_COVERAGE_CHECK = CI workflow changes only, no source code coverage impact

This pull request was AI-assisted by Isaac.

Workflow consolidation: - Delete integration.yml and daily-telemetry-e2e.yml (redundant with coverage workflow which already runs all e2e tests) - Add push-to-main trigger to coverage workflow - Run all tests (including telemetry) in single pytest invocation with --dist=loadgroup to respect xdist_group markers for isolation Fix pyarrow cache: - Remove cache-path: .venv-pyarrow from pyarrow jobs. Poetry always creates .venv regardless of the cache-path input, so the cache was never saved ("Path does not exist" error). The cache-suffix already differentiates keys between variants. Fix 3.14 post-test DNS hang: - Add enable_telemetry=False to unit test DUMMY_CONNECTION_ARGS that use server_hostname="foo". This prevents FeatureFlagsContext from making real HTTP calls to fake hosts, eliminating ~8min hang from ThreadPoolExecutor threads timing out on DNS on protected runners. Improve e2e test parallelization: - Split TestPySQLLargeQueriesSuite into 3 separate classes (TestPySQLLargeWideResultSet, TestPySQLLargeNarrowResultSet, TestPySQLLongRunningQuery) so xdist distributes them across workers instead of all landing on one. Speed up slow tests: - Reduce large result set sizes from 300MB to 100MB (still validates large fetches, lz4, chunking, row integrity) - Start test_long_running_query at scale_factor=50 instead of 1 to skip ramp-up iterations that finish instantly Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

- Use -n 4 instead of -n auto in coverage workflow. The e2e tests are network-bound (waiting on warehouse), not CPU-bound, so 4 workers on a 2-CPU runner is fine and doubles parallelism. - Lower test_long_running_query min_duration from 3 min to 1 min. The test validates long-running query completion — 1 minute is sufficient and saves ~4 min per variant. - Split lz4 on/off loop in test_query_with_large_wide_result_set into separate parametrized test cases so xdist can run them on different workers instead of sequentially in one test. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

jprakash-db

Overall LGTM. Thanks for making the changes

Per review feedback from jprakash-db: - Remove mixin classes (LargeWideResultSetMixin, etc) — inline the test methods directly into the test classes in test_driver.py - Remove backward-compat LargeQueriesMixin alias (nothing uses it) - Rename _LargeQueryRowHelper — replaced entirely by inlining - Convert large_queries_mixin.py to just a fetch_rows() helper function Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

vikrantpuppala temporarily deployed to azure-prod April 13, 2026 13:00 — with GitHub Actions Inactive

vikrantpuppala requested review from jprakash-db and msrathore-db April 13, 2026 13:01

vikrantpuppala temporarily deployed to azure-prod April 13, 2026 15:57 — with GitHub Actions Inactive

vikrantpuppala commented Apr 13, 2026

View reviewed changes

Comment thread .github/workflows/code-coverage.yml

vikrantpuppala commented Apr 13, 2026

View reviewed changes

Comment thread .github/workflows/daily-telemetry-e2e.yml Outdated

vikrantpuppala changed the title ~~Optimize CI: consolidate workflows, fix caching, speed up e2e tests~~ Optimize CI: consolidate workflows, fix caching, speed up e2e tests (47min → 15min) Apr 13, 2026

vikrantpuppala requested a review from samikshya-db April 13, 2026 16:23

vikrantpuppala commented Apr 13, 2026

View reviewed changes

Comment thread .github/workflows/integration.yml Outdated

jprakash-db approved these changes Apr 14, 2026

View reviewed changes

Comment thread tests/e2e/common/large_queries_mixin.py Outdated

Comment thread tests/e2e/test_driver.py Outdated

Comment thread tests/e2e/common/large_queries_mixin.py Outdated

Comment thread tests/e2e/test_driver.py Outdated

vikrantpuppala temporarily deployed to azure-prod April 14, 2026 06:03 — with GitHub Actions Inactive

vikrantpuppala enabled auto-merge (squash) April 14, 2026 06:20

vikrantpuppala merged commit c46b3a0 into main Apr 14, 2026
34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CI: consolidate workflows, fix caching, speed up e2e tests (47min → 15min)#772

Optimize CI: consolidate workflows, fix caching, speed up e2e tests (47min → 15min)#772
vikrantpuppala merged 3 commits intomainfrom
ci/optimize-e2e-and-coverage-v2

vikrantpuppala commented Apr 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jprakash-db left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vikrantpuppala commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results (measured)

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jprakash-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vikrantpuppala commented Apr 13, 2026 •

edited

Loading