Skip to content

[test] Fix flaky table cleanup #17652

Closed
xiangfu0 wants to merge 1 commit intoapache:masterfrom
xiangfu0:codex/fix-table-tasks-cleanup-flaky
Closed

[test] Fix flaky table cleanup #17652
xiangfu0 wants to merge 1 commit intoapache:masterfrom
xiangfu0:codex/fix-table-tasks-cleanup-flaky

Conversation

@xiangfu0
Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 commented Feb 6, 2026

Problem

Two flaky test failures were observed in CI:

  1. PinotTableRestletResourceTest#testTableTasksCleanupWithNonActiveTasks
  • Intermittent 500 with Failed to delete job ... from queue ... during table deletion.
  1. TruncateDecimalTransformFunctionTest#testTruncateDecimalTransformFunction
  • Intermittent assertion mismatch expected [0.0] but found [-0.0].

Changes

1) Controller test race fix

  • In testTableTasksCleanupWithNonActiveTasks, keep the minion task queue stopped while table deletion runs.
  • Resume queue in a finally block to avoid leaking queue state to other tests.

2) Truncate signed-zero fix

  • In TruncateDecimalTransformFunction, changed the single-argument path (truncate(value)) to use the same truncation logic as truncate(value, 0) via BigDecimal.setScale(0, RoundingMode.DOWN).
  • Added explicit regression coverage in TruncateDecimalTransformFunctionTest for truncate(-0.4) to guarantee canonical 0.0 output (not -0.0).

Why this fixes flakiness

  • The first issue removes a task-state transition race during cleanup.
  • The second issue removes a signed-zero inconsistency in output formatting and aligns truncate(value) with truncate(value, 0) semantics.

Validation

  • ./mvnw -pl pinot-controller -Dtest=PinotTableRestletResourceTest#testTableTasksCleanupWithNonActiveTasks -Dsurefire.failIfNoSpecifiedTests=false test -DskipITs -DskipIntegrationTests
  • Repeated runs of the same controller test (9 consecutive successful loop iterations + 1 additional pass)
  • ./mvnw -pl pinot-controller -Dtest=PinotTableRestletResourceTest#testTableTasksCleanupWithActiveTasks -Dsurefire.failIfNoSpecifiedTests=false test -DskipITs -DskipIntegrationTests
  • ./mvnw -pl pinot-controller -Dtest=PinotTableRestletResourceTest -Dsurefire.failIfNoSpecifiedTests=false test -DskipITs -DskipIntegrationTests
  • ./mvnw -pl pinot-core -Dtest=TruncateDecimalTransformFunctionTest -Dsurefire.failIfNoSpecifiedTests=false test -DskipITs -DskipIntegrationTests
  • Repeated runs of TruncateDecimalTransformFunctionTest#testTruncateDecimalTransformFunction (5 consecutive successful iterations)

@xiangfu0 xiangfu0 requested a review from Copilot February 6, 2026 11:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition in PinotTableRestletResourceTest#testTableTasksCleanupWithNonActiveTasks that caused flaky test failures. The issue occurred when the minion task queue was resumed before table deletion completed, allowing concurrent task transitions to interfere with cleanup operations.

Changes:

  • Move task queue resume logic from before table deletion to after deletion in a finally block
  • Add explanatory comments describing why the queue must remain stopped during deletion
  • Ensure queue resume happens even if deletion fails, preventing test interference

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Feb 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.24%. Comparing base (23c9a59) to head (d2fcb1e).

Additional details and impacted files
@@            Coverage Diff            @@
##             master   #17652   +/-   ##
=========================================
  Coverage     63.24%   63.24%           
  Complexity     1466     1466           
=========================================
  Files          3190     3190           
  Lines        191963   191963           
  Branches      29410    29410           
=========================================
+ Hits         121400   121401    +1     
+ Misses        61059    61052    -7     
- Partials       9504     9510    +6     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.21% <ø> (+<0.01%) ⬆️
java-21 63.20% <ø> (-0.01%) ⬇️
temurin 63.24% <ø> (+<0.01%) ⬆️
unittests 63.23% <ø> (+<0.01%) ⬆️
unittests1 55.57% <ø> (-0.01%) ⬇️
unittests2 34.23% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xiangfu0 xiangfu0 changed the title [controller] Fix flaky PinotTableRestletResourceTest cleanup race [test] Fix flaky table cleanup and truncate signed-zero tests Feb 6, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

@xiangfu0 xiangfu0 added the flaky-test Tracks a test that intermittently fails label Feb 6, 2026
@xiangfu0 xiangfu0 force-pushed the codex/fix-table-tasks-cleanup-flaky branch from 3d37641 to e3b021a Compare February 7, 2026 21:13
@xiangfu0
Copy link
Copy Markdown
Contributor Author

xiangfu0 commented Feb 7, 2026

Addressed another flake from CI in TableRebalancePauselessIntegrationTest: avoid reusing extra server IDs/data directories across data-provider invocations (and avoid overlapping the baseline server ID).

Validation:

  • ./mvnw -Ppinot-fastdev -pl pinot-integration-tests -am -Dtest=TableRebalancePauselessIntegrationTest -Dsurefire.failIfNoSpecifiedTests=false test
  • Result: 2 tests run, 0 failures, 0 errors.

waitForTaskState(taskName, TaskState.IN_PROGRESS);

// stop the task queue to abort the task
// Stop the task queue to abort the task. Keep it stopped until table deletion is complete to avoid
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test capturing a race condition?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^ Is this capturing a valid race condition in production code?

} else {
truncated = 0.0d;
}
// Normalize -0.0 to +0.0 for deterministic output and test stability.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? This is an actual production fix, suggest putting it in a separate PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will separate

xiangfu0 added a commit to xiangfu0/pinot that referenced this pull request Feb 11, 2026
Extracted from apache#17652 as a standalone production fix.

- Use allocation-free math truncation in no-scale path
- Preserve NaN/+Inf/-Inf behavior for no-scale truncate
- Normalize -0.0 to canonical +0.0 for deterministic output
- Add regression coverage for signed-zero bit pattern and non-finite values
@xiangfu0
Copy link
Copy Markdown
Contributor Author

Split out the production truncate(value) fix into a dedicated PR: #17677.

That PR includes the non-finite handling guard, allocation-free no-scale truncation path, signed-zero normalization, and expanded unit coverage.

xiangfu0 added a commit to xiangfu0/pinot that referenced this pull request Feb 11, 2026
Keep apache#17652 focused on flaky-test fixes only; production truncate(value) fix is now tracked in apache#17677.
@xiangfu0
Copy link
Copy Markdown
Contributor Author

Updated this PR to remove all net pinot-core truncate changes; it now only contains flaky-test fixes in controller/integration tests.

The production truncate(value) fix remains in dedicated PR #17677.

Validation rerun:

  • ./mvnw -pl pinot-controller -am -Dtest=PinotTableRestletResourceTest#testTableTasksCleanupWithNonActiveTasks -Dsurefire.failIfNoSpecifiedTests=false test
  • ./mvnw -Ppinot-fastdev -pl pinot-integration-tests -am -Dtest=TableRebalancePauselessIntegrationTest -Dsurefire.failIfNoSpecifiedTests=false test

@xiangfu0 xiangfu0 changed the title [test] Fix flaky table cleanup and truncate signed-zero tests [test] Fix flaky table cleanup Feb 14, 2026
@xiangfu0 xiangfu0 force-pushed the codex/fix-table-tasks-cleanup-flaky branch from 7e9c12f to d2fcb1e Compare March 9, 2026 04:12
@xiangfu0 xiangfu0 added testing Related to tests or test infrastructure cleanup Code cleanup or removal of dead code bug Something is not working as expected labels Mar 20, 2026
@xiangfu0 xiangfu0 closed this Apr 6, 2026
@xiangfu0 xiangfu0 deleted the codex/fix-table-tasks-cleanup-flaky branch April 6, 2026 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something is not working as expected cleanup Code cleanup or removal of dead code flaky-test Tracks a test that intermittently fails testing Related to tests or test infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants