Skip to content

OSS test fix#436

Open
pbhandar2 wants to merge 1 commit intomainfrom
export-D97184057
Open

OSS test fix#436
pbhandar2 wants to merge 1 commit intomainfrom
export-D97184057

Conversation

@pbhandar2
Copy link
Contributor

Summary:
Fix NvmCacheTest.EvictToNvmGet failing in OSS CI.

/home/runner/work/CacheLib/CacheLib/cachelib/allocator/nvmcache/tests/NvmCacheTests.cpp:252: Failure
Expected equality of these values:
  0
  nvm.getNumActiveHandles()
    Which is: 1

I0318 19:45:43.067236 184036 Driver.cpp:84] Driver: finish scheduler
I0318 19:45:43.067307 184036 Driver.cpp:86] Driver: finish scheduler successful
I0318 19:45:43.067613 184036 ThreadPoolJobScheduler.cpp:86] JobScheduler: join threads for reader_pool
I0318 19:45:43.069042 184036 ThreadPoolJobScheduler.cpp:90] JobScheduler: join threads for reader_pool successful
I0318 19:45:43.072589 184036 ThreadPoolJobScheduler.cpp:86] JobScheduler: join threads for writer_pool
I0318 19:45:43.074025 184036 ThreadPoolJobScheduler.cpp:90] JobScheduler: join threads for writer_pool successful
[  FAILED  ] NvmCacheTest.EvictToNvmGet (433 ms)
[----------] 1 test from NvmCacheTest (433 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (433 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] NvmCacheTest.EvictToNvmGet

 1 FAILED TEST

Two bugs:

  1. Unsigned underflow in loop index. The loop uses the i-- > 0 idiom which already decrements i before the body runs (values nKeys+99 down to 0). The extra index = i - 1 double-decremented, so on the last iteration (i=0), index = (unsigned)0 - 1 = UINT_MAX. This created a spurious lookup for a nonsensical key and skipped key1123.
  • Fix: index = i.
  1. Race between async NVM callback and handle count check. When a key is fetched from NVM, the async callback runs onGetComplete →removeFromFillMap → ~GetCtx → wakeUpWaiters → baton_.post(). The baton wakes the test thread, but GetCtx::it (the fill handle) is destroyed after the baton post, still inside the same callback. Until that destruction completes, the async thread's handle count is +1. If the test thread reaches getNumActiveHandles() before the callback fully returns, it observes the stale +1.
  • Fix: call flushNvmCache() before checking handle counts. This calls scheduler_->finish() which busy-waits until all navy jobs (including the callback that destroys GetCtx::it) have fully completed.

Reviewed By: byahn0996

Differential Revision: D97184057

Summary:
Fix NvmCacheTest.EvictToNvmGet failing in OSS CI.

```
/home/runner/work/CacheLib/CacheLib/cachelib/allocator/nvmcache/tests/NvmCacheTests.cpp:252: Failure
Expected equality of these values:
  0
  nvm.getNumActiveHandles()
    Which is: 1

I0318 19:45:43.067236 184036 Driver.cpp:84] Driver: finish scheduler
I0318 19:45:43.067307 184036 Driver.cpp:86] Driver: finish scheduler successful
I0318 19:45:43.067613 184036 ThreadPoolJobScheduler.cpp:86] JobScheduler: join threads for reader_pool
I0318 19:45:43.069042 184036 ThreadPoolJobScheduler.cpp:90] JobScheduler: join threads for reader_pool successful
I0318 19:45:43.072589 184036 ThreadPoolJobScheduler.cpp:86] JobScheduler: join threads for writer_pool
I0318 19:45:43.074025 184036 ThreadPoolJobScheduler.cpp:90] JobScheduler: join threads for writer_pool successful
[  FAILED  ] NvmCacheTest.EvictToNvmGet (433 ms)
[----------] 1 test from NvmCacheTest (433 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (433 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] NvmCacheTest.EvictToNvmGet

 1 FAILED TEST
```

Two bugs:

1. Unsigned underflow in loop index. The loop uses the i-- > 0 idiom which already decrements i before the body runs (values nKeys+99 down to 0). The extra index = i - 1 double-decremented, so on the last iteration (i=0), index = (unsigned)0 - 1 = UINT_MAX. This created a spurious lookup for a nonsensical key and skipped key1123. 
* **Fix**: index = i.

2. Race between async NVM callback and handle count check. When a key is fetched from NVM, the async callback runs onGetComplete →removeFromFillMap → ~GetCtx → wakeUpWaiters → baton_.post(). The baton wakes the test thread, but GetCtx::it (the fill handle) is destroyed after the baton post, still inside the same callback. Until that destruction completes, the async thread's handle count is +1. If the test thread reaches `getNumActiveHandles()` before the callback fully returns, it observes the stale +1. 

* **Fix**: call flushNvmCache() before checking handle counts. This calls scheduler_->finish() which busy-waits until all navy jobs (including the callback that destroys GetCtx::it) have fully completed.

Reviewed By: byahn0996

Differential Revision: D97184057
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 20, 2026
@meta-codesync
Copy link

meta-codesync bot commented Mar 20, 2026

@pbhandar2 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D97184057.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant