Skip to content

Updates#1

Open
autonull wants to merge 1207 commits intodeepstupid:multilevel_researchfrom
ben-manes:master
Open

Updates#1
autonull wants to merge 1207 commits intodeepstupid:multilevel_researchfrom
ben-manes:master

Conversation

@autonull
Copy link
Copy Markdown

No description provided.

@ben-manes ben-manes force-pushed the master branch 2 times, most recently from 940dca6 to 2f271e9 Compare July 23, 2022 22:15
@ben-manes ben-manes force-pushed the master branch 2 times, most recently from ac8e049 to 08b24cd Compare August 27, 2022 03:02
@ben-manes ben-manes force-pushed the master branch 17 times, most recently from a337c19 to 48d5246 Compare September 10, 2022 01:17
@ben-manes ben-manes force-pushed the master branch 2 times, most recently from 53f222c to 00b1503 Compare September 26, 2022 02:50
@ben-manes ben-manes force-pushed the master branch 2 times, most recently from 73f40d3 to 6abd616 Compare October 2, 2022 09:40
@ben-manes ben-manes force-pushed the master branch 2 times, most recently from a7d31f5 to d74756d Compare October 14, 2022 06:02
The lambda declared (k, executor) but applied the captured outer key
instead of k. Functionally identical but k should be used directly.
Fix advance() to correctly expire timers when nanoTime() crosses from
negative to zero by using separate variables for delta computation
(compensated) and bucket scanning (raw), and changing the compensation
condition from > 0 to >= 0. Fix expire() to clamp delta in long space
before narrowing to int, preventing silent no-op on large deltas. Fix
getExpirationDelay() to clamp async computing entries' negative age to
zero, preventing ~220-year scheduler delays (aedile#49).
Policy methods (weightOf, ageOf, getExpiresAfter, setExpiresAfter)
now check node.getValue() == null before returning metadata, consistent
with containsKey's handling of GC-collected weak/soft value entries.
The async branch wrapped the future-resolution work in
whenCompleteAsync(..., executor()) and thenAcceptAsync(..., executor()),
so the outer comparison stage was submitted to the executor directly.
If the executor rejected the continuation (shutdown, bounded queue
saturated), the dependent stage completed exceptionally but nobody
observed the rejection — the REPLACED notification was silently lost.

Run the nv/ov comparisons synchronously on the future-completer thread
and let notifyRemoval perform the single executor submission. It
already falls back to running the listener inline if the executor
rejects, matching every other notifyRemoval call site. The comparison
itself is a pointer check, so the thread it runs on is irrelevant.
The sync views (CacheView, LoadingCacheView) lacked writeReplace and
readObject protection, unlike the 8 concrete cache types. A crafted
byte stream could bypass writeReplace and directly deserialize a view
with a null or malicious asyncCache field.

Add SyncViewProxy that wraps the async cache and recreates the sync
view via asyncCache.synchronous() on deserialization. Add readObject
to AbstractCacheView that rejects direct deserialization attempts.
Port the upstream JCTools fix (d64b9b44) for MpscGrowableArrayQueue.
If buffer allocation fails after the producer index is set to odd
(signaling resize-in-progress), restore it to the original even value
so other producers can resume instead of spinning indefinitely.
- Clamp ceilingPowerOfTwo(int) to 2^30 for inputs exceeding the
  max signed power of two, avoiding a negative return value
- Add inline fallback to AsyncRemovalListener when the executor
  rejects, matching the sync notifyRemoval pattern
- Make MpscGrowableArrayQueue.allocate non-static so tests can
  override it to simulate OOME during resize
- Add -Pasync[=format] to the JMH Gradle plugin. ap-loader 4.3-13 provides
  the native library cross-platform (macOS x64/arm64, Linux x64/arm64); the
  extraction directory is redirected into the build dir to avoid macOS TCC.
  Only one output format is requested at a time because JMH's async-profiler
  integration drops all samples when multiple are combined.
- Add .github/actions/run-benchmark: runs JMH, uploads results to the gist,
  and (when async-profiler is supported) emits a CPU flame graph SVG via
  FlameGraph.pl into the step summary and as an artifact. Skips the profiler
  on OpenJ9 (Semeru) since async-profiler relies on HotSpot-specific APIs.
- Collapse the six per-benchmark workflow jobs into one matrixed job over
  benchmark x JDK (36 cells); publish now depends on a single job.
The int overload defensively clamps at 1 << 30 but the long overload
silently overflowed to Long.MIN_VALUE for inputs above 1L << 62 (including
Long.MAX_VALUE). Currently unreachable in production since the long
overload is only called with small compile-time constants, but the
asymmetry with the int overload is a latent hazard for future callers.
The inline `argumentProviders.add { ... }` lambda embedded the absolute
output-directory path in the returned arguments, which leaks into the
task's cache key and breaks build-cache relocatability across workspaces.
Replace with a typed `CommandLineArgumentProvider` whose `resultsDir`
field is `@get:Internal`, matching the `codeGenerationTask` pattern and
the project rule (PR #1947).
AbstractCacheView.readObject rejected direct deserialization, but a
crafted byte stream whose class-descriptor chain omits AbstractCacheView
(via TC_NULL superClassDesc) bypassed the guard: Java neither invoked
readObject on the missing ancestor nor readObjectNoData (which was not
declared), leaving the concrete CacheView / LoadingCacheView instance
with a stream-controlled asyncCache field.

Declare readObjectNoData on AbstractCacheView so the hierarchy-skip path
also throws. Add byte-stream crafted tests for both concrete subtypes.
The two-step pre-check `isCompletedExceptionally() || (isDone() && join()
== null)` had a race window: if the future was cancelled between the two
reads, the subsequent join() threw CancellationException out of put()
instead of treating it as a failed completion. The same hazard applied
to obtrudeException-style forced transitions.

Collapse to a single isDone() check that catches the join exception so
any done-but-not-ready state (failure, cancel, or null value) uniformly
removes the mapping without propagating.
doComputeIfAbsent and remap recorded eviction stats (EXPIRED / COLLECTED)
when a user operation discovered an already-evicted entry, but the sibling
paths in put, remove(key), remove(key, value), and removeNode silently
fired notifyEviction + notifyRemoval without bumping the counter. The
asymmetry meant the same logical event was counted on compute paths but
invisible on direct mutation paths.

Capture oldWeight in EvictContext / RemoveContext under synchronized(node)
and record the eviction alongside notifyRemoval when wasEvicted() is true.
The sibling refresh paths in LocalLoadingCache.refresh and
LocalAsyncLoadingCache.refresh guard their REPLACED notification with
`if (discard[0] && (newValue != null))`. BoundedLocalCache.refreshIfNeeded
had no such guard on its fall-through branch, so a loader returning null
combined with a concurrent put (which removed the refresh from the
refreshes map before the handler ran) fired notifyRemoval(key, null,
REPLACED). Non-null-tolerant RemovalListener implementations NPE on the
null value, which the cache swallows and logs at WARN.

Gate the fall-through `cause[0] = REPLACED` on `value != null` to match
the sibling paths.
If the user's Weigher or Expiry throws inside remap during the refresh
handler's compute, the exception propagated out of refreshIfNeeded's
handle lambda unlogged and without any stats update. The refreshed
future completed exceptionally, the caller saw null, but the
LoadingCache.refresh javadoc's "exception will be logged (using System.
Logger) and swallowed" contract was silently violated.

Wrap the refresh commit in a try/catch that mirrors the sibling
LocalAsyncLoadingCache.refresh path: log at WARNING, record a load
failure, and return null. The entry's previous value remains in place.
The prior note described keyReference as "set once during construction
and never mutated", which is inaccurate: retire() and die() on generated
weak/soft-value nodes write sentinels into the field under
synchronized(node). The note also attributed the storeStoreFence's role
to publishing the newly-constructed reference's non-final fields;
the actual load-bearing role (per the JCStress IntermittentNull test) is
ordering the old reference's ref.clear() against the setRelease of the
new reference so the getValue re-check loop's invariant holds.
A refresh that is preempted by a concurrent write (put / remove / compute
/ etc. discarded the refresh before its handler ran) falls through to
`return currentValue` in the refresh handler's compute lambda. Returning
the same instance previously caused `remap` to apply the normal update
block: `expireAfterUpdate`, `setWeight`, `setAccessTime`, `setVariableTime`,
and — when `exceedsWriteTimeTolerance` holds for slow refreshes —
`setWriteTime`. The entry's expireAfterWrite countdown effectively
restarted at the refresh completion time rather than at the user's put
time, shifting expiration by up to the refresh duration.

Add a package-private `LocalCache.compute` overload that accepts a
`boolean[] preserveTimestamps` reference. The remapping lambda sets it
to `true` on branches where the same-instance return represents a
no-op (not a legitimate same-instance replace). `BoundedLocalCache.remap`
short-circuits after the lambda returns when the flag is set, the new
value is the same instance as the old, and no eviction cause was set,
skipping all metadata updates and the post-compute afterWrite/afterRead
dispatch. `UnboundedLocalCache` ignores the flag (no timestamps to
preserve). `refreshes.remove` remains inside the compute lambda, so the
single-flight invariant is preserved (CHM bin lock still serializes the
claim with the ABA check).

Adopted in the three refresh handlers: `BoundedLocalCache.refreshIfNeeded`,
`LocalLoadingCache.refresh`, and `LocalAsyncLoadingCache.tryComputeRefresh`.
The 0L value doubles as "unscheduled/cancelled" and as a possible output
of (now + TOLERANCE) or scheduleAt when the ticker lands on a specific
value. Bump the result to 1L in that case so the recursion guard in
schedule() stays correct even under an immediate scheduler.
PMD and SpotBugs both emit per-source-set JSON/SARIF files named by
source set (main.json, test.json, etc.), not by module. The flat
`find ... -exec cp {} <dir>/ \;` clobbered same-basename files across
caffeine/guava/jcache/simulator, so only the last-visited module's
findings survived the merge into the Code Scanning upload. Prefix the
destination filename with the module so every module's results are
preserved.
Three defensive branches in BoundedLocalCache were uncovered because
their preconditions can't arise through the normal refresh protocol:

  1. In the refresh completion lambda, the `removed && (currentValue
     == oldValue)` guard at L1401 never sees `removed==true` with a
     mismatched currentValue — concurrent writes always drop the
     refresh token under the same CHM bin lock. Exercise it via
     `replace(..., shouldDiscardRefresh=false)` with a different new
     value.

  2. In `remap`, the `preserveTimestamps[0]` no-op short-circuit also
     checks `newValue == oldValue` and `cause == null`. Production
     callers couple the flag to returning `currentValue` with a null
     cause, so those fallback branches are unreachable. Cover them via
     the package-level `compute(..., preserveTimestamps)` interface
     with a pre-set flag and (a) a mismatched return value, (b) an
     expired entry where the lambda returns the captured old value.
Align the Policy setter with Caffeine.refreshAfterWrite, which has always
enforced duration > 0. Zero-duration refresh is pathological (every read
becomes refresh-eligible) and the asymmetry was unintentional.
Every getIfPresent on an expireAfterAccess cache wrote Node.accessTime
opaquely, and every read on an Expiry cache CAS'd Node.variableTime via
tryExpireAfterRead. Under concurrent reads of the same hot entry, every
write transitioned the Node's cache line from S to M in MESI,
invalidating the line on all other cores and forcing a refetch on the
next read. Opaque and CAS modes avoid the writer's fence but not the
coherence traffic.

Guard setAccessTime and tryExpireAfterRead with a tolerance check: skip
the store if the new value is within EXPIRE_TOLERANCE (1s) of the stored
one. Math.abs is used so the async-completion path (timestamps set to
now + ASYNC_EXPIRY during in-flight loads) still fires when the sentinel
is replaced. Tolerance is bypassed when the configured or per-entry
duration is <= tolerance so sub-second expiration windows still behave
exactly. Trade-off matches the pre-existing writeTime tolerance: entries
may expire up to 1s earlier than the configured duration, never later.

Renamed EXPIRE_WRITE_TOLERANCE to EXPIRE_TOLERANCE to reflect the
broader scope.

HotEntryBenchmark pins every thread to the same key. On Linux x86
(GHA, 4 threads, JDK 25) expireAccess_hot rises from 18.6M to 49.7M
ops/s (2.67x) after the change, matching expireAccess_spread. The
cache-line true-sharing cost is eliminated. ARM shows the same pattern.
A recent audit caught that put/remove/clear paths weren't recording
eviction stats when the old entry had already been evicted (fixed in
5bb78f0). The bug lived because the ~100 parameterized tests that
already asserted on the eviction notification via

  assertThat(context).notifications().withCause(EXPIRED|COLLECTED|SIZE)
      .contains(...).exclusively();

all skipped the sibling

  assertThat(context).stats().evictions(N).evictionWeight(W);

so only a small handful of tests (~9 across EvictionTest / ReferenceTest /
BoundedLocalCacheTest) exercised the stats counters for eviction-cause
removals. The asymmetry was silent — the cause-based notification and the
stats counters were three observations of the same logical event and
nothing forced them to agree.

Fold both stats checks into the Exclusive.exclusively() subject: when the
cause wasEvicted(), assert evictionCount == entries.length AND
evictionWeight == sum of context.weigher().weigh(k, v) over the entries.
Null values (which the listener may receive on COLLECTED with cleared
weak/soft refs) are treated as weight 0 to avoid NPE in user weighers.

Also fold the symmetric "no evictions" check into ListenerSubject.isEmpty()
and ListenerSubject.hasNoEvictions(): both paths now also assert
evictions(0).evictionWeight(0).

When stats recording is disabled for a parameterization,
StatsSubject.awaitStatistic short-circuits to a no-op, so no-stats runs
are unaffected. This closes the gap for every existing eviction-cause
assertion and prevents future tests from reproducing the same blind spot.
The eviction-recording paths were split between two orderings:
- record-then-notify: evictEntry, doComputeIfAbsent's compute lambda, remap
- notify-then-record: removeNode, put-on-expired/collected,
  remove(key), remove(key, value)

For an executor(Runnable::run) cache with a synchronous RemovalListener that
reads cache.stats(), the same logical eviction event would surface with
evictionCount already incremented at some sites and not yet at others. Not
a documented contract — stats observability inside a listener isn't
guaranteed — but the asymmetry is gratuitous and was caught by review.

Flip the four notify-then-record sites to record-then-notify so every
eviction-recording path uses the same order. Also drop a stray double space
in refreshIfNeeded's containsKey call.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants