Skip to content

Fix DevTools command completion race#533

Open
OLSander wants to merge 1 commit intochromiumembedded:masterfrom
OLSander:fix/devtools-timeout-hardening-v3
Open

Fix DevTools command completion race#533
OLSander wants to merge 1 commit intochromiumembedded:masterfrom
OLSander:fix/devtools-timeout-hardening-v3

Conversation

@OLSander
Copy link
Copy Markdown
Contributor

Commit 7ac5c1f fixed a queuedCommands_ memory leak by adding removeQueuedCommand(messageId) at the end of onDevToolsMethodResult. Unfortunately this introduced a race condition that causes DevTools calls to hang indefinitely (or, if the caller adds orTimeout, to throw a TimeoutException).

The race

executeDevToolsMethod chains two async steps:

  1. Submit the DevTools command to CEF → receive the assigned messageId back via a callback
  2. Look up or create a CompletableFuture<String> in queuedCommands_ for that messageId (via thenCompose)

The leak fix assumed that by the time the observer's onDevToolsMethodResult fires, step 2 has already completed and the queued future is bound to the caller. That assumption holds in the common case, but it is not guaranteed. The two callbacks — the message-id callback and the observer result callback — cross the JNI boundary independently and can be scheduled in either order.

When the result arrives before step 2 runs:

  • Observer fires, calls getQueuedCommand(messageId) → creates a new CompletableFuture F1, completes it, removes it from the map
  • Step 2 runs, calls getQueuedCommand(messageId) → map entry is gone, creates a new CompletableFuture F2
  • F2 is what the caller holds; it is never completed → indefinite hang

This was confirmed with an instrumentation build that added lifecycle logging across the Java/native boundary. The observed log sequence for a reproducing Network.setCacheDisabled call matched the scenario above exactly, including a warning "DevTools result arrived before consumer bound queued future."

Fix

Remove the observer-side cleanup. Instead, clean up from the consumer side — specifically from the whenComplete attached in executeDevToolsMethod, using the two-argument Map.remove(key, value) overload to avoid accidentally deleting a newer mapping for a reused message id:

return browser_.executeDevToolsMethod(method, parametersAsJson).thenCompose(messageId -> {
    CompletableFuture<String> queuedCommand = getQueuedCommand(messageId);
    return queuedCommand.whenComplete(
            (result, throwable) -> queuedCommands_.remove(messageId, queuedCommand));
});

This ensures cleanup can only happen after the consumer has bound to the specific future instance, eliminating the race while still preventing the original leak.

@OLSander
Copy link
Copy Markdown
Contributor Author

OLSander commented Apr 12, 2026

@magreenblatt can you please review this PR?

Also mentioning @S1artie

We have been running into CDP-related timeouts after updating JCEF. This happens rarely, but for some reason the timeouts are much easier to reproduce on a VM with Windows Server 2022. We flagged commit 7ac5c1f as the most likely cause, but finding conclusive proof took some time.

Part of the problem was that we were close to a release, so we were forced to postpone the investigation and temporarily downgrade JCEF. Another complication was that the AI agents we involved all concluded that commit 7ac5c1f could not possibly introduce a race.

After instrumenting JCEF with logging we are now convinced that the earlier AI analysis was flawed; the issue was indeed caused by the leak fix. We verified that the changes in this PR prevent the timeouts.

Flawed AI analysis (provided by a coworker, not sure which model they used):
ai-flawed-analysis

New AI conclusion (GPT-5.4):
ai-new-conclusion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant