Cancel outstanding threads prior to closing sqlite connection#4886
Cancel outstanding threads prior to closing sqlite connection#4886crtschin wants to merge 2 commits intohaskell:masterfrom
Conversation
7b3b875 to
9b542d3
Compare
The shake session holds references to the sqlite connection. When the stop signal is given and the scope is exitted, the sqlite connections are closed, which outstanding threads may still be using. This leads to use-after-frees. Ensure the session is shutdown prior to leaving the worker scope.
9b542d3 to
054b254
Compare
There was a problem hiding this comment.
handleServerExceptionOrShutDown also shutdowns, is that redundant?
There was a problem hiding this comment.
Ah missed it, nicely spotted. Somewhat yes. They do the same thing in shutting down the connection, though this one is definitely too late in stopping the users of the connection. I'll take this bit into account as well.
fendor
left a comment
There was a problem hiding this comment.
Amazing detective work! CI was green at the first attempt, if this fixes the segmentation faults we have been observing on windows, this would a huge win!
I think the idea is correct, but the exact location for shutdown feels not quite correct.
Perhaps we need to cancel the worker threads in runWithWorkerThreads? Then a comment could state the invariant to avoid use-after-free.
| import UnliftIO.Exception | ||
|
|
||
| import qualified Colog.Core as Colog | ||
| import Control.Concurrent.Async (cancelMany) |
| -- Keep this after putMVar ideMVar ide; otherwise shutdown during | ||
| -- initialization could leave handleInit blocked indefinitely on readMVar. | ||
| untilReactorStopSignal $ forever $ do | ||
| untilReactorStopSignal $ flip finally (shutdown ide) $ forever $ do |
There was a problem hiding this comment.
Perhaps this cancellation should happen in runWithWorkerThreads?
Here it looks a bit ad-hoc and non-trivial to understand the relation with the hiedb thread.
|
Great detective work @crtschin ! Some Threads(ShakeRestart and Sessionloader) in If we cancel the shake session before leaving the scope of the body of Perhaps a better idea would be, shutdown the shake session right before shutting down the DB threads but after the shutting down of ShakeRestart thread and Sessionloader thread inside the PS. I was wondering why the problem only happens for windows. I was wrong, Sessionloader would be called inside shake session's threads(Rule GhcSession). and Sessionloader would call ShakeRestart hence it creates cycling depedencies. 0 0 It is tricky. Perhaps, do so in three folds, first stop Sessionloader's worker and then ShakeRestart's woker, then shutdown the shake session, finally close the queues of Sessionloader and ShakeRestart. |
That's fine no? Following your and @fendor's suggestion, I'm thinking shutting down the shake session at the following spot: diff --git a/ghcide/src/Development/IDE/LSP/LanguageServer.hs b/ghcide/src/Development/IDE/LSP/LanguageServer.hs
index 72720e302..40d34d2ca 100644
--- a/ghcide/src/Development/IDE/LSP/LanguageServer.hs
+++ b/ghcide/src/Development/IDE/LSP/LanguageServer.hs
@@ -354,6 +354,7 @@ handleInit lifecycleCtx env (TRequestMessage _ _ m params) = otTracedHandler "In
runWithWorkerThreads :: Recorder (WithPriority Session.Log) -> FilePath -> (WithHieDb -> ThreadQueue -> IO ()) -> IO ()
runWithWorkerThreads recorder dbLoc f = evalContT $ do
(WithHieDbShield hiedb, threadQueue) <- runWithDb recorder dbLoc
+ -- Shutdown session here, note that shutdown happens bottom->up
sessionRestartTQueue <- withWorkerQueueSimple (cmapWithPrio Session.LogSessionWorkerThread recorder) "RestartTQueue"
sessionLoaderTQueue <- withWorkerQueueSimple (cmapWithPrio Session.LogSessionWorkerThread recorder) "SessionLoaderTQueue"
liftIO $ f hiedb (ThreadQueue threadQueue sessionRestartTQueue sessionLoaderTQueue)So this'd be after the restart and loader threads have closed down. If these threads are closed down, then they shouldn't be accessing the connection anymore. Restarting also wouldn't occur because the thread that does it is now finished. |
Forgot to mention, I'm running into the segfaults locally on my linux machine when running the following command repeatedly: I assume this occurs more often with the above, because I'm running with 6 capabilities and those tests specifically are very quick leaving a big potential gap between everything being loaded in and the threads being torn down. |
From the investigation in #4884 (comment).
I was repeatedly running the
ghcide-teststo debug flakiness, and next to the EOF closed handle reads, I also encountered segfaults. Investigating the latter further via the coredumps showed some sqlite functions, suggesting use-after-free of the sqlite handle.The problem here is that the connections are spawned and given to numerous async threads, either to handle the request, or via the shake session. These async threads outlive the workerthread lifetimes, leading to racey use-after-frees. Solve this by both keeping track of live threads and cancelling those, and additionally issuing a shake session shutdown prior to leaving. The segfaults disappear with these changes.
Edit: Removed the cancellation of the threads that were handling the request, and only included the shake shutdown prior to leaving scope. This was enough to avoid segfaults in my case (running a small portion of the testsuite 100x repeatedly), and stops the tests on CI from failing.