fix: add thread safety locks to perf tracing module#1166
fix: add thread safety locks to perf tracing module#1166kbhujbal wants to merge 1 commit intogoogle:mainfrom
Conversation
The tracing module uses background threads for async device span callbacks but had no synchronization protecting shared mutable state. This resolves the acknowledged TODO(yangmu) for adding locks to protect Timeline.stack and Timeline._last_span, and also protects PerfTracer's timeline dictionaries and DeviceTimeline's thread list from concurrent access. Adds multi-threaded unit tests to verify concurrent safety of span group operations, device span recording, timeline creation, and the full PerfTracer.span() API from multiple threads.
Summary of ChangesHello @kbhujbal, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness of the performance tracing module by implementing thread-safe access to critical shared data structures. The changes prevent potential race conditions and data corruption that could arise from concurrent operations by background threads, thereby improving the reliability of performance tracing in multi-threaded applications. New tests have been added to validate these concurrency fixes. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Test plan
|
There was a problem hiding this comment.
Code Review
The pull request significantly enhances the thread safety of the performance tracing module by introducing threading.Lock mechanisms to protect shared mutable states within PerfTracer, Timeline, and DeviceTimeline classes. This directly addresses known concurrency issues, particularly concerning Timeline.stack, Timeline._last_span, and the internal dictionaries and thread lists of PerfTracer and DeviceTimeline. The inclusion of comprehensive multi-threaded unit tests provides strong verification for these critical synchronization changes, ensuring the module's robustness in concurrent environments. This is a crucial improvement for the stability and correctness of the tracing functionality.
|
@jiangyangmu this PR addresses your TODO at tunix/perf/trace.py:250 for adding lock protection to stack and _last_span. The implementation uses fine grained locking at two levels : PerfTracer._timelines_lock for the timeline registry and Timeline._lock for per timeline state, with a consistent acquisition order to avoid deadlocks. Happy to iterate on the approach if you have a different design in mind. Would appreciate a review when you get a chance. |
The tracing module uses background threads for async device span callbacks but had no synchronization protecting shared mutable state. This resolves the acknowledged TODO(yangmu) for adding locks to protect Timeline.stack and Timeline._last_span, and also protects PerfTracer's timeline dictionaries and DeviceTimeline's thread list from concurrent access.
Adds multi-threaded unit tests to verify concurrent safety of span group operations, device span recording, timeline creation, and the full PerfTracer.span() API from multiple threads.
Resolves #<issue_number_goes_here>
Reference
Colab Notebook
Checklist