Skip to content

fix: add thread safety locks to perf tracing module#1166

Open
kbhujbal wants to merge 1 commit intogoogle:mainfrom
kbhujbal:fix/thread-safety-perf-trace
Open

fix: add thread safety locks to perf tracing module#1166
kbhujbal wants to merge 1 commit intogoogle:mainfrom
kbhujbal:fix/thread-safety-perf-trace

Conversation

@kbhujbal
Copy link
Copy Markdown

The tracing module uses background threads for async device span callbacks but had no synchronization protecting shared mutable state. This resolves the acknowledged TODO(yangmu) for adding locks to protect Timeline.stack and Timeline._last_span, and also protects PerfTracer's timeline dictionaries and DeviceTimeline's thread list from concurrent access.

Adds multi-threaded unit tests to verify concurrent safety of span group operations, device span recording, timeline creation, and the full PerfTracer.span() API from multiple threads.

Resolves #<issue_number_goes_here>

It's a good idea to open an issue first for discussion.

Reference

Colab Notebook

Checklist

  • I have added all the necessary unit tests for my change.
  • I have verified that my change does not break existing code and all unit tests pass.
  • I have added all appropriate doc-strings/documentation.
  • My PR is based on the latest changes of the main branch (if unsure, rebase the code).
  • I have signed the Contributor License Agreement.
  • I have followed Contribution Guidelines.

The tracing module uses background threads for async device span
callbacks but had no synchronization protecting shared mutable state.
This resolves the acknowledged TODO(yangmu) for adding locks to protect
Timeline.stack and Timeline._last_span, and also protects PerfTracer's
timeline dictionaries and DeviceTimeline's thread list from concurrent
access.

Adds multi-threaded unit tests to verify concurrent safety of span
group operations, device span recording, timeline creation, and the
full PerfTracer.span() API from multiple threads.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @kbhujbal, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of the performance tracing module by implementing thread-safe access to critical shared data structures. The changes prevent potential race conditions and data corruption that could arise from concurrent operations by background threads, thereby improving the reliability of performance tracing in multi-threaded applications. New tests have been added to validate these concurrency fixes.

Highlights

  • Thread Safety Implementation: Introduced threading.Lock mechanisms within PerfTracer and Timeline classes to protect shared mutable state, specifically Timeline.stack, Timeline._last_span, PerfTracer._thread_timelines, PerfTracer._device_timelines, and DeviceTimeline._threads from concurrent access issues.
  • Comprehensive Unit Testing: Added a new ThreadSafetyTest class with multiple multi-threaded unit tests to rigorously verify the concurrent safety of span group operations, device span recording, timeline creation, and the overall PerfTracer.span() API.
  • Resolution of TODOs: Addressed an acknowledged TODO for adding locks to protect Timeline.stack and Timeline._last_span, ensuring data integrity in multi-threaded environments.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • tests/perf/trace_test.py
    • Added ThreadSafetyTest class to house new multi-threaded unit tests.
    • Included test_concurrent_span_group_begin_end to verify timeline stack integrity.
    • Added test_concurrent_device_span to ensure no data loss during concurrent device span recording.
    • Implemented test_concurrent_get_or_create_thread_timeline for safe thread timeline creation.
    • Added test_concurrent_get_or_create_device_timeline for safe device timeline creation.
    • Included test_concurrent_span_from_multiple_threads to validate the PerfTracer.span() API.
    • Added test_device_timeline_wait_pending_spans_thread_safe to test safe handling of pending spans.
  • tunix/perf/trace.py
    • Initialized _timelines_lock in PerfTracer to protect timeline dictionaries.
    • Applied _timelines_lock to _get_timelines, _get_or_create_thread_timeline, and _get_or_create_device_timeline methods.
    • Initialized _lock in Timeline to protect stack and _last_span.
    • Wrapped span_group_begin and span_group_end methods with _lock for thread safety.
    • Protected device_span, thread_span_begin, and thread_span_end methods with _lock.
    • Ensured thread-safe appending to _threads and snapshotting _threads in DeviceTimeline methods using _lock.
Activity
  • No human activity (comments, reviews) has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@kbhujbal
Copy link
Copy Markdown
Author

  • Adds threading.Lock to Timeline, PerfTracer, and DeviceTimeline to protect shared mutable state (stack, _last_span, timeline dicts, thread list) from concurrent access
  • Resolves the acknowledged TODO(yangmu): add lock to protect stack and _last_span at tunix/perf/trace.py:250
  • Adds 6 multi-threaded unit tests verifying concurrent safety

Test plan

  • Existing trace_test.py tests pass (single-threaded behavior unchanged)
  • New ThreadSafetyTest class passes with 4-10 concurrent threads
  • No deadlocks (lock ordering: PerfTracer._timelines_lockTimeline._lock)

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request significantly enhances the thread safety of the performance tracing module by introducing threading.Lock mechanisms to protect shared mutable states within PerfTracer, Timeline, and DeviceTimeline classes. This directly addresses known concurrency issues, particularly concerning Timeline.stack, Timeline._last_span, and the internal dictionaries and thread lists of PerfTracer and DeviceTimeline. The inclusion of comprehensive multi-threaded unit tests provides strong verification for these critical synchronization changes, ensuring the module's robustness in concurrent environments. This is a crucial improvement for the stability and correctness of the tracing functionality.

@kbhujbal
Copy link
Copy Markdown
Author

@jiangyangmu this PR addresses your TODO at tunix/perf/trace.py:250 for adding lock protection to stack and _last_span. The implementation uses fine grained locking at two levels : PerfTracer._timelines_lock for the timeline registry and Timeline._lock for per timeline state, with a consistent acquisition order to avoid deadlocks. Happy to iterate on the approach if you have a different design in mind. Would appreciate a review when you get a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant