Skip to content

Feat/detailed coordination window diagnostics#3861

Merged
lionakhnazarov merged 9 commits intothreshold-network:mainfrom
lionakhnazarov:feat/detailed-coordination-window-diagnostics
Feb 5, 2026
Merged

Feat/detailed coordination window diagnostics#3861
lionakhnazarov merged 9 commits intothreshold-network:mainfrom
lionakhnazarov:feat/detailed-coordination-window-diagnostics

Conversation

@lionakhnazarov
Copy link
Copy Markdown
Collaborator

Detailed Coordination Window Diagnostics

Summary

This PR introduces comprehensive diagnostics and metrics tracking for tBTC coordination windows, significantly enhancing observability into the coordination process. The changes add detailed per-window and per-wallet metrics, improve network diagnostics, and expand performance monitoring capabilities.

A new comprehensive metrics tracking system for coordination windows that provides:

  • Per-Window Tracking: Each coordination window is tracked with:

    • Window identification (index, coordination block)
    • Timing information (start time, end time, duration, block ranges)
    • Coordination statistics (wallets coordinated, successful/failed counts)
    • Leader distribution across wallets
    • Action type breakdown
    • Fault statistics (by type and culprit)
  • Per-Wallet Details: For each wallet coordinated in a window:

    • Wallet public key hash
    • Leader address
    • Action type
    • Success/failure status
    • Duration
    • Error messages (if failed)
    • Detailed fault information
  • Memory Management: Tracks up to 100 recent windows (~25 hours) with automatic cleanup of older windows to prevent unbounded memory growth

- Updated  to include a new peer for the sepolia network.
- Added timeout handling in  to prevent indefinite hangs.
- Introduced new system metrics: CPU load, RAM utilization, and swap utilization, with corresponding updates to the performance metrics registration.
- Introduced a new  structure to track detailed metrics for individual coordination windows, including timing, success rates, and fault statistics.
- Enhanced the coordination layer to record the start and end of coordination windows, as well as wallet-specific coordination details.
- Added new metrics for coordination windows, including total wallets coordinated, successful, and failed, along with fault tracking.
- Introduced new metrics for redemption actions, including total executions, success, and failure counts, as well as duration tracking.
- Updated the performance metrics registration to include these new redemption metrics.
- Refactored existing code to utilize defined constants for metric names, enhancing consistency and readability.
- Improved error handling in redemption proof submissions to accurately record failure metrics.
- Updated the  and  structures to include JSON tags for improved serialization.
- Introduced a new  structure to capture detailed fault information during coordination.
- Enhanced the  method to include error messages for failed wallet actions.
- Added a new method  to retrieve a summary of coordination window metrics.
- Registered coordination windows as a diagnostic source in the client info for better monitoring.
- Added a mutex and a map to track peers that have already been pinged to avoid duplicate ping tests.
- Updated the connected and disconnected callback functions to manage the pinged peers set, ensuring each unique peer is only pinged once.
- Enhanced disconnection handling to allow re-pinging if a peer reconnects later.
Copy link
Copy Markdown
Member

@lrsaturnino lrsaturnino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Some comments pending clarification.

Comment thread pkg/tbtc/coordination_window_metrics.go Outdated
Comment thread pkg/tbtc/coordination_window_metrics.go Outdated
Comment thread pkg/tbtc/coordination_window_metrics.go Outdated
Comment thread pkg/tbtc/coordination_window_metrics.go Outdated
Comment thread pkg/tbtc/node.go
Comment thread pkg/net/libp2p/libp2p.go Outdated
Comment thread pkg/tbtc/coordination_window_metrics.go Outdated
- Refactored metric increment calls in libp2p to utilize constants for peer connections, disconnections, and ping tests.
- Enhanced coordination window metrics by adding a mutex for safe access to previous window data across goroutines.
- Introduced a cleanup goroutine to ensure the end time of the last coordination window is recorded on shutdown.
lrsaturnino
lrsaturnino previously approved these changes Feb 3, 2026
Copy link
Copy Markdown
Member

@lrsaturnino lrsaturnino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

- Updated the coordinationExecutor to return a partial result containing leader and faults information when a follower's routine fails, allowing for better metric recording.
…seconds for improved performance tracking.

- Added detailed comments to clarify the behavior of CPU utilization sampling and the prevention of double-recording in coordination window metrics.
Copy link
Copy Markdown
Member

@lrsaturnino lrsaturnino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lionakhnazarov lionakhnazarov merged commit d04ce46 into threshold-network:main Feb 5, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants