Skip to content

telemetry: submit BGP session status onchain per user#3487

Open
juan-malbeclabs wants to merge 8 commits intomainfrom
jo/3465-4
Open

telemetry: submit BGP session status onchain per user#3487
juan-malbeclabs wants to merge 8 commits intomainfrom
jo/3465-4

Conversation

@juan-malbeclabs
Copy link
Copy Markdown
Contributor

@juan-malbeclabs juan-malbeclabs commented Apr 8, 2026

Resolves: #3465

Summary of Changes

  • Adds a BGP status submitter to the telemetry daemon that periodically checks BGP session state in the BGP VRF namespace and submits `Up`/`Down` status onchain for each activated user on the device
  • Adds `BGPStatus` type and `SetUserBGPStatus` executor instruction to the Go serviceability SDK
  • Fixes a gap where a disappeared tunnel interface (e.g., after an ungraceful daemon kill) would silently skip the `Down` submission — now correctly transitions from `Up` to `Down` when the tunnel is gone but the user remains activated onchain
  • Introduces a `CachingFetcher` in the telemetry daemon that deduplicates `GetProgramData` RPC calls across the BGP status submitter, ledger peer discovery, and collector within a 5s TTL window using `singleflight`, mirroring the pattern in the client daemon
  • Adds Prometheus metrics for both the caching layer and onchain submission outcomes
  • Adds an e2e test (`TestE2E_UserBGPStatus`) verifying the full `Up` → `Down` lifecycle

Diff Breakdown

Category Files Lines (+/-) Net
Core logic 4 +526 / -0 +526
Scaffolding 5 +161 / -7 +154
Tests 4 +770 / -0 +770
Docs 1 +4 / -0 +4

Mostly new code: a self-contained submitter package, a shared caching layer, and their tests.

Key files (click to expand)

Testing Verification

  • Unit tests cover all tick-loop branches: session established, session down with and without grace period, tunnel not found with last status Up/Down, periodic refresh, in-flight dedup, and submission retry
  • `TestE2E_UserBGPStatus` verified end-to-end: connects a client, waits for BGP session to reach `Established` in vrf1, confirms `BGPStatusUp` appears onchain, then kills doublezerod ungracefully and confirms `BGPStatusDown` appears onchain within 60s

Base automatically changed from jo/3465-3 to main April 8, 2026 16:42
@juan-malbeclabs juan-malbeclabs marked this pull request as draft April 8, 2026 18:12
@juan-malbeclabs juan-malbeclabs marked this pull request as ready for review April 8, 2026 21:01
@juan-malbeclabs juan-malbeclabs force-pushed the jo/3465-4 branch 2 times, most recently from 4d1f467 to c6fd942 Compare April 8, 2026 22:04
…r instruction

Adds the BGPStatus enum (Unknown/Up/Down) with JSON serialization and
the SetUserBGPStatus instruction (code 106) to the Go serviceability SDK.
Introduces the bgpstatus package, which reads BGP neighbor state from
Arista devices and submits SetUserBGPStatus instructions onchain.
Wires it into the telemetry binary with four new flags:
--bgp-status-enable, --bgp-status-interval,
--bgp-status-refresh-interval, and --bgp-status-down-grace-period.
Adds an end-to-end test that starts a devnet with the BGP status
submitter enabled and verifies the onchain state reflects the actual
BGP session status. Also extends DeviceTelemetrySpec with the four
BGP status flags and updates user list golden files to include the
new bgp_status column.
- In the bgpstatus submitter, when FindLocalTunnel returns
  ErrLocalTunnelNotFound but the last known onchain status is Up,
  fall through with observedUp=false so the Down transition is submitted.
  Previously the submitter always continued on tunnel-not-found, so a
  clean-disconnect scenario (tunnel interface removed) never triggered
  a Down submission.
- Change the e2e disconnect step to kill doublezerod ungracefully
  (SIGKILL) instead of running "doublezero disconnect": a clean
  disconnect deletes the user account onchain before the submitter can
  record Down, whereas killing the daemon drops the BGP session while
  the user remains activated.
- Add volume cleanup to "make e2e-test-cleanup" so persistent ledger
  volumes from test-keep runs don't carry stale state into subsequent
  runs.
Introduce a CachingFetcher in controlplane/telemetry/internal/serviceability/
that wraps any ProgramDataProvider and deduplicates RPC calls within a 5s
TTL window using sync.RWMutex + singleflight, mirroring the pattern in
client/doublezerod/internal/onchain/fetcher.go.

Wire the BGP status submitter through the cached client so multiple
telemetry components can share a single GetProgramData result per window
instead of each issuing independent RPCs.
Add observability for the BGP status pipeline:

CachingFetcher (controlplane/telemetry/internal/serviceability):
- doublezero_telemetry_programdata_fetch_duration_seconds: RPC latency
- doublezero_telemetry_programdata_fetch_total{result}: fetch outcomes
- doublezero_telemetry_programdata_stale_cache_age_seconds: staleness on error

BGP status submitter (controlplane/telemetry/internal/bgpstatus):
- doublezero_bgpstatus_submissions_total{bgp_status,result}: onchain submissions by status and outcome
- doublezero_bgpstatus_submission_duration_seconds: onchain transaction latency
Create a single CachingFetcher instance in main() and pass it to all
three consumers — ledger peer discovery, the telemetry collector, and
the BGP status submitter — so they share one GetProgramData result per
TTL window instead of each issuing independent RPCs.

Change Config.ServiceabilityProgramClient from *serviceability.Client
to the ServiceabilityProgramClient interface so the cached wrapper can
be injected there too.
continue
}

userPK := solana.PublicKeyFromBytes(user.PubKey[:]).String()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a telemetry restart, userState is empty so lastOnchainStatus is zero (BGPStatusUnknown). If a user was previously Up onchain but the tunnel has since disappeared, this branch hits continue because Unknown != Up, and the Down transition is never submitted — the user stays Up onchain indefinitely.

Seed lastOnchainStatus from the onchain user.BgpStatus field (already available on the User struct) when creating the userState entry, or hydrate state from program data before the first tick.


// Slow path: fetch via singleflight so concurrent callers share one RPC.
v, err, _ := f.group.Do("fetch", func() (any, error) {
// Re-check cache — another goroutine may have refreshed it while we waited.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The singleflight.Do callback captures the first caller's ctx. If that caller's context is cancelled while the RPC is in-flight, the fetch fails for all waiters — even if their contexts are still valid. On a cold cache (no stale data to fall back on), one cancelled context causes all concurrent callers to get an error.

Use context.WithoutCancel(ctx) (or a detached context with a timeout) for the RPC call inside the singleflight callback.

// Start launches the submitter in the background and returns a channel that
// receives a fatal error (or is closed on clean shutdown). It mirrors the
// state.Collector.Start pattern.
func (s *Submitter) Start(ctx context.Context, cancel context.CancelFunc) <-chan error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This map grows as users are seen but is never pruned. If a user is deactivated or moved to another device, their entry persists forever — a slow memory leak over long uptimes.

Sweep userState at the end of each tick to remove keys not present in the current programData.Users for this device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PR 2 - Account changes & SDK updates

2 participants