telemetry: submit BGP session status onchain per user by juan-malbeclabs · Pull Request #3487 · malbeclabs/doublezero

juan-malbeclabs · 2026-04-08T10:46:53Z

Resolves: #3465

Summary of Changes

Adds a BGP status submitter to the telemetry daemon that periodically checks BGP session state in the BGP VRF namespace and submits `Up`/`Down` status onchain for each activated user on the device
Adds `BGPStatus` type and `SetUserBGPStatus` executor instruction to the Go serviceability SDK
Fixes a gap where a disappeared tunnel interface (e.g., after an ungraceful daemon kill) would silently skip the `Down` submission — now correctly transitions from `Up` to `Down` when the tunnel is gone but the user remains activated onchain
Introduces a `CachingFetcher` in the telemetry daemon that deduplicates `GetProgramData` RPC calls across the BGP status submitter, ledger peer discovery, and collector within a 5s TTL window using `singleflight`, mirroring the pattern in the client daemon
Adds Prometheus metrics for both the caching layer and onchain submission outcomes
Adds an e2e test (`TestE2E_UserBGPStatus`) verifying the full `Up` → `Down` lifecycle

Diff Breakdown

Category	Files	Lines (+/-)	Net
Core logic	4	+526 / -0	+526
Scaffolding	5	+161 / -7	+154
Tests	4	+770 / -0	+770
Docs	1	+4 / -0	+4

Mostly new code: a self-contained submitter package, a shared caching layer, and their tests.

Key files (click to expand)

`controlplane/telemetry/internal/bgpstatus/bgpstatus.go` — `Submitter` struct, `Config`, and pure helpers (`buildEstablishedIPSet`, `computeEffectiveStatus`, `shouldSubmit`)
`controlplane/telemetry/internal/bgpstatus/submitter.go` — tick/worker loop: collects BGP socket state, resolves tunnel IPs per user, enqueues status updates with retry; includes tunnel-not-found → Down fix
`e2e/user_bgp_status_test.go` — e2e test validating `Up` detection after BGP session establishes and `Down` detection after `pkill -9 doublezerod`
`controlplane/telemetry/internal/bgpstatus/submitter_test.go` — unit tests for tick logic, grace period, periodic refresh, in-flight deduplication, and tunnel-not-found → Down transition
`controlplane/telemetry/internal/serviceability/cache.go` — `CachingFetcher`: `sync.RWMutex` fast-path + `singleflight` slow-path + stale-on-error fallback for `GetProgramData`
`controlplane/telemetry/cmd/telemetry/main.go` — wires `CachingFetcher` and BGP status submitter; flags: `--bgp-status-enable`, `--bgp-status-interval`, `--bgp-status-refresh-interval`, `--bgp-status-down-grace-period`
`smartcontract/sdk/go/serviceability/executor.go` — adds `SetUserBGPStatus` executor (instruction opcode 106)
`smartcontract/sdk/go/serviceability/state.go` — adds `BGPStatus` type with `Unknown/Up/Down` constants and JSON serialization

Testing Verification

Unit tests cover all tick-loop branches: session established, session down with and without grace period, tunnel not found with last status Up/Down, periodic refresh, in-flight dedup, and submission retry
`TestE2E_UserBGPStatus` verified end-to-end: connects a client, waits for BGP session to reach `Established` in vrf1, confirms `BGPStatusUp` appears onchain, then kills doublezerod ungracefully and confirms `BGPStatusDown` appears onchain within 60s

controlplane/telemetry/internal/bgpstatus/submitter.go

…r instruction Adds the BGPStatus enum (Unknown/Up/Down) with JSON serialization and the SetUserBGPStatus instruction (code 106) to the Go serviceability SDK.

Introduces the bgpstatus package, which reads BGP neighbor state from Arista devices and submits SetUserBGPStatus instructions onchain. Wires it into the telemetry binary with four new flags: --bgp-status-enable, --bgp-status-interval, --bgp-status-refresh-interval, and --bgp-status-down-grace-period.

Adds an end-to-end test that starts a devnet with the BGP status submitter enabled and verifies the onchain state reflects the actual BGP session status. Also extends DeviceTelemetrySpec with the four BGP status flags and updates user list golden files to include the new bgp_status column.

- In the bgpstatus submitter, when FindLocalTunnel returns ErrLocalTunnelNotFound but the last known onchain status is Up, fall through with observedUp=false so the Down transition is submitted. Previously the submitter always continued on tunnel-not-found, so a clean-disconnect scenario (tunnel interface removed) never triggered a Down submission. - Change the e2e disconnect step to kill doublezerod ungracefully (SIGKILL) instead of running "doublezero disconnect": a clean disconnect deletes the user account onchain before the submitter can record Down, whereas killing the daemon drops the BGP session while the user remains activated. - Add volume cleanup to "make e2e-test-cleanup" so persistent ledger volumes from test-keep runs don't carry stale state into subsequent runs.

Introduce a CachingFetcher in controlplane/telemetry/internal/serviceability/ that wraps any ProgramDataProvider and deduplicates RPC calls within a 5s TTL window using sync.RWMutex + singleflight, mirroring the pattern in client/doublezerod/internal/onchain/fetcher.go. Wire the BGP status submitter through the cached client so multiple telemetry components can share a single GetProgramData result per window instead of each issuing independent RPCs.

Add observability for the BGP status pipeline: CachingFetcher (controlplane/telemetry/internal/serviceability): - doublezero_telemetry_programdata_fetch_duration_seconds: RPC latency - doublezero_telemetry_programdata_fetch_total{result}: fetch outcomes - doublezero_telemetry_programdata_stale_cache_age_seconds: staleness on error BGP status submitter (controlplane/telemetry/internal/bgpstatus): - doublezero_bgpstatus_submissions_total{bgp_status,result}: onchain submissions by status and outcome - doublezero_bgpstatus_submission_duration_seconds: onchain transaction latency

Create a single CachingFetcher instance in main() and pass it to all three consumers — ledger peer discovery, the telemetry collector, and the BGP status submitter — so they share one GetProgramData result per TTL window instead of each issuing independent RPCs. Change Config.ServiceabilityProgramClient from *serviceability.Client to the ServiceabilityProgramClient interface so the cached wrapper can be injected there too.

snormore · 2026-04-08T22:36:25Z

controlplane/telemetry/internal/bgpstatus/submitter.go

+			continue
+		}
+
+		userPK := solana.PublicKeyFromBytes(user.PubKey[:]).String()


After a telemetry restart, userState is empty so lastOnchainStatus is zero (BGPStatusUnknown). If a user was previously Up onchain but the tunnel has since disappeared, this branch hits continue because Unknown != Up, and the Down transition is never submitted — the user stays Up onchain indefinitely.

Seed lastOnchainStatus from the onchain user.BgpStatus field (already available on the User struct) when creating the userState entry, or hydrate state from program data before the first tick.

snormore · 2026-04-08T22:36:25Z

controlplane/telemetry/internal/serviceability/cache.go

+
+	// Slow path: fetch via singleflight so concurrent callers share one RPC.
+	v, err, _ := f.group.Do("fetch", func() (any, error) {
+		// Re-check cache — another goroutine may have refreshed it while we waited.


The singleflight.Do callback captures the first caller's ctx. If that caller's context is cancelled while the RPC is in-flight, the fetch fails for all waiters — even if their contexts are still valid. On a cold cache (no stale data to fall back on), one cancelled context causes all concurrent callers to get an error.

Use context.WithoutCancel(ctx) (or a detached context with a timeout) for the RPC call inside the singleflight callback.

snormore · 2026-04-08T22:36:25Z

controlplane/telemetry/internal/bgpstatus/bgpstatus.go

+// Start launches the submitter in the background and returns a channel that
+// receives a fatal error (or is closed on clean shutdown).  It mirrors the
+// state.Collector.Start pattern.
+func (s *Submitter) Start(ctx context.Context, cancel context.CancelFunc) <-chan error {


This map grows as users are seen but is never pruned. If a user is deactivated or moved to another device, their entry persists forever — a slow memory leak over long uptimes.

Sweep userState at the end of each tick to remove keys not present in the current programData.Users for this device.

Base automatically changed from jo/3465-3 to main April 8, 2026 16:42

juan-malbeclabs marked this pull request as draft April 8, 2026 18:12

juan-malbeclabs force-pushed the jo/3465-4 branch from 31debd6 to 971082d Compare April 8, 2026 20:38

juan-malbeclabs marked this pull request as ready for review April 8, 2026 21:01

snormore reviewed Apr 8, 2026

View reviewed changes

controlplane/telemetry/internal/bgpstatus/submitter.go Show resolved Hide resolved

juan-malbeclabs force-pushed the jo/3465-4 branch 2 times, most recently from 4d1f467 to c6fd942 Compare April 8, 2026 22:04

juan-malbeclabs added 8 commits April 8, 2026 22:09

smartcontract/sdk/go: add BGPStatus type and SetUserBGPStatus executo…

3b00c85

…r instruction Adds the BGPStatus enum (Unknown/Up/Down) with JSON serialization and the SetUserBGPStatus instruction (code 106) to the Go serviceability SDK.

telemetry: add BGP status submitter CHANGELOG entry

3808219

juan-malbeclabs force-pushed the jo/3465-4 branch from 43ab70b to cfc6729 Compare April 8, 2026 22:09

snormore reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

telemetry: submit BGP session status onchain per user#3487

telemetry: submit BGP session status onchain per user#3487
juan-malbeclabs wants to merge 8 commits intomainfrom
jo/3465-4

juan-malbeclabs commented Apr 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

snormore Apr 8, 2026

Uh oh!

snormore Apr 8, 2026

Uh oh!

snormore Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

juan-malbeclabs commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of Changes

Diff Breakdown

Testing Verification

Uh oh!

Uh oh!

snormore Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

snormore Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

snormore Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

juan-malbeclabs commented Apr 8, 2026 •

edited

Loading