feat: structured benchmark result output via BENCH_RESULT_OUTPUT#3195
feat: structured benchmark result output via BENCH_RESULT_OUTPUT#3195
Conversation
emit full benchmark run metadata (config, tags, metrics, block range, spamoor stats) as JSON when BENCH_RESULT_OUTPUT is set. consumed by external matrix runner for table generation.
|
Claude finished @chatton's task in 1m 57s —— View job Code Review
Overall, this is a clean and well-structured addition. The defer-based emission pattern is consistent and safe, the schema is self-describing with A few issues worth fixing before merge, ranging from a functional gap to minor nits. Functional Issues1. New
|
| Issue | Severity | |
|---|---|---|
| 1 | New config fields missing from JSON output | Medium — functional gap |
| 2 | evRethAttrs collected but not emitted |
Medium — silent data loss |
| 3 | BENCH_OBJECTIVE read from env inside builder |
Low — inconsistency |
| 4 | 7× JSON unmarshal per span for resource attrs | Low — clean-up |
| 5 | cfg.log called inconsistently, missing new fields |
Low — observability |
| 6 | Negative env int values not validated | Low — fail-fast |
| 7 | float64 count type unexplained |
Nit |
| 8 | Schema version policy undocumented | Nit |
The fetchResourceAttrs timeout issue (item from CodeRabbit / @chatton) is tracked for follow-up — no action needed here.
📝 WalkthroughWalkthroughAdds structured JSON "run result" emission for benchmarks, expands benchConfig with spammer parameters, captures final spamoor metrics and resource attributes, refactors trace/stat helpers, and integrates result emission into multiple spamoor/gasburner benchmark tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Benchmark Test
participant SpammerAPI as Spamoor API
participant Traces as VictoriaTraces
participant FS as Filesystem
Test->>SpammerAPI: run workload / waitForSpamoorDone
SpammerAPI-->>Test: final sent/failed counters
Test->>Traces: collect spans & resource attrs
Traces-->>Test: spans + resource attributes
Test->>Test: buildRunResult(result, wallClock, spans, attrs, spamoorStats)
Test->>FS: emitRunResult -> write JSON file (BENCH_RESULT_OUTPUT)
FS-->>Test: write success/failure
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3195 +/- ##
==========================================
+ Coverage 61.43% 61.45% +0.01%
==========================================
Files 120 120
Lines 12470 12470
==========================================
+ Hits 7661 7663 +2
+ Misses 3949 3948 -1
+ Partials 860 859 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Deduplicate overhead/reth-rate computation: move stats-based helpers to helpers.go, make span-based wrappers delegate to them - Fix sub-millisecond precision loss in engine span timings by using microsecond-based float division instead of integer truncation - Add spamoor stats to TestGasBurner for consistency with other tests
- Add MaxPending, Rebroadcast, BaseFee, TipFee to benchConfig - Fix ERC20 test hardcoding max_wallets=200 instead of using cfg - Replace all hardcoded spamoor params with cfg fields across tests
- Add resourceAttrs struct with host, OS, and service fields - Extract attributes from VictoriaTraces LogsQL span data via resourceAttrCollector interface - Include host metadata in structured benchmark result JSON
Move emitRunResult into a deferred closure in all three test functions. If the test fails after metrics are collected, the structured JSON is still written. If it fails before result data exists, the defer is a no-op.
Remove the 3-second sleep before requireSpammersRunning that caused all transactions to be mined before the measurement window started, leaving SteadyState at 0s. Also add deferred emitRunResult, configurable spamoor params, and spamoorStats collection to match the other benchmark tests.
… mode In external mode the sequencer reports spans as "ev-node" (not the test-specific name like "ev-node-erc20"), so trace queries returned zero spans. Store service names on env: local mode uses the test-specific name, external mode defaults to "ev-node"/"ev-reth" with BENCH_EVNODE_SERVICE_NAME/BENCH_EVRETH_SERVICE_NAME overrides.
fetchResourceAttrs only needs one span but was streaming the full result set from VictoriaTraces. Add limit=1 to the LogsQL query to avoid wasting bandwidth on long-lived instances with many spans.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
test/e2e/benchmark/helpers.go (1)
390-395: Avoid silent metric disappearance on “zero-count” paths.At Line 390 / Line 394 / Line 413, returning
ok=falsedrops the metric downstream with no diagnostic signal (field omitted in JSON/log paths). For matrix comparisons, this can make runs harder to interpret. Consider surfacing an explicit reason (or a stable placeholder field) so missing values are distinguishable from truly unavailable traces.Also applies to: 413-414
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/benchmark/helpers.go` around lines 390 - 395, The current metric-extraction code returns ok=false whenever a metric exists but has Count==0 (e.g., the checks around produce, execute and stats[spanExecuteTxs]), which causes the metric to be dropped downstream and indistinguishable from a truly missing metric; change the logic in the function that inspects stats (the blocks referencing produce, execute, stats and spanExecuteTxs) so that instead of returning ok=false for zero-count cases you return a stable placeholder value (e.g., NaN or a sentinel like -1) or set ok=true plus an explicit reason/flag (e.g., zeroCount=true or reason="zero_count") so downstream JSON/loggers always include the field and can differentiate “zero count” from “no data.” Ensure the change updates all similar branches (including the checks at lines handling produce/execute and the block noted at 413–414) and preserves the original non-zero behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/benchmark/config.go`:
- Around line 50-53: The BENCH_* knobs can be negative because envInt allows -1;
update the benchConfig construction (where MaxPending, Rebroadcast, BaseFee,
TipFee are set) to validate or clamp values returned by envInt so negatives are
rejected or set to sane defaults—e.g., ensure MaxPending >= 0, Rebroadcast >= 0,
BaseFee >= 0, TipFee >= 0—and return an error (or panic) from the bench config
constructor (the function that builds benchConfig) on invalid inputs so
misconfiguration fails fast before hitting Spamoor.
In `@test/e2e/benchmark/suite_test.go`:
- Around line 246-248: collectTraces currently populates tr.evRethAttrs but
buildRunResult only includes traces.evNodeAttrs in the serialized
BENCH_RESULT_OUTPUT, so evRethAttrs is never exported; to fix, update
buildRunResult (or the serialization path that produces BENCH_RESULT_OUTPUT) to
include tr.evRethAttrs alongside tr.evNodeAttrs (e.g., add a field for
evRethAttrs in the RunResult/traces struct and serialize it), and ensure any
consumers expecting the output are updated to read the new field; touch the
functions buildRunResult and the traces/RunResult type so evRethAttrs is carried
through to output.
In `@test/e2e/benchmark/traces.go`:
- Around line 266-304: fetchResourceAttrs can block forever because it uses the
caller context (t.Context()) for the HTTP request; wrap the incoming ctx in a
short child timeout and use that for the request to ensure the optional
resource-attrs query cannot hang the whole test. Modify fetchResourceAttrs to
create a child context with a small timeout (e.g. 5s) via
context.WithTimeout(ctx, 5*time.Second) and defer cancel(), then pass that child
context to http.NewRequestWithContext / http.DefaultClient.Do so the request and
response read will be cancelled if it stalls; ensure the cancel is deferred
immediately after creating the timed context.
---
Nitpick comments:
In `@test/e2e/benchmark/helpers.go`:
- Around line 390-395: The current metric-extraction code returns ok=false
whenever a metric exists but has Count==0 (e.g., the checks around produce,
execute and stats[spanExecuteTxs]), which causes the metric to be dropped
downstream and indistinguishable from a truly missing metric; change the logic
in the function that inspects stats (the blocks referencing produce, execute,
stats and spanExecuteTxs) so that instead of returning ok=false for zero-count
cases you return a stable placeholder value (e.g., NaN or a sentinel like -1) or
set ok=true plus an explicit reason/flag (e.g., zeroCount=true or
reason="zero_count") so downstream JSON/loggers always include the field and can
differentiate “zero count” from “no data.” Ensure the change updates all similar
branches (including the checks at lines handling produce/execute and the block
noted at 413–414) and preserves the original non-zero behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0ceebf01-d369-4248-87a4-dbfcd7f438df
📒 Files selected for processing (11)
test/e2e/benchmark/config.gotest/e2e/benchmark/gasburner_test.gotest/e2e/benchmark/helpers.gotest/e2e/benchmark/result.gotest/e2e/benchmark/run_result.gotest/e2e/benchmark/spamoor_defi_test.gotest/e2e/benchmark/spamoor_erc20_test.gotest/e2e/benchmark/spamoor_smoke_test.gotest/e2e/benchmark/spamoor_state_pressure_test.gotest/e2e/benchmark/suite_test.gotest/e2e/benchmark/traces.go
🚧 Files skipped from review as they are similar to previous changes (4)
- test/e2e/benchmark/spamoor_erc20_test.go
- test/e2e/benchmark/spamoor_defi_test.go
- test/e2e/benchmark/gasburner_test.go
- test/e2e/benchmark/run_result.go
| MaxPending: envInt("BENCH_MAX_PENDING", 50_000), | ||
| Rebroadcast: envInt("BENCH_REBROADCAST", 0), | ||
| BaseFee: envInt("BENCH_BASE_FEE", 20), | ||
| TipFee: envInt("BENCH_TIP_FEE", 2), |
There was a problem hiding this comment.
Reject negative BENCH_* load knobs up front.
envInt accepts -1 here, so max_pending, rebroadcast, and the fee fields can go negative and only fail much later inside Spamoor. Validate or clamp these values when building benchConfig so misconfigured runs fail fast.
Possible fix
+func envNonNegativeInt(key string, fallback int) int {
+ n := envInt(key, fallback)
+ if n < 0 {
+ return fallback
+ }
+ return n
+}
+
- MaxPending: envInt("BENCH_MAX_PENDING", 50_000),
- Rebroadcast: envInt("BENCH_REBROADCAST", 0),
- BaseFee: envInt("BENCH_BASE_FEE", 20),
- TipFee: envInt("BENCH_TIP_FEE", 2),
+ MaxPending: envNonNegativeInt("BENCH_MAX_PENDING", 50_000),
+ Rebroadcast: envNonNegativeInt("BENCH_REBROADCAST", 0),
+ BaseFee: envNonNegativeInt("BENCH_BASE_FEE", 20),
+ TipFee: envNonNegativeInt("BENCH_TIP_FEE", 2),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| MaxPending: envInt("BENCH_MAX_PENDING", 50_000), | |
| Rebroadcast: envInt("BENCH_REBROADCAST", 0), | |
| BaseFee: envInt("BENCH_BASE_FEE", 20), | |
| TipFee: envInt("BENCH_TIP_FEE", 2), | |
| func envNonNegativeInt(key string, fallback int) int { | |
| n := envInt(key, fallback) | |
| if n < 0 { | |
| return fallback | |
| } | |
| return n | |
| } | |
| MaxPending: envNonNegativeInt("BENCH_MAX_PENDING", 50_000), | |
| Rebroadcast: envNonNegativeInt("BENCH_REBROADCAST", 0), | |
| BaseFee: envNonNegativeInt("BENCH_BASE_FEE", 20), | |
| TipFee: envNonNegativeInt("BENCH_TIP_FEE", 2), |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/benchmark/config.go` around lines 50 - 53, The BENCH_* knobs can be
negative because envInt allows -1; update the benchConfig construction (where
MaxPending, Rebroadcast, BaseFee, TipFee are set) to validate or clamp values
returned by envInt so negatives are rejected or set to sane defaults—e.g.,
ensure MaxPending >= 0, Rebroadcast >= 0, BaseFee >= 0, TipFee >= 0—and return
an error (or panic) from the bench config constructor (the function that builds
benchConfig) on invalid inputs so misconfiguration fails fast before hitting
Spamoor.
| if rac, ok := e.traces.(resourceAttrCollector); ok { | ||
| tr.evNodeAttrs = rac.fetchResourceAttrs(ctx, e.evNodeServiceName) | ||
| tr.evRethAttrs = rac.fetchResourceAttrs(ctx, e.evRethServiceName) |
There was a problem hiding this comment.
evRethAttrs is fetched but never reaches the artifact.
collectTraces fills tr.evRethAttrs here, but buildRunResult only serializes traces.evNodeAttrs (test/e2e/benchmark/run_result.go:166-172). Right now the second lookup is dead work and the reth-side host metadata is silently dropped from BENCH_RESULT_OUTPUT.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/benchmark/suite_test.go` around lines 246 - 248, collectTraces
currently populates tr.evRethAttrs but buildRunResult only includes
traces.evNodeAttrs in the serialized BENCH_RESULT_OUTPUT, so evRethAttrs is
never exported; to fix, update buildRunResult (or the serialization path that
produces BENCH_RESULT_OUTPUT) to include tr.evRethAttrs alongside tr.evNodeAttrs
(e.g., add a field for evRethAttrs in the RunResult/traces struct and serialize
it), and ensure any consumers expecting the output are updated to read the new
field; touch the functions buildRunResult and the traces/RunResult type so
evRethAttrs is carried through to output.
| func (v *victoriaTraceProvider) fetchResourceAttrs(ctx context.Context, serviceName string) *resourceAttrs { | ||
| end := time.Now() | ||
| query := fmt.Sprintf(`_stream:{resource_attr:service.name="%s"}`, serviceName) | ||
| baseURL := strings.TrimRight(v.queryURL, "/") | ||
| url := fmt.Sprintf("%s/select/logsql/query?query=%s&start=%s&end=%s&limit=1", | ||
| baseURL, | ||
| neturl.QueryEscape(query), | ||
| v.startTime.Format(time.RFC3339Nano), | ||
| end.Format(time.RFC3339Nano)) | ||
|
|
||
| req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil) | ||
| if err != nil { | ||
| v.t.Logf("warning: failed to create resource attrs request: %v", err) | ||
| return nil | ||
| } | ||
|
|
||
| resp, err := http.DefaultClient.Do(req) | ||
| if err != nil { | ||
| v.t.Logf("warning: failed to fetch resource attrs: %v", err) | ||
| return nil | ||
| } | ||
| defer resp.Body.Close() | ||
|
|
||
| if resp.StatusCode != http.StatusOK { | ||
| v.t.Logf("warning: unexpected status %d fetching resource attrs", resp.StatusCode) | ||
| return nil | ||
| } | ||
|
|
||
| scanner := bufio.NewScanner(resp.Body) | ||
| scanner.Buffer(make([]byte, 0, 64*1024), 1024*1024) | ||
| for scanner.Scan() { | ||
| line := scanner.Bytes() | ||
| if len(line) == 0 { | ||
| continue | ||
| } | ||
| attrs := extractResourceAttrs(line) | ||
| return &attrs | ||
| } | ||
| return nil |
There was a problem hiding this comment.
Give the optional resource-attrs query its own timeout.
collectTraces passes t.Context() into this method, so a stalled VictoriaTraces endpoint can keep http.DefaultClient.Do blocked until the whole test times out. That makes optional host metadata capable of hanging the benchmark and delaying artifact emission after span collection already succeeded.
Possible fix
func (v *victoriaTraceProvider) fetchResourceAttrs(ctx context.Context, serviceName string) *resourceAttrs {
+ ctx, cancel := context.WithTimeout(ctx, 15*time.Second)
+ defer cancel()
+
end := time.Now()
query := fmt.Sprintf(`_stream:{resource_attr:service.name="%s"}`, serviceName)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/benchmark/traces.go` around lines 266 - 304, fetchResourceAttrs can
block forever because it uses the caller context (t.Context()) for the HTTP
request; wrap the incoming ctx in a short child timeout and use that for the
request to ensure the optional resource-attrs query cannot hang the whole test.
Modify fetchResourceAttrs to create a child context with a small timeout (e.g.
5s) via context.WithTimeout(ctx, 5*time.Second) and defer cancel(), then pass
that child context to http.NewRequestWithContext / http.DefaultClient.Do so the
request and response read will be cancelled if it stalls; ensure the cancel is
deferred immediately after creating the timed context.
There was a problem hiding this comment.
will address in follow up
There was a problem hiding this comment.
@chatton, understood — thanks for the heads-up! Would you like me to open a GitHub issue to track adding the context.WithTimeout guard to fetchResourceAttrs so it doesn't get lost?
🧠 Learnings used
Learnt from: CR
Repo: evstack/ev-node PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-20T13:39:38.827Z
Learning: Applies to **/*.go : Use context.Context for cancellation in Go
Learnt from: alpe
Repo: evstack/ev-node PR: 3131
File: block/internal/syncing/syncer_backoff_test.go:241-245
Timestamp: 2026-03-06T09:40:36.029Z
Learning: In evstack/ev-node, the module declares go 1.25.6. Since Go 1.22, loop variables are per-iteration by default, so loop variable capture is not a concern. Do not flag or fix loop variable capture in this codebase for any Go files; this rule applies to all Go files in the repository, not just the specific test file.
Summary
run_result.gowith structured JSON output for benchmark runs, written toBENCH_RESULT_OUTPUTwhen setemitRunResultin TestGasBurner, TestERC20Throughput, and TestDeFiSimulationContext
Supports the external-mode matrix runner in the benchmarks repo. Each matrix entry sets
BENCH_RESULT_OUTPUTto collect structured results that can be aggregated into comparison tables.This output will be used across many runs with different permutations of configurations to generate markdown tables.
local test output
Summary by CodeRabbit
Tests
Refactor