Skip to content

Eval: wire aikit-sdk token usage and TokenUsageLine into artifacts #118

@aroff

Description

@aroff

Summary

FastSkill eval already defines input_tokens / output_tokens on CaseResult and CaseSummary, but the eval runner always sets them to None. aikit-sdk (on goaikit/aikit main, including work such as PR #60 / token usage aggregation) now exposes:

  • RunResult.token_usage: Option<TokenUsage> with aggregated input_tokens, output_tokens, optional totals, cache, and reasoning fields.
  • AgentEventPayload::TokenUsageLine { usage, source, raw_agent_line_seq } on the streaming callback when RunOptions::emit_token_usage_events is true (default).

This issue tracks updating gofastskill/fastskill so eval runs record that data in artifacts and traces.

Dependency: latest aikit-sdk

  • Requirement: Bump the resolved aikit-sdk revision to the current tip of https://github.com/goaikit/aikit branch main (run cargo update -p aikit-sdk and commit Cargo.lock).
  • Reference SHA (verify at implementation time): main at e007c0182feefa87ba6a1d405826ae420953ca63 (must be at or after the token-usage merge; e.g. commit 72a69c7d517bb26a1e443d51b273ab126b04e1ce and descendants).
  • [dependencies] in Cargo.toml already uses aikit-sdk = { git = "...", branch = "main" }; the important part is refreshing Cargo.lock so CI and local builds pick up the new APIs.

Code changes (FastSkill)

1. src/eval/trace.rs — handle new event variant

  • Extend TracePayload with a variant for token usage (e.g. embed usage + source as JSON-friendly structs, or reuse types re-exported from aikit-sdk if appropriate).
  • Update agent_events_to_trace to match AgentEventPayload::TokenUsageLine so trace.jsonl includes per-step usage when the SDK emits it.
  • Ensure existing consumers (count_raw_json_events, checks on trace) remain correct: token lines should not be counted as raw_json tool commands unless you intentionally change that definition.

2. src/eval/runner.rs — populate CaseResult from RunResult

  • On successful run_agent_events / Ok(run_result), map run_result.token_usage into CaseResult:
    • Set input_tokens / output_tokens from TokenUsage when Some.
    • Optionally extend CaseResult (and serde JSON) with optional fields for total_tokens, cache, and reasoning if product wants full detail in result.json; otherwise document that only input/output are stored in the flat fields.

3. eval report / eval score (if applicable)

  • Ensure human-readable report shows token totals when present.
  • Confirm eval score re-read path still deserializes result.json after any schema extension (backward compatible Option fields).

4. Tests

  • Update or add CLI/integration tests under tests/cli/eval_tests.rs (or unit tests with a stub runner) so a fake RunResult with token_usage produces non-null token fields in result.json.
  • If the trace format gains a new payload type, add a small serialization test in eval/trace.rs tests.

Acceptance criteria

  • Cargo.lock pins aikit-sdk to latest main compatible with TokenUsage / TokenUsageLine.
  • fastskill eval run writes non-null input_tokens / output_tokens in per-case result.json when the agent stream includes usage the SDK can aggregate.
  • trace.jsonl records token usage events without breaking existing checks that rely on raw_json command counts.
  • Docs: short note in skill or repo docs that eval artifacts may include token usage (optional follow-up in gofastskill/skill references/eval.md).

References

  • aikit-sdk: TokenUsage, RunResult::token_usage, AgentEventPayload::TokenUsageLine, aggregate_token_usage in aikit-sdk/src/runner.rs.
  • fastskill eval: src/eval/runner.rs, src/eval/trace.rs, src/eval/artifacts.rs (CaseResult).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions