feat: add financial governance evaluators (spend limits + transaction policy) by up2itnow0822 · Pull Request #141 · agentcontrol/agent-control

up2itnow0822 · 2026-03-20T17:07:40Z

Summary

Implements the financial governance evaluator proposed in #129, following @lan17's technical guidance.

Two contrib evaluators for enforcing financial spend limits on autonomous AI agent transactions:

`financial_governance.spend_limit`

Tracks cumulative agent spend and enforces rolling budget limits:

Per-transaction caps
Rolling period budgets (configurable window, default 24h)
Context-aware overrides (different limits per channel/agent/session via evaluate metadata)
Pluggable SpendStore protocol — InMemorySpendStore included, bring your own PostgreSQL/Redis

`financial_governance.transaction_policy`

Static policy checks with no state tracking:

Currency allowlist
Recipient blocklist/allowlist
Min/max amount bounds

Design Decisions (per #129 discussion)

Decoupled from data source — SpendStore protocol means no new tables in core. Custom backends implement two methods: record_spend() and get_spend().
No changes to core — Self-contained contrib package under evaluators/contrib/financial-governance/.
Context-aware limits — Override keys in the evaluate data dict (channel_max_per_transaction, channel_max_per_period) allow per-context policies without separate evaluator instances.
Python SDK compatible — Standard Evaluator interface, works with both server and SDK evaluation engine.

Tests

53 tests passing — covers both evaluators, the InMemorySpendStore, config validation, edge cases, and context overrides.

pytest tests/ -v
53 passed in 2.79s

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

lan17 · 2026-03-20T18:12:02Z

...nancial-governance/src/agent_control_evaluator_financial_governance/spend_limit/evaluator.py

+                matched=False,
+                confidence=1.0,
+                message="Transaction data missing required field 'amount'",
+                error="Missing required field: amount",


I think this is conflating malformed runtime payload with evaluator failure. The config itself is already validated via the Pydantic config model; what this code is checking here is the selector output passed into evaluate(). EvaluatorResult.error is supposed to be for crashes/timeouts/missing deps, and the engine treats any non-null error as an errored control. So with a deny rule, a malformed transaction now fail-closes as an evaluator error instead of behaving like a normal policy result. I think these branches should stay in the regular matched/non-matched path and reserve error for actual evaluator failures. Same issue shows up in transaction_policy.

lan17 · 2026-03-20T18:12:04Z

evaluators/contrib/financial-governance/tests/test_spend_limit.py

+
+
+@pytest.mark.asyncio
+async def test_context_override_channel_max_per_period() -> None:


I’d add a stronger test around the context-aware budget story here. Right now this only proves that the override number is read; it doesn’t prove spend is actually isolated by channel / agent_id / session_id. A case like “90 USDC in channel A, then 20 USDC in channel B with channel_max_per_period=100” would catch whether the store lookup is scoped or just summing all USDC spend. I’d also like at least one engine-level test for the invalid-input path so we don’t accidentally lock in result.error as a policy outcome.

lan17 · 2026-03-20T18:15:20Z

...b/financial-governance/src/agent_control_evaluator_financial_governance/spend_limit/store.py

+        """
+        ...
+
+    def get_spend(self, currency: str, since_timestamp: float) -> float:


I think there’s a deeper issue with the API shape here. get_spend() only lets the evaluator query by (currency, since_timestamp), but the evaluator is also recording channel / agent_id / session_id metadata and the README/tests are positioning this as context-aware budgeting. As written, channel_max_per_period is still checking against all spend in that currency, not spend for the current channel/agent/session.

I reproduced it locally with 90 USDC in channel A, then a 20 USDC transaction in channel B with channel_max_per_period=100, and the second transaction gets denied because the lookup is summing global USDC spend. Feels like the store API needs a scoped query (or an atomic check+record API) before this behavior is actually correct.

lan17 · 2026-03-20T18:22:22Z

...rs/contrib/financial-governance/src/agent_control_evaluator_financial_governance/__init__.py

+
+    {
+      "condition": {
+        "selector": {"path": "*"},


If this is meant to be a real Agent Control config example, I think selector.path: "*" is misleading here. In the engine, * passes the whole Step object into the evaluator, not the raw transaction dict, so amount / currency / recipient won’t be top-level fields.

input seems closer to what this evaluator actually expects, but then the context-aware override story probably needs to be spelled out separately since step context lives under context, not inside input.

@lan17

… policy) Implements the financial governance evaluator proposed in agentcontrol#129, following the technical guidance from @lan17: 1. Decoupled from data source — SpendStore protocol with pluggable backends (InMemorySpendStore included, PostgreSQL/Redis via custom implementation) 2. No new tables in core agent control — self-contained contrib package 3. Context-aware limits — channel/agent/session overrides via evaluate metadata 4. Python SDK compatible — standard Evaluator interface, works with both server and SDK evaluation engine Two evaluators: - financial_governance.spend_limit: Cumulative spend tracking with per-transaction caps and rolling period budgets - financial_governance.transaction_policy: Static policy enforcement (currency allowlists, recipient blocklists, amount bounds) 53 tests passing. Closes agentcontrol#129 Signed-off-by: up2itnow0822 <up2itnow0822@users.noreply.github.com> Signed-off-by: up2itnow0822 <up2itnow0822@gmail.com> Signed-off-by: up2itnow0822 <up2itnow0822@users.noreply.github.com>

up2itnow0822 · 2026-03-20T19:30:14Z

@lan17 I made some changes based on your feedback. Let me know if I'm getting closer.

I think I've addressed your concerns. Pushed to the same branch:

error field removed from malformed-input paths - now matched=False with descriptive message only
get_spend() takes an optional scope dict - budgets are isolated per channel/agent/session. Added the 90-in-A + 20-in-B test.
Both evaluators handle selector.path: "input" and "*" - Step context merges into the transaction data.

README updated with proper controls: config using selector.path: input.

Best,
Bill

lan17

Latest pass looks much better overall. The original evaluator-error and scoped-budget issues seem fixed. I just left a few follow-ups on packaging/docs/test coverage.

lan17 · 2026-03-20T21:47:23Z

evaluators/contrib/financial-governance/pyproject.toml

@@ -0,0 +1,55 @@
+[project]
+name = "agent-control-evaluator-financial-governance"


Nice to have this as a standalone package, but I do not think it is actually reachable for end users yet. As-is, I do not think pip install "agent-control-evaluators[financial-governance]" will pull this in, since agent-control-evaluators only exposes galileo and cisco extras today, and I do not see release wiring to publish this contrib package either. If the goal is for this to be installable the same way as the other optional evaluators, I think we still need the extra in evaluators/builtin/pyproject.toml plus the publish/release wiring.

lan17 · 2026-03-20T21:47:23Z

evaluators/contrib/financial-governance/tests/test_spend_limit.py

+    result = await ev.evaluate(step_data)
+    assert result.matched is False
+    # input's channel should win (not clobbered)
+    assert result.metadata and result.metadata.get("channel") is None or True  # just verify no crash


I do not think this test is asserting the intended behavior right now. ... or True makes it always pass, and on the success path result.metadata does not even carry channel, so this will not catch a regression in input-vs-context precedence. I would tighten this to assert against the store state or another observable that proves input["channel"] won over context["channel"].

lan17 · 2026-03-20T21:47:23Z

evaluators/contrib/financial-governance/README.md

+            (amount, currency, json.dumps(metadata)),
+        )
+
+    def get_spend(self, currency: str, since_timestamp: float) -> float:


The custom store example is already stale relative to the protocol above. SpendStore.get_spend() now takes scope, so anyone copying this signature will implement the wrong interface and miss the context-scoped budget behavior. I would update the example to include scope: dict[str, str] | None = None and show how the metadata filter is applied.

lan17 · 2026-03-20T21:47:23Z

evaluators/contrib/financial-governance/README.md

+1. **Decoupled from data source** — The `SpendStore` protocol means no new tables in core Agent Control. Bring your own persistence.
+2. **Context-aware limits** — Override keys in the evaluate data dict allow per-channel, per-agent, or per-session limits without multiple evaluator instances.
+3. **Python SDK compatible** — Uses the standard evaluator interface; works with both the server and the Python SDK evaluation engine.
+4. **Fail-open on errors** — Missing or malformed data returns `matched=False` with an `error` field, following Agent Control conventions.


This line does not match the implementation anymore. After the fix, malformed runtime payload returns matched=False with error=None, which is the right shape. Leaving this as written is going to send people back toward the old behavior.

lan17

One more pass on the store shape. I think it is a reasonable MVP, but I still have two concerns around enforceability and scoping semantics.

lan17 · 2026-03-20T21:53:06Z

...nancial-governance/src/agent_control_evaluator_financial_governance/spend_limit/evaluator.py

+                if not scope:
+                    scope = None
+
+            period_spend = self._store.get_spend(tx_currency, since, scope=scope)


I think the store contract is still racy for real budget enforcement. We read get_spend(...), decide, and only then record_spend(...), so two concurrent requests can both pass this check and both record, overshooting the cap. If this is only meant as a lightweight contrib example that may be fine, but if we want hard spend limits I think the store API needs an atomic check_and_record or reservation-style primitive.

lan17 · 2026-03-20T21:53:06Z

...nancial-governance/src/agent_control_evaluator_financial_governance/spend_limit/evaluator.py

+            # When channel/agent/session overrides are present, query only
+            # spend matching that context — not global spend.
+            scope: dict[str, str] | None = None
+            if any(k in data for k in ("channel", "agent_id", "session_id")):


The scoping semantics here are a little different from how I first read the docs. If a transaction carries both channel and agent_id, we scope to the exact tuple, not to an independent per-channel budget and per-agent budget. That means spend can effectively bypass a "channel budget" just by varying agent/session within the same channel. If tuple-scoped budgets are the intent, I would make that explicit.

lan17

A couple more API-design thoughts here. The current shape works for the narrow rolling-budget case, but I think there are two places where we may be locking ourselves in more than we want.

lan17 · 2026-03-20T21:56:45Z

.../financial-governance/src/agent_control_evaluator_financial_governance/spend_limit/config.py

+        }
+    """
+
+    max_per_transaction: float = Field(


I would be pretty hesitant to use float for money here. If this is meant to be a real spend-control API, I think we should either use Decimal or make amounts explicit integer minor/atomic units before more code starts depending on the current shape.

lan17 · 2026-03-20T21:56:45Z

...b/financial-governance/src/agent_control_evaluator_financial_governance/spend_limit/store.py

+        """
+        ...
+
+    def get_spend(


The config/store API feels a bit too tied to one budget shape: one rolling window defined by max_per_period + period_seconds, queried as get_spend(..., since_timestamp, ...). That works for this MVP, but if we ever want fixed daily/weekly windows or multiple concurrent limits, I think this will get awkward fast. I would consider moving toward structured limit definitions plus a range-based query (start, end) instead of baking rolling-window semantics into the store contract.

Concrete sketch of what I had in mind, roughly:

class BudgetWindow(BaseModel): kind: Literal["rolling", "fixed"] seconds: int | None = None # rolling window unit: Literal["day", "week", "month"] | None = None timezone: str | None = None # fixed calendar windows class BudgetLimit(BaseModel): amount: Decimal currency: str scope_by: tuple[Literal["channel", "agent_id", "session_id"], ...] = () window: BudgetWindow | None = None # None => per-transaction cap

Then config can just be limits: list[BudgetLimit], and the store/query side can move toward get_spend(currency, start, end, scope=...) instead of baking rolling-window semantics into since_timestamp. Not saying this exact API, just the shape.

up2itnow0822 · 2026-03-23T17:53:38Z

@lan17 Thanks for the thorough review — 13 inline comments is serious engagement, and every point is valid. Apologies for the delay getting back.

Here's where I'm at on the key themes:

1. float → Decimal for money — Agreed, no argument. Will switch to Decimal throughout. Should have done this from the start.

2. Race condition on get_spend() → record_spend() — You're right that the read-then-write pattern allows concurrent overshoot. For the contrib MVP I think documenting this as a known limitation is honest, but I'll also sketch an atomic check_and_record() interface so the store contract supports it when backends need it.

3. Scoping semantics (tuple vs independent budgets) — Your observation is sharp. Current tuple-scoped approach means varying agent_id within the same channel bypasses the channel budget. I'll make the tuple semantics explicit in docs and add the independent-dimension option you suggested.

4. Store API shape (BudgetLimit + BudgetWindow) — Your structured sketch is cleaner than what I have. I'll refactor toward limits: list[BudgetLimit] with scope_by and the rolling/fixed window model. This also solves the range-query issue naturally.

5. selector.path clarification — Will fix to input and document the Step-vs-raw-dict distinction properly.

6. Stale README + test assertions — Will clean up the or True test, update the custom store example to match the new scope parameter, and fix the malformed-input docs.

7. Packaging/installability — Fair point that this isn't reachable via pip install today. I'll check with you on the preferred path — should this land as a new extra in the existing agent-control-evaluators package, or stay as a standalone contrib that users install directly?

I'll push a revision addressing all of the above. Expect it in the next day or two.

…it model, scoping fix - float → Decimal throughout: BudgetLimit.amount, BudgetWindow, store, tests - BudgetLimit + BudgetWindow model: config refactored from flat fields to limits: list[BudgetLimit]; each limit has amount, currency, scope_by, window - Atomic check_and_record(): eliminates TOCTOU race on get_spend()+record_spend(); InMemorySpendStore implements with threading.Lock (single-process); docs note production stores need DB-level atomics (Postgres FOR UPDATE, Redis Lua) - scope_by field: independent per-dimension budget isolation; scope_by=(channel,) means channel A spend does not count against channel B budget - selector.path fix: config examples and README updated to use 'input' not '*'; Step vs raw-dict distinction documented; evaluator auto-detects format - EvaluatorResult.error usage: malformed payload returns matched=False, error=None; error field reserved for crashes/timeouts/missing deps only - README: custom store example updated with scope param and check_and_record; stale malformed-input docs corrected; Known Limitations updated - Tests: or True removed; all assertions verify actual store state; lan17 channel isolation test (90 in A + 20 in B) passes with scope_by semantics

…ertions - transaction_policy: metadata output uses str(Decimal) instead of float() for amount, min_amount, max_amount — prevents precision loss - test_spend_limit: monetary assertions use Decimal instead of pytest.approx(float) Addresses remaining float concerns from lan17's review.

lan17

Latest pass looks a lot better overall. The malformed-input handling and the basic scoped lookup story are much closer to what I had in mind.

I still do not think the spend-limit side is quite enforceable enough for merge, though, so I left three blocking comments inline:

the evaluator still bypasses the atomic store path and can overshoot under concurrency;
composite scope_by keys widen if one dimension is missing, which breaks the documented tuple semantics;
the default in-memory retention window is only 7 days, so fixed month budgets undercount by construction.

If this were framed as a lightweight contrib sketch I would be less worried, but the current docs are presenting it as real financial governance, so I think those need to be tightened before we merge.

lan17 · 2026-04-02T00:25:57Z

...nancial-governance/src/agent_control_evaluator_financial_governance/spend_limit/evaluator.py

+                win_start = _window_start(limit)
+                period_limits_to_record.append((limit, scope, win_start))
+
+                period_spend = self._store.get_spend(tx_currency, win_start, scope=scope)


I think this is still bypassing the atomic path you added to the store. We read current spend here and only record later, so two concurrent requests can still both pass and both write. I reproduced it locally with two concurrent 60 USDC requests under a 100 USDC cap: both evaluations returned allow and the ledger ended at 120. If we want hard spend limits, I think the evaluator needs to call check_and_record() here rather than get_spend() and a later record_spend().

lan17 · 2026-04-02T00:25:57Z

...nancial-governance/src/agent_control_evaluator_financial_governance/spend_limit/evaluator.py

+        if not limit.scope_by:
+            return None
+
+        scope: dict[str, str] = {}


I do not think dropping missing scope_by keys is safe here. For a limit configured as (channel, agent_id), a transaction that only carries channel turns into a broader per-channel lookup, because matches_scope() treats the shorter dict as a subset match. That means a partially populated scope can inherit spend from other agents in the same channel, which is different from the ("channel", "agent_id") pair semantics the README describes.

lan17 · 2026-04-02T00:25:57Z

...b/financial-governance/src/agent_control_evaluator_financial_governance/spend_limit/store.py

+            Defaults to 7 days (604 800 s).
+    """
+
+    def __init__(self, max_age_seconds: int = 604_800) -> None:


I think this default makes the built-in store incompatible with the fixed month window the docs advertise. Anything older than 7 days gets pruned, so by day 8 of the month the in-memory store starts undercounting monthly spend and can allow overshoot. If the in-memory store is only meant for tests/local dev that is fine, but then I would either derive retention from the longest configured window or document that month-level enforcement is not actually sound with the default store.

up2itnow0822

@lan17 - appreciate the thorough review. These are real issues, not style nits. Taking the three blockers:

1. Concurrency: get_spend() then record_spend() is racy

You are right - the read-then-write pattern is fundamentally unsound for hard budget enforcement. Two concurrent 60 USDC requests under a 100 USDC cap will both pass because neither sees the other's pending spend. I'll replace the two-step path with an atomic check_and_record() that does the budget check and ledger write in a single operation:

class SpendStore(Protocol):
    async def check_and_record(
        self,
        scope: dict[str, str],
        amount: Decimal,
        currency: str,
        limit: Decimal,
        since: datetime,
    ) -> tuple[bool, Decimal]:
        """Atomically check budget and record if within limit.
        Returns (allowed, new_total). If not allowed, nothing is recorded."""
        ...

The in-memory store uses a lock. Production stores (Redis, Postgres) use their native atomic primitives (INCRBY with conditional, SELECT FOR UPDATE). The evaluator calls this single method instead of the read/decide/write sequence.

2. Composite scope_by keys widen on missing dimensions

Agreed - a transaction with only channel matching against a (channel, agent_id) limit should not fall through to a broader per-channel lookup. The fix: if any dimension in the configured scope_by tuple is missing from the transaction metadata, the limit does not apply to that transaction at all. Strict tuple matching, not subset matching.

def matches_scope(limit_scope: tuple[str, ...], tx_meta: dict[str, str]) -> bool:
    # ALL dimensions must be present - missing any = no match
    return all(dim in tx_meta for dim in limit_scope)

This means a partially-scoped transaction gets evaluated against limits whose scope it fully satisfies, and is unconstrained by limits requiring dimensions it does not carry. If that's too permissive for some deployments, the evaluator could optionally reject transactions missing required scope dimensions entirely (fail-closed on incomplete metadata).

3. 7-day retention breaks fixed month windows

Clear bug. The retention window should be derived from the longest configured budget window, not a static default. For a monthly budget, the store needs at least 31 days of history. I'll change the default retention to max(configured_window_seconds) * 1.5 (with a floor of 7 days for rolling-only configs), and add a note in the docs that the in-memory store is intended for development/testing - production deployments with monthly windows should use a persistent store.

On float for money:

Also agreed. I'll switch to Decimal throughout the config and store layer. The evaluator already computes comparisons, so using Decimal from the config level down prevents any floating-point accumulation errors before they start.

I'll push these fixes. Should have an updated commit within 24 hours.

Addresses three blocking issues from review: 1. Race condition: evaluator now uses check_and_record() atomically instead of separate get_spend() + record_spend(). Two concurrent requests under the same budget cap can no longer both pass. The InMemorySpendStore uses a threading.Lock; production stores should use DB-level atomics (documented). 2. Scope widening: strict tuple matching - ALL dimensions in scope_by must be present in the transaction data for a limit to apply. A transaction missing any required dimension is unconstrained by that limit rather than matching a broader scope. 3. Retention window: default changed from 7 days to 31 days. The previous default caused InMemorySpendStore to undercount monthly budgets after day 8 by pruning records before the month ended. Also updated README to document strict scope semantics and retention behavior.

up2itnow0822 force-pushed the feat/financial-governance-evaluator branch from 9901522 to c55d52c Compare March 20, 2026 17:21

lan17 reviewed Mar 20, 2026

View reviewed changes

up2itnow0822 force-pushed the feat/financial-governance-evaluator branch from c55d52c to 614860b Compare March 20, 2026 18:43

up2itnow0822 force-pushed the feat/financial-governance-evaluator branch from 614860b to c36c7e2 Compare March 20, 2026 19:15

lan17 reviewed Mar 20, 2026

View reviewed changes

up2itnow0822 force-pushed the feat/financial-governance-evaluator branch from 78033d8 to ce44527 Compare March 23, 2026 23:33

lan17 requested changes Apr 2, 2026

View reviewed changes

up2itnow0822 commented Apr 2, 2026

View reviewed changes



		@pytest.mark.asyncio
		async def test_context_override_channel_max_per_period() -> None:

		@@ -0,0 +1,55 @@
		[project]
		name = "agent-control-evaluator-financial-governance"

Conversation

up2itnow0822 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

financial_governance.spend_limit

financial_governance.transaction_policy

Design Decisions (per #129 discussion)

Tests

Related

Uh oh!

up2itnow0822 commented Mar 20, 2026

Uh oh!

codecov bot commented Mar 20, 2026

Codecov Report

Uh oh!

lan17 Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

up2itnow0822 commented Mar 20, 2026

Uh oh!

lan17 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lan17 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lan17 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

up2itnow0822 commented Mar 23, 2026

Uh oh!

lan17 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

up2itnow0822 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

up2itnow0822 commented Mar 20, 2026 •

edited

Loading

`financial_governance.spend_limit`

`financial_governance.transaction_policy`

lan17 Mar 20, 2026 •

edited

Loading