Problem statement
When developing extension models, AI agents frequently believe the model is correct after unit tests pass, but subtle corner-case bugs survive into pushes and extension publishes. These bugs are typically discovered only during manual smoke testing against live APIs — after the code has already been committed or published. Examples from real development sessions:
- Content-Type mismatches: v2 API required
application/vnd.api+json but the model used application/json — unit tests with stubbed fetch didn't catch this
- Stale bundle caching: Source fixes weren't reflected at runtime because
.swamp/bundles/*.js wasn't cleared — agents didn't know about this caching layer
- API validation quirks: Honeycomb boards require
type: "flexible" in the body, not just a name — only discovered during live create
- delete_protected defaults: Honeycomb creates environments with
delete_protected: true, making delete fail unless update is called first
- Read-only resource guards: Attempting create/update/delete on read-only resources like
dataset-definitions or auth should be rejected before making API calls
These are the kinds of issues that unit tests with mocked responses can't catch, but a structured smoke-test protocol would.
Proposed solution
A swamp-smoke-test skill that agents can invoke (or that hooks trigger automatically) before git push, swamp extension push, or similar publish actions. The skill would:
- Discover the extension's method surface: Parse the model to enumerate all methods × resource types × argument combinations
- Generate a smoke-test plan: For each method, identify:
- Safe read-only operations (GET/list) that can run against live APIs without side effects
- CRUD cycle candidates: resources that can be safely created, updated, and deleted (with unique test names to avoid collisions)
- Error-path tests: missing required args, read-only resource rejection, invalid auth
- Corner cases specific to the API: required fields beyond
name, default flags that block deletion, etc.
- Execute the plan: Run each test via
swamp model method run, verify success/failure matches expectations
- Report results: Produce a structured summary table (method × resource × result) suitable for PR descriptions
- Clean up: Ensure all created test resources are deleted, even if intermediate steps fail
Key design considerations
- The skill should be API-aware but not API-specific — it reads the model's method schemas and resource registry to generate tests, rather than hard-coding per-service knowledge
- It should never touch pre-existing resources — all created resources use unique names (e.g.
smoke-test-{resource}-{timestamp})
- It should handle permission errors gracefully — a 401 on
slos because the key lacks permission is not a test failure, it's an expected constraint
- Bundle cache clearing (
.swamp/bundles/) should be part of the pre-test setup
- The skill could optionally integrate with git hooks to block pushes when smoke tests fail
Alternatives considered
- Manual smoke testing: Current approach — works but is tedious, error-prone, and depends on the agent remembering to do it
- Enhanced unit tests: Better mocks could catch some issues, but can't catch Content-Type mismatches, bundle caching, or API validation quirks that only surface with real HTTP calls
- CI-based integration tests: Would require live API credentials in CI, which adds secret management complexity
Additional context
This was motivated by developing the @bixu/honeycomb extension, where multiple bugs survived unit tests and were only caught during manual smoke testing sessions. The pattern of "agent thinks it's done → smoke test reveals bugs → fix → re-test" repeated across several sessions. A skill that codifies this testing protocol would catch these issues earlier and more consistently.
Problem statement
When developing extension models, AI agents frequently believe the model is correct after unit tests pass, but subtle corner-case bugs survive into pushes and extension publishes. These bugs are typically discovered only during manual smoke testing against live APIs — after the code has already been committed or published. Examples from real development sessions:
application/vnd.api+jsonbut the model usedapplication/json— unit tests with stubbed fetch didn't catch this.swamp/bundles/*.jswasn't cleared — agents didn't know about this caching layertype: "flexible"in the body, not just aname— only discovered during live createdelete_protected: true, making delete fail unless update is called firstdataset-definitionsorauthshould be rejected before making API callsThese are the kinds of issues that unit tests with mocked responses can't catch, but a structured smoke-test protocol would.
Proposed solution
A
swamp-smoke-testskill that agents can invoke (or that hooks trigger automatically) beforegit push,swamp extension push, or similar publish actions. The skill would:name, default flags that block deletion, etc.swamp model method run, verify success/failure matches expectationsKey design considerations
smoke-test-{resource}-{timestamp})slosbecause the key lacks permission is not a test failure, it's an expected constraint.swamp/bundles/) should be part of the pre-test setupAlternatives considered
Additional context
This was motivated by developing the
@bixu/honeycombextension, where multiple bugs survived unit tests and were only caught during manual smoke testing sessions. The pattern of "agent thinks it's done → smoke test reveals bugs → fix → re-test" repeated across several sessions. A skill that codifies this testing protocol would catch these issues earlier and more consistently.