Skip to content

Latest commit

 

History

History
1286 lines (841 loc) · 36.4 KB

File metadata and controls

1286 lines (841 loc) · 36.4 KB

Observer User Guide

This guide is for people who need to set up Observer, understand how the pieces fit together, and run real verification flows without reverse-engineering the repository.

It is not a design spec.

If you want platform rationale, read OBSERVER.md. If you want normative detail, read the files under specs. If you want to get a project working, stay here.

If you want the gentlest possible on-ramp first, open QUICKSTART.html. It asks two practical questions and gives you a copy-paste starter command.

If You Are New, Start Here

If Observer currently feels confusing, use this stripped-down mental model first.

Observer is not mainly asking you to "run a script".

Observer wants you to make the verification surface explicit.

That usually means:

  1. expose real targets through a provider
  2. derive inventory from those targets
  3. write a suite against that inventory
  4. run the suite and keep the report

If you only need one short sentence, use this one:

Observer turns a verification setup into an explicit contract pipeline.

If you want Observer to lay down a working starting tree for you, run one of these first:

observer new shell-proxy demo --proxy python --subject-family rust
observer new shell-proxy demo --proxy javascript --subject-family rust
observer new shell-proxy demo --proxy python --subject-family c --strategy single-test
observer new shell-proxy demo --proxy python --subject-family rust --strategy staged-product

The default scaffold is multi-test, because that is the preferred everyday pattern.

That command shape matters. It separates two concerns explicitly:

  1. the proxy/orchestration language such as Python
  2. the subject family under test such as Rust, C, or Python

Today the built-in shell proxies are:

  • python
  • javascript

Today the built-in subject families are:

  • rust
  • c
  • python
  • generic

Scaffold Model

The scaffold command is trying to model the real adoption shape directly.

It is not asking only one question.

It is asking two questions:

  1. what language is acting as the shell-orchestration proxy
  2. what subject family is actually being verified

That distinction matters because these are usually not the same thing.

In many real systems, including the GLC-style shape you called out, the proxy language exists because it is better at orchestration, process control, JSON handling, or filesystem inspection than plain shell.

That does not mean the proxy language is the semantic thing being certified.

Examples:

  • --proxy python --subject-family rust: Python orchestrates checks against a Rust product surface
  • --proxy javascript --subject-family c: JavaScript orchestrates checks against a C product surface
  • --proxy python --subject-family python: Python is still just the proxy layer unless the target identities are actually about externally visible Python product behavior
  • --proxy javascript --subject-family generic: JavaScript orchestrates checks against a broader CLI, protocol, workflow, or artifact surface

If you want one sentence for the model, use this one:

Proxy language is how the checks are expressed. Subject family is what those checks are about.

Why This Is Better Than new python

observer new python sounds like Observer thinks the project is mainly about Python.

That is too weak for the real pattern.

The stronger model is:

  • external shell-proxy pattern
  • explicit proxy backend
  • explicit subject family

That is why the primary command is now observer new shell-proxy ....

The hidden observer new python ... alias still exists for compatibility, but it is not the teaching surface anymore.

Scaffold Command Reference

The main command is:

observer new shell-proxy <path> --proxy <python|javascript> --subject-family <rust|c|python|generic>

Useful flags:

  • --strategy single-test|multi-test|staged-product
  • --proxy python|javascript
  • --subject-family rust|c|python|generic
  • --provider <id>
  • --name <display-name>
  • --force
  • --dry-run

Defaults:

  • strategy defaults to multi-test
  • provider id defaults to the proxy name
  • subject family defaults to rust

What Each Strategy Means

single-test means:

  • one externally meaningful proof unit
  • only use it when there is truly one contract worth naming

multi-test means:

  • several granular proof units over one product surface
  • this is the normal default because it gives better inventory and better failure localization

staged-product means:

  • several verification concerns that should certify together
  • the generated tree uses a GLC-shaped teaching decomposition: unit, golden, dict

What The Proxy Changes

Changing --proxy changes the orchestration surface and the generated provider host implementation.

Today that means:

  • python: generated provider hosts and helper code are Python
  • javascript: generated provider hosts and helper code are JavaScript

The intent is not to make Python and JavaScript look like the subject under test.

The intent is to let teams choose the orchestration surface that best handles process glue, JSON, files, and small control logic.

What The Subject Family Changes

Changing --subject-family changes the teaching language around the generated tree.

Today it affects:

  • generated README guidance
  • subject-family wording in the scaffolded project
  • how the scaffold tells you to interpret the toy implementation

Current subject families mean:

  • rust: think crate, binary, or generated artifact surface
  • c: think library, binary, or generated artifact surface
  • python: think module, CLI, or generated artifact surface
  • generic: think product, protocol, workflow, or artifact surface

Today most subject families still mainly adjust wording.

The main exception is the Rust multi-test starter, which now generates a worked Cargo proof starter with native Rust tests plus an external shell-proxy Observer stage.

That is intentional. The command shape stays stable while subject-specialized starter content grows incrementally.

Scaffold Matrix

You can think of the built-in scaffolds as a matrix.

Rows are proxy backends.

Columns are strategies.

Subject family is an interpretation layer over that matrix.

proxy backend      single-test     multi-test      staged-product
python             yes             yes             yes
javascript         yes             yes             yes

And subject family is currently:

subject-family     affects generated wording and usage guidance
rust               yes
c                  yes
python             yes
generic            yes

What Gets Generated

All shell-proxy scaffolds generate the same broad artifact story:

  • provider host source
  • observer config
  • suite files
  • Makefile shortcuts
  • a vendored Observer host SDK for the chosen proxy language

single-test and multi-test generate a suite-first tree.

staged-product generates a product-first tree with:

  • unit
  • golden
  • dict
  • product.json

The toy code is intentionally replaceable.

The important part is the contract shape, target naming, and stage separation.

How To Choose Quickly

Use this shortcut.

Choose --proxy python when:

  • your team already uses Python for glue code
  • you want a small orchestration surface with direct filesystem and JSON handling

Choose --proxy javascript when:

  • your team already uses Node-based tooling
  • you want orchestration close to existing JS toolchains

Choose --subject-family rust when:

  • the real thing under test is mostly a Rust crate, binary, or generated Rust-oriented artifact surface

Choose --subject-family c when:

  • the real thing under test is mostly a C library, binary, or generated low-level artifact surface

Choose --subject-family python when:

  • the real thing under test is a Python product surface, while still keeping target names about externally visible behavior rather than about the proxy wrapper itself

Choose --subject-family generic when:

  • the real thing under test is broader than one implementation language, such as a CLI, wire protocol, package surface, or workflow artifact graph

Example Recipes

Start with a Python shell proxy over a Rust product surface:

observer new shell-proxy demo --proxy python --subject-family rust
cd demo
make report

Start with a JavaScript shell proxy over a C product surface:

observer new shell-proxy demo --proxy javascript --subject-family c
cd demo
make report

Start with the staged product teaching shape:

observer new shell-proxy demo --proxy python --subject-family rust --strategy staged-product
cd demo
make certify

Inspect the file plan without writing anything:

observer new shell-proxy demo --proxy javascript --subject-family generic --dry-run

The Three Rules To Remember

If you forget everything else, remember these three rules.

Rule 1: The target should name the real thing being verified

Good:

  • compiler/emits-canonical-json
  • cli/help-shows-subcommands
  • package/wheel-has-license

Bad:

  • python-wrapper-ran
  • run-all-checks
  • integration-script

Rule 2: Inventory is a contract, not a cache

tests.inv is not a throwaway implementation detail.

It is the explicit list of targets Observer believes exist.

If the target set changes, that is a real verification change.

Rule 3: A report is evidence, not just console exhaust

When Observer emits JSONL, that output is not noise.

It is the machine-readable proof that later commands use for:

  • cubes
  • compares
  • views
  • product certification rollup

Which Path Should I Choose?

Use this decision guide.

If you want Observer to generate the starting tree for one of these paths, use observer new shell-proxy <path> --proxy python --subject-family <name> --strategy <name>.

Choose single-test-strategy when

  • there is only one externally meaningful proof unit
  • the target still names a real product contract
  • splitting further would be fake granularity

Choose multi-test-strategy when

  • one system exposes several distinct behaviors
  • you want failure localization
  • you want better compare and analytics artifacts
  • you want inventory to describe the real surface cleanly

Choose staged-product when

  • several verification areas must pass together
  • one area is unit-like, another is corpus-like, another is consistency-like
  • release health is a product question, not one suite question

The Fastest Useful Reading Order

If you want to get to a working mental model quickly, read in this order:

  1. this guide through the proxy-language sections
  2. ../examples/python-proxy-pattern/single-test-strategy/README.md
  3. ../examples/python-proxy-pattern/multi-test-strategy/README.md
  4. ../examples/python-proxy-pattern/staged-product/README.md
  5. one runnable starter such as ../lib/python/starter/README.md or ../lib/rust/starter/README.md

What Observer Is

Observer is a deterministic verification platform built around explicit contracts and derived artifacts.

The core flow is:

  1. a provider exposes tests or workflow targets
  2. Observer derives canonical inventory from that provider
  3. a suite selects from that inventory and declares expectations
  4. Observer runs the suite and emits a structured report
  5. optional derived artifacts such as cubes, compares, compare indexes, and HTML views are produced from that report
  6. optional product certification combines multiple stages into one product verdict

If you remember nothing else, remember this pipeline:

provider -> inventory -> suite -> report -> cube/compare/view -> product

The rest of the tool is built around that shape.

When To Use Which Surface

Observer has a few distinct working layers.

  • Provider: how a language or workflow exposes executable targets to Observer.
  • Inventory: the canonical list of runnable targets.
  • Suite: the expectations you want enforced.
  • Report: the machine-readable record of one execution.
  • Analytics: derived artifacts such as cubes, compares, compare indexes, and HTML views.
  • Product: an ordered, multi-stage certification contract above suites.

Use them like this:

  • If you are onboarding a new language integration, start with a provider and inventory.
  • If you already have inventory, write a suite and run it.
  • If you need artifact history or build-to-build comparison, derive cubes and compares.
  • If release health depends on multiple verification areas, define a product and use certify.

Primary Pattern: Proxy-Language Verification

One usage pattern deserves to be made explicit because it is likely to be the main way many teams adopt Observer.

The language you use to author tests is often not the real subject under test.

That language is frequently just the most convenient control surface for expressing verification against something else.

Examples:

  • Python tests that verify a CLI's behavior
  • Python tests that verify generated files or package outputs
  • Python tests that verify a service contract or protocol exchange
  • Python tests that verify compiler output or workflow results

In that model:

  • Python is the authoring surface
  • Observer is the verification platform
  • the real subject is the product behavior being checked

This is the preferred pattern.

The scaffold command above is simply the productized version of this pattern.

It makes the two layers explicit at generation time instead of making the reader infer them later.

Beginner Translation Of The Pattern

If the phrase "proxy-language verification" sounds abstract, translate it into ordinary language like this:

  • Python is the pen
  • the product behavior is the thing you are writing about
  • Observer is the notebook that keeps the record straight

Or even more simply:

  • Python is how you say the test
  • the product behavior is what the test means

That is the whole idea.

You are not using Python because you want to prove "Python worked".

You are using Python because it is a convenient way to express checks against something else.

Anti-Pattern: Wrapper Script As The Test Subject

The wrong shape looks like this:

  1. write one Python script that orchestrates a lot of work
  2. register that script as one Observer test target
  3. treat exit 0 from that script as the proof that verification succeeded

Why this is weak:

  • Observer only sees the wrapper, not the underlying verification units
  • failure localization is poor
  • target identity becomes vague or meaningless
  • inventory becomes coarse and unhelpful
  • analytics and compare artifacts lose useful granularity
  • product certification ends up composed from blobs instead of explicit proofs

You still get execution, but you do not get a strong verification model.

Side-By-Side: Wrong Versus Right

Wrong:

Observer target -> run_release_checks.py -> many hidden checks -> one exit code

Right:

Observer target -> cli/help-shows-subcommands
Observer target -> package/wheel-has-license
Observer target -> compiler/rejects-bad-input

In the wrong shape, the real verification surface is hidden inside one program.

In the right shape, Observer can see the actual proof units.

Preferred Pattern: Granular Proxy Tests

The better shape is:

  1. use a host language such as Python to author granular tests
  2. let each test correspond to one real behavior of the underlying system
  3. expose those tests through the Observer provider boundary
  4. derive inventory from those granular targets
  5. run suites and product stages against those explicit targets

That gives Observer meaningful units such as:

  • cli/help-shows-subcommands
  • compiler/emits-canonical-json
  • package/wheel-contains-license
  • api/rejects-missing-token

Those are much better verification targets than something like:

  • python-wrapper-ran

The practical test is simple:

Ask this question for every target name:

If this target fails, will the name tell me what product contract regressed?

If the answer is no, the target is probably too coarse.

How This Maps To GLC-Shaped Work

The GLC-shaped lesson is not “pick Python”.

The lesson is:

  • keep the orchestration layer separate from the semantic subject
  • decompose the verification surface into maintained proof units and proof stages
  • certify the product from explicit stage contracts rather than from one orchestration blob

That is exactly why observer new shell-proxy is modeled the way it is.

The command is trying to preserve that structure at the moment the tree is created.

What To Optimize For

When you use a proxy language, optimize for these properties:

  • target identity should name the real thing being verified
  • each target should represent one meaningful proof unit
  • assertions should be about product behavior, not wrapper-script survival
  • provider output should expose a useful target set, not one orchestration blob
  • reports and compares should tell you what changed in the product surface

Naming Targets So Humans Can Understand Them

Good target names are one of the biggest usability wins in Observer.

Use names that answer this question:

What real thing did we just verify?

Prefer names like:

  • compiler/emits-canonical-json
  • compiler/rejects-malformed-input
  • package/writes-license-metadata
  • api/rejects-missing-token
  • cli/version-reports-build-stamp

Avoid names like:

  • wrapper
  • smoke
  • integration
  • python-script
  • run-all

Those names tell you almost nothing once a report, compare, or product stage fails.

Good Example

Good:

  • Python test: package/metadata-has-license
  • Python test: package/wheel-imports-cleanly
  • Python test: cli/version-reports-build-stamp

Each test uses Python as a scripting medium, but each target refers to a real product contract.

Bad Example

Bad:

  • Python script: run_release_checks.py
  • one Observer target runs it
  • exit code is treated as the only meaningful signal

That shape hides the real verification surface inside the wrapper.

Why This Pattern Matters To Observer

Observer becomes much more valuable when it sees the real verification topology.

Granular proxy-language tests improve:

  • inventory quality
  • failure localization
  • report usefulness
  • analytics fidelity
  • compare clarity
  • product certification composition

This is especially important for Python, shell, and other scripting-friendly integrations. Those languages should usually be treated as verification media, not as the semantic subject of the test unless Python itself is what you are actually trying to verify.

For copy-pasteable examples of this pattern, see ../examples/python-proxy-pattern.

The Easiest Practical Starting Point

If you want the least confusing way to begin, do this:

  1. start with ../examples/python-proxy-pattern/multi-test-strategy/README.md
  2. run make list
  3. run make inventory
  4. inspect tests.inv
  5. run make report
  6. read .observer/report.default.jsonl

That path is short, granular, and close to the real usage model this guide is recommending.

Prerequisites

Observer itself is a Rust workspace and the main binary is published as frogfish-observer but installs the executable observer.

Typical prerequisites are:

  • Rust and Cargo for the Observer CLI itself
  • whatever toolchain your provider host needs
  • a POSIX shell for the repo-owned starter Makefiles

Install the CLI from crates.io:

cargo install frogfish-observer

Or run it from the repository root during development:

cargo run -q -p frogfish-observer -- --help

The Minimal Working Set

For a basic project, you usually need four files:

observer.toml
tests.inv
tests.obs
.observer/

What they mean:

  • observer.toml: provider configuration
  • tests.inv: canonical inventory derived from a provider
  • tests.obs: suite expectations
  • .observer/: generated reports, hashes, and local derived artifacts

Only observer.toml and tests.obs are normally authored by hand.

tests.inv is usually generated.

.observer/ should generally be treated as working output, not as hand-edited source.

Recommended Repository Layout

This is a good default shape for one provider-backed verification area:

your-project/
  observer.toml
  tests.obs
  tests.inv
  .observer/
  build/
  src/

If your project has several verification areas, keep each one local to the thing it verifies:

your-project/
  unit/
    observer.toml
    tests.inv
    tests.obs
    .observer/
  workflow/
    tests.obs
    .observer/
  product.json

That second shape is what product certification is for: each stage stays local, and product.json ties them together.

First Real Setup: Rust Starter

The quickest way to understand Observer is to run a starter that already works.

Use the runnable Rust starter in this repository:

lib/rust/starter/

Its important files are:

lib/rust/starter/
  Cargo.toml
  Makefile
  observer.toml
  tests.inv
  tests.obs
  src/
  expected.default.jsonl
  expected.inventory.sha256
  expected.suite.sha256

What each file does:

  • Cargo.toml: builds the Rust provider host
  • Makefile: wraps the common Observer flows
  • observer.toml: tells Observer how to invoke the provider host
  • tests.inv: canonical inventory for the provider targets
  • tests.obs: the expectations to enforce
  • expected.*: checked-in verification artifacts used by the starter's make verify

Step 1: Build the provider host

cd lib/rust/starter
make build

This builds the provider binary that Observer will call for list and run operations.

Step 2: Inspect raw provider discovery

make list

This writes the raw provider output to .observer/provider-list.json.

Use this when you need to answer the question: "is the provider itself exposing the targets I think it is?"

Step 3: Derive canonical inventory

make inventory
cat tests.inv

This is the first major Observer contract.

Inventory is the explicit execution surface. Once inventory exists, suite execution no longer depends on fuzzy runtime discovery.

Step 4: Inspect provider configuration

The starter's observer.toml looks like this:

version = "0"

[providers.rust]
command = "./build/target/debug/ledger-observer-host"
cwd = "."
inherit_env = false

Important fields:

  • command: the provider host executable
  • cwd: working directory for that provider
  • inherit_env = false: makes the host less dependent on ambient machine state

If the provider cannot be found or behaves differently from machine to machine, check this file first.

Step 5: Run the suite with human output

make run

This runs:

observer run --inventory tests.inv --suite tests.obs --config observer.toml --surface simple --ui rich --report none --color never --show-output all

Use this mode when you are working interactively and want readable operator feedback.

Step 6: Emit a machine-readable report

make report

This writes:

.observer/report.default.jsonl

That JSONL report is what later commands consume.

Step 7: Verify hashes and golden report artifacts

make verify

This checks:

  • inventory hash
  • suite hash
  • report JSONL

The point is not only that the run passes. The point is that the contracts and generated evidence are stable.

Python As A Proxy Language

Python is a particularly important example of the pattern above because many teams reach for it first.

The preferred Python model is not:

  • write one Python wrapper script
  • let Observer run it
  • call the job done if the script exits zero

The preferred Python model is:

  • write Python tests with the Observer Python integration
  • use those tests to verify real product behaviors at granular scope
  • let Python act as the scripting proxy for the thing you actually care about

If you want a one-line rule for teams:

Do not ask Observer to verify that a Python wrapper script ran. Ask Observer to verify the real behaviors that the Python tests are checking.

That means a Python-based provider is often best when the real subject is:

  • a binary interface
  • a package or install surface
  • a network interaction
  • a generated artifact
  • a workflow with precise observable checkpoints

Python is just the control language. The product contract is still the center.

Python Team Checklist

If you are reviewing a Python-based Observer setup, check these five things.

  1. Are target names about product behavior rather than about the wrapper script?
  2. Does the provider expose several meaningful targets instead of one orchestration target?
  3. Does tests.inv look like a useful public execution contract?
  4. Would a failed report line tell an operator what actually regressed?
  5. Could the same targets participate cleanly in a later product certification stage?

If most answers are no, the setup is probably still in the wrapper-script anti-pattern.

Python Pattern In One Sentence Each

Single-test strategy:

  • use when there is one real proof unit and its name is still meaningful

Multi-test strategy:

  • use when one product surface can be decomposed into several real proofs

Staged product strategy:

  • use when several verification areas must pass together as one product verdict

See the concrete examples here:

For the Python integration and runnable examples, see ../lib/python/README.md and ../lib/python/HOWTO.md.

Understanding The Main Files

Before going file by file, here is the plain-English version.

  • observer.toml tells Observer how to find the provider
  • tests.inv tells Observer what targets exist
  • tests.obs tells Observer what should be true about those targets
  • .observer/ stores the evidence generated by the run

That is the basic working set.

observer.toml

This defines providers.

Use it to answer:

  • which providers exist
  • how to invoke them
  • which working directory they run in
  • whether environment inheritance is allowed

tests.inv

This is canonical inventory. It is the explicit list of runnable targets.

In normal workflows, generate it with:

observer derive-inventory --config observer.toml --provider rust > tests.inv

If inventory changes unexpectedly, treat that as a meaningful contract change rather than just build noise.

tests.obs

This is the suite.

The Rust starter uses the simple suite surface:

test prefix: "ledger/" timeoutMs: 1000: expect exit = 0.

test "ledger/rejects-overdraft" timeoutMs: 1000: [
	expect exit = 0.
	expect out contains "denied overdraft".
].

Use the simple surface when you mainly need expectation-based test verification.

Use the full surface when you need richer workflow logic, branching, extraction, publication, or more complex verification flows.

.observer/

This is where local generated artifacts usually go.

Common contents include:

  • report JSONL
  • current hashes
  • provider discovery output
  • generated HTML views
  • derived analytics

Keep it local to the verification area you are working in.

The Commands You Will Actually Use

If you feel lost, treat these as the normal six:

  1. derive-inventory
  2. hash-inventory
  3. hash-suite
  4. run
  5. cube
  6. view

Everything else is either deeper validation, product-level composition, or operator convenience.

derive-inventory

Use when you need to convert provider output into canonical inventory.

observer derive-inventory --config observer.toml --provider rust > tests.inv

hash-inventory

Use when inventory should be treated as a stable contract.

observer hash-inventory --inventory tests.inv

hash-suite

Use when the suite itself is part of the contract you want to pin.

observer hash-suite --suite tests.obs --surface simple

run

Use for normal suite execution.

Interactive operator mode:

observer run --inventory tests.inv --suite tests.obs --config observer.toml --surface simple --ui rich --report none

Machine-readable mode:

observer run --inventory tests.inv --suite tests.obs --config observer.toml --surface simple --ui off --report jsonl > .observer/report.default.jsonl

doctor

Use before a run when you suspect setup problems.

Examples:

observer doctor --inventory tests.inv --suite tests.obs --surface simple
observer doctor --config observer.toml --provider rust

doctor is the command to reach for when you are not sure whether the problem is in the provider, config, inventory, or suite wiring.

cube

Use when one report should become a derived build artifact.

observer cube --report .observer/report.default.jsonl --out .observer/build.cube.json

compare

Use when you want one build compared against another.

observer compare --cube build-a.cube.json --cube build-b.cube.json --out compare.json

view

Use when you need a self-contained HTML artifact for local inspection or sharing.

observer view --cube .observer/build.cube.json --out .observer/build.html

Product Certification

Use product certification when one product is only healthy if several verification stages pass together.

You can author product inputs either as canonical JSON or as TOML that lowers mechanically into the same canonical product model.

The example product file in this repository looks like this:

{
  "k": "observer_product",
  "v": "0",
  "product_id": "demo",
  "product_label": "Demo Product",
  "certification_rule": "all_pass",
  "stages": [
    {
      "stage_id": "unit",
      "runner": {
        "k": "observer_suite",
        "cwd": "unit",
        "suite": "tests.obs",
        "inventory": "tests.inv",
        "surface": "simple",
        "mode": "default"
      }
    },
    {
      "stage_id": "workflow",
      "runner": {
        "k": "observer_suite",
        "cwd": "workflow",
        "suite": "tests.obs",
        "surface": "full",
        "mode": "default"
      }
    }
  ]
}

The important design rule is that each stage runs from its own declared working directory.

That lets a product pull together verification areas that would otherwise remain scattered shell glue.

Product stages can also import a child product as one explicit proof stage through the observer_product runner. In TOML authoring, that shape is naturally expressed with a [subproduct.<id>] stanza.

Run product certification with:

observer certify --product product.json --ui off --report jsonl > .observer/product.default.jsonl

Then derive analytics from the product report:

observer cube-product --report .observer/product.default.jsonl --root . --out .observer/analytics-product

CMake Model Certification

Observer also has a first product slice for CMake-constructed products.

Use it when CMake already defines construction truth and Observer should certify that surface.

Current shape:

  1. configure and build with CMake so the File API reply exists
  2. lower the CMake model
  3. hash the lowered model if needed
  4. certify the product stage or derive analytics from the resulting report

Core commands:

observer lower-cmake-model --build out/build/debug --out .observer/cmake-model.json
observer hash-cmake-model --model .observer/cmake-model.json

The repo-owned example is under:

tests/cmake-model/observer/

If you are trying to understand the current CMake slice, start there instead of from the spec.

Recipes

If you want the easiest beginner path, start with the first two recipes only.

They cover most first-time adoption problems.

Recipe: start from zero with a provider-backed project

  1. create observer.toml with one provider definition
  2. make sure the provider host can answer list and run
  3. derive tests.inv
  4. write tests.obs
  5. run observer doctor
  6. run the suite interactively
  7. emit a JSONL report
  8. add hash checks or golden report checks if the flow is meant to stay stable

Short version:

provider -> inventory -> suite -> report

Recipe: figure out why nothing is running

  1. run the provider host directly
  2. run observer derive-inventory
  3. inspect tests.inv
  4. run observer doctor
  5. verify the suite actually selects targets present in inventory

In most cases, the issue is one of:

  • provider host path is wrong
  • provider host cwd is wrong
  • provider emits a different target than the suite expects
  • inventory was not regenerated after a provider change

This is the first debugging loop to memorize.

Recipe: produce a shareable artifact from one run

  1. emit report JSONL
  2. derive a cube
  3. render an HTML view

Commands:

observer run --inventory tests.inv --suite tests.obs --config observer.toml --surface simple --ui off --report jsonl > .observer/report.default.jsonl
observer cube --report .observer/report.default.jsonl --out .observer/build.cube.json
observer view --cube .observer/build.cube.json --out .observer/build.html

Recipe: compare two builds

  1. derive one cube per build
  2. compare those cubes
  3. render the compare HTML

Commands:

observer compare --cube build-a.cube.json --cube build-b.cube.json --out compare.json
observer view --compare compare.json --out compare.html

How To Think About Output Modes

Observer intentionally separates human output from machine output.

  • Human UI goes to stderr when practical.
  • Machine-readable artifacts go to stdout or explicit files.

That means a command like this is normal:

observer run --inventory tests.inv --suite tests.obs --report jsonl > report.jsonl

You still see human progress, but stdout remains clean enough to capture the structured report.

If you want only the machine artifact, use --ui off.

Troubleshooting

If you are stuck, work from left to right through the pipeline:

  1. provider
  2. inventory
  3. suite
  4. report
  5. derived artifacts

Do not jump straight to the product layer if the provider and inventory are not already trustworthy.

observer says it cannot find a provider

Check:

  • observer.toml path to the provider binary
  • cwd for that provider
  • whether the binary was built at all

The suite does not match any targets

Check:

  • whether tests.inv was regenerated after provider changes
  • whether suite target names actually match inventory entries
  • whether the suite surface is correct (simple versus full)

The run works locally but not in CI

Check:

  • whether provider config relies on inherited environment
  • whether generated paths differ by machine
  • whether you are treating derived artifacts as canonical when they are actually volatile

I am not sure whether the problem is config, provider, inventory, or suite

Run:

observer doctor --config observer.toml --inventory tests.inv --suite tests.obs --surface simple

Then work backwards from the first concrete finding.

Where To Go Next

If you are teaching a team, this is the simplest recommendation to give them:

  1. read the top of this guide through the Python sections
  2. copy ../examples/python-proxy-pattern/multi-test-strategy
  3. rename the targets so they describe your real product behaviors
  4. only add product stages after the target surface is already clean

That sequence avoids the most common failure mode: building a large wrapper-script blob and only later discovering that Observer cannot see the verification surface clearly.

The important habit is to treat Observer as a contract pipeline, not as a magical test launcher. Once you do that, the folder structure and command flow become much easier to reason about.