Desktest is a CLI tool for automated end-to-end testing of Linux desktop applications using LLM-powered computer use agents. It spins up a disposable Docker container with a virtual desktop, deploys your app, and runs an agent that interacts with it like a real user β clicking, typing, and reading the screen. Deterministic programmatic checks then validate correctness.
Warning: Desktest is beta software under active development. APIs, task schema, and CLI flags may change between releases.
Copy-paste the following prompt into Claude Code (or any coding agent) to install desktest and set up the agent skill:
Install the desktest CLI by running
curl -fsSL https://raw.githubusercontent.com/Edison-Watch/desktest/master/install.sh | sh. Then copyskills/desktest-skill.mdfrom the desktest repo (https://raw.githubusercontent.com/Edison-Watch/desktest/master/skills/desktest-skill.md) to~/.claude/skills/desktest/SKILL.mdso you have context on how to use it.
Testing & Execution
- Structured JSON task definitions with schema validation
- OSWorld-style agent loop: observe (screenshot + accessibility tree) β think β act (PyAutoGUI) β repeat
- Programmatic evaluation: file comparison, command output checks, file existence, exit codes
- Three validation modes: LLM-only, programmatic-only, or hybrid (both must pass)
- Test suites: run a directory of tests with aggregated results
Observability & Debugging
- Live monitoring dashboard: real-time web UI to watch agent actions as they happen
- Video recording: ffmpeg captures every test session
- Trajectory logging: step-by-step JSONL logs with screenshots and accessibility trees
- Interactive mode: step through agent actions one at a time for debugging
Extensibility
- Custom Docker images: bring your own image for apps with complex dependencies
- Attach mode: connect to an already-running container for integration with external orchestration
- CI integration: run tests in GitHub Actions, Cirrus CI, EC2 Mac, and other CI environments
- Remote monitoring: access the dashboard and VNC from another machine via SSH or direct network access
- QA mode (
--qa): agent reports application bugs it encounters as structured markdown reports - Slack notifications: send QA bug reports to Slack channels via Incoming Webhooks
Build deterministic regression tests by watching an LLM agent explore your app, then converting the trajectory into a replayable script:
1. EXPLORE β desktest run task.json --monitor # LLM agent explores your app (watch live!)
2. REVIEW β desktest review desktest_artifacts/ # Inspect trajectory in web viewer
3. CODIFY β desktest codify trajectory.jsonl --overwrite task.json # Generate script + update task JSON
4. REPLAY β desktest run task.json --replay # Deterministic replay (no LLM, no API costs)
5. CI β Run codified tests on every commit
Step 4 detail:
--replayswitches to fully deterministic execution β the codified PyAutoGUI script drives the app directly with zero LLM calls and zero API costs. The same evaluator metrics validate the result. Without--replay, the LLM agent runs as normal (useful for re-recording).
Use desktest as the eyes for your coding agent. You watch the test live, then hand off investigation to your coding agent (e.g. Claude Code) via the CLI-friendly logs command:
1. RUN β desktest run task.json --monitor # Watch the agent live in the browser
2. DIAGNOSE β desktest logs desktest_artifacts/ # Hand off to your coding agent for analysis
desktest logs desktest_artifacts/ --steps 3-7 # Or drill into specific steps
3. FIX β Coding agent reads the logs, diagnoses the issue, and fixes the code
4. RERUN β desktest run task.json --monitor # Verify the fix
--monitor is designed for human eyes (real-time web dashboard), while logs is designed for agent consumption (structured terminal output). Together they close the loop between observing a failure and fixing it.
Desktest can shell out to locally-installed coding agent CLIs instead of calling LLM APIs directly:
desktest run task.json --provider claude-cliUses your existing Claude Code subscription. Each step shells out to claude -p, saves trajectory screenshots and accessibility trees as files, and instructs Claude to read them via its Read tool. See docs/claude-cli-provider.md.
desktest run task.json --provider codex-cliUses your existing ChatGPT login or CODEX_API_KEY. Screenshots are passed directly as -i flags (native image input), and accessibility trees are embedded inline in the prompt. See docs/codex-cli-provider.md.
To run tests (Linux β default):
- Linux or macOS host
- Docker daemon running (Docker Desktop, OrbStack, Colima, etc.)
- An LLM API key (OpenAI, Anthropic, or compatible), or a CLI-based provider: Claude Code (
--provider claude-cli) or Codex CLI (--provider codex-cli) β not needed for--replaymode
To run tests (macOS apps β planned)
- Apple Silicon Mac (M1 or later) running macOS 13+
- Tart installed (
brew install cirruslabs/cli/tart) - A Tart golden image with TCC permissions pre-configured (Accessibility + Screen Recording)
- An LLM API key (same as Linux)
- 2-VM limit: Apple's macOS SLA and Virtualization.framework permit a maximum of 2 macOS VMs running simultaneously per physical Mac. This limits parallel test execution for suite runs. See macOS Support for details, including Apple TOS compliance.
To run tests (Windows apps β planned)
- Windows VM support is planned but not yet designed. Expected to use QEMU/libvirt or Hyper-V with Windows VMs, RDP or VNC for display access, and UI Automation APIs for accessibility. Details TBD.
To build from source (optional):
- Rust toolchain (
cargo) - Git
- Xcode Command Line Tools (for macOS a11y helper binary β macOS only)
Run desktest doctor to verify your setup.
# One-line install (downloads pre-built binary)
curl -fsSL https://raw.githubusercontent.com/Edison-Watch/desktest/master/install.sh | sh
# Or build from source
git clone https://github.com/Edison-Watch/desktest.git
cd desktest
make install_cliExpand
# Validate a task file
desktest validate elcalc-test.json
# Run a single test
desktest run elcalc-test.json
# Run a test suite
desktest suite tests/
# Interactive debugging (starts container, prints VNC info, pauses)
desktest interactive elcalc-test.json
# Step-by-step mode (pause after each agent action)
desktest interactive elcalc-test.json --stepdesktest [OPTIONS] <COMMAND>
Commands:
run Run a single test from a task JSON file
suite Run all *.json task files in a directory
interactive Start container and pause for debugging
attach Attach to an existing running container
validate Check task JSON against schema without running
codify Convert trajectory to deterministic Python replay script
review Generate interactive HTML trajectory viewer
logs View trajectory logs in the terminal (supports --steps N, N-M, or N,M,X-Y)
monitor Start a persistent monitor server for multi-phase runs
doctor Check that all prerequisites are installed and configured
update Update desktest to the latest release from GitHub
Options:
--config <FILE> Config JSON file (optional; API key can come from env vars)
--output <DIR> Output directory for results (default: ./test-results/)
--debug Enable debug logging
--verbose Include full LLM responses in trajectory logs
--record Enable video recording
--monitor Enable live monitoring web dashboard
--monitor-port <PORT> Port for the monitoring dashboard (default: 7860)
--resolution <WxH> Display resolution (e.g., 1280x720, 1920x1080, or preset: 720p, 1080p)
--artifacts-dir <DIR> Directory for trajectory logs, screenshots, and a11y snapshots
--qa Enable QA mode: agent reports app bugs during testing
--with-bash Allow the agent to run bash commands inside the container (disabled by default)
--provider <PROVIDER> LLM provider: anthropic, openai, openrouter, cerebras, gemini, claude-cli, codex-cli, custom
--model <MODEL> LLM model name (overrides config file)
--api-key <KEY> API key for the LLM provider (prefer env vars to avoid shell history exposure)
Expand
Tests are defined in JSON files. Here's a complete example that tests a calculator app:
{
"schema_version": "1.0", // Required: task schema version
"id": "elcalc-addition", // Unique test identifier
"instruction": "Using the calculator app, compute 42 + 58.", // What the agent should do
"completion_condition": "The calculator display shows 100 as the result.", // Success criteria (optional)
"app": {
"type": "appimage", // How to deploy the app (see App Types below)
"path": "./elcalc-2.0.3-x86_64.AppImage"
},
"evaluator": {
"mode": "llm", // Validation mode: "llm", "programmatic", or "hybrid"
"llm_judge_prompt": "Does the calculator display show the number 100 as the result? Answer pass or fail."
},
"timeout": 120 // Max seconds before the test is aborted
}The optional completion_condition field lets you define the success criteria separately from the task instruction. When present, it's appended to the instruction sent to the agent, and rendered as a collapsible section in the review and live dashboards.
See examples/ for more examples including folder deploys and custom Docker images.
| Type | Description |
|---|---|
appimage |
Deploy a single AppImage file |
folder |
Deploy a directory with an entrypoint script |
docker_image |
Use a pre-built custom Docker image |
vnc_attach |
Attach to an existing running desktop (see Attach Mode) |
macos_tart |
(Planned) macOS app in a Tart VM (see macOS Support) |
macos_native |
(Planned) macOS app on host desktop, no VM isolation (see macOS Support) |
windows |
(Planned) Windows app in a VM β details TBD |
Electron apps: Add
"electron": trueto your app config to use thedesktest-desktop:electronimage with Node.js pre-installed. See examples/ELECTRON_QUICKSTART.md.
| Metric | Description |
|---|---|
file_compare |
Compare a container file against an expected file (exact or normalized) |
file_compare_semantic |
Parse and compare structured files (JSON, YAML, XML, CSV) |
command_output |
Run a command, check stdout (contains, equals, regex) |
file_exists |
Check if a file exists (or doesn't) in the container |
exit_code |
Run a command, check its exit code |
script_replay |
Run a Python replay script, check for REPLAY_COMPLETE + exit 0 |
Add --monitor to any run or suite command to launch a real-time web dashboard that streams the agent's actions as they happen:
# Watch a single test live
desktest run task.json --monitor
# Watch a test suite with progress tracking
desktest suite tests/ --monitor
# Use a custom port
desktest run task.json --monitor --monitor-port 8080Open http://localhost:7860 in your browser to see:
- Live step feed: screenshots, agent thoughts, and action code appear as each step completes
- Test info header: test ID, instruction, VNC link, and max steps
- Suite progress: progress bar showing completed/total tests during suite runs
- Status indicator: pulsing dot shows connection state (live vs disconnected)
The dashboard uses the same UI as desktest review β a sidebar with step navigation, main panel with screenshot/thought/action details. The difference is that steps stream in via Server-Sent Events (SSE) instead of being loaded from a static file.
Add --qa to any run, suite, or attach command to enable bug reporting. The agent will complete its task as normal, but also watch for application bugs and report them as markdown files:
# Run a test with QA bug reporting
desktest run task.json --qa
# QA mode in a test suite
desktest suite tests/ --qaWhen --qa is enabled:
- The agent gains a
BUGcommand to report application bugs it discovers - Bash access is automatically enabled for diagnostic investigation (log files, process state, etc.)
- Bug reports are written to
desktest_artifacts/bugs/BUG-001.md,BUG-002.md, etc. - Each report includes: summary, description, screenshot reference, accessibility tree state
- The agent continues its task after reporting β multiple bugs can be found per run
- Bug count is included in
results.jsonand the test output
Optionally send bug reports to Slack as they're discovered. Add an integrations section to your config JSON:
{
"integrations": {
"slack": {
"webhook_url": "https://hooks.slack.com/services/T.../B.../xxx",
"channel": "#qa-bugs"
}
}
}Or set the DESKTEST_SLACK_WEBHOOK_URL environment variable (takes precedence over config). The channel field is optional β webhooks already target a default channel. Notifications are fire-and-forget and never block the test.
Expand
Developer writes task.json
β
βΌ
βββββββββββ
β desktest CLI β validate / run / suite / interactive
ββββββ¬ββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββ
β Docker Container β
β ββββββββ βββββββ ββββββββββ β
β β Xvfb β βXFCE β βx11vnc β β
β ββββ¬ββββ ββββ¬βββ ββββββββββ β
β β virtual desktop β
β ββββ΄ββββββββββ΄βββ β
β β Your App β β
β βββββββββββββββββ β
ββββββββββββ¬ββββββββββββββββββββββββ
β screenshot + a11y tree
βΌ
ββββββββββββββββββββ
β LLM Agent Loop β observe β think β act β repeat
β (PyAutoGUI code) β
ββββββββββ¬ββββββββββ
β
βΌ
ββββββββββββββββββββ
β Evaluator β programmatic checks / LLM judge / hybrid
ββββββββββ¬ββββββββββ
β
βΌ
results.json + recording.mp4 + trajectory.jsonl
Expand
Each test run produces:
test-results/
results.json # Structured test results (always)
desktest_artifacts/
recording.mp4 # Video of the test session (with --record)
trajectory.jsonl # Step-by-step agent log (always)
agent_conversation.json # Full LLM conversation (always)
step_001.png # Screenshot per step (always)
step_001_a11y.txt # Accessibility tree per step (always)
bugs/ # Bug reports (with --qa)
BUG-001.md # Individual bug report (with --qa)
| Code | Meaning |
|---|---|
| 0 | Test passed |
| 1 | Test failed |
| 2 | Configuration error |
| 3 | Infrastructure error |
| 4 | Agent error |
Expand
| Variable | Description |
|---|---|
OPENAI_API_KEY |
OpenAI API key |
ANTHROPIC_API_KEY |
Anthropic API key |
OPENROUTER_API_KEY |
OpenRouter API key |
CEREBRAS_API_KEY |
Cerebras API key |
GEMINI_API_KEY |
Gemini API key |
CODEX_API_KEY |
Codex CLI API key (alternative to ChatGPT login) |
LLM_API_KEY |
Fallback API key for any provider |
DESKTEST_SLACK_WEBHOOK_URL |
Slack Incoming Webhook URL for QA bug notifications (overrides config) |
GITHUB_TOKEN |
GitHub token (used by desktest update) |