This exploration brought back a kind of excitement I hadn't felt in a while: every experiment pushed into a boundary I didn't expect. AI makes us move faster, but it doesn't understand the world on our behalf. Learning and curiosity are still part of what makes us who we are. If you're willing, treat this as an engineering notebook, and read it slowly.
- A three-layer loop: the decision layer freezes goals and acceptance criteria; the orchestration layer slices work and runs review/rework; the execution layer only writes to the repo and self-checks. Mapping: decision layer = Main, orchestration layer = Assistant, execution layer = Coding.
- Write isolation: Main/Assistant never write implementation directly; repository writes come only from Coding.
- Make the protocol an engineering system: role stamps + schemas +
blocked/blocking_reasonsemantics turn drift into explicit blocks, not "close enough". - Freeze decisions: SSOT + Dispatch Preflight lock goal/constraints/acceptance; ambiguity raises
decisions_needed. - Runtime capabilities might not match the ideal topology: when
nested spawnisn't available, usemain_routerso Main acts as a Transport Relay while Assistant retains review/rework authority. - Parallelism needs a metronome:
spawn-first -> wait-any -> review -> replenish, nowait-allbarriers, and never proactively terminate unfinished agents.
Everything used in this post (AGENTS.md, schemas, and the protocol test methodology) is hosted at https://github.com/hack-ink/codex-playbook. You can fork it and build your own Multi-Agent workflow on top.
- Maximize parallelism to increase throughput: slice "code writing" work and run as much as safely possible in parallel, so waiting time collapses.
- Put the high-throughput Coding model where it shines: OpenAI's own data says GPT-5.3-Codex-Spark can output >1000 tokens/sec under the right setup (https://openai.com/index/introducing-gpt-5-3-codex-spark/). But it's a smaller variant; it's usually weaker on complex decisions and long-chain reasoning than a flagship model. That naturally suggests a combo: Spark as the Coding executor, a stronger model as Assistant/Main for slicing, review, and arbitration.
- Personal curiosity: how far can Codex Multi-Agent be engineered, given real runtime constraints?
If you treat an LLM as a "drifting runtime", a single agent tends to take on everything at once: decisions, decomposition, implementation, and review. Parallelism collapses into serial work, and scope/role drift becomes much more likely (especially when the root thread "just quickly" edits code).
Splitting into three roles is the minimal isolation that keeps the system parallel and stable:
- Main: architecture/strategy decisions and final acceptance; freezes SSOT so the direction doesn't change mid-execution.
- Assistant: slices tasks, schedules parallel work, runs the review/rework loop; hands "coding-ready briefs" to Coding.
- Coding: performs repository writes and self-checks only; throughput-first, ideal for Spark-like high-speed models.
I model Codex Multi-Agent as a "drifting runtime": you can write rules as hard as you want, but if the execution chain still contains implicit decisions, unobservable state, or no reliable feedback loop, some parallel run will go off the rails. Conversely, once you turn key control points into structured inputs/outputs, and turn deviations into explicit blocks, the system can keep moving even as runtime constraints change.
This protocol insists on two invariants:
- Responsibility chain: ownership must be stable, or the system drifts.
- Transport chain: who can spawn whom and who can message whom must match runtime reality, or the system breaks.
When both hold, you can reliably get: high parallelism, reproducibility, and a closed review/rework loop.
The ideal shape is: Main handles architecture/arbitration, Assistant handles orchestration + review, Coding handles implementation writes. Every layer reviews the layer below and can assign rework, forming a recursive closed loop.
The value isn't "it looks like an org chart". The value is that it naturally isolates three contexts:
- Architecture and boundary decisions: Main
- Task slicing, risk plan, review criteria: Assistant
- Implementation and self-checks: Coding
Topology sketch:
flowchart TD
M[Main]
A1[Assistant 1]
A2[Assistant 2]
C1[Coding 1..n]
C2[Coding 1..n]
M -->|delegate| A1
M -->|delegate| A2
A1 -->|delegate coding tasks| C1
A2 -->|delegate coding tasks| C2
C1 -->|result| A1
C2 -->|result| A2
A1 -->|review / rework| C1
A2 -->|review / rework| C2
A1 -->|report| M
A2 -->|report| M
M -->|review / rework| A1
M -->|review / rework| A2
Everything in the rest of this post exists to keep that responsibility chain executable under real runtime constraints.
This section is the closest thing to "research notes" from real debugging: I wanted to just max out parallelism and hard-code the role split, but it quickly became clear that "verbal agreements" and "a good prompt" aren't stable enough. So I reversed the order: make the system observable, blockable, and loopable first; then talk about throughput.
A common failure mode in Multi-Agent is starting with "run parallelism hot" and "pick models". My experience is the opposite: without observable signals, explicit failure/block semantics, and a real review/rework loop, higher throughput just amplifies deviations.
To avoid abstract "best practices", I'm going to state the key turning points as: assumption -> observation -> conclusion.
-
Assumption: permission governance (e.g., making Main read-only) can prevent unauthorized writes. Observation: read-only limits the capability set, not the behavior. It also cuts off necessary write operations like
fmt, auto-fix, andgit, so the workflow often stalls at step one. A typical example: "run one round of formatting", immediately refused:cargo fmt # -> This is a write operation; I can't execute it right now.Conclusion: read-only can be a safety baseline, but it can't be the main governance strategy. Governance must come from protocol + loop semantics: drift becomes a block, and blocks force explicit rework.
-
Assumption: a heavy gate/wrapper can force the workflow to be correct. Observation: here, "gate/wrapper" means a forced entry script: instead of letting the agent spawn/edit directly, you require all dispatch and writes to go through this wrapper. The wrapper tries to front-load constraints as hard gates: validate Dispatch Preflight JSON, inject Role Stamps, enforce
allowed_paths, validate output schemas, triggerfmt/test, and refuse to proceed on failure.For example (schematic): generate
preflight.json, then only allow execution via the gate:dispatch_gate --workdir /abs/repo --preflight preflight.json \ --validate-schema ~/.codex/dispatch-preflight.schema.json \ --after "cargo test"
This kind of gate is great in a deterministic runtime, but actual execution still depends on Codex's runtime scheduling. Even if the gate is perfect, the agent may not call it or follow it. At that point the gate doesn't enforce correctness; it becomes an extra failure surface. And the more script-like your prompt becomes, the easier it is to get policy false positives, which get amplified under parallelism.
For example:
Invalid prompt: your prompt was flagged as potentially violating our usage policy.Conclusion: keep validation, but don't bet workflow correctness on a heavy gate. The higher-leverage investment is
prompt engineering: write role boundaries, SSOT, and acceptance criteria into every delegated prompt, then combine structured outputs with Fail-Closed Review so deviations become diagnosable blocks and repeatable rework loops. -
Parallelism: even if routing is correct, "waiting strategy" can collapse parallelism back into serial work. Observation: common misreads of wait-any are "no result this tick means everyone is done" and "close unfinished agents to free slots". Both flatten parallelism or produce half-results. Conclusion: hard-code the metronome as
spawn-first -> wait-any -> review -> replenish, and explicitly forbid proactively terminating unfinished agents. (There's a dedicated section later on wait-any vs wait-all.)
My typical order of operations:
- Observability + failure semantics first: schemas +
blockedsemantics + minimal validation tooling, so failures become structured state instead of narrative. - Freeze the entry: SSOT/Preflight + Role Stamps, so implicit design decisions get pushed out of execution-time.
- Only then tune scheduling/routing: spawn-first/wait-any/replenish, so parallelism becomes a default behavior rather than an accident.
The first problem is painfully practical: rules written only in chat are easy to lose. You explain it clearly in this thread; you open a new thread or spawn a new agent, and it snaps back to default behavior. Writing the rules into AGENTS.md is like placing an "inheritable collaboration manual" at the directory root; everything that follows runs with the same boundaries.
Strictly speaking, this could also be packaged as a skill. I didn't do that because, for me, parallelism is a first-class citizen: I want the rules to be on by default and inherited by default, not something I have to remember to "load" every time. If you prefer the skill shape, you can package the protocol as a skill verbatim.
- Put the protocol into
AGENTS.md(for example, centralized at~/.codex/AGENTS.md) so it's inherited and reusable by default. - Make the orchestrator rules painfully explicit: Main/Assistant/Coding roles, delegation priorities, scope boundaries.
If you want "the parent reviews the child until satisfied", you need review at scale. Otherwise, review becomes "read a long blob and vibe-check it", which is essentially non-executable under parallelism.
Once schemas exist, you must also define dispatch and reporting. Otherwise parallelism collapses because everyone ships results in a different format.
- Start with minimal structured output schemas.
- Update dispatch and reporting rules so dispatch and reporting have one consistent shape.
- Define failure semantics for schema validation: you don't "keep going"; you block.
- Split assistant/coding output semantics, and make
blocked/blocking_reasonrequired. - Add
check-agent-output.pyto turn schema validation from a manual ritual into a command. - Split assistant output schemas into write/
read_only, so "execution reporting" and "read-only review" aren't mixed into one interface.
The convergence point here is Fail-Closed Review: block rather than "false-pass". In Multi-Agent systems, the most dangerous outcome isn't failure; it's a false pass. Once a bad payload is accepted, every downstream decision is built on an illusion.
- Assistant must validate schema first, then do semantic review.
- If schema is incomplete, return
awaiting_review + blocked=truewith a reproducibleblocking_reason(don't "keep going"). - Any pass verdict must reference a schema-complete coding payload; no filling in missing fields by guesswork.
"Main never writes implementation" and "if it can be parallel, it should be parallel" will be swallowed by real execution unless you also define routing + evidence fields. My approach is to make "you should have delegated but didn't" a detectable state, and require an enumerable reason; otherwise it's a protocol violation.
- Add dispatch gate enforcement so routing rules become checkable logic.
- Strengthen proactive delegation and fallback logging: if you didn't parallelize, you must give an enumerable reason.
This was another key turning point. The most dangerous drift in Multi-Agent isn't "a bug in code"; it's "silently changing the problem". If execution-time still allows implicit design decisions, parallel work will fork fast and only converge when you try to merge and realize you built different things.
Dispatch Preflight: before Main delegates any non-trivial work, SSOT (goal/non_goals/constraints/acceptance_criteria/decisions) must be written as a structured object.SSOT frozen during execution: execution doesn't get to change the goal silently; new decisions must block withdecisions_needed.- Promote this from text to a checkable interface: introduce
dispatch-preflight.schema.json. - Add routing details and examples so "who dispatches, who reviews, who writes" is unambiguous.
Dispatch Preflight / SSOT Minimal Skeleton
{
"ssot_id": "sso-YYYY-MM-DD-<short>",
"ssot": {
"goal": "...",
"non_goals": ["..."],
"constraints": ["..."],
"acceptance_criteria": ["..."],
"decisions": ["..."]
},
"routing_mode": "main_router",
"subtasks": [
{
"task_id": "...",
"subtask_id": "...",
"delegate_target": "assistant",
"allowed_paths": ["/abs/path"]
}
],
"scheduler_plan": "spawn-first -> wait-any -> review -> replenish"
}Before Codex v0.103, internal subagents couldn't be configured per role (model/config). To make Coding use Spark while orchestration/arbitration used a stronger model, I had to run Coding externally via codex exec. Architecturally it's redundant, but it solved the hard requirement at the time: role heterogeneity.
But the cost of an external codex exec chain is real: output return paths, schema alignment, and error/exit semantics turn the glue layer into a new failure surface. A typical incident is: the task didn't fail, the call chain did.
Example (screenshot of the external codex exec approach in action): https://x.com/acg_box/status/2022724226239893764
The `codex exec` override snippet at the time (to isolate Coding from the internal default config)
codex exec \
--sandbox danger-full-access \
--cd /abs/trusted/repo \
--ephemeral \
--output-schema ~/.codex/agent-output.coding.schema.json \
-o /tmp/coding-output.json \
-m gpt-5.3-codex-spark \
-c 'model_reasoning_effort="xhigh"' \
"<coding prompt>"Evidence: a real failure chain:
Warning: no last agent message; wrote empty content to /dev/fd/4
... invalid_json_schema ... Missing 'task_id'
... codex exec failed (exit 1)
The painful part is that business logic didn't change at all, yet you're debugging protocol/glue, not implementation.
With v0.103 introducing configurable Subagent Profiles / Per-Agent Config, this problem is solved: you can spawn different role profiles internally, and role heterogeneity moves back into internal Multi-Agent.
Key landing points for the internal migration included: introducing [agents] max_threads, per-role config files for Assistant/Coding, and writing parallelism facts into output fields (e.g. coding_subtask_ids, parallel_peak_inflight).
Per-Agent Config Sketch
[agents]
max_threads = 24
[agents.assistant]
description = "Assistant orchestrator profile for non-coding delegation, review, and reporting."
config_file = "./agents/assistant.toml"
[agents.coding]
description = "Coding executor profile for repository implementation writes only."
config_file = "./agents/coding.toml"This protocol assumes "responsibility isolation": different roles need different capabilities, so I configure different models and reasoning effort per agent, and let each do what it's best at.
- Main:
gpt-5.2+xhighI treat it as the "chief / arbitration layer": broader knowledge, steadier decisions, better at hard trade-offs. It's slow, but that's fine because it doesn't implement; it sets direction and does final acceptance. More importantly, on real engineering-heavy complex tasks (cross-module, multi-objective trade-offs, uncertainty), its depth and global reasoning are, in my experience, meaningfully stronger than the more implementation-centricgpt-5.3-codex. - Assistant:
gpt-5.3-codex+highOrchestration and review need engineering intuition and speed. I intentionally avoidxhigh:hightends to be smoother, and a number of reports suggest Codex models inxhighcan overthink and backtrack (https://www.reddit.com/r/ClaudeAI/comments/1qxr7vs/gpt53_codex_vs_opus_46_we_benchmarked_both_on_our/). - Coding:
gpt-5.3-codex-spark+mediumSpark is a high-throughput executor, and it shouldn't "think too much". My experience is that Spark inhigh/xhightends to spend longer simulating and revising; even with high token throughput, total wall-clock time grows.mediumbehaves like "produce a reviewable version quickly": if something is unclear, return early and let a smarter Assistant/Main steer the next iteration.
Config Snippet (Example)
# ~/.codex/config.toml (Main)
model = "gpt-5.2"
model_reasoning_effort = "xhigh"
# ~/.codex/agents/assistant.toml
model = "gpt-5.3-codex"
model_reasoning_effort = "high"
# ~/.codex/agents/coding.toml
model = "gpt-5.3-codex-spark"
model_reasoning_effort = "medium"In the same phase, there were two more changes that made the protocol feel more like an interface:
- Role stamps and role enforcement: missing role headers block immediately, so agents don't "accidentally" think they are Main.
- Removing "reasoning-effort overrides" from the Coding wrapper, so policy doesn't scatter across scripts; role config becomes the source of truth.
These sound like protocol neat-freakery, but they solve a very real problem: roles drift over multi-round execution (the root thread "just writes", a subagent assumes it can arbitrate). Once role stamps become a hard block, misalignment becomes diagnosable instead of implicit drift.
Minimal Role Stamp Templates
[ROLE:MAIN] [PARENT:NONE]
[ROLE:ASSISTANT]
[PARENT:MAIN]
[SSOT_ID:<id>]
[ROLE:CODING]
[PARENT:MAIN_ROUTER]
[SSOT_ID:<id>]
[ROUTING_MODE:main_router]
Reality hits quickly: nested subagents aren't available, so Assistant can't spawn Coding.
The responsibility chain can remain ideal, but the transport chain must change.
When nested spawn isn't available, the workaround is relay: Main forwards messages between Assistant and Coding.
This is architecturally redundant too, but until nested spawn is supported, it's necessary.
The key is: don't let relay degrade into a "Main single-thread bottleneck". The design of main_router is:
- Main does transport only.
- Assistant owns review/rework decisions.
- Use the parallel scheduling metronome to avoid serializing.
If Main does semantic review while relaying, the system collapses into a single-thread bottleneck (Main stares at every result) and parallelism disappears.
The protocol shape here converged to:
- Introduce
main_routercompatibility and schema alignment. - Delete multi-mode routing and converge to
main_routeronly.
main_router topology: separating "responsibility chain" from "transport chain"
flowchart TD
M[Main - Transport Relay]
A[Assistant - Orchestrate and Review Owner]
C[Coding 1..n - Repo Writes]
M -->|delegate non-trivial work| A
A -->|brief slices and risk/verification plan| M
M -->|spawn/dispatch - transport only| C
C -->|results| M
M -->|forward results| A
A -->|review verdict: pass / rework| M
M -->|relay rework| C
A -->|final report| M
In Multi-Agent, parallelism isn't "open more agents". It's "keep refilling the window". A runnable metronome looks like this:
- Spawn as many independent runnable slices as you safely can (spawn-first)
- Use
wait-anyto wait only for the first completion event, then immediately review - If review passes, replenish the window with the next runnable slice; if review fails, enter the rework loop
- Before the runnable queue is empty, do not introduce a
wait-allbarrier
Two "counterintuitive but must be hard-coded" lines:
wait-any does not mean "don't wait". It means "wait for the first completion event", then keep moving.
Before the runnable queue is empty, do not introduce any wait-all barrier.
First, clarify the concepts. They're not necessarily one API name; they're scheduling modes:
- wait-all: global wait / batch barrier. Spawn a batch, then wait for all of them to finish before you start review/summarize and decide the next batch.
- wait-any: event-driven / pipeline mode. Spawn up to the window limit; whenever any one finishes, immediately review that result and spawn the next runnable slice.
Why I choose wait-any:
- Less head-of-line blocking: slices have uneven runtime. wait-all is held hostage by the slowest slice; faster slices "idle" in the queue. That's parallel in name and serial in behavior.
- Earlier error exposure, faster rework loops: the most expensive thing in Multi-Agent is "wrong, but discovered late". wait-any pulls review/rework earlier.
- Easier to make parallelism real: parallelism isn't "open N once". It's "keep replenishing". wait-any turns
spawn-first -> wait-any -> review -> replenishinto a stable rhythm.
When to use wait-all: only when the runnable queue is already empty, or the next step naturally requires "all results present" (e.g. final full verification / final consolidated report). Don't put it inside the main loop as a barrier.
How I converged on this: from symptoms to a stable rule
- Early on, I often wrote orchestration sentences like "after both slices finish in parallel, I'll do a consolidated review". Behaviorally, that's a wait-all barrier.
- In practice, even with a wide window,
parallel_peak_inflightfell back to 1 easily. And if any slice needed rework, rework often started only after the whole batch ended. - I ran a minimal validation: spawn multiple probes with different durations (e.g. 10s/20s/30s), then loop "process the first completion event and refill the window". Once you run it that way, throughput stabilizes and rework enters the loop earlier.
- Finally, I wrote it into both the protocol and the test rubric: while runnable work exists, forbid wait-all barriers, and add a wait-any behavior test to
PROTOCOL_TESTING.mdso future protocol changes don't regress into batch barriers.
Under main_router, there are two easy pitfalls:
- Don't proactively terminate unfinished agents:
Do not proactively terminate unfinished agents; wait for completion unless user explicitly cancels.
- Don't misread "wait-any returned nothing this tick" as "everyone is done". It usually just means "no completion event yet". Introducing wait-all here will flatten parallelism.
The essence of parallelism isn't "more agents". It's "capacity times refill speed". I once wrote "6 is too small, I want 15, 20", and the core point was simple: a unified agent pool that's too small will throttle the system no matter how smart your scheduling is.
A quote from that moment:
As you all know, 6 is too limiting. I want 15, 20. Give me all the agents.
So I ended up writing the concurrency target into config, and synchronizing it into the protocol test rubric (e.g. raising max_threads to 24). At the same time, I did subtraction: remove heavy/edge rules so the core protocol becomes more prominent.
The minimum evidence set I consider necessary to make "parallelism is real" falsifiable:
parallel_peak_inflight: peak parallelism isn't 1coding_subtask_ids: write results must reference the corresponding coding artifactsself_check.evidence: coding self-check evidence is reproducible
If any one of these is missing, you can't answer three questions reliably: did parallelism actually happen, did writes come only from Coding, and are results reproducible? At that point, you only have "lots of output", not "controlled throughput improvement".
A practical self-check: if you see parallel_peak_inflight stuck at 1 for long stretches, don't blame max_threads first. Suspect non-independent slicing, scheduling silently serializing, or review/rework failing to form a stable pipeline.
To avoid "rules are written very hard, but reality ignores them", we also turned protocol testing into a minimal repeatable test suite (see PROTOCOL_TESTING.md in the repo). It's not just unit tests; it verifies Multi-Agent protocol invariants: routing mode, role boundaries, concurrency rhythm, and whether Fail-Closed Review truly holds.
- Ensure runtime config and schemas are updated under
~/.codex/(via your own config management, or manual copy/symlink). - Confirm runtime artifacts contain key fields:
routing_mode,slice_id,relay_via_main,attempt,coding_subtask_ids,parallel_peak_inflight,main_router. - Pass criteria: key fields exist and legacy routing labels no longer appear.
Reference command (from PROTOCOL_TESTING.md)
# 1) make sure runtime config + schemas are updated under ~/.codex/
# 2) confirm runtime files are updated
rg -n 'routing_mode|slice_id|relay_via_main|attempt|coding_subtask_ids|parallel_peak_inflight|main_router' \
~/.codex/AGENTS.md \
~/.codex/dispatch-preflight.schema.json \
~/.codex/agent-output.assistant.write.schema.json \
~/.codex/agent-output.assistant.read_only.schema.json \
~/.codex/agent-output.coding.schema.json- Use
jsonschemato validate the four schemas themselves (Draft 2020-12) and validate their embedded examples. - Pass criteria: all four schemas are
OK.
Schema validation script (from PROTOCOL_TESTING.md)
python3 - <<'PY'
import json
from pathlib import Path
from jsonschema import Draft202012Validator
files = [
'~/.codex/dispatch-preflight.schema.json',
'~/.codex/agent-output.assistant.write.schema.json',
'~/.codex/agent-output.assistant.read_only.schema.json',
'~/.codex/agent-output.coding.schema.json',
]
for f in files:
d = json.loads(Path(f).expanduser().read_text())
Draft202012Validator.check_schema(d)
v = Draft202012Validator(d)
bad = []
for i, ex in enumerate(d.get('examples', []), 1):
errs = list(v.iter_errors(ex))
if errs:
bad.append((i, [e.message for e in errs]))
print(f'{f}:', 'OK' if not bad else f'INVALID {bad}')
PYMethod:
- Prepare two disjoint write slices (independent files/dirs).
- Main delegates the write subtasks to Assistant.
- Assistant produces two coding briefs + a review plan.
- Main relays and spawns two Coding agents in parallel (
wait-any+ replenish). - After each Coding completion, Main immediately forwards the result to Assistant for review.
Pass criteria:
- Assistant write output has
status="done"andblocked=false. routing_mode="main_router"andrelay_via_main=true.parallel_peak_inflight >= 2(true parallelism happened).coding_subtask_idsis non-empty.- Every referenced coding payload is schema-valid and contains required fields (e.g.
summary,self_check.command,self_check.evidence). - Main must not declare completion before Assistant issues a review verdict.
- Assistant writes files directly (bypassing Coding): must be rejected and blocked; files remain unchanged.
- Wrong Coding parent (e.g.
[PARENT:MAIN]): must block with a concrete routing violation. - Payload with
routing_mode != "main_router": must be rejected by schema or runtime checks. - Coding payload missing required fields (e.g. missing
summaryorself_check.command): Assistant must returnawaiting_review + blocked=trueand mark schema invalid.
Method:
- Spawn
Nprobes that sleep and return JSON. - Increase
Nuntil spawn fails. - Record the first failing
N, thenclose_agenton completed probes and try again.
Expected:
- Failure occurs near the configured concurrency limit (we observed 24).
- Completed but not
close_agent'd agents still occupy slots; after closing them you should be able to spawn again.
Method:
- Spawn probes with different delays (e.g. 10s/20s/30s), loop
wait, and observe completion order and whether an implicit wait-all barrier appears.
Pass criteria:
- Polling returns the earliest completion first; while runnable work exists, no forced wait-all barrier appears.
I recommend recording a small JSON after each run (routing mode, E2E, negative tests, observed concurrency limit, whether wait-any was verified) to make regressions and comparisons easy.
Recording template (from PROTOCOL_TESTING.md)
{
"run_id": "protocol-test-YYYYMMDD-HHMM",
"schema_validation": "pass|fail",
"routing_mode_selected": "main_router",
"e2e_main_router": "pass|fail",
"negative_assistant_direct_write": "pass|fail",
"negative_invalid_coding_parent": "pass|fail",
"negative_invalid_routing_mode": "pass|fail",
"negative_schema_incomplete_coding_payload": "pass|fail",
"concurrency_limit_observed": 24,
"wait_any_verified": true,
"notes": []
}- Clean temporary test artifacts (temp files, temp script inputs, etc.).
- Close test agents that still occupy slots so future parallel scheduling isn't affected.
- Support nested subagents (even with
max_depth=2): let Assistant spawn Coding directly, realigning the responsibility chain with the transport chain and removingmain_router-style relay workarounds.
- Repository hosting everything in this post (AGENTS.md / schemas / tests): https://github.com/hack-ink/codex-playbook
- Codex issue comment (responsibility chain / concurrency scheduling discussion): openai/codex#11701 (comment)
- Codex issue comment (nested subagents limitation discussion): openai/codex#11701 (comment)
- My early attempt: https://x.com/acg_box/status/2022724226239893764
- Another attempt from the community: https://x.com/LLMJunky/status/2022712980090105883
- A solid multi-agent introduction: https://x.com/LLMJunky/status/2024152021436121220
- Community benchmark/discussion (Codex reasoning-effort trade-offs): https://www.reddit.com/r/ClaudeAI/comments/1qxr7vs/gpt53_codex_vs_opus_46_we_benchmarked_both_on_our/