AgentOS — Open-Source TypeScript AI Agent Runtime with Cognitive Memory, HEXACO Personality, and Runtime Tool Forging

AgentOS — TypeScript AI Agent Framework with Cognitive Memory

AgentOS — Open-Source TypeScript AI Agent Runtime with Cognitive Memory, HEXACO Personality, and Runtime Tool Forging

85.6% on LongMemEval-S at $0.0090/correct, +1.4 above Mastra OM gpt-4o (84.23%) · 70.2% on LongMemEval-M (1.5M-token variant), the only open-source library on the public record above 65% on M with publicly reproducible methodology · 16 LLM providers · 8 neuroscience-backed memory mechanisms · Apache-2.0

Benchmarks · Website · Docs · npm · Discord · Blog

AgentOS is an open-source TypeScript runtime for AI agents that remember, adapt, and write their own tools.

When an agent encounters a sub-task no existing tool covers, it generates a TypeScript function with a Zod-described schema, sends it through an LLM judge, and on approval runs it in a hardened node:vm sandbox. The new tool joins the catalog for the rest of the session. When a multi-agent team hits a capability gap, the manager calls spawn_specialist and the LLM judge reviews the synthesized agent spec before it joins the live roster.

The runtime carries the parts of an agent that should outlive a single chat completion: persistent cognitive memory (Ebbinghaus decay, retrieval-induced forgetting, reconsolidation, source-confidence decay) grounded in published cognitive-science literature, optional HEXACO personality vectors that bias retrieval and routing, six multi-agent orchestration strategies, streaming guardrails, a voice pipeline, and one dispatch interface across 21 LLM providers. Apache-2.0.

100+ first-party extensions (channel adapters, tool packs, guardrail packs) and 88 curated SKILL.md skills auto-discover at startup through their respective registries: a host pulls a curated index and the runtime wires every tool, guardrail, channel, and skill without manual registration. The auto-loader is the same surface that runtime-forged tools join: an agent that invents a function in session N can promote it (with judge approval and SkillExporter) into a SKILL.md that the registry picks up on the next process start. Forging is how the surface grows mid-run; auto-discovery is how it ships as a first-class capability afterward.

On benchmarks: 85.6% on LongMemEval-S at $0.0090 per correct answer (gpt-4o reader, +1.4 points above Mastra's published 84.23%, 0.4 points behind Emergence.ai's 86% closed-source SaaS SOTA); 70.2% on LongMemEval-M (1.5M-token haystacks, 500 sessions per question), the only open-source library on the public record above 65% on M with publicly reproducible methodology. Per-case run JSONs and single-CLI reproduction ship in agentos-bench.

Install

npm install @framers/agentos

import { agent } from '@framers/agentos';

const tutor = agent({
  provider: 'anthropic',
  instructions: 'You are a patient CS tutor.',
  personality: { openness: 0.9, conscientiousness: 0.95 },
  memory: { types: ['episodic', 'semantic'], working: { enabled: true } },
});

const session = tutor.session('student-1');
await session.send('Explain recursion with an analogy.');
await session.send('Can you expand on that?'); // remembers context

Full quickstart · Examples cookbook · API reference

Emergent Design

"So we and our elaborately evolving computers may meet each other halfway."

— Philip K. Dick, The Android and the Human, 1972

Three things accumulate across an AgentOS session and compose into behavior:

Memory. What was said, what was decided, what was retrieved.
Tool surface. Starts at whatever was registered. Can grow when an agent forges a new function mid-decision and the judge approves it.
Personality (optional). A HEXACO trait vector that biases retrieval, specialist routing, and decision-making.

Each is configurable and observable; none crosses into "emergent agent" on its own. The composition is the interesting part.

Runtime Tool Forging

When an agent encounters a sub-task that no available tool covers, it generates a TypeScript function with a Zod-described input and output schema. A separate LLM call evaluates the forged function against the agent's stated intent and either approves or rejects it. Approved functions execute in a hardened node:vm sandbox with strict defaults (5-second wall clock, 128 MB heap-delta budget, eval / require / process banned, fetch / fs / crypto allowlist-empty by default). Approved tools join a discoverable index keyed by name and signature; subsequent turns invoke them via call_forged_tool(name, args). First forge costs full LLM tokens; reuse costs tens of tokens. Sandbox internals, isolation tradeoffs (node:vm vs queued isolated-vm for the hosted multi-tenant tier), and the full safety policy are in the emergent capabilities docs.

The pattern the runtime supports: an agent forges a tool mid-decision, the judge approves it, that turn invokes it, and a few turns later a different specialist agent in the same session invokes the same tool because the index made it findable. Promoted tools can be exported as SKILL.md skills via SkillExporter and join the auto-discovery surface on the next process start.

HEXACO Personality (optional)

Personality is opt-in. The runtime behaves identically with or without a trait vector, and most production deployments do not pass one.

// Personality-neutral (most production agents)
const support = agent({
  provider: 'openai',
  instructions: 'Resolve customer tickets.',
  memory: { types: ['episodic', 'semantic'] },
});

// Opt-in HEXACO (when persona consistency across sessions matters)
const coach = agent({
  provider: 'openai',
  instructions: "Long-running career coach. Hold the user accountable to their stated goals across weekly check-ins; flag drift, push back on excuses, escalate when goals shift.",
  personality: {
    conscientiousness: 0.9,    // won't let goals drift between sessions
    honestyHumility: 0.85,     // won't tell the user what they want to hear
    emotionality: 0.3,         // stays steady when the user is reactive
  },
  memory: { types: ['episodic', 'semantic'] },
});

When a vector is supplied, the kernel weights retrieval, specialist routing, and tool selection by the trait values. Same agent, same prompt, same tools: a high-Openness leader and a high-Conscientiousness leader produce measurably different decision sequences. Personality lives in the kernel, not in the prompt — prompt-only personality dissolves under context pressure while kernel-encoded bias persists. The vector remains editable, inspectable, and removable on consent.

Memory Benchmarks

gpt-4o reader, gpt-4o-2024-08-06 judge, full N=500 across every row. Cross-provider numbers are excluded from the tables because their public methodology disclosures don't admit reproduction.

LongMemEval-S (115K tokens, 50 sessions)

System	Accuracy	$/correct	p50 latency
EmergenceMem Internal	86.0%	not published	5,650 ms
AgentOS (canonical-hybrid + reader-router)	85.6%	$0.0090	3,558 ms
Mastra OM gpt-4o (gemini-flash observer)	84.23%	not published	not published
Supermemory gpt-4o	81.6%	not published	not published
EmergenceMem Simple Fast (rerun in agentos-bench)	80.6%	$0.0586	3,703 ms
Zep (self / independent reproduction)	71.2% / 63.8%	not published	not published

+1.4 points above Mastra OM. EmergenceMem Internal posts 86.0% (0.4 above) but doesn't publish per-case results or a reproducible CLI; among open-source libraries with single-CLI reproduction at gpt-4o, 85.6% is the highest publicly reproducible number located. p50 latency 3,558 ms vs EmergenceMem's published median 5,650 ms.

Cross-provider numbers omitted from the table (different reader and/or undisclosed judge): Mastra OM 94.87% (gpt-5-mini + gemini-2.5-flash observer), agentmemory 96.2% (Claude Opus 4.6), MemMachine 93.0% (GPT-5-mini), Hindsight 91.4% (unspecified backbone).

LongMemEval-M (1.5M tokens, 500 sessions)

M's haystacks exceed every production context window; most vendors only publish on S.

System	Accuracy	License
LongMemEval paper, GPT-4o round Top-10 (paper's best)	72.0%	open repo
AgentBrain	71.7%	closed-source SaaS
LongMemEval paper, GPT-4o session Top-5	71.4%	open repo
AgentOS (sem-embed + reader-router + Top-5)	70.2%	Apache-2.0
LongMemEval paper, GPT-4o round Top-5	65.7%	open repo
Mem0 v3, Mastra, Hindsight, Zep, EmergenceMem, Supermemory, Letta	not published	—

At matched Top-5 retrieval, +4.5 above the round-level paper baseline (65.7%) and 1.2 below the session-level (71.4%); the paper's overall strongest GPT-4o result is 72.0% at Top-10. Of open-source libraries with publicly reproducible runs, AgentOS is the only one above 65% on M.

Full leaderboard → · Run JSONs → · Transparency audit → · LongMemEval paper (Wu et al., ICLR 2025, Table 3)

Methodology stack: bootstrap 95% CIs at 10k Mulberry32 resamples (seed 42), per-benchmark judge-FPR probes (S 1%, M 2%, LOCOMO 0%), per-case run JSONs, single-CLI reproduction. The transparency audit covers what the headline numbers don't: LOCOMO's ~6.4% answer-key error rate, the LongMemEval-S context-window confound, and the Mem0-vs-Zep comparison gaming case study, alongside which vendors disclose which methodology dimensions.

Ecosystem

Package	Role
`@framers/agentos`	Core runtime: GMI agents, cognitive memory, multi-agent orchestration, guardrails, voice, 21 LLM providers. Apache 2.0.
`@framers/agentos-extensions`	100+ first-party extensions and templates: channel adapters, tool packs, integrations, guardrail packs.
`@framers/agentos-extensions-registry`	Discovery + auto-loader layer for the extensions catalog. Hosts pull the index without pulling every implementation; the runtime resolves and registers packs at startup.
`@framers/agentos-skills`	88 curated `SKILL.md` skills covering common tasks.
`@framers/agentos-skills-registry`	Discovery + auto-loader layer for the skills catalog. Also the surface where promoted forged tools land after `SkillExporter`.
`@framers/agentos-bench`	Open benchmark harness. Bootstrap 95% CIs at 10k resamples, judge false-positive-rate probes, per-case run JSONs at fixed seed. MIT (the rest of AgentOS is Apache 2.0).
`@framers/sql-storage-adapter`	Cross-platform SQL persistence: SQLite, Postgres, IndexedDB, Capacitor SQLite.
`paracosm`	AI agent swarm simulation engine that uses AgentOS as its substrate.
`wunderland`	Sister project (preview) — batteries-included CLI plus daemon over the AgentOS extension and skill registries. 28-command CLI, 5-tier security, 8 agent presets, step-up HITL. github.com/jddunn/wunderland. MIT.

Extensions and skills auto-load at startup. The runtime walks each registry plus any user-supplied paths, resolves each pack's createExtensionPack(context) factory or SKILL.md frontmatter, and registers tools, guardrails, channels, and skills without manual wiring. Capability gating and HITL approval gates apply to side-effecting installs. See extensions architecture for the full loading model.

📄 Technical Whitepaper · Coming Soon

The full architecture and benchmark methodology, written for engineers and researchers who want a citable PDF instead of scrolling docs. Cognitive memory pipeline, classifier-driven dispatch, HEXACO personality modulation, runtime tool forging, full LongMemEval-S/M and LOCOMO benchmark methodology with confidence interval math, judge-FPR probes, per-stage retention metrics, and reproducibility recipes.

Covers	What's inside
Architecture	Generalized Mind Instances, IngestRouter / MemoryRouter / ReadRouter, 8 cognitive mechanisms with primary-source citations
Benchmarks	LongMemEval-S 85.6%, LongMemEval-M 70.2%, vendor landscape, confidence interval methodology, judge FPR probes, full transparency stack
Reproducibility	Per-case run JSONs at `--seed 42`, single-CLI reproduction, Apache-2.0 bench at github.com/framersai/agentos-bench

Join Discord for the announcement → · Read the benchmarks now →

Classifier-Driven Memory Pipeline

Most memory libraries retrieve on every query. AgentOS gates memory through three LLM-as-judge classifiers in a single shared pass, so trivial queries skip retrieval entirely and the rest get the right architecture and reader per category.

User query
    │
    ▼ Stage 1: QueryClassifier (gpt-5-mini, ~$0.0001/query)
    │    T0=none ─────► answer from context, skip retrieval
    │    T1+=needs memory
    ▼ Stage 2: MemoryRouter      → canonical-hybrid · OM-v10 · OM-v11
    ▼ Stage 3: ReaderRouter      → gpt-4o (TR/SSU) · gpt-5-mini (SSA/SSP/KU/MS)
    ▼
Grounded answer

Stages 2 and 3 reuse the Stage 1 classification, so the full pipeline costs one classifier call per query, not three. The T0 / no-memory gate is the novel piece: removing retrieval entirely for greetings and small talk saves the embedding + rerank + reader cost on a substantial fraction of typical agent traffic.

Primitive	Source	Decision
`QueryClassifier`	`@framers/agentos/query-router`	T0/none vs T1/simple vs T2/moderate vs T3/complex
`MemoryRouter`	`@framers/agentos/memory-router`	canonical-hybrid vs observational-memory-v10 vs v11
`ReaderRouter`	`@framers/agentos/memory-router`	gpt-4o vs gpt-5-mini per category

Cognitive Pipeline docs → · Architecture deep dive → · Beyond RAG →

Why AgentOS

vs.	AgentOS differentiator
LangChain / LangGraph	Cognitive memory (8 neuroscience-backed mechanisms), HEXACO personality, runtime tool forging
Vercel AI SDK	Multi-agent teams (6 strategies), 7 vector backends, guardrails, voice/telephony
CrewAI / Mastra	Unified orchestration (DAGs + graphs + missions), personality-driven routing, published reproducible numbers on LongMemEval-S (85.6%) and LongMemEval-M (70.2%) with full methodology disclosure

Full framework comparison →

Key Features

Category	Highlights
LLM Providers	16: OpenAI, Anthropic, Gemini, Groq, Ollama, OpenRouter, Together, Mistral, xAI, Claude/Gemini CLI, + 5 image/video
Cognitive Memory	8 mechanisms: reconsolidation, retrieval-induced forgetting, involuntary recall, FOK, gist extraction, schema encoding, source decay, emotion regulation
HEXACO Personality	6 traits modulate memory, retrieval bias, response style
RAG Pipeline	7 vector backends · 4 retrieval strategies · GraphRAG · HyDE · Cohere rerank-v3.5
Multi-Agent Teams	6 coordination strategies · shared memory · inter-agent messaging · HITL gates
Orchestration	`workflow()` DAGs · `AgentGraph` cycles · `mission()` goal-driven planning · checkpointing
Guardrails	5 security tiers · 6 packs (PII, ML classifiers, topicality, code safety, grounding, content policy)
Emergent Capabilities	Runtime tool forging · 4 self-improvement tools · tiered promotion · skill export
Voice & Telephony	ElevenLabs, Deepgram, Whisper · Twilio, Telnyx, Plivo
Channels	37 platform adapters (Telegram, Discord, Slack, WhatsApp, webchat, ...)
Observability	OpenTelemetry · usage ledger · cost guard · circuit breaker

Multi-Agent in 6 Lines

import { agency } from '@framers/agentos';

const team = agency({
  strategy: 'graph',
  agents: {
    researcher: { provider: 'anthropic', instructions: 'Find relevant facts.' },
    writer:     { provider: 'openai',    instructions: 'Summarize clearly.',  dependsOn: ['researcher'] },
    reviewer:   { provider: 'gemini',    instructions: 'Check accuracy.',     dependsOn: ['writer'] },
  },
});

const result = await team.generate('Compare TCP vs UDP for game networking.');

Strategies: sequential · parallel · debate · review-loop · hierarchical · graph. With strategy: 'hierarchical' + emergent: { enabled: true }, the manager LLM gets a spawn_specialist tool that mints new sub-agents at runtime when the static roster doesn't cover a sub-task. agency() is for single-request multi-agent coordination — for long-running world simulations or per-turn parallel agent loops, build your own orchestration with agent() + the lower-level primitives. Multi-agent docs → · Hierarchical + emergent → · Scope guide →

See It In Action

🌀 Paracosm — AI Agent Swarm Simulation

Define any scenario as JSON. Run it with AI commanders that have different HEXACO personalities. Same starting conditions, different decisions, divergent civilizations. Built on AgentOS.

npm install paracosm

Live Demo · GitHub · npm

Configure API Keys

Three layers, highest priority first:

// 1. Inline on the call (per-tenant, per-test, per-customer)
generateText({ apiKey: 'sk-customer', prompt: '...' });

// 2. Module-level default — set once at boot, no .env needed
import { setDefaultProvider } from '@framers/agentos';
setDefaultProvider({ provider: 'openai', apiKey: process.env.MY_OWN_KEY });

// 2b. Reorder the env-var auto-detect chain instead (when you keep multiple keys)
import { setProviderPriority } from '@framers/agentos';
setProviderPriority(['anthropic', 'openai', 'ollama']);

// 3. Environment variable auto-detect chain (default order)
//    OpenRouter → OpenAI → Anthropic → Gemini → Groq → Together → Mistral
//    → xAI → claude CLI → gemini CLI → Ollama → image providers

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GEMINI_API_KEY=AIza...

# Comma-separated keys auto-rotate with quota detection
export OPENAI_API_KEY=sk-key1,sk-key2,sk-key3

Full credential resolution + default models per provider →

API Surfaces

agent(): lightweight stateful agent. Prompts, sessions, personality, hooks, tools, memory.
agency(): multi-agent teams + full runtime. Emergent tooling, guardrails, RAG, voice, channels, HITL.
generateText() / streamText() / generateObject() / generateImage() / generateVideo() / generateMusic() / performOCR() / embedText(): low-level multi-modal helpers with native tool calling.
workflow() / AgentGraph / mission(): three orchestration authoring APIs over one graph runtime.

Provider fallback is an explicit opt-in via agent({ fallbackProviders: [...] }) (or buildFallbackChain() for programmatic chains). Defaults to off — the runtime never silently retries against a different provider unless you configured a chain.

Full API reference → · High-Level API guide →

Documentation & Community

Benchmarks: benchmark tables, 95% confidence intervals, methodology audit
Architecture: system design, layer breakdown
Cognitive Memory: 8 mechanisms with 30+ APA citations
RAG Configuration: vector stores, embeddings, sources
Guardrails: 5 tiers, 6 packs
Voice Pipeline: TTS, STT, telephony
Blog: engineering posts, benchmark publications, transparency audits
Discord · GitHub Issues · Wilds.ai (AI game worlds powered by AgentOS)

Contributing

git clone https://github.com/framersai/agentos.git && cd agentos
pnpm install && pnpm build && pnpm test

Contributing Guide · We use Conventional Commits.

License

Apache 2.0

Built by Manic Agency LLC · Frame.dev · Wilds.ai

Name		Name	Last commit message	Last commit date
Latest commit History 1,554 Commits
.github		.github
assets		assets
docs		docs
examples		examples
scripts		scripts
src		src
tests		tests
.coderabbit.yaml		.coderabbit.yaml
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml
eslint.config.js		eslint.config.js
package.json		package.json
release.config.js		release.config.js
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json
tsdoc.json		tsdoc.json
typedoc.json		typedoc.json
typedoc.modules.json		typedoc.modules.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentOS — Open-Source TypeScript AI Agent Runtime with Cognitive Memory, HEXACO Personality, and Runtime Tool Forging

Install

Emergent Design

Runtime Tool Forging

HEXACO Personality (optional)

Memory Benchmarks

LongMemEval-S (115K tokens, 50 sessions)

LongMemEval-M (1.5M tokens, 500 sessions)

Ecosystem

📄 Technical Whitepaper · Coming Soon

Classifier-Driven Memory Pipeline

Why AgentOS

Key Features

Multi-Agent in 6 Lines

See It In Action

🌀 Paracosm — AI Agent Swarm Simulation

Configure API Keys

API Surfaces

Documentation & Community

Contributing

License

About

Uh oh!

Releases 306

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AgentOS — Open-Source TypeScript AI Agent Runtime with Cognitive Memory, HEXACO Personality, and Runtime Tool Forging

Install

Emergent Design

Runtime Tool Forging

HEXACO Personality (optional)

Memory Benchmarks

LongMemEval-S (115K tokens, 50 sessions)

LongMemEval-M (1.5M tokens, 500 sessions)

Ecosystem

📄 Technical Whitepaper · Coming Soon

Classifier-Driven Memory Pipeline

Why AgentOS

Key Features

Multi-Agent in 6 Lines

See It In Action

🌀 Paracosm — AI Agent Swarm Simulation

Configure API Keys

API Surfaces

Documentation & Community

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 306

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages