Skip to content

framersai/agentos

AgentOS — TypeScript AI Agent Framework with Cognitive Memory

AgentOS — Open-Source TypeScript AI Agent Runtime with Cognitive Memory, HEXACO Personality, and Runtime Tool Forging

85.6% on LongMemEval-S at $0.0090/correct, +1.4 above Mastra OM gpt-4o (84.23%) · 70.2% on LongMemEval-M (1.5M-token variant), the only open-source library on the public record above 65% on M with publicly reproducible methodology · 16 LLM providers · 8 neuroscience-backed memory mechanisms · Apache-2.0

npm CI tests codecov TypeScript License LongMemEval-S LongMemEval-M agentos-bench Discord

Benchmarks · Website · Docs · npm · Discord · Blog


AgentOS is an open-source TypeScript runtime for AI agents that remember, adapt, and write their own tools.

When an agent encounters a sub-task no existing tool covers, it generates a TypeScript function with a Zod-described schema, sends it through an LLM judge, and on approval runs it in a hardened node:vm sandbox. The new tool joins the catalog for the rest of the session. When a multi-agent team hits a capability gap, the manager calls spawn_specialist and the LLM judge reviews the synthesized agent spec before it joins the live roster.

The runtime carries the parts of an agent that should outlive a single chat completion: persistent cognitive memory (Ebbinghaus decay, retrieval-induced forgetting, reconsolidation, source-confidence decay) grounded in published cognitive-science literature, optional HEXACO personality vectors that bias retrieval and routing, six multi-agent orchestration strategies, streaming guardrails, a voice pipeline, and one dispatch interface across 21 LLM providers. Apache-2.0.

100+ first-party extensions (channel adapters, tool packs, guardrail packs) and 88 curated SKILL.md skills auto-discover at startup through their respective registries: a host pulls a curated index and the runtime wires every tool, guardrail, channel, and skill without manual registration. The auto-loader is the same surface that runtime-forged tools join: an agent that invents a function in session N can promote it (with judge approval and SkillExporter) into a SKILL.md that the registry picks up on the next process start. Forging is how the surface grows mid-run; auto-discovery is how it ships as a first-class capability afterward.

On benchmarks: 85.6% on LongMemEval-S at $0.0090 per correct answer (gpt-4o reader, +1.4 points above Mastra's published 84.23%, 0.4 points behind Emergence.ai's 86% closed-source SaaS SOTA); 70.2% on LongMemEval-M (1.5M-token haystacks, 500 sessions per question), the only open-source library on the public record above 65% on M with publicly reproducible methodology. Per-case run JSONs and single-CLI reproduction ship in agentos-bench.


Install

npm install @framers/agentos
import { agent } from '@framers/agentos';

const tutor = agent({
  provider: 'anthropic',
  instructions: 'You are a patient CS tutor.',
  personality: { openness: 0.9, conscientiousness: 0.95 },
  memory: { types: ['episodic', 'semantic'], working: { enabled: true } },
});

const session = tutor.session('student-1');
await session.send('Explain recursion with an analogy.');
await session.send('Can you expand on that?'); // remembers context

Full quickstart · Examples cookbook · API reference


Emergent Design

"So we and our elaborately evolving computers may meet each other halfway."

— Philip K. Dick, The Android and the Human, 1972

Three things accumulate across an AgentOS session and compose into behavior:

  1. Memory. What was said, what was decided, what was retrieved.
  2. Tool surface. Starts at whatever was registered. Can grow when an agent forges a new function mid-decision and the judge approves it.
  3. Personality (optional). A HEXACO trait vector that biases retrieval, specialist routing, and decision-making.

Each is configurable and observable; none crosses into "emergent agent" on its own. The composition is the interesting part.

Runtime Tool Forging

When an agent encounters a sub-task that no available tool covers, it generates a TypeScript function with a Zod-described input and output schema. A separate LLM call evaluates the forged function against the agent's stated intent and either approves or rejects it. Approved functions execute in a hardened node:vm sandbox with strict defaults (5-second wall clock, 128 MB heap-delta budget, eval / require / process banned, fetch / fs / crypto allowlist-empty by default). Approved tools join a discoverable index keyed by name and signature; subsequent turns invoke them via call_forged_tool(name, args). First forge costs full LLM tokens; reuse costs tens of tokens. Sandbox internals, isolation tradeoffs (node:vm vs queued isolated-vm for the hosted multi-tenant tier), and the full safety policy are in the emergent capabilities docs.

The pattern the runtime supports: an agent forges a tool mid-decision, the judge approves it, that turn invokes it, and a few turns later a different specialist agent in the same session invokes the same tool because the index made it findable. Promoted tools can be exported as SKILL.md skills via SkillExporter and join the auto-discovery surface on the next process start.

HEXACO Personality (optional)

Personality is opt-in. The runtime behaves identically with or without a trait vector, and most production deployments do not pass one.

// Personality-neutral (most production agents)
const support = agent({
  provider: 'openai',
  instructions: 'Resolve customer tickets.',
  memory: { types: ['episodic', 'semantic'] },
});

// Opt-in HEXACO (when persona consistency across sessions matters)
const coach = agent({
  provider: 'openai',
  instructions: "Long-running career coach. Hold the user accountable to their stated goals across weekly check-ins; flag drift, push back on excuses, escalate when goals shift.",
  personality: {
    conscientiousness: 0.9,    // won't let goals drift between sessions
    honestyHumility: 0.85,     // won't tell the user what they want to hear
    emotionality: 0.3,         // stays steady when the user is reactive
  },
  memory: { types: ['episodic', 'semantic'] },
});

When a vector is supplied, the kernel weights retrieval, specialist routing, and tool selection by the trait values. Same agent, same prompt, same tools: a high-Openness leader and a high-Conscientiousness leader produce measurably different decision sequences. Personality lives in the kernel, not in the prompt — prompt-only personality dissolves under context pressure while kernel-encoded bias persists. The vector remains editable, inspectable, and removable on consent.


Memory Benchmarks

gpt-4o reader, gpt-4o-2024-08-06 judge, full N=500 across every row. Cross-provider numbers are excluded from the tables because their public methodology disclosures don't admit reproduction.

LongMemEval-S (115K tokens, 50 sessions)

System Accuracy $/correct p50 latency
EmergenceMem Internal 86.0% not published 5,650 ms
AgentOS (canonical-hybrid + reader-router) 85.6% $0.0090 3,558 ms
Mastra OM gpt-4o (gemini-flash observer) 84.23% not published not published
Supermemory gpt-4o 81.6% not published not published
EmergenceMem Simple Fast (rerun in agentos-bench) 80.6% $0.0586 3,703 ms
Zep (self / independent reproduction) 71.2% / 63.8% not published not published

+1.4 points above Mastra OM. EmergenceMem Internal posts 86.0% (0.4 above) but doesn't publish per-case results or a reproducible CLI; among open-source libraries with single-CLI reproduction at gpt-4o, 85.6% is the highest publicly reproducible number located. p50 latency 3,558 ms vs EmergenceMem's published median 5,650 ms.

Cross-provider numbers omitted from the table (different reader and/or undisclosed judge): Mastra OM 94.87% (gpt-5-mini + gemini-2.5-flash observer), agentmemory 96.2% (Claude Opus 4.6), MemMachine 93.0% (GPT-5-mini), Hindsight 91.4% (unspecified backbone).

LongMemEval-M (1.5M tokens, 500 sessions)

M's haystacks exceed every production context window; most vendors only publish on S.

System Accuracy License
LongMemEval paper, GPT-4o round Top-10 (paper's best) 72.0% open repo
AgentBrain 71.7% closed-source SaaS
LongMemEval paper, GPT-4o session Top-5 71.4% open repo
AgentOS (sem-embed + reader-router + Top-5) 70.2% Apache-2.0
LongMemEval paper, GPT-4o round Top-5 65.7% open repo
Mem0 v3, Mastra, Hindsight, Zep, EmergenceMem, Supermemory, Letta not published

At matched Top-5 retrieval, +4.5 above the round-level paper baseline (65.7%) and 1.2 below the session-level (71.4%); the paper's overall strongest GPT-4o result is 72.0% at Top-10. Of open-source libraries with publicly reproducible runs, AgentOS is the only one above 65% on M.

Full leaderboard → · Run JSONs → · Transparency audit → · LongMemEval paper (Wu et al., ICLR 2025, Table 3)

Methodology stack: bootstrap 95% CIs at 10k Mulberry32 resamples (seed 42), per-benchmark judge-FPR probes (S 1%, M 2%, LOCOMO 0%), per-case run JSONs, single-CLI reproduction. The transparency audit covers what the headline numbers don't: LOCOMO's ~6.4% answer-key error rate, the LongMemEval-S context-window confound, and the Mem0-vs-Zep comparison gaming case study, alongside which vendors disclose which methodology dimensions.


Ecosystem

Package Role
@framers/agentos Core runtime: GMI agents, cognitive memory, multi-agent orchestration, guardrails, voice, 21 LLM providers. Apache 2.0.
@framers/agentos-extensions 100+ first-party extensions and templates: channel adapters, tool packs, integrations, guardrail packs.
@framers/agentos-extensions-registry Discovery + auto-loader layer for the extensions catalog. Hosts pull the index without pulling every implementation; the runtime resolves and registers packs at startup.
@framers/agentos-skills 88 curated SKILL.md skills covering common tasks.
@framers/agentos-skills-registry Discovery + auto-loader layer for the skills catalog. Also the surface where promoted forged tools land after SkillExporter.
@framers/agentos-bench Open benchmark harness. Bootstrap 95% CIs at 10k resamples, judge false-positive-rate probes, per-case run JSONs at fixed seed. MIT (the rest of AgentOS is Apache 2.0).
@framers/sql-storage-adapter Cross-platform SQL persistence: SQLite, Postgres, IndexedDB, Capacitor SQLite.
paracosm AI agent swarm simulation engine that uses AgentOS as its substrate.
wunderland Sister project (preview) — batteries-included CLI plus daemon over the AgentOS extension and skill registries. 28-command CLI, 5-tier security, 8 agent presets, step-up HITL. github.com/jddunn/wunderland. MIT.

Extensions and skills auto-load at startup. The runtime walks each registry plus any user-supplied paths, resolves each pack's createExtensionPack(context) factory or SKILL.md frontmatter, and registers tools, guardrails, channels, and skills without manual wiring. Capability gating and HITL approval gates apply to side-effecting installs. See extensions architecture for the full loading model.


📄 Technical Whitepaper · Coming Soon

The full architecture and benchmark methodology, written for engineers and researchers who want a citable PDF instead of scrolling docs. Cognitive memory pipeline, classifier-driven dispatch, HEXACO personality modulation, runtime tool forging, full LongMemEval-S/M and LOCOMO benchmark methodology with confidence interval math, judge-FPR probes, per-stage retention metrics, and reproducibility recipes.

Covers What's inside
Architecture Generalized Mind Instances, IngestRouter / MemoryRouter / ReadRouter, 8 cognitive mechanisms with primary-source citations
Benchmarks LongMemEval-S 85.6%, LongMemEval-M 70.2%, vendor landscape, confidence interval methodology, judge FPR probes, full transparency stack
Reproducibility Per-case run JSONs at --seed 42, single-CLI reproduction, Apache-2.0 bench at github.com/framersai/agentos-bench

Join Discord for the announcement → · Read the benchmarks now →


Classifier-Driven Memory Pipeline

Most memory libraries retrieve on every query. AgentOS gates memory through three LLM-as-judge classifiers in a single shared pass, so trivial queries skip retrieval entirely and the rest get the right architecture and reader per category.

User query
    │
    ▼ Stage 1: QueryClassifier (gpt-5-mini, ~$0.0001/query)
    │    T0=none ─────► answer from context, skip retrieval
    │    T1+=needs memory
    ▼ Stage 2: MemoryRouter      → canonical-hybrid · OM-v10 · OM-v11
    ▼ Stage 3: ReaderRouter      → gpt-4o (TR/SSU) · gpt-5-mini (SSA/SSP/KU/MS)
    ▼
Grounded answer

Stages 2 and 3 reuse the Stage 1 classification, so the full pipeline costs one classifier call per query, not three. The T0 / no-memory gate is the novel piece: removing retrieval entirely for greetings and small talk saves the embedding + rerank + reader cost on a substantial fraction of typical agent traffic.

Primitive Source Decision
QueryClassifier @framers/agentos/query-router T0/none vs T1/simple vs T2/moderate vs T3/complex
MemoryRouter @framers/agentos/memory-router canonical-hybrid vs observational-memory-v10 vs v11
ReaderRouter @framers/agentos/memory-router gpt-4o vs gpt-5-mini per category

Cognitive Pipeline docs → · Architecture deep dive → · Beyond RAG →


Why AgentOS

vs. AgentOS differentiator
LangChain / LangGraph Cognitive memory (8 neuroscience-backed mechanisms), HEXACO personality, runtime tool forging
Vercel AI SDK Multi-agent teams (6 strategies), 7 vector backends, guardrails, voice/telephony
CrewAI / Mastra Unified orchestration (DAGs + graphs + missions), personality-driven routing, published reproducible numbers on LongMemEval-S (85.6%) and LongMemEval-M (70.2%) with full methodology disclosure

Full framework comparison →


Key Features

Category Highlights
LLM Providers 16: OpenAI, Anthropic, Gemini, Groq, Ollama, OpenRouter, Together, Mistral, xAI, Claude/Gemini CLI, + 5 image/video
Cognitive Memory 8 mechanisms: reconsolidation, retrieval-induced forgetting, involuntary recall, FOK, gist extraction, schema encoding, source decay, emotion regulation
HEXACO Personality 6 traits modulate memory, retrieval bias, response style
RAG Pipeline 7 vector backends · 4 retrieval strategies · GraphRAG · HyDE · Cohere rerank-v3.5
Multi-Agent Teams 6 coordination strategies · shared memory · inter-agent messaging · HITL gates
Orchestration workflow() DAGs · AgentGraph cycles · mission() goal-driven planning · checkpointing
Guardrails 5 security tiers · 6 packs (PII, ML classifiers, topicality, code safety, grounding, content policy)
Emergent Capabilities Runtime tool forging · 4 self-improvement tools · tiered promotion · skill export
Voice & Telephony ElevenLabs, Deepgram, Whisper · Twilio, Telnyx, Plivo
Channels 37 platform adapters (Telegram, Discord, Slack, WhatsApp, webchat, ...)
Observability OpenTelemetry · usage ledger · cost guard · circuit breaker

Multi-Agent in 6 Lines

import { agency } from '@framers/agentos';

const team = agency({
  strategy: 'graph',
  agents: {
    researcher: { provider: 'anthropic', instructions: 'Find relevant facts.' },
    writer:     { provider: 'openai',    instructions: 'Summarize clearly.',  dependsOn: ['researcher'] },
    reviewer:   { provider: 'gemini',    instructions: 'Check accuracy.',     dependsOn: ['writer'] },
  },
});

const result = await team.generate('Compare TCP vs UDP for game networking.');

Strategies: sequential · parallel · debate · review-loop · hierarchical · graph. With strategy: 'hierarchical' + emergent: { enabled: true }, the manager LLM gets a spawn_specialist tool that mints new sub-agents at runtime when the static roster doesn't cover a sub-task. agency() is for single-request multi-agent coordination — for long-running world simulations or per-turn parallel agent loops, build your own orchestration with agent() + the lower-level primitives. Multi-agent docs → · Hierarchical + emergent → · Scope guide →


See It In Action

🌀 Paracosm — AI Agent Swarm Simulation

Define any scenario as JSON. Run it with AI commanders that have different HEXACO personalities. Same starting conditions, different decisions, divergent civilizations. Built on AgentOS.

npm install paracosm

Live Demo · GitHub · npm


Configure API Keys

Three layers, highest priority first:

// 1. Inline on the call (per-tenant, per-test, per-customer)
generateText({ apiKey: 'sk-customer', prompt: '...' });

// 2. Module-level default — set once at boot, no .env needed
import { setDefaultProvider } from '@framers/agentos';
setDefaultProvider({ provider: 'openai', apiKey: process.env.MY_OWN_KEY });

// 2b. Reorder the env-var auto-detect chain instead (when you keep multiple keys)
import { setProviderPriority } from '@framers/agentos';
setProviderPriority(['anthropic', 'openai', 'ollama']);

// 3. Environment variable auto-detect chain (default order)
//    OpenRouter → OpenAI → Anthropic → Gemini → Groq → Together → Mistral
//    → xAI → claude CLI → gemini CLI → Ollama → image providers
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GEMINI_API_KEY=AIza...

# Comma-separated keys auto-rotate with quota detection
export OPENAI_API_KEY=sk-key1,sk-key2,sk-key3

Full credential resolution + default models per provider →


API Surfaces

  • agent(): lightweight stateful agent. Prompts, sessions, personality, hooks, tools, memory.
  • agency(): multi-agent teams + full runtime. Emergent tooling, guardrails, RAG, voice, channels, HITL.
  • generateText() / streamText() / generateObject() / generateImage() / generateVideo() / generateMusic() / performOCR() / embedText(): low-level multi-modal helpers with native tool calling.
  • workflow() / AgentGraph / mission(): three orchestration authoring APIs over one graph runtime.

Provider fallback is an explicit opt-in via agent({ fallbackProviders: [...] }) (or buildFallbackChain() for programmatic chains). Defaults to off — the runtime never silently retries against a different provider unless you configured a chain.

Full API reference → · High-Level API guide →


Documentation & Community


Contributing

git clone https://github.com/framersai/agentos.git && cd agentos
pnpm install && pnpm build && pnpm test

Contributing Guide · We use Conventional Commits.


License

Apache 2.0

AgentOS     Frame.dev

Built by Manic Agency LLC · Frame.dev · Wilds.ai

About

Build autonomous AI agents with adaptive intelligence and emergent behaviors, included with multimodal RAG and optional HEXACO personalities.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors