feat: RuntimeState event bus integration with checkpoint/resume by greysonlalonde · Pull Request #5241 · crewAIInc/crewAI

greysonlalonde · 2026-04-02T21:38:38Z

Summary

Pass RuntimeState as optional third arg to event bus handlers
RuntimeState.checkpoint(dir) writes timestamped JSON snapshots
Crew.from_checkpoint(path) restores and resumes via kickoff()
_get_execution_start_index skips tasks with existing output
Convert CrewStructuredTool, StandardPromptResult, SystemPromptResult, TokenCalcHandler to BaseModel
CrewAgentExecutorMixin uses Field(exclude=True) for back-references

Test plan

Real LLM execution: checkpoint after task 1, restore, resume skips task 1 and runs task 2
371 core tests pass
Backwards compatible: 2-arg event handlers still work

Note

Medium Risk
Medium risk because it changes core execution plumbing (event emission, executor serialization, and resume logic) and alters JSON serialization shapes for LLM/tool/executor objects, which could affect backward compatibility and runtime behavior.

Overview
Adds first-class checkpoint/resume support by introducing crewai.state.runtime.RuntimeState (with pluggable BaseProvider + default filesystem JsonProvider) and an EventRecord that captures event relationships during execution.

Integrates runtime state with the event system: the event bus now tracks/records emitted events into the active RuntimeState, can auto-register entities, and supports handlers that optionally accept a third state argument while remaining compatible with 2-arg handlers.

Enables restoring and resuming runs via Crew.from_checkpoint() / Flow.from_checkpoint() / BaseAgent.from_checkpoint(), including rebuilding event scope, rehydrating agent executors/message history, and skipping already-completed tasks when resuming.

Refactors several runtime objects to be Pydantic models (BaseAgentExecutor, CrewAgentExecutor, CrewStructuredTool, prompt result types, token tracking/callbacks) and adjusts LLM/tool serialization to structured dicts; also updates CI/pre-commit uv versions and bumps dependencies (e.g., litellm, openai) plus adds aiofiles for async checkpoint IO.

^{Reviewed by Cursor Bugbot for commit 167b609. Bugbot is set up for automated code reviews on this repo. Configure here.}

…m_checkpoint()

…resume via kickoff()

… instead of PrivateAttr properties

…ert TokenProcess to BaseModel

…construction

…on event bus

lib/crewai/src/crewai/__init__.py

…ider pattern - Move runtime_state.py to state/runtime.py - Add acheckpoint async method using aiofiles - Introduce BaseProvider protocol and JsonProvider for pluggable storage - Add aiofiles dependency to crewai package - Use PrivateAttr for provider on RootModel

lib/crewai/src/crewai/state/provider/core.py

lib/crewai/tests/test_event_record.py

lib/crewai/src/crewai/state/provider/core.py

lib/crewai/src/crewai/crew.py

lib/crewai/src/crewai/events/utils/handlers.py

iris-clawd

Full Review: RuntimeState event bus integration with checkpoint/resume

This is a substantial PR (~1.5K additions, 42 files). Overall the architecture is solid — RuntimeState + EventRecord + pluggable providers is a clean design. A few concerns worth addressing:

✅ What looks good

EventRecord data structure — directed graph with O(1) lookups, typed edges, automatic wiring via add(). Clean design, well-tested (423-line test file).
BaseAgentExecutor refactor — Converting CrewAgentExecutorMixin from a plain class to a Pydantic BaseModel is the right move for serialization. The Field(exclude=True) for back-references avoids circular serialization.
Provider abstraction — BaseProvider protocol with sync/async methods is clean. JsonProvider is a sensible default.
3-arg handler backward compat — Event handlers can accept 2 or 3 args. inspect.signature dispatch is pragmatic.
Resume logic in Crew — _get_execution_start_index checking for task.output is None to skip completed tasks is straightforward and correct.
Test coverage — 423 lines of EventRecord tests covering edge wiring, symmetry, traversal, serialization roundtrips, and RuntimeState integration. Solid.

⚠️ Concerns

1. inspect.signature() on every sync handler call (hot path)

In _call_handlers and is_call_handler_safe, every handler invocation does inspect.signature(handler). This is the event emission hot path — could fire hundreds of times per crew execution. Consider caching the parameter count at registration time (@crewai_event_bus.on) instead.

2. New dependency: aiofiles~=24.1.0

This adds a runtime dependency to the core package for async file I/O in JsonProvider. Given that checkpointing is opt-in and the sync path uses plain open(), is the async path critical enough to justify a new dependency? An alternative: use asyncio.to_thread(Path.read_text, ...) for the async provider methods.

3. register_entity uses id(entity) for dedup

Python's id() can be reused after garbage collection. If an entity is GC'd and a new one allocated at the same address, it won't get registered. In practice this is unlikely during a single execution, but it's a subtle footgun. Consider using the entity's id field (UUID) instead of id().

4. CrewStructuredTool → BaseModel migration

The args_schema: Any and func: Any typing loses the previous type safety. Could these be typed more precisely? e.g., args_schema: type[BaseModel] | None and func: Callable[..., Any] | None.

5. LLM serialization change: str → dict

Changing _serialize_llm_ref to return dict instead of str is a breaking change for anyone consuming serialized agent/crew JSON. The _validate_llm_ref handles dict → LLM on deserialization, which is good, but this needs to be called out in release notes.

6. _restore_runtime matches agents by role string

task.agent is re-linked by matching agent.role == task.agent.role. If two agents share the same role (unusual but possible), this could mis-link. Consider matching on agent id (UUID) instead.

7. mypy: disable-error-code="union-attr,arg-type" added to two files

Both crew_agent_executor.py and experimental/agent_executor.py get blanket mypy suppressions. These are large files — would be better to use inline # type: ignore on specific lines rather than file-wide suppression.

🔍 Minor nits

StandardPromptResult and SystemPromptResult converted from TypedDict to BaseModel with get()/__getitem__/__contains__ methods — these duck-type as dicts for backward compat, which is clever but should be documented.
TokenCalcHandler.__hash__ = object.__hash__ — this is needed because BaseModel changes hash behavior, but it's non-obvious. A comment explaining why would help.
The EventRecord.descendants() uses queue.pop(0) (O(n)) — use collections.deque for proper BFS.
Empty __init__.py files for state/ and state/provider/ — fine, just noting.

Summary

The core design is strong and the test coverage is good. Main things I'd want addressed before merge:

Cache handler param count at registration (perf)
Consider dropping aiofiles dep (use asyncio.to_thread instead)
Match entities by UUID not role string in _restore_runtime
Remove file-wide mypy suppressions

None of these are show-stoppers, but #1 and #3 could cause real issues at scale.

💬 142

- Return len(tasks) from _get_execution_start_index when all tasks complete, preventing full re-execution of finished checkpoints - Add _get_execution_start_index call to _aexecute_tasks so async resume skips completed tasks like the sync path does - Cache inspect.signature results per handler to avoid repeated introspection on every event emission

lib/crewai/src/crewai/crew.py

- Bump uv-pre-commit from 0.9.3 to 0.11.3 to support relative exclude-newer values in pyproject.toml - Use checkpoint_kickoff_event_id to detect resume, preventing second kickoff() from skipping tasks or suppressing events

…ewAIInc/crewAI into chore/runtime-state-event-bus

lib/crewai/src/crewai/events/utils/handlers.py

lib/crewai/src/crewai/events/event_bus.py

litellm 1.83.0 fixes CVE-2026-35029 (proxy config privilege escalation) and CVE-2026-35030 (proxy JWT auth bypass), and is the first release after the supply chain incident. Bump openai to 2.x to satisfy litellm's dependency.

lib/crewai/src/crewai/events/event_bus.py

lib/crewai/src/crewai/agents/agent_builder/base_agent.py

Extract _prepare_event to set previous_event_id, triggered_by_event_id, emission_sequence, parent/child scoping, and event_record tracking. Both emit and aemit now call it, fixing aemit's missing metadata.

lib/crewai/src/crewai/events/event_bus.py

lib/crewai/src/crewai/crew.py

lib/crewai/src/crewai/agents/crew_agent_executor.py

lib/crewai/src/crewai/events/event_bus.py

lib/crewai/src/crewai/experimental/agent_executor.py

lib/crewai/src/crewai/task.py

Replay the event record during _restore_runtime to rebuild _event_id_stack with correct event IDs. Remove manual push_event_scope calls from task and crew resume paths that used task UUIDs instead of event IDs.

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 167b609. Configure here.}

cursor · 2026-04-04T15:26:49Z

lib/crewai/src/crewai/agents/agent_builder/base_agent.py

+    if isinstance(value, dict):
+        from crewai.llm import LLM
+
+        return LLM(**value)


LLM deserialization always creates LLM, losing provider type

Medium Severity

_validate_llm_ref always reconstructs a litellm-based LLM from a dict, even if the original object was a different BaseLLM subclass (e.g., OpenAICompletion). After a checkpoint/restore cycle, the LLM provider type is silently changed, which alters runtime behavior. The same issue applies to _validate_executor_ref, which always creates a CrewAgentExecutor even if the original was an AgentExecutor.

Additional Locations (1)

lib/crewai/src/crewai/agents/agent_builder/base_agent.py#L82-L88

^{Reviewed by Cursor Bugbot for commit 167b609. Configure here.}

cursor · 2026-04-04T15:26:49Z

lib/crewai/src/crewai/agents/agent_builder/base_agent.py

-    return getattr(value, "model", str(value))
+        return {"model": value}
+    result: dict[str, Any] = value.model_dump()
+    return result


Checkpoint files may contain plaintext API credentials

Medium Severity

_serialize_llm_ref calls value.model_dump() which serializes all LLM fields, potentially including api_key, api_base, and other credentials. Checkpoint JSON files written by RuntimeState.checkpoint() could contain sensitive secrets in plaintext on the filesystem.

^{Reviewed by Cursor Bugbot for commit 167b609. Configure here.}

cursor · 2026-04-04T15:26:49Z

lib/crewai/src/crewai/events/event_bus.py

+        set_last_event_id(event.event_id)
+
+        if self._runtime_state is not None:
+            self._runtime_state.event_record.add(event)


Concurrent emit calls race on shared event_record

Low Severity

_prepare_event is called from the emitting thread without any locking, and self._runtime_state.event_record.add(event) mutates the shared nodes dict and modifies neighboring nodes' edge lists. When multiple threads call emit() concurrently, these compound mutations can race, potentially producing an inconsistent event record in the checkpoint.

^{Reviewed by Cursor Bugbot for commit 167b609. Configure here.}

greysonlalonde added 2 commits April 3, 2026 04:19

feat: pass RuntimeState through event bus, add .checkpoint(directory)

6627845

feat: pass RuntimeState through event bus, add .checkpoint() and .fro…

cf241d8

…m_checkpoint()

github-actions bot added the size/L label Apr 2, 2026

feat: convert executor/tools/prompts to BaseModel, enable checkpoint …

2e1f882

…resume via kickoff()

github-actions bot added the size/XL label Apr 3, 2026

greysonlalonde added 3 commits April 3, 2026 12:21

fix: preserve kickoff_event_id on resume, verbose already works

743ebed

refactor: make CrewAgentExecutorMixin a proper base class with Fields…

9ab85e6

… instead of PrivateAttr properties

Merge branch 'main' into chore/runtime-state-event-bus

5179b41

greysonlalonde changed the title ~~feat: runtime state event bus~~ feat: RuntimeState event bus integration with checkpoint/resume Apr 3, 2026

greysonlalonde added 9 commits April 3, 2026 12:46

Merge branch 'main' into chore/runtime-state-event-bus

2c4914b

feat: type executor fields, auto-register entities in event bus, conv…

6504e39

…ert TokenProcess to BaseModel

fix: TokenCalcHandler hashability, test MinimalExecutor as instance

78fbe45

fix: type remaining Any fields on CrewAgentExecutor

de9f121

fix: use spec= on test mocks for typed executor fields

0b980db

fix: replace object.__new__ and MinimalExecutor subclass with proper …

3a08e95

…construction

fix: validate entity_type tag before auto-registering in emit()

caaccd7

fix: mypy errors in streaming.py and core.py

2e1525f

refactor: move RuntimeState to runtime_state.py, type _runtime_state …

1ed6646

…on event bus

github-code-quality bot found potential problems Apr 3, 2026

View reviewed changes

lib/crewai/src/crewai/__init__.py Dismissed Show dismissed Hide dismissed

github-code-quality bot found potential problems Apr 3, 2026

View reviewed changes

lib/crewai/src/crewai/state/provider/core.py Dismissed Show dismissed Hide dismissed

lib/crewai/src/crewai/state/provider/core.py Dismissed Show dismissed Hide dismissed

feat: add EventRecord to RuntimeState checkpoints

c653d41

github-code-quality bot found potential problems Apr 3, 2026

View reviewed changes

lib/crewai/tests/test_event_record.py Fixed Show fixed Hide fixed

greysonlalonde added 3 commits April 3, 2026 23:02

fix: suppress duplicate lifecycle events on checkpoint resume

5ace0bf

feat: mid-task checkpoint resume and executor refactor

6dc9f46

refactor: generic from_checkpoint with provider, full LLM serialization

191053c

github-code-quality bot found potential problems Apr 3, 2026

View reviewed changes

lib/crewai/src/crewai/state/provider/core.py Dismissed Show dismissed Hide dismissed

lib/crewai/src/crewai/state/provider/core.py Dismissed Show dismissed Hide dismissed

greysonlalonde added 3 commits April 4, 2026 02:07

fix: guard register_entity when RuntimeState is None

fb8b59d

fix: add BaseAgentExecutor to model_rebuild chain

f9d58d4

fix: use spec= on mocks for typed executor fields

fba5605

greysonlalonde marked this pull request as ready for review April 3, 2026 20:01

Merge branch 'main' into chore/runtime-state-event-bus

81e51f0

cursor bot reviewed Apr 3, 2026

View reviewed changes

lib/crewai/src/crewai/crew.py Outdated Show resolved Hide resolved

lib/crewai/src/crewai/crew.py Show resolved Hide resolved

lib/crewai/src/crewai/events/utils/handlers.py Show resolved Hide resolved

iris-clawd reviewed Apr 3, 2026

View reviewed changes

greysonlalonde and others added 2 commits April 4, 2026 04:11

Merge branch 'main' into chore/runtime-state-event-bus

dc2904d

cursor bot reviewed Apr 3, 2026

View reviewed changes

lib/crewai/src/crewai/crew.py Show resolved Hide resolved

lib/crewai/src/crewai/crew.py Show resolved Hide resolved

greysonlalonde added 4 commits April 4, 2026 21:29

fix: restore checkpoint_train flag during checkpoint resume

d769469

Merge branch 'chore/runtime-state-event-bus' of https://github.com/cr…

88d9984

…ewAIInc/crewAI into chore/runtime-state-event-bus

refactor: use lru_cache for handler param count

c4bbb03

cursor bot reviewed Apr 4, 2026

View reviewed changes

lib/crewai/src/crewai/events/utils/handlers.py Show resolved Hide resolved

lib/crewai/src/crewai/events/event_bus.py Show resolved Hide resolved

greysonlalonde added 3 commits April 4, 2026 22:04

fix: bump litellm to ~=1.83.0 and openai to ~=2.30.0

70fc701

litellm 1.83.0 fixes CVE-2026-35029 (proxy config privilege escalation) and CVE-2026-35030 (proxy JWT auth bypass), and is the first release after the supply chain incident. Bump openai to 2.x to satisfy litellm's dependency.

fix: handle unhashable partial handlers in param count cache

fac186a

fix: register entities in aemit like emit does

686cff6

cursor bot reviewed Apr 4, 2026

View reviewed changes

lib/crewai/src/crewai/events/event_bus.py Outdated Show resolved Hide resolved

lib/crewai/src/crewai/events/event_bus.py Outdated Show resolved Hide resolved

lib/crewai/src/crewai/agents/agent_builder/base_agent.py Show resolved Hide resolved

greysonlalonde added 2 commits April 4, 2026 22:18

ci: bump uv from 0.8.4 to 0.11.3 in all workflows

0079c70

fix: share event metadata setup between emit and aemit

da5a890

Extract _prepare_event to set previous_event_id, triggered_by_event_id, emission_sequence, parent/child scoping, and event_record tracking. Both emit and aemit now call it, fixing aemit's missing metadata.

cursor bot reviewed Apr 4, 2026

View reviewed changes

lib/crewai/src/crewai/events/event_bus.py Show resolved Hide resolved

greysonlalonde added 2 commits April 4, 2026 22:27

cleanup: remove redundant _registered_entity_ids class annotation

3f447f2

fix: seed _registered_entity_ids from restored RuntimeState

0c228b4

cursor bot reviewed Apr 4, 2026

View reviewed changes

lib/crewai/src/crewai/crew.py Show resolved Hide resolved

lib/crewai/src/crewai/agents/crew_agent_executor.py Show resolved Hide resolved

lib/crewai/src/crewai/events/event_bus.py Outdated Show resolved Hide resolved

greysonlalonde added 2 commits April 4, 2026 22:37

fix: return 0 instead of None when checkpoint resumes from first task

e0fc321

fix: avoid duplicating LLM hooks on checkpoint restore

6e7afb7

cursor bot reviewed Apr 4, 2026

View reviewed changes

lib/crewai/src/crewai/events/event_bus.py Show resolved Hide resolved

lib/crewai/src/crewai/experimental/agent_executor.py Outdated Show resolved Hide resolved

lib/crewai/src/crewai/task.py Outdated Show resolved Hide resolved

greysonlalonde added 4 commits April 4, 2026 23:00

fix: resolve mypy errors from openai 2.x type changes

055d146

fix: skip adding crew-owned agents as top-level RuntimeState entities

5c243b7

fix: return state messages by reference, not copy

b46e965

fix: restore event scope stack from checkpoint event record

167b609

Replay the event record during _restore_runtime to rebuild _event_id_stack with correct event IDs. Remove manual push_event_scope calls from task and crew resume paths that used task UUIDs instead of event IDs.

cursor bot reviewed Apr 4, 2026

View reviewed changes

Conversation

greysonlalonde commented Apr 2, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iris-clawd left a comment

Choose a reason for hiding this comment

Full Review: RuntimeState event bus integration with checkpoint/resume

✅ What looks good

⚠️ Concerns

🔍 Minor nits

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 4, 2026

Choose a reason for hiding this comment

LLM deserialization always creates LLM, losing provider type

Uh oh!

cursor bot Apr 4, 2026

Choose a reason for hiding this comment

Checkpoint files may contain plaintext API credentials

Uh oh!

cursor bot Apr 4, 2026

Choose a reason for hiding this comment

Concurrent emit calls race on shared event_record

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greysonlalonde commented Apr 2, 2026 •

edited by cursor bot

Loading

LLM deserialization always creates `LLM`, losing provider type

Concurrent `emit` calls race on shared `event_record`