Skip to content

Cortex consolidate — Performance Report #13

@darval

Description

@darval

Cortex consolidate — Performance Report

Plugin: cortex (v3.9.1, per skill path)
Date: 2026-04-15
Environment: macOS 25.4.0 (Darwin), Mac M2 Max, 64 GB RAM

Summary

A full consolidate({decay, compress, cls, memify}) run on a memory
store of ~66K memories took 35 min 50 s (2,150,496 ms). Most of
the work appears to be O(N) per-row updates across stages that could
plausibly be batched. One stage scanned but produced nothing, and the
homeostatic stage reported health_score: 0.0 and skipped scaling.

I'd like to suggest a few low-risk optimizations and — more usefully —
some per-stage telemetry that would let maintainers (and operators)
see where time actually goes on real stores.

Pre-run state

{
  "total_memories": 65949,
  "episodic_count": 25319,
  "semantic_count": 40630,
  "active_count":   65610,
  "archived_count": 339,
  "protected_count": 3088,
  "avg_heat": 0.6746,
  "total_entities": 10770,
  "total_relationships": 34735,
  "active_triggers": 1456,
  "last_consolidation": "2026-04-13T01:20:15Z",   // ~2 days prior
  "has_vector_search": true
}

Full-run result

{
  "decay":       { "memories_decayed": 62522, "metabolic_updates": 65937,
                   "entities_decayed": 10770, "total_memories": 65949 },
  "plasticity":  { "ltp": 17, "ltd": 33092, "edges_updated": 33109 },
  "pruning":     { "edges_pruned": 32024, "entities_archived": 718 },
  "compression": { "compressed_to_gist": 213, "compressed_to_tag": 0,
                   "protected_skipped": 3088, "semantic_skipped": 38502 },
  "cls":         { "patterns_found": 0, "new_semantics_created": 0,
                   "skipped_inconsistent": 0, "skipped_duplicate": 0,
                   "causal_edges_found": 0 },
  "memify":      { "pruned": 0, "strengthened": 0, "reweighted": 1345 },
  "cascade":     { "advanced": 503, "transitions": [... 503 entries ...] },
  "homeostatic": { "scaling_applied": false, "health_score": 0.0 },
  "duration_ms": 2150496
}

Observations

1. Most work is row-by-row across large result sets

  • Decay touched 62,522 memories (94.8 % of the store) plus 10,770
    entities. If this is implemented as per-row UPDATE statements
    inside a loop, it is the most plausible source of the bulk of the
    35 minutes.
  • Plasticity updated 33,109 edges (17 LTP + 33,092 LTD).
  • Pruning deleted 32,024 edges.

These three stages together account for ~138K row operations. A single
set-based UPDATE ... SET heat = heat * :decay WHERE ... (and the
analogous UPDATE / DELETE for plasticity and pruning) is typically
orders of magnitude faster than per-row application logic.

2. CLS scanned ~25K episodic memories and produced zero output

patterns_found: 0, new_semantics_created: 0,
skipped_inconsistent: 0, skipped_duplicate: 0, causal_edges_found: 0

A stage that finds nothing after a 2-day gap on a 25K-episodic store
is either miscalibrated (threshold too strict) or doing expensive
work with no observable effect. Either way, it's a candidate for
early-exit when the input set hasn't changed meaningfully since the
last run.

3. Homeostatic stage reports health_score: 0.0, scaling skipped

"scaling_applied": false, "health_score": 0.0 looks like either a
divide-by-zero / empty-input guard short-circuiting, or a metric that
legitimately is zero but whose name suggests a bug. Worth a glance.

4. Compression is heavily skipped

protected_skipped: 3088, semantic_skipped: 38502, with only
compressed_to_gist: 213. Skips are ~99 % of the candidate set. If
those skip checks are cheap, this is fine — but if the skip path
re-reads each memory's content/flags from the DB, that's another
O(N) tax.

5. No per-stage duration in the result payload

Today the response includes only duration_ms (total). Everything
above is inferred from row counts. Per-stage timings would make
bottleneck identification mechanical instead of speculative.

Suggested telemetry (the most useful change, probably)

Return duration_ms per stage in the existing result dict:

{
  "decay":       { ..., "duration_ms": 1234567 },
  "plasticity":  { ..., "duration_ms": 234567 },
  "pruning":     { ..., "duration_ms": 45678 },
  "compression": { ..., "duration_ms": 12345 },
  "cls":         { ..., "duration_ms": 456789 },   // scanned but empty
  "memify":      { ..., "duration_ms": 6789 },
  "cascade":     { ..., "duration_ms": 3456 },
  "homeostatic": { ..., "duration_ms": 12 },
  "duration_ms":  2150496
}

If it's useful, I'd be happy to collect and share per-run timing
stats from my installation so you have at least one real-world
large-store data point to tune against.

Additional fields that would be valuable if cheap to collect:

Field Why
rows_scanned vs rows_modified per stage Catches the CLS-style "scanned 25K, produced 0" case
db_query_count, db_tx_count per stage Surfaces N+1 patterns directly
avg_work_per_memory_ms Normalizes across memory-store sizes
peak_memory_mb Flags when in-memory structures should be streamed
delta_vs_last_run_ms Drift signal — "this run was 2× slower than last"

Suggested optimizations (in rough order of likely payoff)

  1. Batch the decay update. One SQL statement over the candidate
    set instead of per-row updates. Same for plasticity LTD and pruning.
  2. Cooldown window on decay. Skip memories whose last_decayed
    timestamp is within the configured cooldown — on a 2-day gap this
    may not help, but on daily runs it should dramatically shrink the
    working set.
  3. Early-exit CLS when the episodic-memory delta (new / accessed
    since last run) is below some minimum. The current run scanned the
    whole 25K to produce zero patterns.
  4. Instrument the *_skipped counters in compression so it's
    clear whether the skip check itself is the expense or only the
    (rare) non-skip path.
  5. Fix or explain health_score: 0.0 + scaling_applied: false.
    Either the metric name is misleading (rename) or the scaling path
    is silently disabled (fix).

Offer to help

If any of the above is actionable and you'd like:

  • a real-world workload for benchmarking (my ~66K-memory store before
    and after a patched run),
  • per-stage timings once the telemetry lands, or
  • a PR for the telemetry change itself (adding duration_ms per
    stage is a small, contained change),

happy to do it. Let me know what's most useful.

Repro

Tool: mcp__plugin_cortex_cortex__consolidate
Args: { decay: true, compress: true, cls: true, memify: true, deep: false }
Store size at run: 65949 memories, 40630 semantic, 34735 relationships
Time since last consolidate: ~2 days
Wall-clock duration: 2,150,496 ms (35 min 50 s)

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions