Skip to content

feat: add TemporalFS CHASM Archetype#1

Draft
moedash wants to merge 70 commits intomainfrom
moe/temporal-fs
Draft

feat: add TemporalFS CHASM Archetype#1
moedash wants to merge 70 commits intomainfrom
moe/temporal-fs

Conversation

@moedash
Copy link
Owner

@moedash moedash commented Mar 19, 2026

What changed?

Introduces the TemporalFS CHASM archetype — a virtual filesystem built on top of the CHASM framework, backed by temporal-fs (PebbleDB). This PR includes:

  • Proto definitions: state.proto, tasks.proto, request_response.proto, service.proto with generated Go bindings for the TemporalFS gRPC service.
  • Core archetype implementation: Filesystem root component, state machine (Create/Archive/Delete transitions), CHASM library registration, dynamic config, search attributes, and FX wiring into service/history.
  • Pluggable storage: FSStoreProvider interface with PebbleStoreProvider (OSS default) using deterministic FNV-1a partition IDs. SaaS overrides this with CDSStoreProvider.
  • gRPC handler: All 18 filesystem RPCs implemented via temporal-fs ByID APIs — CreateFilesystem, Getattr, Lookup, ReadDir, Mkdir, CreateFile, WriteChunks, ReadChunks, CreateSnapshot, Setattr, Truncate, Unlink, Rmdir, Rename, Link, Symlink, Readlink, Mknod, Statfs.
  • Background tasks: ChunkGC (runs garbage collection and reschedules), QuotaCheck (reads FS metrics and warns on quota breach), ManifestCompact (placeholder).
  • Bug fixes during development: deterministic partition IDs (was using a reset-on-restart counter), store resource leak fixes in handler error paths, unsigned integer underflow guard in GC stats, complete gRPC error mapping, N+1 query fix in ReadDir via ReadDirPlusByID, nil-safety in state machine, FX lifecycle hook to close PebbleDB on shutdown.
  • CI changes: Configure GOPRIVATE/GONOSUMCHECK and git credentials across all workflows for the private temporal-fs dependency (using moedash/temporal-fs fork for CI access).
  • Architecture documentation in docs/architecture/temporalfs.md.

Why?

TemporalFS provides MVCC-snapshotted, append-friendly filesystem storage as a first-class CHASM archetype. This enables use cases like AI research agents that need transactional file operations with snapshot-based time travel — activities can write files, take snapshots, and later read from any historical snapshot for reproducibility. The pluggable FSStoreProvider allows SaaS to swap in CDS-backed storage without changing the core logic.

How did you test it?

  • built
  • added new unit test(s)
    • Component & state machine tests (lifecycle states, transitions, search attributes)
    • Task executor tests (ChunkGC with real PebbleDB, QuotaCheck, validation)
    • Handler tests for all 18 RPCs (including error paths)
    • PebbleStoreProvider partition isolation and cross-instance stability
  • added new functional test(s)
    • Integration test: full lifecycle (create → write → read → getattr → snapshot)
    • Research agent scenario: 3-iteration agent with workspace dirs, research notes, and MVCC snapshot time-travel verification
    • Crash recovery tests: failpoint injection verifying atomicity and retry correctness
    • Real integration test using FunctionalTestBase with CHASM-enabled Temporal server exercising the full stack (FX → PebbleStoreProvider → tfs.FS)

Potential risks

  • Private dependency: The temporal-fs module is sourced from moedash/temporal-fs fork via go.mod replace directive. Before merging to temporalio/temporal main, this must be switched to temporalio/temporal-fs and the CI secret (GO_PRIVATE_TOKEN) must be configured in the upstream repo.
  • PebbleDB resource lifecycle: PebbleStoreProvider opens a single PebbleDB instance per history service. If the FX lifecycle hook fails to fire (e.g., ungraceful shutdown), the PebbleDB lock file may persist and block restart.
  • Partition ID stability: FNV-1a hash of namespaceID+filesystemID is deterministic but not collision-resistant for adversarial inputs. This is acceptable for internal partition routing but should not be exposed as a user-facing identifier.
  • go.mod/go.sum churn: 80 additions / 41 deletions in go.mod due to the new temporal-fs dependency tree. Review for unintended transitive dependency upgrades.

moedash added 30 commits March 18, 2026 11:28
Define proto files for the TemporalFS archetype following the activity
archetype pattern:
- state.proto: FilesystemState, FilesystemConfig, FSStats, FilesystemStatus enum
- tasks.proto: ChunkGCTask, ManifestCompactTask, QuotaCheckTask
- request_response.proto: Request/response types for all FS operations
- service.proto: TemporalFSService gRPC service with routing annotations

Generated Go bindings in chasm/lib/temporalfs/gen/temporalfspb/.
Implement the TemporalFS archetype following the activity pattern:
- filesystem.go: Root component with lifecycle state and search attributes
- statemachine.go: State transitions (Create, Archive, Delete)
- library.go: CHASM library registration with component and tasks
- config.go: Dynamic config and default filesystem configuration
- search_attributes.go: FilesystemStatus search attribute
- handler.go: gRPC handler with CreateFilesystem, GetFilesystemInfo,
  ArchiveFilesystem implemented; FS operations stubbed
- tasks.go: ChunkGC, ManifestCompact, QuotaCheck task executors (stubs)
- fx.go: FX module for history service wiring
- errors.go: Shared error definitions

Wire TemporalFS HistoryModule into service/history/fx.go.
Define the pluggable FSStoreProvider interface — the sole extension point
for SaaS to provide a WalkerStore implementation. Includes FSStore and
FSBatch interfaces for key-value operations.

Provide InMemoryStoreProvider as a development/testing placeholder.
The production OSS implementation will use PebbleDB once the temporal-fs
module is integrated as a dependency.

Wire FSStoreProvider into the FX module with InMemoryStoreProvider as default.
Replace InMemoryStoreProvider with PebbleStoreProvider backed by
temporal-fs. The handler now uses temporal-fs APIs for Getattr
(StatByID), ReadChunks (ReadAtByID), WriteChunks (WriteAtByID),
CreateSnapshot, and CreateFilesystem (tfs.Create). Operations
requiring inode-based directory access (Lookup, ReadDir, Mkdir, etc.)
remain stubbed until temporal-fs exposes those APIs.
ChunkGC executor now opens the FS store and runs f.RunGC() to process
tombstones and delete orphaned chunks, then reschedules itself.
QuotaCheck executor reads FS metrics to update stats and warns when
size quota is exceeded. ManifestCompact remains a placeholder since
compaction operates at the PebbleDB shard level.
Tests cover:
- Filesystem lifecycle state (RUNNING/ARCHIVED/DELETED mapping)
- Terminate sets status to DELETED
- SearchAttributes returns status attribute
- StateMachineState with nil and non-nil state
- TransitionCreate with custom config, defaults, and zero GC interval
- TransitionArchive/Delete with valid and invalid source states
Tests cover:
- ChunkGC/ManifestCompact/QuotaCheck Validate (RUNNING only)
- ChunkGC Execute with real PebbleDB store and GC rescheduling
- ChunkGC Execute with zero interval (no rescheduling)
- QuotaCheck Execute initializes stats
- FS metrics tracking on open instance
Tests cover:
- openFS/createFS helpers with real PebbleDB
- createFS default chunk size fallback
- inodeToAttr conversion
- mapFSError nil passthrough
- Getattr on root inode
- ReadChunks/WriteChunks round-trip
- CreateSnapshot returns valid txn ID
- All stubbed methods return errNotImplemented
Tests cover:
- Full lifecycle: create FS → write → read → getattr → snapshot
- PebbleStoreProvider partition isolation across filesystems
- PebbleStoreProvider Close releases all resources
Documents the internal architecture following the Scheduler archetype
pattern in docs/architecture/. Covers component tree, state machine,
tasks, pluggable storage (FSStoreProvider), gRPC service RPCs, FX
wiring, and configuration defaults.
Switch from local replace directive to the published v1.0.0 release
at github.com/temporalio/temporal-fs.
These planning documents (1-pager, PRD, design doc) should remain
as local working files, not committed to the repository.
The per-shard PebbleDB map was a leaky abstraction from the SaaS
per-shard Walker model. Since all handler operations use shardID=0
and PrefixedStore already provides key isolation between filesystem
executions, a single PebbleDB instance is sufficient.
Wire all 14 stub methods to temporal-fs ByID APIs: Lookup,
Setattr, Truncate, Mkdir, Unlink, Rmdir, Rename, ReadDir,
Link, Symlink, Readlink, CreateFile, Mknod, Statfs.

Add proper mapFSError with full error mapping to gRPC service
errors. Remove errNotImplemented.
Replace TestStubsReturnNotImplemented with 15 real tests
covering Lookup, Setattr, Truncate, Mkdir, Unlink, Rmdir,
Rename, ReadDir, Link, Symlink, Readlink, CreateFile, Mknod,
and Statfs handler methods.
- All 14 previously stubbed RPCs are now implemented with ByID methods
- Storage diagram shows CDSStoreProvider (SaaS) instead of placeholder
- Add CDSStoreProvider description with link to saas-temporal CDS doc
The moedash/temporal fork's CI needs access to temporal-fs. Since
moedash/temporal-fs is accessible to the fork's CI, add a replace
directive to source the module from there instead of temporalio/temporal-fs.
Add replace directive in go.mod to source temporal-fs from
moedash/temporal-fs. Configure git credentials and GOPRIVATE/GONOSUMCHECK
in CI workflows so go mod download can fetch the private module.
Composite actions can't access secrets directly, so add a
go-private-token input and pass it from all calling workflows.
PebbleStoreProvider used an in-memory counter for partition IDs that
reset on restart, causing FS data to map to wrong PrefixedStore prefixes.
Replace with deterministic FNV-1a hash of namespaceID+filesystemID so
partition IDs are stable across restarts. Add test for cross-instance
stability.
openFS and createFS leaked stores on error paths — the store was returned
to callers but never closed on failure. Now both methods close the store
internally on error and return only (*tfs.FS, error). Callers no longer
receive the store since f.Close() handles the store lifecycle.

Also add named constants for Statfs virtual capacity magic numbers and
wrap store-level errors through mapFSError for consistent error mapping.
QuotaCheckTaskExecutor silently swallowed GetStore/Open errors, returning
nil (success) so the task was never retried. Now returns the error for
retry by the task framework.

ChunkGCTaskExecutor ignored f.Close() errors. Now logs a warning.
- Fix PebbleStoreProvider description to reflect FNV-1a partition IDs
- Add WAL integration section covering walEngine, stateTracker, flusher,
  and recovery pipeline
- Clarify handler store lifecycle (openFS/createFS close on error)
- fx.go: Add fx.Lifecycle hook to close PebbleStoreProvider on shutdown,
  preventing PebbleDB resource leak.
- statemachine.go: Nil-check FilesystemState in SetStateMachineState to
  prevent panic on zero-value Filesystem.
- tasks.go: Log warning on f.Close() error in quotaCheckTaskExecutor
  (was silently ignored).
ReadDir was calling ReadDirByID then StatByID for every entry (N+1
queries). Now uses ReadDirPlusByID which returns embedded inode data
from the dir_scan keys, falling back to StatByID only for hardlinked
files where the inode isn't embedded. Also modernize Statfs min() usage.
ErrClosed and ErrVersionMismatch were not mapped, causing raw internal
errors to leak to clients. ErrLockConflict was also unmapped. Now:
- ErrLockConflict → FailedPrecondition
- ErrClosed, ErrVersionMismatch → Unavailable
ChunkCount is uint64 and ChunksDeleted could exceed it (e.g., when
stats drift from actual FS state or on first GC run with zero-init
stats). The subtraction would wrap to a massive value, permanently
corrupting the persisted CHASM stats. Now clamps to zero.
moedash added 28 commits March 20, 2026 23:06
120+ research topics across science, tech, policy, and medicine domains.
Five template-based markdown generators produce deterministic content
for each workflow step (sources, summary, fact-check, report, review).
DemoStore wraps a shared PebbleDB with manifest management for tracking
workflows. The Temporal workflow chains 5 activities (WebResearch,
Summarize, FactCheck, FinalReport, PeerReview), each writing files and
creating MVCC snapshots through TemporalFS. Random failures are injected
per-activity with attempt-aware seeding so retries can succeed.
Runner starts N workflows via Temporal SDK with semaphore-based
concurrency control and atomic stat counters. Dashboard renders a
live ANSI terminal TUI at 200ms refresh with progress bar, throughput
metrics, and a 12-line color-coded activity feed.
main.go provides run/report/browse subcommands. Report generates a
self-contained HTML file with dark theme, stat cards, workflow table,
and expandable filesystem explorer. README covers usage, demo script,
architecture, and file structure.
Activities now open the FS and verify prior step's files exist BEFORE
injecting failures. On retry, each activity logs the number of files
from the previous step and the last snapshot name, proving TemporalFS
durability across failures. Retries are counted in real-time via shared
RunStats so the dashboard shows them as they happen.
Store workflow results (retries, status) in the manifest after each
workflow completes. The HTML report now shows a "Retries Survived"
stat card and per-workflow retry badges (yellow) and status badges
(green/red) in the workflow table.
Builds the binary, starts Temporal dev server, runs workflows, lists
them in Temporal, browses a filesystem, generates the HTML report,
and opens it in the browser. Supports --workflows, --concurrency,
--failure-rate, and --seed flags. Cleans up on exit.
- Add --continuous flag: runs workflows indefinitely until Ctrl+C,
  auto-opens Temporal UI, and generates HTML report on shutdown
- Dashboard shows animated cycling bar with "∞" in continuous mode
- Runner supports labeled break loop for graceful cancellation
- Update run-demo.sh with --continuous flag support
- Update README with run-demo.sh usage, continuous mode docs
- Add .gitignore for demo binaries and generated artifacts
The "tfs: not found" errors were caused by stale Temporal activity
tasks from previous runs being delivered to the new worker against
a fresh PebbleDB. Fix by:

1. Use unique task queue per run (research-demo-<timestamp>) to
   isolate each demo run from previous Temporal server state
2. Pre-create all FS partitions before starting workflows to
   ensure superblocks exist before any activity executes
3. Simplify openFS to 2-return (removed diagnostic mutex and
   post-close verification that were added during investigation)
4. Remove unused createOrOpenFS method

Tested: 200 workflows, 50 concurrent, failure-rate=1.0, 0 errors.
When --no-dashboard is used, nobody reads from runner.EventCh.
After the buffer fills, goroutines block on channel writes and
can't finish, causing wg.Wait() to hang on shutdown.

Fix: spawn a goroutine to drain events when dashboard is disabled.
When the user presses Ctrl+C in continuous mode, in-flight workflows
are no longer waited on. Previously these were counted as "failed"
because run.Get() returned a context-cancelled error. Now we detect
context cancellation and exclude them from the failure count.
Activities now emit started/retrying/completed events to the shared
EventCh, giving the live dashboard real-time visibility into each
workflow step. Removed --no-dashboard from run-demo.sh so the TUI
shows by default during the demo.
- Pipe Temporal SDK and Go log output to <data-dir>/demo.log so the
  live dashboard isn't buried in log lines
- Fix dashboard box drawing: add visibleLen/boxLine helpers that
  auto-pad lines to exact box width, ignoring ANSI escape codes
- Reduce progress bar from 40 to 30 chars to fit 66-char box
- Make all EventCh sends in runOne non-blocking (select/default) to
  prevent goroutines from deadlocking when the channel buffer fills,
  which was holding semaphore slots and stopping new workflows
- Set WorkflowIDConflictPolicy to TERMINATE_EXISTING so stale
  workflows from previous runs don't cause ExecuteWorkflow failures
Include the task queue name (which contains a per-run timestamp) in
workflow IDs instead of using TERMINATE_EXISTING conflict policy.
This prevents workflows from previous runs being terminated while
ensuring each run's IDs are globally unique.
- Update all import paths from temporal-fs to temporal-zfs
- Rename tfs import alias to tzfs throughout
- Bump dependency to v1.2.0 (module path updated in upstream repo)
Replace all occurrences of "TemporalFS" with "TemporalZFS" in comments,
strings, and documentation across chasm/lib/temporalfs/ and tests/.
Generated protobuf code, proto definitions, package names, directory
paths, and import paths are intentionally left unchanged.
Rename directories, packages, and all references:
- chasm/lib/temporalfs → chasm/lib/temporalzfs
- temporalfspb → temporalzfspb
- docs/architecture/temporalfs.md → temporalzfs.md
- tests/temporalfs_test.go → temporalzfs_test.go
- temporal-fs → temporal-zfs in docs, CI GOPRIVATE/GONOSUMCHECK
- TemporalFS → TemporalZFS in architecture docs
- /tmp/tfs-demo → /tmp/tzfs-demo in demo scripts and README
Proto .pb.go files had stale raw descriptors from the temporalfs→temporalzfs
rename. Also fix unused loop variable in run-demo.sh (SC2034).
The research-agent-demo is an example app, not library code. Exclude
it from forbidigo (time.Now), errcheck, and revive rules that apply
to the chasm/lib package.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant