Skip to content

In-process virtual filesystem with hard quota for Python

License

Notifications You must be signed in to change notification settings

nightmarewalker/D-MemFS

Repository files navigation

D-MemFS

An in-process virtual filesystem with hard quota enforcement for Python.

Python 3.11+ Tests License: MIT Zero dependencies (runtime)

Languages: English | Japanese


Why MFS?

MemoryFileSystem gives you a fully isolated filesystem-like workspace inside a Python process.

  • Hard quota (MFSQuotaExceededError) to reject oversized writes before OOM
  • Hierarchical directories and multi-file operations (import_tree, copy_tree, move)
  • File-level RW locking + global structure lock for thread-safe operations
  • Free-threaded Python compatible (PYTHON_GIL=0) — stress-tested under 50-thread contention
  • Async wrapper (AsyncMemoryFileSystem) powered by asyncio.to_thread
  • Zero runtime dependencies (standard library only)

This is useful when io.BytesIO is too primitive (single buffer), and OS-level RAM disks/tmpfs are impractical (permissions, container policy, Windows driver friction).


Installation

pip install D-MemFS

Requirements: Python 3.11+


Quick Start

from dmemfs import MemoryFileSystem, MFSQuotaExceededError

mfs = MemoryFileSystem(max_quota=64 * 1024 * 1024)

mfs.mkdir("/data")
with mfs.open("/data/hello.bin", "wb") as f:
    f.write(b"hello")

with mfs.open("/data/hello.bin", "rb") as f:
    print(f.read())  # b"hello"

print(mfs.listdir("/data"))
print(mfs.is_file("/data/hello.bin"))  # True

try:
    with mfs.open("/huge.bin", "wb") as f:
        f.write(bytes(512 * 1024 * 1024))
except MFSQuotaExceededError as e:
    print(e)

API Highlights

MemoryFileSystem

  • open(path, mode, *, preallocate=0, lock_timeout=None)
  • mkdir, remove, rmtree, rename, move, copy, copy_tree
  • listdir, exists, is_dir, is_file, walk, glob
  • stat, stats, get_size
  • export_as_bytesio, export_tree, iter_export_tree, import_tree

Constructor parameters:

  • max_quota (default 256 MiB): byte quota for file data
  • max_nodes (default None): optional cap on total node count (files + directories). Raises MFSNodeLimitExceededError when exceeded.
  • default_storage (default "auto"): storage backend for new files — "auto" / "sequential" / "random_access"
  • promotion_hard_limit (default None): byte threshold above which Sequential→RandomAccess auto-promotion is suppressed (None uses the built-in 512 MiB limit)
  • chunk_overhead_override (default None): override the per-chunk overhead estimate used for quota accounting

Note: The BytesIO returned by export_as_bytesio() is outside quota management. Exporting large files may consume significant process memory beyond the configured quota limit.

Supported binary modes: rb, wb, ab, r+b, xb

MemoryFileHandle

  • read, write, seek, tell, truncate, flush, close
  • file-like capability checks: readable, writable, seekable

flush() is intentionally a no-op (compatibility API for file-like integrations).

stat() return (MFSStatResult)

size, created_at, modified_at, generation, is_dir

  • Supports both files and directories
  • For directories: size=0, generation=0, is_dir=True

Text Mode

D-MemFS natively operates in binary mode. For text I/O, use MFSTextHandle:

from dmemfs import MemoryFileSystem, MFSTextHandle

mfs = MemoryFileSystem()
mfs.mkdir("/data")

# Write text
with mfs.open("/data/hello.bin", "wb") as f:
    th = MFSTextHandle(f, encoding="utf-8")
    th.write("こんにちは世界\n")
    th.write("Hello, World!\n")

# Read text line by line
with mfs.open("/data/hello.bin", "rb") as f:
    th = MFSTextHandle(f, encoding="utf-8")
    for line in th:
        print(line, end="")

MFSTextHandle is a thin, bufferless wrapper. It encodes on write() and decodes on read() / readline(). Unlike io.TextIOWrapper, it introduces no buffering issues when used with MemoryFileHandle.


Use Case Tutorials

ETL Staging

Stage data through raw → processed → output directories:

from dmemfs import MemoryFileSystem

mfs = MemoryFileSystem(max_quota=16 * 1024 * 1024)
mfs.mkdir("/raw")
mfs.mkdir("/processed")

raw_data = b"id,name,value\n1,foo,100\n2,bar,200\n"
with mfs.open("/raw/data.csv", "wb") as f:
    f.write(raw_data)

with mfs.open("/raw/data.csv", "rb") as f:
    data = f.read()

with mfs.open("/processed/data.csv", "wb") as f:
    f.write(data.upper())

mfs.rmtree("/raw")  # cleanup staging

Archive-like Operations

Store, list, and export multiple files as a tree:

from dmemfs import MemoryFileSystem

mfs = MemoryFileSystem()
mfs.import_tree({
    "/archive/doc1.bin": b"Document 1",
    "/archive/doc2.bin": b"Document 2",
    "/archive/sub/doc3.bin": b"Document 3",
})

print(mfs.listdir("/archive"))  # ['doc1.bin', 'doc2.bin', 'sub']

snapshot = mfs.export_tree(prefix="/archive")  # dict of {path: bytes}

SQLite Snapshot

Serialize an in-memory SQLite DB into MFS and restore it later:

import sqlite3
from dmemfs import MemoryFileSystem

mfs = MemoryFileSystem()
conn = sqlite3.connect(":memory:")
conn.execute("CREATE TABLE t (id INTEGER, val TEXT)")
conn.execute("INSERT INTO t VALUES (1, 'hello')")
conn.commit()

with mfs.open("/snapshot.db", "wb") as f:
    f.write(conn.serialize())
conn.close()

with mfs.open("/snapshot.db", "rb") as f:
    raw = f.read()
restored = sqlite3.connect(":memory:")
restored.deserialize(raw)
rows = restored.execute("SELECT * FROM t").fetchall()  # [(1, 'hello')]

Concurrency and Locking Notes

  • Path/tree operations are guarded by _global_lock.
  • File access is guarded by per-file ReadWriteLock.
  • lock_timeout behavior:
    • None: block indefinitely
    • 0.0: try-lock (fail immediately with BlockingIOError)
    • > 0: timeout in seconds, then BlockingIOError
  • Current ReadWriteLock is non-fair: under sustained read load, writers can starve.

Operational guidance:

  • Keep lock hold duration short
  • Set an explicit lock_timeout in latency-sensitive code paths
  • walk() and glob() provide weak consistency: each directory level is snapshotted under _global_lock, but the overall traversal is NOT atomic. Concurrent structural changes may produce inconsistent results.

Async Usage

from dmemfs import AsyncMemoryFileSystem

async def run() -> None:
    mfs = AsyncMemoryFileSystem(max_quota=64 * 1024 * 1024)
    await mfs.mkdir("/a")
    async with await mfs.open("/a/f.bin", "wb") as f:
        await f.write(b"data")
    async with await mfs.open("/a/f.bin", "rb") as f:
        print(await f.read())

Benchmarks

Minimal benchmark tooling is included:

  • MFS vs io.BytesIO vs PyFilesystem2 (MemoryFS) vs tempfile
  • Cases: many-small-files and stream write/read
  • Optional report output to benchmarks/results/

Note: As of setuptools 82 (February 2026), pyfilesystem2 fails to import due to a known upstream issue (#597). Benchmark results including PyFilesystem2 were measured with setuptools ≤ 81 and are valid as historical comparison data.

Run:

uvx --with-requirements requirements.txt --with-editable . python benchmarks/compare_backends.py --save-md auto --save-json auto

See BENCHMARK.md for details.

Latest benchmark snapshot:


Testing and Coverage

Test execution and dev flow are documented in TESTING.md.

Typical local run:

uv pip compile requirements.in -o requirements.txt
uvx --with-requirements requirements.txt --with-editable . pytest tests/ -v --timeout=30 --cov=dmemfs --cov-report=xml --cov-report=term-missing

CI (.github/workflows/test.yml) runs tests with coverage XML generation.


API Docs Generation

API docs can be generated as Markdown (viewable on GitHub) using pydoc-markdown:

uvx --with pydoc-markdown --with-editable . pydoc-markdown '{
  loaders: [{type: python, search_path: [.]}],
  processors: [{type: filter, expression: "default()"}],
  renderer: {type: markdown, filename: docs/api_md/index.md}
}'

Or as HTML using pdoc (local browsing only):

uvx --with-requirements requirements.txt pdoc dmemfs -o docs/api

Compatibility and Non-Goals

  • Core open() is binary-only (rb, wb, ab, r+b, xb). Text I/O is available via the MFSTextHandle wrapper.
  • No symlink/hardlink support — intentionally omitted to eliminate path traversal loops and structural complexity (same rationale as pathlib.PurePath).
  • No direct pathlib.Path / os.PathLike API — MFS paths are virtual and must not be confused with host filesystem paths. Accepting os.PathLike would allow third-party libraries or a plain open() call to silently treat an MFS virtual path as a real OS path, potentially issuing unintended syscalls against the host filesystem. All paths must be plain str with POSIX-style absolute notation (e.g. "/data/file.txt").
  • No kernel filesystem integration (intentionally in-process only)

Auto-promotion behavior:

  • By default (default_storage="auto"), new files start as SequentialMemoryFile and auto-promote to RandomAccessMemoryFile when random writes are detected.
  • Promotion is one-way (no downgrade back to sequential).
  • Use default_storage="sequential" or "random_access" to fix the backend at construction; use promotion_hard_limit to suppress auto-promotion above a byte threshold.
  • Storage promotion temporarily doubles memory usage for the promoted file. The quota system accounts for this, but process-level memory may spike briefly.

Security note: In-memory data may be written to physical disk via OS swap or core dumps. MFS does not provide memory-locking (e.g., mlock) or secure erasure. Do not rely on MFS alone for sensitive data isolation.


Exception Reference

Exception Typical cause
MFSQuotaExceededError write/import/copy would exceed quota
MFSNodeLimitExceededError node count would exceed max_nodes (subclass of MFSQuotaExceededError)
FileNotFoundError path missing
FileExistsError creation target already exists
IsADirectoryError file operation on directory
NotADirectoryError directory operation on file
BlockingIOError lock timeout or open-file conflict
io.UnsupportedOperation mode mismatch / unsupported operation
ValueError invalid mode/path/seek/truncate arguments

Testing with pytest

D-MemFS ships a pytest plugin that provides an mfs fixture:

# conftest.py — register the plugin explicitly
pytest_plugins = ["dmemfs._pytest_plugin"]

Note: The plugin is not auto-discovered. Users must declare it in conftest.py to opt in.

# test_example.py
def test_write_read(mfs):
    mfs.mkdir("/tmp")
    with mfs.open("/tmp/hello.txt", "wb") as f:
        f.write(b"hello")
    with mfs.open("/tmp/hello.txt", "rb") as f:
        assert f.read() == b"hello"

Development Notes

Design documents (Japanese):

These documents are written in Japanese and serve as internal design references.


Performance Summary

Key results from the included benchmark (300 small files × 4 KiB, 16 MiB stream, 2 GiB large stream):

Case MFS (ms) BytesIO (ms) tempfile (ms)
small_files_rw 34 5 164
stream_write_read 64 51 17
random_access_rw 24 53 27
large_stream_write_read 1 438 7 594 1 931
many_files_random_read 777 163 4 745

MFS incurs a small overhead on tiny-file workloads but delivers significantly better performance on large streams and random-access patterns compared with BytesIO. See BENCHMARK.md and benchmark_current_result.md for full data.

Note: tempfile results above were measured with the system temp directory on a RAM disk. On a physical SSD/HDD, tempfile performance will be substantially slower.


License

MIT License