Skip to content

fix(security): bound pickle metadata reads in metadata extraction#712

Open
mldangelo wants to merge 5 commits intomainfrom
codex/fix-dos-vulnerability-in-pickle-extraction
Open

fix(security): bound pickle metadata reads in metadata extraction#712
mldangelo wants to merge 5 commits intomainfrom
codex/fix-dos-vulnerability-in-pickle-extraction

Conversation

@mldangelo
Copy link
Member

@mldangelo mldangelo commented Mar 16, 2026

Motivation

  • Prevent unbounded memory reads during metadata extraction of pickle files which allowed a DoS by calling f.read() on arbitrarily large .pkl files.
  • Ensure metadata path respects existing file-size safeguards without changing scanner behavior for normal files.

Description

  • Add a bounded-read guard in PickleScanner.extract_metadata() using a new config key max_metadata_pickle_read_size with a default of 10 MiB and raise a ValueError when the file or read exceeds the limit.
  • Read at most max_metadata_pickle_read_size + 1 bytes and surface a clear extraction_error when the limit is exceeded to avoid large allocations while preserving opcode analysis for small files.
  • Add a regression test test_pickle_metadata_enforces_read_limit to tests/test_metadata_extractor.py that verifies oversized pickle metadata extraction is rejected.

Testing

  • Ran formatting and linting: uv run ruff format modelaudit/ tests/ (reformatted 2 files) and uv run ruff check --fix modelaudit/ tests/ (passed).
  • Type checking: uv run mypy modelaudit/ (passed with no issues).
  • Added regression unit test and validated behavior with a focused runtime check (small Python snippet asserting PickleScanner({'max_metadata_pickle_read_size':64}).extract_metadata() returns an extraction_error for a 128-byte file), which passed.
  • Full test suite uv run pytest -n auto -m "not slow and not integration" --maxfail=1 was attempted but encountered unrelated, pre-existing failures in other tests; the change-specific checks and linters passed.

Codex Task

Summary by CodeRabbit

  • New Features

    • Added a configurable read limit for pickle metadata extraction (default 10 MB). The extractor now requires a positive limit and rejects files whose metadata exceeds the configured cap.
  • Tests

    • Added parameterized tests validating read-limit enforcement for valid, zero, negative, and oversized cases, and updated a pickle-related test to use a simpler pickle stream.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 16, 2026

Walkthrough

Adds a configurable read-size cap for pickle metadata extraction: validates the configured limit (>0), checks file size against it, performs a capped read (limit + 1) to detect over-read, infers protocol from the read bytes, and raises ValueError when limits are exceeded.

Changes

Cohort / File(s) Summary
Pickle Scanner
modelaudit/scanners/pickle_scanner.py
Introduce max_metadata_read_size (default 10 MB) validation (>0). Stat file size and enforce limit, read at most limit + 1 bytes to detect overflow, set pickle_size to actual read length, detect protocol from read bytes, raise ValueError when limits exceeded.
Tests — Metadata Extractor
tests/test_metadata_extractor.py
Add parameterized tests test_pickle_metadata_enforces_read_limit (cases: valid limit, zero, negative) and integrate read-limit enforcement into existing dangerous-opcodes test. Update an existing safe-data test to use protocol 0 pickled data.

Sequence Diagram

sequenceDiagram
    participant Scanner as PickleScanner
    participant Config as Configuration
    participant FileIO as File I/O
    participant Validator as Validator
    participant Metadata as MetadataExtractor

    Scanner->>Config: read max_metadata_read_size
    Config-->>Scanner: return limit

    Scanner->>FileIO: stat file -> size
    FileIO-->>Scanner: return size

    Scanner->>Validator: ensure limit > 0
    Validator-->>Scanner: ok / raise ValueError

    Scanner->>Validator: compare file size <= limit
    alt file size > limit
        Validator-->>Scanner: raise ValueError
    else file size <= limit
        Scanner->>FileIO: read up to (limit + 1) bytes
        FileIO-->>Scanner: return pickle_data

        Scanner->>Validator: ensure len(pickle_data) <= limit
        alt read exceeded limit
            Validator-->>Scanner: raise ValueError
        else within limit
            Scanner->>Scanner: detect protocol from pickle_data
            Scanner->>Metadata: extract metadata from limited data
            Metadata-->>Scanner: return metadata
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibble bytes with careful paws,
A limit set to guard the laws.
Read one extra, just to peep,
If it spills—no secrets to keep.
Hops and metadata, tidy and neat.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main security fix: introducing bounded/capped reads for pickle metadata to prevent denial-of-service attacks from unbounded memory reads.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/fix-dos-vulnerability-in-pickle-extraction
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 6129-6148: The code allows non-positive max_metadata_read_size
which causes f.read(-1) to read the entire file and bypass the guard; change the
logic in the pickle metadata read block (where max_metadata_read_size,
get_file_size, open(file_path) and pickle_data are used) to reject non-positive
values or coerce them to a safe default (e.g., 10*1024*1024) before reading;
specifically, validate max_metadata_read_size > 0 at the start of the try block
and if it is <= 0 raise a ValueError (or set it to the documented default) so
that f.read is always called with a bounded positive length and the subsequent
length checks remain effective.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2bc940b8-4f9f-4bb4-9f56-2b5dbb99ec89

📥 Commits

Reviewing files that changed from the base of the PR and between d9fe283 and b639d2d.

📒 Files selected for processing (2)
  • modelaudit/scanners/pickle_scanner.py
  • tests/test_metadata_extractor.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_metadata_extractor.py`:
- Around line 469-489: The parametrized test
test_pickle_metadata_enforces_read_limit currently only asserts failure cases;
add a valid success case (e.g., limit 256) to the ("limit", "expected_error")
matrix and update assertions so that when limit is the success value the test
asserts that PickleScanner({"max_metadata_pickle_read_size":
limit}).extract_metadata(...) does not contain "extraction_error" (i.e., assert
"extraction_error" not in metadata), using the same tmp_path pkl_file setup and
referencing PickleScanner and the max_metadata_pickle_read_size config key.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4e0768a7-e299-4842-8114-1b30f84ba28b

📥 Commits

Reviewing files that changed from the base of the PR and between b639d2d and d3ff557.

📒 Files selected for processing (2)
  • modelaudit/scanners/pickle_scanner.py
  • tests/test_metadata_extractor.py

Comment on lines +469 to +489
@pytest.mark.parametrize(
("limit", "expected_error"),
[
(64, "read limit exceeded"),
(0, "must be greater than 0"),
(-1, "must be greater than 0"),
],
)
def test_pickle_metadata_enforces_read_limit(self, tmp_path: Path, limit: int, expected_error: str) -> None:
"""Ensure pickle metadata extraction rejects oversized and invalid read limits."""
from modelaudit.scanners.pickle_scanner import PickleScanner

pkl_file = tmp_path / "oversized.pkl"
pkl_file.write_bytes(b"x" * 128)

scanner = PickleScanner({"max_metadata_pickle_read_size": limit})
metadata = scanner.extract_metadata(str(pkl_file))

assert "extraction_error" in metadata
assert expected_error in metadata["extraction_error"]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add a success-path case to this limit matrix.

This parametrization only checks failing paths. Add one valid limit case (e.g., 256) and assert metadata extraction succeeds without extraction_error to prevent over-restrictive regressions.

Proposed test refinement
 `@pytest.mark.parametrize`(
-    ("limit", "expected_error"),
+    ("limit", "expected_error"),
     [
         (64, "read limit exceeded"),
         (0, "must be greater than 0"),
         (-1, "must be greater than 0"),
+        (256, None),
     ],
 )
-def test_pickle_metadata_enforces_read_limit(self, tmp_path: Path, limit: int, expected_error: str) -> None:
+def test_pickle_metadata_enforces_read_limit(
+    self, tmp_path: Path, limit: int, expected_error: str | None
+) -> None:
@@
-    assert "extraction_error" in metadata
-    assert expected_error in metadata["extraction_error"]
+    if expected_error is None:
+        assert "extraction_error" not in metadata
+        assert metadata.get("pickle_size") == 128
+    else:
+        assert "extraction_error" in metadata
+        assert expected_error in metadata["extraction_error"]
Based on learnings "Preserve or strengthen security detections; test both benign and malicious samples when adding scanner/feature changes".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_metadata_extractor.py` around lines 469 - 489, The parametrized
test test_pickle_metadata_enforces_read_limit currently only asserts failure
cases; add a valid success case (e.g., limit 256) to the ("limit",
"expected_error") matrix and update assertions so that when limit is the success
value the test asserts that PickleScanner({"max_metadata_pickle_read_size":
limit}).extract_metadata(...) does not contain "extraction_error" (i.e., assert
"extraction_error" not in metadata), using the same tmp_path pkl_file setup and
referencing PickleScanner and the max_metadata_pickle_read_size config key.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 6885-6895: The code reads max_metadata_read_size from self.config
but only rejects non-positive values and allows arbitrarily large sizes; change
the logic around max_metadata_read_size in pickle_scanner.py (the variable
max_metadata_read_size, the config.get call) so that after parsing you enforce a
hard upper bound of 10 * 1024 * 1024 (10 MiB) — e.g., validate > 0 then apply
max_metadata_read_size = min(max_metadata_read_size, 10 * 1024 * 1024) (or
replace oversized values with the cap) to ensure metadata reads cannot exceed 10
MiB.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 38d7bd2d-53f6-42de-9347-8bad4c7dc0e8

📥 Commits

Reviewing files that changed from the base of the PR and between d3ff557 and 93c38ad.

📒 Files selected for processing (1)
  • modelaudit/scanners/pickle_scanner.py

Comment on lines +6885 to +6895
max_metadata_read_size = int(self.config.get("max_metadata_pickle_read_size", 10 * 1024 * 1024))

try:
import pickle
import pickletools
from io import BytesIO

if max_metadata_read_size <= 0:
raise ValueError(
f"Invalid pickle metadata read limit: {max_metadata_read_size} (must be greater than 0)"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Enforce a hard 10 MiB upper bound for max_metadata_pickle_read_size.

The new validation rejects non-positive values, but any very large positive config still weakens the metadata DoS protection.

🔧 Proposed fix
             if max_metadata_read_size <= 0:
                 raise ValueError(
                     f"Invalid pickle metadata read limit: {max_metadata_read_size} (must be greater than 0)"
                 )
+            hard_cap = 10 * 1024 * 1024
+            if max_metadata_read_size > hard_cap:
+                raise ValueError(
+                    f"Invalid pickle metadata read limit: {max_metadata_read_size} (must be <= {hard_cap})"
+                )
As per coding guidelines "Cap archive member reads to 10 MB for metadata validation to prevent memory spikes on large pickles".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/pickle_scanner.py` around lines 6885 - 6895, The code
reads max_metadata_read_size from self.config but only rejects non-positive
values and allows arbitrarily large sizes; change the logic around
max_metadata_read_size in pickle_scanner.py (the variable
max_metadata_read_size, the config.get call) so that after parsing you enforce a
hard upper bound of 10 * 1024 * 1024 (10 MiB) — e.g., validate > 0 then apply
max_metadata_read_size = min(max_metadata_read_size, 10 * 1024 * 1024) (or
replace oversized values with the cap) to ensure metadata reads cannot exceed 10
MiB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant