fix: enforce decompression limits for compressed tar wrappers by mldangelo · Pull Request #709 · promptfoo/modelaudit

mldangelo · 2026-03-16T09:24:23Z

Motivation

Recent routing changes caused compound compressed TAR filenames (e.g. .tar.gz, .tar.bz2, .tar.xz) to be handled directly by TarScanner, bypassing the CompressedScanner wrapper-level decompression size/ratio guardrails and enabling potential compression-bomb DoS during scanning.
The change restores wrapper-level protections for compressed TAR wrappers while keeping TarScanner behavior and per-entry limits intact.

Description

Added wrapper-level decompression limits and configuration support to TarScanner by introducing DEFAULT_MAX_DECOMPRESSED_BYTES, DEFAULT_MAX_DECOMPRESSION_RATIO, and COMPRESSED_TAR_EXTENSIONS, and by reading compressed_max_decompressed_bytes and compressed_max_decompression_ratio from the scanner config (modelaudit/scanners/tar_scanner.py).
Implemented pre-extraction checks that compute total_member_size and compressed_size for compressed-TAR inputs and emit a Compressed Wrapper Decompression Limits check and short-circuit the scan when limits are exceeded.
Kept existing per-entry extraction limits and nested-TAR scanning logic intact and re-used the existing compressed_* config keys so operator settings remain consistent.
Added regression tests to assert compressed-TAR ratio and decompressed-size enforcement and imported CheckStatus in tests/scanners/test_tar_scanner.py to validate failed checks.

Testing

Ran targeted TAR tests: uv run pytest tests/scanners/test_tar_scanner.py -k 'decompression_ratio_limit or decompressed_size_limit' and both new checks passed.
Verified related compressed-wrapper routing tests: uv run pytest tests/scanners/test_compressed_scanner.py -k 'compound_tar_wrappers' and they passed.
Ran code quality and typing checks: uv run ruff format, uv run ruff check --fix, and uv run mypy modelaudit/ which all passed.
Ran the full non-integration unit suite (uv run pytest -n auto -m "not slow and not integration") which completed but stopped on a pre-existing, unrelated failure tests/utils/helpers/test_secure_hasher.py::TestErrorHandling::test_hash_permission_denied and is not caused by these changes.

Codex Task

Summary by CodeRabbit

New Features
- Enhanced TAR scanning: detects compressed TAR wrappers, preflight-streams archives, and enforces configurable decompressed-size and decompression-ratio limits; improves format detection to recognize TAR content even when disguised.
- Core routing updated to prefer the TAR scanner when TAR content is detected.
Tests
- Added tests for decompression-ratio and decompressed-size enforcement, padding handling, content-based detection of compressed TARs, and end-to-end routing of disguised compressed TARs.

coderabbitai · 2026-03-16T09:25:07Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b952a1bf-541d-4a48-a041-5aeb23d19c40

📥 Commits

Reviewing files that changed from the base of the PR and between 31faa50 and 099d45c.

📒 Files selected for processing (5)

modelaudit/core.py
modelaudit/scanners/tar_scanner.py
modelaudit/utils/file/detection.py
tests/scanners/test_tar_scanner.py
tests/utils/file/test_filetype.py

Walkthrough

Adds compressed-TAR detection and preflight decompression-safety checks: new module defaults and per-instance thresholds, magic-byte detection for gzip/bzip2/xz wrappers, streamed estimation of decompressed TAR size, and short-circuiting scan results when decompressed-size or decompression-ratio limits are exceeded.

Changes

Cohort / File(s)	Summary
TAR Scanner core `modelaudit/scanners/tar_scanner.py`	Adds `DEFAULT_MAX_DECOMPRESSED_BYTES` and `DEFAULT_MAX_DECOMPRESSION_RATIO`; per-instance `max_decompressed_bytes`/`max_decompression_ratio`; magic headers (`_GZIP_MAGIC`, `_BZIP2_MAGIC`, `_XZ_MAGIC`); `_detect_compressed_tar_wrapper`, `_finalize_tar_stream_size`, `_add_compressed_wrapper_limit_check`, and `_preflight_tar_archive`; integrates preflight checks into `_scan_tar_file` to short-circuit on limit failures.
File format detection util `modelaudit/utils/file/detection.py`	Adds `_is_tar_archive(path)` and integrates TAR detection into `detect_file_format` so TAR (including compressed TARs) can be identified by content earlier in detection flow.
Core scanner mapping `modelaudit/core.py`	Adds format-to-scanner mapping entry for `"tar"` to route TAR-formatted archives to the tar scanner.
TarScanner tests `tests/scanners/test_tar_scanner.py`	Adds numerous tests for compressed-TAR wrappers: enforcement of decompression-ratio and decompressed-size limits across gzip/bzip2/xz; padding-accounting; streaming-based preflight; detection-by-content vs suffix; and integration with core.scan_file. Updates imports to include `Literal`, `CheckStatus`, and core.
File-detection tests `tests/utils/file/test_filetype.py`	Adds `test_detect_file_format_disguised_compressed_tar_by_content` to ensure content-based detection classifies a disguised compressed TAR correctly and reports outer compression codec.

Sequence Diagram

sequenceDiagram
    actor Caller
    participant Scanner as TarScanner
    participant File as Compressed TAR File
    participant Detector as Wrapper Detector
    participant Streamer as TAR Streamer/Reader
    participant Validator as Decompression Validator

    Caller->>Scanner: scan(path)
    Scanner->>File: read header bytes / open file
    Scanner->>Detector: _detect_compressed_tar_wrapper(path/header)
    alt compressed wrapper detected
        Scanner->>File: open compressed stream
        File->>Streamer: stream TAR headers/records
        Streamer->>Scanner: consumed_size, entry_count (streamed)
        Scanner->>Validator: _finalize_tar_stream_size(consumed_size)
        Validator-->>Scanner: estimated_decompressed_size
        Scanner->>Validator: compute ratio = decompressed / compressed_size
        rect rgba(255, 99, 71, 0.5)
        Validator->>Scanner: compare to max_decompressed_bytes & max_decompression_ratio
        alt any limit exceeded
            Validator-->>Scanner: limit exceeded (record check)
            Scanner->>Caller: return short-circuit FAILED result with check
        else within limits
            Validator-->>Scanner: limits OK
            Scanner->>Streamer: continue full extraction/scan per-entry
            Scanner->>Caller: return full scan results
        end
        end
    else no compressed wrapper
        Scanner->>Streamer: standard TAR streaming and per-entry scanning
        Scanner->>Caller: return scan results
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nudge the tar, I sniff the wrap,

Gzip, BZ2, XZ — I mind the gap.
I stream the headers, tally the size,
I flag the bloat before it flies.
A cautious hop keeps extractions wise.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 76.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and accurately describes the main change: enforcing decompression limits for compressed TAR wrappers, which directly addresses the vulnerability described in the PR objectives.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/fix-vulnerability-in-tar-scanner-routing

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/tar_scanner.py`:
- Line 210: The current total_member_size calculation only sums member.size
(visible payload) and misses TAR overhead; update the logic in tar_scanner.py
(replace the sum(...) that sets total_member_size) to compute the full TAR
stream size by iterating members and for each member adding a 512-byte header,
plus the file data rounded up to 512-byte blocks ((member.size + 511) // 512 *
512) for regular files (use member.isfile()), and for non-file entries count
just the 512-byte header; after the loop add the archive EOF trailer of 1024
bytes (two 512-byte blocks) to the total to get the correct wrapper budget.
- Around line 232-234: The decompression checks rely on filename suffix
(COMPRESSED_TAR_EXTENSIONS) but tarfile.open(..., "r:*") at the earlier tar
extraction step will transparently decompress payloads regardless of name;
update the logic to detect compression from file content instead of
lower_path.endswith by probing the file header bytes (e.g., check for gzip
0x1f8b, bzip2 "BZh", xz "\xfd7zXZ", lzma signatures) before running the
decompression-ratio and size checks so renamed compressed archives are treated
the same; locate the code around tarfile.open(...) (used at line ~208) and the
variables compressed_size, lower_path and the COMPRESSED_TAR_EXTENSIONS check
and replace the suffix check with a small helper that reads the first few bytes
of the path/fileobj to set an is_compressed flag and gate the decompression
validations on that flag.

In `@tests/scanners/test_tar_scanner.py`:
- Around line 515-549: Parameterize the failing tests
(test_scan_tar_gz_enforces_decompression_ratio_limit and
test_scan_tar_gz_enforces_decompressed_size_limit) over all supported
compressed-TAR suffixes (e.g., ".tar.gz", ".tar.bz2", ".tar.xz") and create an
additional within-limit positive case that asserts CheckStatus.PASSED; for each
suffix produce the archive with the matching tarfile mode (w:gz, w:bz2, w:xz),
instantiate TarScanner with the same config keys
(compressed_max_decompression_ratio and compressed_max_decompressed_bytes), run
scanner.scan, and assert the check with name "Compressed Wrapper Decompression
Limits" fails for the malicious inputs and passes for the benign input. Ensure
you reuse TarScanner, result.checks and CheckStatus in the assertions so both
fail and pass paths are covered for every compressed wrapper format.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d203050f-15d4-4688-932a-e7f780a05731

📥 Commits

Reviewing files that changed from the base of the PR and between d9fe283 and 712e4b4.

📒 Files selected for processing (2)

modelaudit/scanners/tar_scanner.py
tests/scanners/test_tar_scanner.py

modelaudit/scanners/tar_scanner.py

tests/scanners/test_tar_scanner.py

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelaudit/scanners/tar_scanner.py (1)

236-304: ⚠️ Potential issue | 🟠 Major

Enforce decompression limits during member iteration, not after getmembers().

Line 237's getmembers() fully enumerates and loads all TAR members into memory before the entry count check at line 238. This causes immediate resource exhaustion on TAR bombs—the damage occurs inside getmembers() before the check can abort. The decompression limit checks (lines 264–303) execute even later, after the members list is already materialized.

Refactor to use tar.next() for incremental member iteration: checks should execute during iteration, aborting before substantial resource consumption. Ensure testing includes malicious TAR bomb samples alongside benign archives to verify the fix prevents DoS.

🔧 Suggested refactor direction

 with tarfile.open(path, "r:*") as tar:
-    members = tar.getmembers()
+    members: list[tarfile.TarInfo] = []
+    compression_codec = self._detect_compressed_tar_wrapper(path)
+    compressed_size = os.path.getsize(path)
+    total_member_size = 1024 if compression_codec is not None else 0
+
+    while True:
+        member = tar.next()
+        if member is None:
+            break
+        members.append(member)
+        if len(members) > self.max_entries:
+            # existing Entry Count Limit Check failure path
+            return result
+        if compression_codec is not None:
+            total_member_size += 512
+            if member.isfile():
+                total_member_size += ((member.size + 511) // 512) * 512
+            if total_member_size > self.max_decompressed_bytes:
+                # existing Compressed Wrapper Decompression Limits failure path
+                return result
+            if compressed_size > 0 and (total_member_size / compressed_size) > self.max_decompression_ratio:
+                # existing Compressed Wrapper Decompression Limits failure path
+                return result

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/tar_scanner.py` around lines 236 - 304, The current code
calls tar.getmembers() (in tarfile.open(...)) which materializes the entire
member list and can OOM on TAR bombs; instead iterate the archive incrementally
using tar.next() (or tar.__iter__()) and perform the checks as you stream
members: maintain a running entry_count and running total_member_size
(use/adjust _estimate_tar_stream_size to compute per-member size or compute from
TarInfo.size), abort and add the existing "Entry Count Limit Check" or
"Compressed Wrapper Decompression Limits" results as soon as entry_count >
self.max_entries, total_member_size > self.max_decompressed_bytes, or
(total_member_size / compressed_size) > self.max_decompression_ratio, and return
immediately; remove the getmembers() usage and any code paths that assume a full
members list so the scanner enforces limits during iteration rather than after
materializing members.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@modelaudit/scanners/tar_scanner.py`:
- Around line 236-304: The current code calls tar.getmembers() (in
tarfile.open(...)) which materializes the entire member list and can OOM on TAR
bombs; instead iterate the archive incrementally using tar.next() (or
tar.__iter__()) and perform the checks as you stream members: maintain a running
entry_count and running total_member_size (use/adjust _estimate_tar_stream_size
to compute per-member size or compute from TarInfo.size), abort and add the
existing "Entry Count Limit Check" or "Compressed Wrapper Decompression Limits"
results as soon as entry_count > self.max_entries, total_member_size >
self.max_decompressed_bytes, or (total_member_size / compressed_size) >
self.max_decompression_ratio, and return immediately; remove the getmembers()
usage and any code paths that assume a full members list so the scanner enforces
limits during iteration rather than after materializing members.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: de7eb2a9-c741-4810-ad68-e32230253c67

📥 Commits

Reviewing files that changed from the base of the PR and between 712e4b4 and 31faa50.

📒 Files selected for processing (2)

modelaudit/scanners/tar_scanner.py
tests/scanners/test_tar_scanner.py

fix: enforce decompression limits for compressed tar wrappers

712e4b4

mldangelo added codex aardvark labels Mar 16, 2026 — with ChatGPT Codex Connector

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

modelaudit/scanners/tar_scanner.py Outdated Show resolved Hide resolved

modelaudit/scanners/tar_scanner.py Outdated Show resolved Hide resolved

tests/scanners/test_tar_scanner.py Outdated Show resolved Hide resolved

fix: enforce tar wrapper limits by content

31faa50

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

mldangelo added 3 commits March 17, 2026 07:03

Merge branch 'main' into codex/fix-vulnerability-in-tar-scanner-routing

e5cc6cf

fix(tar): route disguised tar wrappers through scanner

099d45c

Merge remote-tracking branch 'origin/main' into audit-pr709-mainmerge

c657184

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: enforce decompression limits for compressed tar wrappers#709

fix: enforce decompression limits for compressed tar wrappers#709
mldangelo wants to merge 5 commits intomainfrom
codex/fix-vulnerability-in-tar-scanner-routing

mldangelo commented Mar 16, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 16, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mldangelo commented Mar 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mldangelo commented Mar 16, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 16, 2026 •

edited

Loading