Skip to content

fix: enforce decompression limits for compressed tar wrappers#709

Open
mldangelo wants to merge 5 commits intomainfrom
codex/fix-vulnerability-in-tar-scanner-routing
Open

fix: enforce decompression limits for compressed tar wrappers#709
mldangelo wants to merge 5 commits intomainfrom
codex/fix-vulnerability-in-tar-scanner-routing

Conversation

@mldangelo
Copy link
Member

@mldangelo mldangelo commented Mar 16, 2026

Motivation

  • Recent routing changes caused compound compressed TAR filenames (e.g. .tar.gz, .tar.bz2, .tar.xz) to be handled directly by TarScanner, bypassing the CompressedScanner wrapper-level decompression size/ratio guardrails and enabling potential compression-bomb DoS during scanning.
  • The change restores wrapper-level protections for compressed TAR wrappers while keeping TarScanner behavior and per-entry limits intact.

Description

  • Added wrapper-level decompression limits and configuration support to TarScanner by introducing DEFAULT_MAX_DECOMPRESSED_BYTES, DEFAULT_MAX_DECOMPRESSION_RATIO, and COMPRESSED_TAR_EXTENSIONS, and by reading compressed_max_decompressed_bytes and compressed_max_decompression_ratio from the scanner config (modelaudit/scanners/tar_scanner.py).
  • Implemented pre-extraction checks that compute total_member_size and compressed_size for compressed-TAR inputs and emit a Compressed Wrapper Decompression Limits check and short-circuit the scan when limits are exceeded.
  • Kept existing per-entry extraction limits and nested-TAR scanning logic intact and re-used the existing compressed_* config keys so operator settings remain consistent.
  • Added regression tests to assert compressed-TAR ratio and decompressed-size enforcement and imported CheckStatus in tests/scanners/test_tar_scanner.py to validate failed checks.

Testing

  • Ran targeted TAR tests: uv run pytest tests/scanners/test_tar_scanner.py -k 'decompression_ratio_limit or decompressed_size_limit' and both new checks passed.
  • Verified related compressed-wrapper routing tests: uv run pytest tests/scanners/test_compressed_scanner.py -k 'compound_tar_wrappers' and they passed.
  • Ran code quality and typing checks: uv run ruff format, uv run ruff check --fix, and uv run mypy modelaudit/ which all passed.
  • Ran the full non-integration unit suite (uv run pytest -n auto -m "not slow and not integration") which completed but stopped on a pre-existing, unrelated failure tests/utils/helpers/test_secure_hasher.py::TestErrorHandling::test_hash_permission_denied and is not caused by these changes.

Codex Task

Summary by CodeRabbit

  • New Features

    • Enhanced TAR scanning: detects compressed TAR wrappers, preflight-streams archives, and enforces configurable decompressed-size and decompression-ratio limits; improves format detection to recognize TAR content even when disguised.
    • Core routing updated to prefer the TAR scanner when TAR content is detected.
  • Tests

    • Added tests for decompression-ratio and decompressed-size enforcement, padding handling, content-based detection of compressed TARs, and end-to-end routing of disguised compressed TARs.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 16, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b952a1bf-541d-4a48-a041-5aeb23d19c40

📥 Commits

Reviewing files that changed from the base of the PR and between 31faa50 and 099d45c.

📒 Files selected for processing (5)
  • modelaudit/core.py
  • modelaudit/scanners/tar_scanner.py
  • modelaudit/utils/file/detection.py
  • tests/scanners/test_tar_scanner.py
  • tests/utils/file/test_filetype.py

Walkthrough

Adds compressed-TAR detection and preflight decompression-safety checks: new module defaults and per-instance thresholds, magic-byte detection for gzip/bzip2/xz wrappers, streamed estimation of decompressed TAR size, and short-circuiting scan results when decompressed-size or decompression-ratio limits are exceeded.

Changes

Cohort / File(s) Summary
TAR Scanner core
modelaudit/scanners/tar_scanner.py
Adds DEFAULT_MAX_DECOMPRESSED_BYTES and DEFAULT_MAX_DECOMPRESSION_RATIO; per-instance max_decompressed_bytes/max_decompression_ratio; magic headers (_GZIP_MAGIC, _BZIP2_MAGIC, _XZ_MAGIC); _detect_compressed_tar_wrapper, _finalize_tar_stream_size, _add_compressed_wrapper_limit_check, and _preflight_tar_archive; integrates preflight checks into _scan_tar_file to short-circuit on limit failures.
File format detection util
modelaudit/utils/file/detection.py
Adds _is_tar_archive(path) and integrates TAR detection into detect_file_format so TAR (including compressed TARs) can be identified by content earlier in detection flow.
Core scanner mapping
modelaudit/core.py
Adds format-to-scanner mapping entry for "tar" to route TAR-formatted archives to the tar scanner.
TarScanner tests
tests/scanners/test_tar_scanner.py
Adds numerous tests for compressed-TAR wrappers: enforcement of decompression-ratio and decompressed-size limits across gzip/bzip2/xz; padding-accounting; streaming-based preflight; detection-by-content vs suffix; and integration with core.scan_file. Updates imports to include Literal, CheckStatus, and core.
File-detection tests
tests/utils/file/test_filetype.py
Adds test_detect_file_format_disguised_compressed_tar_by_content to ensure content-based detection classifies a disguised compressed TAR correctly and reports outer compression codec.

Sequence Diagram

sequenceDiagram
    actor Caller
    participant Scanner as TarScanner
    participant File as Compressed TAR File
    participant Detector as Wrapper Detector
    participant Streamer as TAR Streamer/Reader
    participant Validator as Decompression Validator

    Caller->>Scanner: scan(path)
    Scanner->>File: read header bytes / open file
    Scanner->>Detector: _detect_compressed_tar_wrapper(path/header)
    alt compressed wrapper detected
        Scanner->>File: open compressed stream
        File->>Streamer: stream TAR headers/records
        Streamer->>Scanner: consumed_size, entry_count (streamed)
        Scanner->>Validator: _finalize_tar_stream_size(consumed_size)
        Validator-->>Scanner: estimated_decompressed_size
        Scanner->>Validator: compute ratio = decompressed / compressed_size
        rect rgba(255, 99, 71, 0.5)
        Validator->>Scanner: compare to max_decompressed_bytes & max_decompression_ratio
        alt any limit exceeded
            Validator-->>Scanner: limit exceeded (record check)
            Scanner->>Caller: return short-circuit FAILED result with check
        else within limits
            Validator-->>Scanner: limits OK
            Scanner->>Streamer: continue full extraction/scan per-entry
            Scanner->>Caller: return full scan results
        end
        end
    else no compressed wrapper
        Scanner->>Streamer: standard TAR streaming and per-entry scanning
        Scanner->>Caller: return scan results
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nudge the tar, I sniff the wrap,

Gzip, BZ2, XZ — I mind the gap.
I stream the headers, tally the size,
I flag the bloat before it flies.
A cautious hop keeps extractions wise.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately describes the main change: enforcing decompression limits for compressed TAR wrappers, which directly addresses the vulnerability described in the PR objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/fix-vulnerability-in-tar-scanner-routing
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/tar_scanner.py`:
- Line 210: The current total_member_size calculation only sums member.size
(visible payload) and misses TAR overhead; update the logic in tar_scanner.py
(replace the sum(...) that sets total_member_size) to compute the full TAR
stream size by iterating members and for each member adding a 512-byte header,
plus the file data rounded up to 512-byte blocks ((member.size + 511) // 512 *
512) for regular files (use member.isfile()), and for non-file entries count
just the 512-byte header; after the loop add the archive EOF trailer of 1024
bytes (two 512-byte blocks) to the total to get the correct wrapper budget.
- Around line 232-234: The decompression checks rely on filename suffix
(COMPRESSED_TAR_EXTENSIONS) but tarfile.open(..., "r:*") at the earlier tar
extraction step will transparently decompress payloads regardless of name;
update the logic to detect compression from file content instead of
lower_path.endswith by probing the file header bytes (e.g., check for gzip
0x1f8b, bzip2 "BZh", xz "\xfd7zXZ", lzma signatures) before running the
decompression-ratio and size checks so renamed compressed archives are treated
the same; locate the code around tarfile.open(...) (used at line ~208) and the
variables compressed_size, lower_path and the COMPRESSED_TAR_EXTENSIONS check
and replace the suffix check with a small helper that reads the first few bytes
of the path/fileobj to set an is_compressed flag and gate the decompression
validations on that flag.

In `@tests/scanners/test_tar_scanner.py`:
- Around line 515-549: Parameterize the failing tests
(test_scan_tar_gz_enforces_decompression_ratio_limit and
test_scan_tar_gz_enforces_decompressed_size_limit) over all supported
compressed-TAR suffixes (e.g., ".tar.gz", ".tar.bz2", ".tar.xz") and create an
additional within-limit positive case that asserts CheckStatus.PASSED; for each
suffix produce the archive with the matching tarfile mode (w:gz, w:bz2, w:xz),
instantiate TarScanner with the same config keys
(compressed_max_decompression_ratio and compressed_max_decompressed_bytes), run
scanner.scan, and assert the check with name "Compressed Wrapper Decompression
Limits" fails for the malicious inputs and passes for the benign input. Ensure
you reuse TarScanner, result.checks and CheckStatus in the assertions so both
fail and pass paths are covered for every compressed wrapper format.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d203050f-15d4-4688-932a-e7f780a05731

📥 Commits

Reviewing files that changed from the base of the PR and between d9fe283 and 712e4b4.

📒 Files selected for processing (2)
  • modelaudit/scanners/tar_scanner.py
  • tests/scanners/test_tar_scanner.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelaudit/scanners/tar_scanner.py (1)

236-304: ⚠️ Potential issue | 🟠 Major

Enforce decompression limits during member iteration, not after getmembers().

Line 237's getmembers() fully enumerates and loads all TAR members into memory before the entry count check at line 238. This causes immediate resource exhaustion on TAR bombs—the damage occurs inside getmembers() before the check can abort. The decompression limit checks (lines 264–303) execute even later, after the members list is already materialized.

Refactor to use tar.next() for incremental member iteration: checks should execute during iteration, aborting before substantial resource consumption. Ensure testing includes malicious TAR bomb samples alongside benign archives to verify the fix prevents DoS.

🔧 Suggested refactor direction
 with tarfile.open(path, "r:*") as tar:
-    members = tar.getmembers()
+    members: list[tarfile.TarInfo] = []
+    compression_codec = self._detect_compressed_tar_wrapper(path)
+    compressed_size = os.path.getsize(path)
+    total_member_size = 1024 if compression_codec is not None else 0
+
+    while True:
+        member = tar.next()
+        if member is None:
+            break
+        members.append(member)
+        if len(members) > self.max_entries:
+            # existing Entry Count Limit Check failure path
+            return result
+        if compression_codec is not None:
+            total_member_size += 512
+            if member.isfile():
+                total_member_size += ((member.size + 511) // 512) * 512
+            if total_member_size > self.max_decompressed_bytes:
+                # existing Compressed Wrapper Decompression Limits failure path
+                return result
+            if compressed_size > 0 and (total_member_size / compressed_size) > self.max_decompression_ratio:
+                # existing Compressed Wrapper Decompression Limits failure path
+                return result
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/tar_scanner.py` around lines 236 - 304, The current code
calls tar.getmembers() (in tarfile.open(...)) which materializes the entire
member list and can OOM on TAR bombs; instead iterate the archive incrementally
using tar.next() (or tar.__iter__()) and perform the checks as you stream
members: maintain a running entry_count and running total_member_size
(use/adjust _estimate_tar_stream_size to compute per-member size or compute from
TarInfo.size), abort and add the existing "Entry Count Limit Check" or
"Compressed Wrapper Decompression Limits" results as soon as entry_count >
self.max_entries, total_member_size > self.max_decompressed_bytes, or
(total_member_size / compressed_size) > self.max_decompression_ratio, and return
immediately; remove the getmembers() usage and any code paths that assume a full
members list so the scanner enforces limits during iteration rather than after
materializing members.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@modelaudit/scanners/tar_scanner.py`:
- Around line 236-304: The current code calls tar.getmembers() (in
tarfile.open(...)) which materializes the entire member list and can OOM on TAR
bombs; instead iterate the archive incrementally using tar.next() (or
tar.__iter__()) and perform the checks as you stream members: maintain a running
entry_count and running total_member_size (use/adjust _estimate_tar_stream_size
to compute per-member size or compute from TarInfo.size), abort and add the
existing "Entry Count Limit Check" or "Compressed Wrapper Decompression Limits"
results as soon as entry_count > self.max_entries, total_member_size >
self.max_decompressed_bytes, or (total_member_size / compressed_size) >
self.max_decompression_ratio, and return immediately; remove the getmembers()
usage and any code paths that assume a full members list so the scanner enforces
limits during iteration rather than after materializing members.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: de7eb2a9-c741-4810-ad68-e32230253c67

📥 Commits

Reviewing files that changed from the base of the PR and between 712e4b4 and 31faa50.

📒 Files selected for processing (2)
  • modelaudit/scanners/tar_scanner.py
  • tests/scanners/test_tar_scanner.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant