fix: enforce decompression limits for compressed tar wrappers#709
fix: enforce decompression limits for compressed tar wrappers#709
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (5)
WalkthroughAdds compressed-TAR detection and preflight decompression-safety checks: new module defaults and per-instance thresholds, magic-byte detection for gzip/bzip2/xz wrappers, streamed estimation of decompressed TAR size, and short-circuiting scan results when decompressed-size or decompression-ratio limits are exceeded. Changes
Sequence DiagramsequenceDiagram
actor Caller
participant Scanner as TarScanner
participant File as Compressed TAR File
participant Detector as Wrapper Detector
participant Streamer as TAR Streamer/Reader
participant Validator as Decompression Validator
Caller->>Scanner: scan(path)
Scanner->>File: read header bytes / open file
Scanner->>Detector: _detect_compressed_tar_wrapper(path/header)
alt compressed wrapper detected
Scanner->>File: open compressed stream
File->>Streamer: stream TAR headers/records
Streamer->>Scanner: consumed_size, entry_count (streamed)
Scanner->>Validator: _finalize_tar_stream_size(consumed_size)
Validator-->>Scanner: estimated_decompressed_size
Scanner->>Validator: compute ratio = decompressed / compressed_size
rect rgba(255, 99, 71, 0.5)
Validator->>Scanner: compare to max_decompressed_bytes & max_decompression_ratio
alt any limit exceeded
Validator-->>Scanner: limit exceeded (record check)
Scanner->>Caller: return short-circuit FAILED result with check
else within limits
Validator-->>Scanner: limits OK
Scanner->>Streamer: continue full extraction/scan per-entry
Scanner->>Caller: return full scan results
end
end
else no compressed wrapper
Scanner->>Streamer: standard TAR streaming and per-entry scanning
Scanner->>Caller: return scan results
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/tar_scanner.py`:
- Line 210: The current total_member_size calculation only sums member.size
(visible payload) and misses TAR overhead; update the logic in tar_scanner.py
(replace the sum(...) that sets total_member_size) to compute the full TAR
stream size by iterating members and for each member adding a 512-byte header,
plus the file data rounded up to 512-byte blocks ((member.size + 511) // 512 *
512) for regular files (use member.isfile()), and for non-file entries count
just the 512-byte header; after the loop add the archive EOF trailer of 1024
bytes (two 512-byte blocks) to the total to get the correct wrapper budget.
- Around line 232-234: The decompression checks rely on filename suffix
(COMPRESSED_TAR_EXTENSIONS) but tarfile.open(..., "r:*") at the earlier tar
extraction step will transparently decompress payloads regardless of name;
update the logic to detect compression from file content instead of
lower_path.endswith by probing the file header bytes (e.g., check for gzip
0x1f8b, bzip2 "BZh", xz "\xfd7zXZ", lzma signatures) before running the
decompression-ratio and size checks so renamed compressed archives are treated
the same; locate the code around tarfile.open(...) (used at line ~208) and the
variables compressed_size, lower_path and the COMPRESSED_TAR_EXTENSIONS check
and replace the suffix check with a small helper that reads the first few bytes
of the path/fileobj to set an is_compressed flag and gate the decompression
validations on that flag.
In `@tests/scanners/test_tar_scanner.py`:
- Around line 515-549: Parameterize the failing tests
(test_scan_tar_gz_enforces_decompression_ratio_limit and
test_scan_tar_gz_enforces_decompressed_size_limit) over all supported
compressed-TAR suffixes (e.g., ".tar.gz", ".tar.bz2", ".tar.xz") and create an
additional within-limit positive case that asserts CheckStatus.PASSED; for each
suffix produce the archive with the matching tarfile mode (w:gz, w:bz2, w:xz),
instantiate TarScanner with the same config keys
(compressed_max_decompression_ratio and compressed_max_decompressed_bytes), run
scanner.scan, and assert the check with name "Compressed Wrapper Decompression
Limits" fails for the malicious inputs and passes for the benign input. Ensure
you reuse TarScanner, result.checks and CheckStatus in the assertions so both
fail and pass paths are covered for every compressed wrapper format.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d203050f-15d4-4688-932a-e7f780a05731
📒 Files selected for processing (2)
modelaudit/scanners/tar_scanner.pytests/scanners/test_tar_scanner.py
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelaudit/scanners/tar_scanner.py (1)
236-304:⚠️ Potential issue | 🟠 MajorEnforce decompression limits during member iteration, not after
getmembers().Line 237's
getmembers()fully enumerates and loads all TAR members into memory before the entry count check at line 238. This causes immediate resource exhaustion on TAR bombs—the damage occurs insidegetmembers()before the check can abort. The decompression limit checks (lines 264–303) execute even later, after the members list is already materialized.Refactor to use
tar.next()for incremental member iteration: checks should execute during iteration, aborting before substantial resource consumption. Ensure testing includes malicious TAR bomb samples alongside benign archives to verify the fix prevents DoS.🔧 Suggested refactor direction
with tarfile.open(path, "r:*") as tar: - members = tar.getmembers() + members: list[tarfile.TarInfo] = [] + compression_codec = self._detect_compressed_tar_wrapper(path) + compressed_size = os.path.getsize(path) + total_member_size = 1024 if compression_codec is not None else 0 + + while True: + member = tar.next() + if member is None: + break + members.append(member) + if len(members) > self.max_entries: + # existing Entry Count Limit Check failure path + return result + if compression_codec is not None: + total_member_size += 512 + if member.isfile(): + total_member_size += ((member.size + 511) // 512) * 512 + if total_member_size > self.max_decompressed_bytes: + # existing Compressed Wrapper Decompression Limits failure path + return result + if compressed_size > 0 and (total_member_size / compressed_size) > self.max_decompression_ratio: + # existing Compressed Wrapper Decompression Limits failure path + return result🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelaudit/scanners/tar_scanner.py` around lines 236 - 304, The current code calls tar.getmembers() (in tarfile.open(...)) which materializes the entire member list and can OOM on TAR bombs; instead iterate the archive incrementally using tar.next() (or tar.__iter__()) and perform the checks as you stream members: maintain a running entry_count and running total_member_size (use/adjust _estimate_tar_stream_size to compute per-member size or compute from TarInfo.size), abort and add the existing "Entry Count Limit Check" or "Compressed Wrapper Decompression Limits" results as soon as entry_count > self.max_entries, total_member_size > self.max_decompressed_bytes, or (total_member_size / compressed_size) > self.max_decompression_ratio, and return immediately; remove the getmembers() usage and any code paths that assume a full members list so the scanner enforces limits during iteration rather than after materializing members.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@modelaudit/scanners/tar_scanner.py`:
- Around line 236-304: The current code calls tar.getmembers() (in
tarfile.open(...)) which materializes the entire member list and can OOM on TAR
bombs; instead iterate the archive incrementally using tar.next() (or
tar.__iter__()) and perform the checks as you stream members: maintain a running
entry_count and running total_member_size (use/adjust _estimate_tar_stream_size
to compute per-member size or compute from TarInfo.size), abort and add the
existing "Entry Count Limit Check" or "Compressed Wrapper Decompression Limits"
results as soon as entry_count > self.max_entries, total_member_size >
self.max_decompressed_bytes, or (total_member_size / compressed_size) >
self.max_decompression_ratio, and return immediately; remove the getmembers()
usage and any code paths that assume a full members list so the scanner enforces
limits during iteration rather than after materializing members.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: de7eb2a9-c741-4810-ad68-e32230253c67
📒 Files selected for processing (2)
modelaudit/scanners/tar_scanner.pytests/scanners/test_tar_scanner.py
Motivation
.tar.gz,.tar.bz2,.tar.xz) to be handled directly byTarScanner, bypassing theCompressedScannerwrapper-level decompression size/ratio guardrails and enabling potential compression-bomb DoS during scanning.TarScannerbehavior and per-entry limits intact.Description
TarScannerby introducingDEFAULT_MAX_DECOMPRESSED_BYTES,DEFAULT_MAX_DECOMPRESSION_RATIO, andCOMPRESSED_TAR_EXTENSIONS, and by readingcompressed_max_decompressed_bytesandcompressed_max_decompression_ratiofrom the scanner config (modelaudit/scanners/tar_scanner.py).total_member_sizeandcompressed_sizefor compressed-TAR inputs and emit aCompressed Wrapper Decompression Limitscheck and short-circuit the scan when limits are exceeded.compressed_*config keys so operator settings remain consistent.CheckStatusintests/scanners/test_tar_scanner.pyto validate failed checks.Testing
uv run pytest tests/scanners/test_tar_scanner.py -k 'decompression_ratio_limit or decompressed_size_limit'and both new checks passed.uv run pytest tests/scanners/test_compressed_scanner.py -k 'compound_tar_wrappers'and they passed.uv run ruff format,uv run ruff check --fix, anduv run mypy modelaudit/which all passed.uv run pytest -n auto -m "not slow and not integration") which completed but stopped on a pre-existing, unrelated failuretests/utils/helpers/test_secure_hasher.py::TestErrorHandling::test_hash_permission_deniedand is not caused by these changes.Codex Task
Summary by CodeRabbit
New Features
Tests