fix(telemetry): preserve model refs while stripping secrets#717
fix(telemetry): preserve model refs while stripping secrets#717
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughTelemetry payloads were changed to stop emitting raw file paths, URLs, and hashed identifiers. New helpers assemble coarse properties (e.g., path_type, file_extension, domain, model_name). Telemetry recorders and tests were updated to use the new privacy-preserving payload shape. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/telemetry.py`:
- Around line 544-551: The code coerces issue.get("location", "") to str,
turning None into the string "None" and causing _hash_identifier to produce
synthetic IDs; change issue_location to keep the original value (e.g.,
issue_location = issue.get("location")) and only convert to string or call
_hash_identifier / _extract_model_name / _classify_path when the value is
non-null/non-empty (e.g., if issue_location: location_str = str(issue_location);
use location_str for classification but call _hash_identifier(location_str) only
when issue_location is truthy, otherwise set location_identifier/model_name to
None). Apply the same fix to the other block around the references to
_hash_identifier/_classify_path/_extract_model_name noted at 572-576.
In `@tests/test_telemetry.py`:
- Around line 370-372: Add a return type annotation "-> None" to the test method
signature for test_path_and_url_fields_use_hashed_identifiers_only so the method
reads "def test_path_and_url_fields_use_hashed_identifiers_only(self) -> None:";
update only the function signature to include the annotation to satisfy the
repository typing rules for tests.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6352dee5-7c86-42e4-86fb-b473b01282c1
📒 Files selected for processing (3)
CHANGELOG.mdmodelaudit/telemetry.pytests/test_telemetry.py
tests/test_telemetry.py
Outdated
| def test_path_and_url_fields_use_hashed_identifiers_only(self): | ||
| """Path and URL fields should not include raw sensitive values.""" | ||
| mock_posthog = MagicMock() |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add -> None return annotation on this test method.
This keeps test typing consistent with repository rules.
✏️ Proposed fix
- def test_path_and_url_fields_use_hashed_identifiers_only(self):
+ def test_path_and_url_fields_use_hashed_identifiers_only(self) -> None:As per coding guidelines: "tests/**/*.py: Use type hints -> None on all test methods and tmp_path: Path / monkeypatch: pytest.MonkeyPatch on test parameters".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_telemetry.py` around lines 370 - 372, Add a return type annotation
"-> None" to the test method signature for
test_path_and_url_fields_use_hashed_identifiers_only so the method reads "def
test_path_and_url_fields_use_hashed_identifiers_only(self) -> None:"; update
only the function signature to include the annotation to satisfy the repository
typing rules for tests.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_telemetry.py`:
- Around line 273-274: The test method
test_scan_completed_uses_top_level_results_schema is missing a return type
annotation; update its definition to include "-> None" (i.e., change the
function signature for test_scan_completed_uses_top_level_results_schema to
declare a None return type) to comply with the tests/**/*.py typing guideline.
- Around line 335-336: The test method definition
test_scan_started_includes_per_path_fields is missing a return type annotation;
update its signature to include the explicit return annotation "-> None" (i.e.,
change "def test_scan_started_includes_per_path_fields(self):" to "def
test_scan_started_includes_per_path_fields(self) -> None:") so the test follows
the project's typing guideline for tests.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c05982f2-472b-459c-82de-24148c75fb41
📒 Files selected for processing (3)
CHANGELOG.mdmodelaudit/telemetry.pytests/test_telemetry.py
# Conflicts: # CHANGELOG.md
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CHANGELOG.md`:
- Line 84: Update the CHANGELOG entry that currently reads "- **telemetry:**
simplify telemetry payloads to omit file path identifiers" to accurately reflect
the implementation: if the code replaces paths with hashed values, change the
wording to "hash file path identifiers in telemetry payloads" or "replace raw
file path identifiers with hashed values"; if the code actually removes them,
keep "omit file path identifiers"; ensure the changelog text matches the
implementation and mentions "telemetry" and "file path identifiers" so it's easy
to locate and verify.
Resolve the changelog conflict, keep missing issue locations out of telemetry detail fields, and apply the touched test annotations from review. Co-authored-by: Codex <noreply@openai.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_telemetry.py (1)
292-307: 🛠️ Refactor suggestion | 🟠 MajorUse
tmp_pathfixtures instead of hard-coded host paths.These regressions bake in
/tmp/...and/home/user/...literals, which keeps the local-path expectations non-deterministic and violates the test-fixture rule. Build the local samples undertmp_path, passstr(tmp_path / "..."), and type the added parameter astmp_path: Path.As per coding guidelines:
tests/**/*.py: Use deterministic fixtures only in tests; never reference host paths like/etc/passwd; create all targets undertmp_path; use type hintstmp_path: Pathon test parameters.Also applies to: 366-367, 398-399
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_telemetry.py` around lines 292 - 307, The test embeds host-specific paths in the `results` dict (assets paths like "/tmp/a.pkl", "/tmp/b.zip" and issue locations), which should be replaced with deterministic tmp_path fixtures; add a `tmp_path: Path` parameter to the test signature, construct file paths using `str(tmp_path / "a.pkl")`, `str(tmp_path / "b.zip")` (and any other occurrences noted around the test, e.g., the similar literals at the other occurrences), and update both the "assets" entries and "issues" location fields to use those tmp_path-based strings so the test no longer references hard-coded host paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/telemetry.py`:
- Around line 377-383: The _build_url_properties function currently uses helpers
(_extract_domain and _extract_model_name) that return parsed.netloc, which can
include userinfo and leak credentials; change these helpers to derive
domain/model_name values from parsed.hostname and include parsed.port only if
present (i.e., reconstruct host[:port] from parsed.hostname and parsed.port
instead of using parsed.netloc), and update _build_url_properties to use the
revised helpers; also add a regression test that passes a credential-bearing URL
(e.g., https://user:pass@example.com/...) and asserts that neither domain nor
model_name contain userinfo.
---
Outside diff comments:
In `@tests/test_telemetry.py`:
- Around line 292-307: The test embeds host-specific paths in the `results` dict
(assets paths like "/tmp/a.pkl", "/tmp/b.zip" and issue locations), which should
be replaced with deterministic tmp_path fixtures; add a `tmp_path: Path`
parameter to the test signature, construct file paths using `str(tmp_path /
"a.pkl")`, `str(tmp_path / "b.zip")` (and any other occurrences noted around the
test, e.g., the similar literals at the other occurrences), and update both the
"assets" entries and "issues" location fields to use those tmp_path-based
strings so the test no longer references hard-coded host paths.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8abcee8c-3540-41c4-b9e9-e5f4dab74f9d
📒 Files selected for processing (3)
CHANGELOG.mdmodelaudit/telemetry.pytests/test_telemetry.py
Resolve changelog conflict and scrub credential-bearing URL hosts from telemetry. Co-authored-by: Codex <noreply@openai.com>
|
Superseded by the corrected comment below. |
|
Addressed the outstanding review feedback in 2c945ea:
Local QA:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/telemetry.py`:
- Around line 369-375: The telemetry still emits raw identifiers—specifically
the extracted model_name and domain—so update the telemetry builders to avoid
leaking that data: in _build_path_properties (which calls _extract_model_name
and _classify_path) stop returning the raw "model_name" (either drop the key
entirely or replace it with a deterministic non-reversible hash like SHA256 of
the extracted name) and ensure any place that extracts a host/domain from URLs
returns a hashed or omitted "domain" instead of the literal host; likewise
update record_scan_started to not emit raw model_names but to emit the hashed
versions (or omit the field) by mapping each entry through the same hashing
utility to keep telemetry consistent across _build_path_properties and the code
paths referenced around the record_scan_started/model_names usage.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 105f2e41-1ca3-4dc4-8b13-cfc3b51940e9
📒 Files selected for processing (3)
CHANGELOG.mdmodelaudit/telemetry.pytests/test_telemetry.py
| def _build_path_properties(self, path: str) -> dict[str, Any]: | ||
| """Build coarse telemetry properties for a file path.""" | ||
| return { | ||
| "path_type": self._classify_path(path), | ||
| "file_extension": Path(path).suffix.lower(), | ||
| "model_name": self._extract_model_name(path), | ||
| } |
There was a problem hiding this comment.
Drop or hash model_name/domain as well.
These payloads still send raw identifiers even though file_path/url are gone: model_name is the literal basename / repo slug / URL leaf, and domain is the literal host. That still leaks user- or infra-specific data from paths and URLs (for example, local artifact names or internal download hosts), so the privacy fix is incomplete. The same treatment is also needed for record_scan_started()'s model_names field.
✏️ Proposed fix
def _build_path_properties(self, path: str) -> dict[str, Any]:
"""Build coarse telemetry properties for a file path."""
return {
"path_type": self._classify_path(path),
"file_extension": Path(path).suffix.lower(),
- "model_name": self._extract_model_name(path),
}
def _build_url_properties(self, url: str) -> dict[str, Any]:
"""Build coarse telemetry properties for a download URL."""
return {
- "domain": self._extract_domain(url),
"path_type": self._classify_path(url),
- "model_name": self._extract_model_name(url),
}
@@
issue_details.append(
{
"type": issue_type,
"severity": severity,
"location_type": self._classify_path(issue_location) if issue_location else "unknown",
- "model_name": self._extract_model_name(issue_location) if issue_location else None,
}
)Also applies to: 377-383, 542-549
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelaudit/telemetry.py` around lines 369 - 375, The telemetry still emits
raw identifiers—specifically the extracted model_name and domain—so update the
telemetry builders to avoid leaking that data: in _build_path_properties (which
calls _extract_model_name and _classify_path) stop returning the raw
"model_name" (either drop the key entirely or replace it with a deterministic
non-reversible hash like SHA256 of the extracted name) and ensure any place that
extracts a host/domain from URLs returns a hashed or omitted "domain" instead of
the literal host; likewise update record_scan_started to not emit raw
model_names but to emit the hashed versions (or omit the field) by mapping each
entry through the same hashing utility to keep telemetry consistent across
_build_path_properties and the code paths referenced around the
record_scan_started/model_names usage.
Keep secret-stripped model references in telemetry so scanned remote models remain identifiable while credentials, query strings, fragments, and local directory paths stay out of payloads. Co-authored-by: Codex <noreply@openai.com>
Use hashed identifiers for telemetry file and URL fields instead of raw values.
Update the telemetry regression tests and changelog wording.
Summary by CodeRabbit
Bug Fixes
Tests
Documentation