Skip to content

fix(telemetry): preserve model refs while stripping secrets#717

Open
typpo wants to merge 6 commits intomainfrom
codex/telemetry-file-hashes
Open

fix(telemetry): preserve model refs while stripping secrets#717
typpo wants to merge 6 commits intomainfrom
codex/telemetry-file-hashes

Conversation

@typpo
Copy link
Contributor

@typpo typpo commented Mar 16, 2026

Use hashed identifiers for telemetry file and URL fields instead of raw values.
Update the telemetry regression tests and changelog wording.

Summary by CodeRabbit

  • Bug Fixes

    • Telemetry now omits raw file paths, URLs, and direct location identifiers; events include only coarse path/URL metadata (type, extension, domain) and model name when resolvable, with unknown locations reported as such.
  • Tests

    • Tests updated to assert absence of raw path/URL values and identifiers, validate unknown-location behavior, and add JSON checks preventing sensitive leakage.
  • Documentation

    • Changelog updated to note the privacy-preserving telemetry change.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Telemetry payloads were changed to stop emitting raw file paths, URLs, and hashed identifiers. New helpers assemble coarse properties (e.g., path_type, file_extension, domain, model_name). Telemetry recorders and tests were updated to use the new privacy-preserving payload shape.

Changes

Cohort / File(s) Summary
Telemetry core
modelaudit/telemetry.py
Removed _hash_identifier() and hashlib usage. Added _build_path_properties(path) and _build_url_properties(url) and helpers (_extract_url_host). Refactored telemetry recorders to omit raw file_path/url and identifier fields; emit coarse fields like path_type, file_extension, domain, and model_name.
Tests
tests/test_telemetry.py
Updated tests to use tmp_path fixtures and to expect omission of raw paths/URLs and identifier fields. Adjusted assertions for scan-started/completed payloads, added an issue with location=None (expect location_type == "unknown", model_name is None), and added JSON checks to ensure no sensitive strings are serialized.
Changelog
CHANGELOG.md
Added Unreleased → Fixed entry noting telemetry no longer emits raw file path or URL identifiers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I nibble paths and tuck them tight,
Raw routes hidden out of sight,
I hop and build a gentler trace,
Small fields remain, no names to chase,
A rabbit's hush—telemetry made light.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The PR title mentions 'preserve model refs' and 'strip secrets', but the actual changes remove hashing and raw identifiers altogether—not preserve/strip them in the traditional sense. Revise the title to better reflect the actual intent: e.g., 'fix(telemetry): omit raw file paths and URLs from payloads' or 'fix(telemetry): replace identifiers with coarse properties'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/telemetry-file-hashes

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/telemetry.py`:
- Around line 544-551: The code coerces issue.get("location", "") to str,
turning None into the string "None" and causing _hash_identifier to produce
synthetic IDs; change issue_location to keep the original value (e.g.,
issue_location = issue.get("location")) and only convert to string or call
_hash_identifier / _extract_model_name / _classify_path when the value is
non-null/non-empty (e.g., if issue_location: location_str = str(issue_location);
use location_str for classification but call _hash_identifier(location_str) only
when issue_location is truthy, otherwise set location_identifier/model_name to
None). Apply the same fix to the other block around the references to
_hash_identifier/_classify_path/_extract_model_name noted at 572-576.

In `@tests/test_telemetry.py`:
- Around line 370-372: Add a return type annotation "-> None" to the test method
signature for test_path_and_url_fields_use_hashed_identifiers_only so the method
reads "def test_path_and_url_fields_use_hashed_identifiers_only(self) -> None:";
update only the function signature to include the annotation to satisfy the
repository typing rules for tests.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6352dee5-7c86-42e4-86fb-b473b01282c1

📥 Commits

Reviewing files that changed from the base of the PR and between d9fe283 and 26588e7.

📒 Files selected for processing (3)
  • CHANGELOG.md
  • modelaudit/telemetry.py
  • tests/test_telemetry.py

Comment on lines 370 to 372
def test_path_and_url_fields_use_hashed_identifiers_only(self):
"""Path and URL fields should not include raw sensitive values."""
mock_posthog = MagicMock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add -> None return annotation on this test method.

This keeps test typing consistent with repository rules.

✏️ Proposed fix
-    def test_path_and_url_fields_use_hashed_identifiers_only(self):
+    def test_path_and_url_fields_use_hashed_identifiers_only(self) -> None:

As per coding guidelines: "tests/**/*.py: Use type hints -> None on all test methods and tmp_path: Path / monkeypatch: pytest.MonkeyPatch on test parameters".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_telemetry.py` around lines 370 - 372, Add a return type annotation
"-> None" to the test method signature for
test_path_and_url_fields_use_hashed_identifiers_only so the method reads "def
test_path_and_url_fields_use_hashed_identifiers_only(self) -> None:"; update
only the function signature to include the annotation to satisfy the repository
typing rules for tests.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_telemetry.py`:
- Around line 273-274: The test method
test_scan_completed_uses_top_level_results_schema is missing a return type
annotation; update its definition to include "-> None" (i.e., change the
function signature for test_scan_completed_uses_top_level_results_schema to
declare a None return type) to comply with the tests/**/*.py typing guideline.
- Around line 335-336: The test method definition
test_scan_started_includes_per_path_fields is missing a return type annotation;
update its signature to include the explicit return annotation "-> None" (i.e.,
change "def test_scan_started_includes_per_path_fields(self):" to "def
test_scan_started_includes_per_path_fields(self) -> None:") so the test follows
the project's typing guideline for tests.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c05982f2-472b-459c-82de-24148c75fb41

📥 Commits

Reviewing files that changed from the base of the PR and between 26588e7 and 1d911a5.

📒 Files selected for processing (3)
  • CHANGELOG.md
  • modelaudit/telemetry.py
  • tests/test_telemetry.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CHANGELOG.md`:
- Line 84: Update the CHANGELOG entry that currently reads "- **telemetry:**
simplify telemetry payloads to omit file path identifiers" to accurately reflect
the implementation: if the code replaces paths with hashed values, change the
wording to "hash file path identifiers in telemetry payloads" or "replace raw
file path identifiers with hashed values"; if the code actually removes them,
keep "omit file path identifiers"; ensure the changelog text matches the
implementation and mentions "telemetry" and "file path identifiers" so it's easy
to locate and verify.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9743f77e-460c-423f-944e-a75319921bd0

📥 Commits

Reviewing files that changed from the base of the PR and between 1d911a5 and 8f68169.

📒 Files selected for processing (1)
  • CHANGELOG.md

Resolve the changelog conflict, keep missing issue locations out of telemetry detail fields, and apply the touched test annotations from review.

Co-authored-by: Codex <noreply@openai.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/test_telemetry.py (1)

292-307: 🛠️ Refactor suggestion | 🟠 Major

Use tmp_path fixtures instead of hard-coded host paths.

These regressions bake in /tmp/... and /home/user/... literals, which keeps the local-path expectations non-deterministic and violates the test-fixture rule. Build the local samples under tmp_path, pass str(tmp_path / "..."), and type the added parameter as tmp_path: Path.

As per coding guidelines: tests/**/*.py: Use deterministic fixtures only in tests; never reference host paths like /etc/passwd; create all targets under tmp_path; use type hints tmp_path: Path on test parameters.

Also applies to: 366-367, 398-399

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_telemetry.py` around lines 292 - 307, The test embeds
host-specific paths in the `results` dict (assets paths like "/tmp/a.pkl",
"/tmp/b.zip" and issue locations), which should be replaced with deterministic
tmp_path fixtures; add a `tmp_path: Path` parameter to the test signature,
construct file paths using `str(tmp_path / "a.pkl")`, `str(tmp_path / "b.zip")`
(and any other occurrences noted around the test, e.g., the similar literals at
the other occurrences), and update both the "assets" entries and "issues"
location fields to use those tmp_path-based strings so the test no longer
references hard-coded host paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/telemetry.py`:
- Around line 377-383: The _build_url_properties function currently uses helpers
(_extract_domain and _extract_model_name) that return parsed.netloc, which can
include userinfo and leak credentials; change these helpers to derive
domain/model_name values from parsed.hostname and include parsed.port only if
present (i.e., reconstruct host[:port] from parsed.hostname and parsed.port
instead of using parsed.netloc), and update _build_url_properties to use the
revised helpers; also add a regression test that passes a credential-bearing URL
(e.g., https://user:pass@example.com/...) and asserts that neither domain nor
model_name contain userinfo.

---

Outside diff comments:
In `@tests/test_telemetry.py`:
- Around line 292-307: The test embeds host-specific paths in the `results` dict
(assets paths like "/tmp/a.pkl", "/tmp/b.zip" and issue locations), which should
be replaced with deterministic tmp_path fixtures; add a `tmp_path: Path`
parameter to the test signature, construct file paths using `str(tmp_path /
"a.pkl")`, `str(tmp_path / "b.zip")` (and any other occurrences noted around the
test, e.g., the similar literals at the other occurrences), and update both the
"assets" entries and "issues" location fields to use those tmp_path-based
strings so the test no longer references hard-coded host paths.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8abcee8c-3540-41c4-b9e9-e5f4dab74f9d

📥 Commits

Reviewing files that changed from the base of the PR and between 8f68169 and 844ff89.

📒 Files selected for processing (3)
  • CHANGELOG.md
  • modelaudit/telemetry.py
  • tests/test_telemetry.py

@mldangelo-oai mldangelo-oai changed the title fix(telemetry): hash file paths in payloads fix(telemetry): omit raw file paths and URLs from payloads Mar 20, 2026
Resolve changelog conflict and scrub credential-bearing URL hosts from telemetry.

Co-authored-by: Codex <noreply@openai.com>
@mldangelo-oai
Copy link
Contributor

mldangelo-oai commented Mar 20, 2026

Superseded by the corrected comment below.

Copy link
Contributor

Addressed the outstanding review feedback in 2c945ea:

  • Updated the PR title and changelog to match the actual behavior: omit raw paths/URLs, not hash them.
  • Switched the new telemetry regressions to tmp_path-backed fixtures.
  • Fixed credential-bearing URL scrubbing so domain/model_name are derived from hostname[:port], not netloc, and added a regression for https://user:pass@example.com/....

Local QA:

  • uv run ruff format modelaudit/ tests/
  • uv run ruff check modelaudit/ tests/
  • uv run ruff format --check modelaudit/ tests/
  • uv run mypy modelaudit/ tests/
  • uv run pytest tests/test_telemetry.py -q
  • uv run pytest -n auto -m "not slow and not integration" --maxfail=1
  • direct smoke confirmed no user:pass@ leak in captured telemetry properties

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/telemetry.py`:
- Around line 369-375: The telemetry still emits raw identifiers—specifically
the extracted model_name and domain—so update the telemetry builders to avoid
leaking that data: in _build_path_properties (which calls _extract_model_name
and _classify_path) stop returning the raw "model_name" (either drop the key
entirely or replace it with a deterministic non-reversible hash like SHA256 of
the extracted name) and ensure any place that extracts a host/domain from URLs
returns a hashed or omitted "domain" instead of the literal host; likewise
update record_scan_started to not emit raw model_names but to emit the hashed
versions (or omit the field) by mapping each entry through the same hashing
utility to keep telemetry consistent across _build_path_properties and the code
paths referenced around the record_scan_started/model_names usage.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 105f2e41-1ca3-4dc4-8b13-cfc3b51940e9

📥 Commits

Reviewing files that changed from the base of the PR and between 844ff89 and 2c945ea.

📒 Files selected for processing (3)
  • CHANGELOG.md
  • modelaudit/telemetry.py
  • tests/test_telemetry.py

Comment on lines +369 to +375
def _build_path_properties(self, path: str) -> dict[str, Any]:
"""Build coarse telemetry properties for a file path."""
return {
"path_type": self._classify_path(path),
"file_extension": Path(path).suffix.lower(),
"model_name": self._extract_model_name(path),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Drop or hash model_name/domain as well.

These payloads still send raw identifiers even though file_path/url are gone: model_name is the literal basename / repo slug / URL leaf, and domain is the literal host. That still leaks user- or infra-specific data from paths and URLs (for example, local artifact names or internal download hosts), so the privacy fix is incomplete. The same treatment is also needed for record_scan_started()'s model_names field.

✏️ Proposed fix
     def _build_path_properties(self, path: str) -> dict[str, Any]:
         """Build coarse telemetry properties for a file path."""
         return {
             "path_type": self._classify_path(path),
             "file_extension": Path(path).suffix.lower(),
-            "model_name": self._extract_model_name(path),
         }

     def _build_url_properties(self, url: str) -> dict[str, Any]:
         """Build coarse telemetry properties for a download URL."""
         return {
-            "domain": self._extract_domain(url),
             "path_type": self._classify_path(url),
-            "model_name": self._extract_model_name(url),
         }
@@
                 issue_details.append(
                     {
                         "type": issue_type,
                         "severity": severity,
                         "location_type": self._classify_path(issue_location) if issue_location else "unknown",
-                        "model_name": self._extract_model_name(issue_location) if issue_location else None,
                     }
                 )

Also applies to: 377-383, 542-549

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/telemetry.py` around lines 369 - 375, The telemetry still emits
raw identifiers—specifically the extracted model_name and domain—so update the
telemetry builders to avoid leaking that data: in _build_path_properties (which
calls _extract_model_name and _classify_path) stop returning the raw
"model_name" (either drop the key entirely or replace it with a deterministic
non-reversible hash like SHA256 of the extracted name) and ensure any place that
extracts a host/domain from URLs returns a hashed or omitted "domain" instead of
the literal host; likewise update record_scan_started to not emit raw
model_names but to emit the hashed versions (or omit the field) by mapping each
entry through the same hashing utility to keep telemetry consistent across
_build_path_properties and the code paths referenced around the
record_scan_started/model_names usage.

@mldangelo-oai mldangelo-oai changed the title fix(telemetry): omit raw file paths and URLs from payloads fix(telemetry): preserve model refs while stripping secrets Mar 20, 2026
Keep secret-stripped model references in telemetry so scanned remote models remain identifiable while credentials, query strings, fragments, and local directory paths stay out of payloads.

Co-authored-by: Codex <noreply@openai.com>
@mldangelo-oai mldangelo-oai marked this pull request as ready for review March 20, 2026 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants