fix(telemetry): preserve model refs while stripping secrets by typpo · Pull Request #717 · promptfoo/modelaudit

typpo · 2026-03-16T21:21:32Z

Use hashed identifiers for telemetry file and URL fields instead of raw values.
Update the telemetry regression tests and changelog wording.

Summary by CodeRabbit

Bug Fixes
- Telemetry now omits raw file paths, URLs, and direct location identifiers; events include only coarse path/URL metadata (type, extension, domain) and model name when resolvable, with unknown locations reported as such.
Tests
- Tests updated to assert absence of raw path/URL values and identifiers, validate unknown-location behavior, and add JSON checks preventing sensitive leakage.
Documentation
- Changelog updated to note the privacy-preserving telemetry change.

coderabbitai · 2026-03-16T21:21:54Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Telemetry payloads were changed to stop emitting raw file paths, URLs, and hashed identifiers. New helpers assemble coarse properties (e.g., path_type, file_extension, domain, model_name). Telemetry recorders and tests were updated to use the new privacy-preserving payload shape.

Changes

Cohort / File(s)	Summary
Telemetry core `modelaudit/telemetry.py`	Removed `_hash_identifier()` and `hashlib` usage. Added `_build_path_properties(path)` and `_build_url_properties(url)` and helpers (`_extract_url_host`). Refactored telemetry recorders to omit raw `file_path`/`url` and identifier fields; emit coarse fields like `path_type`, `file_extension`, `domain`, and `model_name`.
Tests `tests/test_telemetry.py`	Updated tests to use `tmp_path` fixtures and to expect omission of raw paths/URLs and identifier fields. Adjusted assertions for scan-started/completed payloads, added an issue with `location=None` (expect `location_type == "unknown"`, `model_name is None`), and added JSON checks to ensure no sensitive strings are serialized.
Changelog `CHANGELOG.md`	Added Unreleased → Fixed entry noting telemetry no longer emits raw file path or URL identifiers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I nibble paths and tuck them tight,
Raw routes hidden out of sight,
I hop and build a gentler trace,
Small fields remain, no names to chase,
A rabbit's hush—telemetry made light.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The PR title mentions 'preserve model refs' and 'strip secrets', but the actual changes remove hashing and raw identifiers altogether—not preserve/strip them in the traditional sense.	Revise the title to better reflect the actual intent: e.g., 'fix(telemetry): omit raw file paths and URLs from payloads' or 'fix(telemetry): replace identifiers with coarse properties'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/telemetry-file-hashes

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/telemetry.py`:
- Around line 544-551: The code coerces issue.get("location", "") to str,
turning None into the string "None" and causing _hash_identifier to produce
synthetic IDs; change issue_location to keep the original value (e.g.,
issue_location = issue.get("location")) and only convert to string or call
_hash_identifier / _extract_model_name / _classify_path when the value is
non-null/non-empty (e.g., if issue_location: location_str = str(issue_location);
use location_str for classification but call _hash_identifier(location_str) only
when issue_location is truthy, otherwise set location_identifier/model_name to
None). Apply the same fix to the other block around the references to
_hash_identifier/_classify_path/_extract_model_name noted at 572-576.

In `@tests/test_telemetry.py`:
- Around line 370-372: Add a return type annotation "-> None" to the test method
signature for test_path_and_url_fields_use_hashed_identifiers_only so the method
reads "def test_path_and_url_fields_use_hashed_identifiers_only(self) -> None:";
update only the function signature to include the annotation to satisfy the
repository typing rules for tests.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6352dee5-7c86-42e4-86fb-b473b01282c1

📥 Commits

Reviewing files that changed from the base of the PR and between d9fe283 and 26588e7.

📒 Files selected for processing (3)

CHANGELOG.md
modelaudit/telemetry.py
tests/test_telemetry.py

modelaudit/telemetry.py

coderabbitai · 2026-03-16T21:25:20Z

tests/test_telemetry.py

+    def test_path_and_url_fields_use_hashed_identifiers_only(self):
+        """Path and URL fields should not include raw sensitive values."""
        mock_posthog = MagicMock()


🛠️ Refactor suggestion | 🟠 Major

Add -> None return annotation on this test method.

This keeps test typing consistent with repository rules.

✏️ Proposed fix

- def test_path_and_url_fields_use_hashed_identifiers_only(self): + def test_path_and_url_fields_use_hashed_identifiers_only(self) -> None:

As per coding guidelines: "tests/**/*.py: Use type hints -> None on all test methods and tmp_path: Path / monkeypatch: pytest.MonkeyPatch on test parameters".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_telemetry.py` around lines 370 - 372, Add a return type annotation "-> None" to the test method signature for test_path_and_url_fields_use_hashed_identifiers_only so the method reads "def test_path_and_url_fields_use_hashed_identifiers_only(self) -> None:"; update only the function signature to include the annotation to satisfy the repository typing rules for tests.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_telemetry.py`:
- Around line 273-274: The test method
test_scan_completed_uses_top_level_results_schema is missing a return type
annotation; update its definition to include "-> None" (i.e., change the
function signature for test_scan_completed_uses_top_level_results_schema to
declare a None return type) to comply with the tests/**/*.py typing guideline.
- Around line 335-336: The test method definition
test_scan_started_includes_per_path_fields is missing a return type annotation;
update its signature to include the explicit return annotation "-> None" (i.e.,
change "def test_scan_started_includes_per_path_fields(self):" to "def
test_scan_started_includes_per_path_fields(self) -> None:") so the test follows
the project's typing guideline for tests.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c05982f2-472b-459c-82de-24148c75fb41

📥 Commits

Reviewing files that changed from the base of the PR and between 26588e7 and 1d911a5.

📒 Files selected for processing (3)

CHANGELOG.md
modelaudit/telemetry.py
tests/test_telemetry.py

tests/test_telemetry.py

# Conflicts: # CHANGELOG.md

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CHANGELOG.md`:
- Line 84: Update the CHANGELOG entry that currently reads "- **telemetry:**
simplify telemetry payloads to omit file path identifiers" to accurately reflect
the implementation: if the code replaces paths with hashed values, change the
wording to "hash file path identifiers in telemetry payloads" or "replace raw
file path identifiers with hashed values"; if the code actually removes them,
keep "omit file path identifiers"; ensure the changelog text matches the
implementation and mentions "telemetry" and "file path identifiers" so it's easy
to locate and verify.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9743f77e-460c-423f-944e-a75319921bd0

📥 Commits

Reviewing files that changed from the base of the PR and between 1d911a5 and 8f68169.

📒 Files selected for processing (1)

CHANGELOG.md

CHANGELOG.md

Resolve the changelog conflict, keep missing issue locations out of telemetry detail fields, and apply the touched test annotations from review. Co-authored-by: Codex <noreply@openai.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/test_telemetry.py (1)
292-307: 🛠️ Refactor suggestion | 🟠 Major

Use tmp_path fixtures instead of hard-coded host paths.

These regressions bake in /tmp/... and /home/user/... literals, which keeps the local-path expectations non-deterministic and violates the test-fixture rule. Build the local samples under tmp_path, pass str(tmp_path / "..."), and type the added parameter as tmp_path: Path.

As per coding guidelines: tests/**/*.py: Use deterministic fixtures only in tests; never reference host paths like /etc/passwd; create all targets under tmp_path; use type hints tmp_path: Path on test parameters.

Also applies to: 366-367, 398-399
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_telemetry.py` around lines 292 - 307, The test embeds
host-specific paths in the `results` dict (assets paths like "/tmp/a.pkl",
"/tmp/b.zip" and issue locations), which should be replaced with deterministic
tmp_path fixtures; add a `tmp_path: Path` parameter to the test signature,
construct file paths using `str(tmp_path / "a.pkl")`, `str(tmp_path / "b.zip")`
(and any other occurrences noted around the test, e.g., the similar literals at
the other occurrences), and update both the "assets" entries and "issues"
location fields to use those tmp_path-based strings so the test no longer
references hard-coded host paths.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/telemetry.py`:
- Around line 377-383: The _build_url_properties function currently uses helpers
(_extract_domain and _extract_model_name) that return parsed.netloc, which can
include userinfo and leak credentials; change these helpers to derive
domain/model_name values from parsed.hostname and include parsed.port only if
present (i.e., reconstruct host[:port] from parsed.hostname and parsed.port
instead of using parsed.netloc), and update _build_url_properties to use the
revised helpers; also add a regression test that passes a credential-bearing URL
(e.g., https://user:pass@example.com/...) and asserts that neither domain nor
model_name contain userinfo.

---

Outside diff comments:
In `@tests/test_telemetry.py`:
- Around line 292-307: The test embeds host-specific paths in the `results` dict
(assets paths like "/tmp/a.pkl", "/tmp/b.zip" and issue locations), which should
be replaced with deterministic tmp_path fixtures; add a `tmp_path: Path`
parameter to the test signature, construct file paths using `str(tmp_path /
"a.pkl")`, `str(tmp_path / "b.zip")` (and any other occurrences noted around the
test, e.g., the similar literals at the other occurrences), and update both the
"assets" entries and "issues" location fields to use those tmp_path-based
strings so the test no longer references hard-coded host paths.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8abcee8c-3540-41c4-b9e9-e5f4dab74f9d

📥 Commits

Reviewing files that changed from the base of the PR and between 8f68169 and 844ff89.

📒 Files selected for processing (3)

CHANGELOG.md
modelaudit/telemetry.py
tests/test_telemetry.py

modelaudit/telemetry.py

Resolve changelog conflict and scrub credential-bearing URL hosts from telemetry. Co-authored-by: Codex <noreply@openai.com>

mldangelo-oai · 2026-03-20T16:54:54Z

Superseded by the corrected comment below.

mldangelo-oai · 2026-03-20T16:55:12Z

Addressed the outstanding review feedback in 2c945ea:

Updated the PR title and changelog to match the actual behavior: omit raw paths/URLs, not hash them.
Switched the new telemetry regressions to tmp_path-backed fixtures.
Fixed credential-bearing URL scrubbing so domain/model_name are derived from hostname[:port], not netloc, and added a regression for https://user:pass@example.com/....

Local QA:

uv run ruff format modelaudit/ tests/
uv run ruff check modelaudit/ tests/
uv run ruff format --check modelaudit/ tests/
uv run mypy modelaudit/ tests/
uv run pytest tests/test_telemetry.py -q
uv run pytest -n auto -m "not slow and not integration" --maxfail=1
direct smoke confirmed no user:pass@ leak in captured telemetry properties

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/telemetry.py`:
- Around line 369-375: The telemetry still emits raw identifiers—specifically
the extracted model_name and domain—so update the telemetry builders to avoid
leaking that data: in _build_path_properties (which calls _extract_model_name
and _classify_path) stop returning the raw "model_name" (either drop the key
entirely or replace it with a deterministic non-reversible hash like SHA256 of
the extracted name) and ensure any place that extracts a host/domain from URLs
returns a hashed or omitted "domain" instead of the literal host; likewise
update record_scan_started to not emit raw model_names but to emit the hashed
versions (or omit the field) by mapping each entry through the same hashing
utility to keep telemetry consistent across _build_path_properties and the code
paths referenced around the record_scan_started/model_names usage.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 105f2e41-1ca3-4dc4-8b13-cfc3b51940e9

📥 Commits

Reviewing files that changed from the base of the PR and between 844ff89 and 2c945ea.

📒 Files selected for processing (3)

CHANGELOG.md
modelaudit/telemetry.py
tests/test_telemetry.py

coderabbitai · 2026-03-20T17:01:09Z

modelaudit/telemetry.py

+    def _build_path_properties(self, path: str) -> dict[str, Any]:
+        """Build coarse telemetry properties for a file path."""
+        return {
+            "path_type": self._classify_path(path),
+            "file_extension": Path(path).suffix.lower(),
+            "model_name": self._extract_model_name(path),
+        }


⚠️ Potential issue | 🟠 Major

Drop or hash model_name/domain as well.

These payloads still send raw identifiers even though file_path/url are gone: model_name is the literal basename / repo slug / URL leaf, and domain is the literal host. That still leaks user- or infra-specific data from paths and URLs (for example, local artifact names or internal download hosts), so the privacy fix is incomplete. The same treatment is also needed for record_scan_started()'s model_names field.

✏️ Proposed fix

def _build_path_properties(self, path: str) -> dict[str, Any]: """Build coarse telemetry properties for a file path.""" return { "path_type": self._classify_path(path), "file_extension": Path(path).suffix.lower(), - "model_name": self._extract_model_name(path), } def _build_url_properties(self, url: str) -> dict[str, Any]: """Build coarse telemetry properties for a download URL.""" return { - "domain": self._extract_domain(url), "path_type": self._classify_path(url), - "model_name": self._extract_model_name(url), } @@ issue_details.append( { "type": issue_type, "severity": severity, "location_type": self._classify_path(issue_location) if issue_location else "unknown", - "model_name": self._extract_model_name(issue_location) if issue_location else None, } )

Also applies to: 377-383, 542-549

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelaudit/telemetry.py` around lines 369 - 375, The telemetry still emits raw identifiers—specifically the extracted model_name and domain—so update the telemetry builders to avoid leaking that data: in _build_path_properties (which calls _extract_model_name and _classify_path) stop returning the raw "model_name" (either drop the key entirely or replace it with a deterministic non-reversible hash like SHA256 of the extracted name) and ensure any place that extracts a host/domain from URLs returns a hashed or omitted "domain" instead of the literal host; likewise update record_scan_started to not emit raw model_names but to emit the hashed versions (or omit the field) by mapping each entry through the same hashing utility to keep telemetry consistent across _build_path_properties and the code paths referenced around the record_scan_started/model_names usage.

Keep secret-stripped model references in telemetry so scanned remote models remain identifiable while credentials, query strings, fragments, and local directory paths stay out of payloads. Co-authored-by: Codex <noreply@openai.com>

fix: hash telemetry file paths

26588e7

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

fix: remove telemetry path identifiers

1d911a5

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

tests/test_telemetry.py Outdated Show resolved Hide resolved

tests/test_telemetry.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'origin/main' into audit/pr717

8f68169

# Conflicts: # CHANGELOG.md

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Merge origin/main into codex/telemetry-file-hashes

844ff89

Resolve the changelog conflict, keep missing issue locations out of telemetry detail fields, and apply the touched test annotations from review. Co-authored-by: Codex <noreply@openai.com>

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

modelaudit/telemetry.py Show resolved Hide resolved

mldangelo-oai changed the title ~~fix(telemetry): hash file paths in payloads~~ fix(telemetry): omit raw file paths and URLs from payloads Mar 20, 2026

Merge origin/main into codex/telemetry-file-hashes

2c945ea

Resolve changelog conflict and scrub credential-bearing URL hosts from telemetry. Co-authored-by: Codex <noreply@openai.com>

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

mldangelo-oai changed the title ~~fix(telemetry): omit raw file paths and URLs from payloads~~ fix(telemetry): preserve model refs while stripping secrets Mar 20, 2026

fix(telemetry): preserve scrubbed model references

1e68413

Keep secret-stripped model references in telemetry so scanned remote models remain identifiable while credentials, query strings, fragments, and local directory paths stay out of payloads. Co-authored-by: Codex <noreply@openai.com>

mldangelo-oai marked this pull request as ready for review March 20, 2026 17:07

Conversation

typpo commented Mar 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mldangelo-oai commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mldangelo-oai commented Mar 20, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

typpo commented Mar 16, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 16, 2026 •

edited

Loading

mldangelo-oai commented Mar 20, 2026 •

edited

Loading