Use AST-based hashing for semantic file change detection by Copilot · Pull Request #259 · knitli/codeweaver

Copilot · 2026-03-19T19:27:10Z

File hashing used raw content bytes, so comment edits, whitespace changes, and reformatting all triggered unnecessary re-indexing. This replaces raw hashing with AST-based hashing for the 25+ languages already supported by ast-grep.

How it works

The AST normalizes away whitespace (not represented in the tree) and we skip comment nodes during traversal. The canonical representation is the sequence of node kinds and leaf text values, which is then blake3-hashed as before.

# Same hash — only comments differ
compute_semantic_file_hash(b"# old\ndef f(): pass", Path("x.py"))
compute_semantic_file_hash(b"# new\ndef f(): pass", Path("x.py"))

# Same hash — only formatting differs
compute_semantic_file_hash(b"def add(a,b):\n    return a+b", Path("x.py"))
compute_semantic_file_hash(b"def add(a, b):\n    return a + b", Path("x.py"))

# Different hash — operator changed
compute_semantic_file_hash(b"return a + b", Path("x.py"))
compute_semantic_file_hash(b"return a * b", Path("x.py"))

Non-semantic files (markdown, plain text, etc.) and AST parse failures fall back to raw content hashing.

Changes

discovery.py: Add compute_semantic_file_hash() with internal helpers _compute_ast_hash, _walk_ast_nodes, _get_semantic_language. Update DiscoveredFile.__init__, from_path, and file_hash property to use it.
indexing_service.py: Replace get_blake_hash(content_bytes) with compute_semantic_file_hash(content_bytes, path) in _process_discovery_batch.
test_semantic_hashing.py: 23 unit tests covering comment invariance, formatting invariance, semantic change detection, language detection, fallback behavior, and DiscoveredFile integration.

Notes

No new dependencies — uses existing ast_grep_py (tree-sitter).
pyo3_runtime.PanicException from unsupported languages doesn't inherit Exception, so the fallback catches BaseException (re-raising KeyboardInterrupt/SystemExit).
Docstring changes do produce different hashes since docstrings are AST expression nodes, not comments.

Original prompt

This section details on the original issue you should resolve

<issue_title>Use AST-based hashing for semantic file change detection</issue_title>
<issue_description>## Description

Implement semantic-aware file hashing using AST (Abstract Syntax Tree) instead of raw file contents for better change detection.

Current Approach

Files hashed based on raw content

Formatting changes trigger unnecessary re-indexing

Comments and whitespace changes cause false positives

Proposed Approach

For Semantic Files (discovery.py:165, 182)

Use ast() to compute hash for semantic files like Python, JavaScript, etc.

Benefits:

Detect actual semantic changes

Ignore formatting, comments, whitespace

More intelligent change detection

Reduce unnecessary re-indexing

Implementation:
# Hash AST structure instead of raw bytes
ast_hash = hash_ast(parse_file(path))
Considerations

Language-specific AST parsers (tree-sitter already available)

Fallback to content hash for non-semantic files

Performance impact of parsing

Cache parsed ASTs when possible

Impact

Fewer unnecessary re-indexes

Better performance

Smarter change detection

More efficient resource usage

Source

Files: discovery.py:165, discovery.py:182

Branch: 003-our-aim-to</issue_description>

<agent_instructions>@copilot we need a new assessment on this issue based on the codebase's current state following major refactoring

If you assess the path to implementation is straight forward, then go ahead, complete it and submit a PR. Otherwise, submit a plan.</agent_instructions>

Comments on the Issue (you are @copilot in this section)
@bashandbone @copilot we need a new assessment on this issue based on the codebase's current state following major refactoring

Fixes Use AST-based hashing for semantic file change detection #95

⌨️ Start Copilot coding agent tasks without leaving your editor — available in VS Code, Visual Studio, JetBrains IDEs and Eclipse.

Use ast-grep to parse supported language files (25+ languages) and compute hashes from the canonical AST representation instead of raw file bytes. This ignores comments, whitespace, and formatting changes so that only genuine semantic modifications trigger re-indexing. Changes: - Add compute_semantic_file_hash() and helpers to discovery.py - Update DiscoveredFile.__init__, from_path, and file_hash property - Update indexing_service._process_discovery_batch to use semantic hashing - Add comprehensive unit tests (23 tests) for semantic hashing Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

src/codeweaver/core/discovery.py

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

Copilot

Pull request overview

Introduces semantic-aware file hashing to reduce unnecessary re-indexing by hashing an AST-derived canonical form (ignoring comments/formatting) for supported languages, with a fallback to raw content hashing when AST parsing isn’t available.

Changes:

Add compute_semantic_file_hash() and AST-walking helpers in core/discovery.py, and wire it into DiscoveredFile hashing.
Update indexing discovery batching to use semantic hashing when deciding whether a file needs reindexing.
Add unit tests covering comment/whitespace invariance, semantic-change detection, language detection, fallback behavior, and DiscoveredFile integration.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`src/codeweaver/core/discovery.py`	Adds AST-based hashing and updates `DiscoveredFile` hashing behavior.
`src/codeweaver/engine/services/indexing_service.py`	Uses semantic hashing when determining whether files have changed.
`tests/unit/core/test_semantic_hashing.py`	Adds unit tests validating semantic hashing behavior and integration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/unit/core/test_semantic_hashing.py

Copilot · 2026-03-19T22:48:41Z

src/codeweaver/core/discovery.py

+        return get_blake_hash(canonical)
+    except (KeyboardInterrupt, SystemExit):
+        raise
+    except BaseException:


_compute_ast_hash catches BaseException, which will also swallow MemoryError/GeneratorExit and other non-recoverable errors (falling back to a content hash instead of surfacing the failure). If the goal is specifically to handle pyo3_runtime.PanicException from ast-grep, consider catching that type (or Exception plus that type) and/or explicitly re-raising MemoryError/GeneratorExit similar to how KeyboardInterrupt/SystemExit are handled.

Suggested change

except BaseException:

except Exception:

Copilot · 2026-03-19T22:48:41Z

src/codeweaver/core/discovery.py

+        """Return the blake3 hash of the file, using AST-based hashing when supported."""
        if self._file_hash is not None:
            return self._file_hash
        if self.path.exists() and self.path.is_file():
-            content_hash = get_blake_hash(self.path.read_bytes())
+            content_bytes = self.path.read_bytes()
+            content_hash = compute_semantic_file_hash(
+                content_bytes, self.path, ext_category=self.ext_category
+            )


DiscoveredFile.file_hash reads via self.path.exists()/read_bytes(), but path is often stored relative to the project root (see set_relative_path usage in from_path and absolute_path). If the current working directory isn’t the project root, this branch can incorrectly treat an existing file as missing and fall back to get_blake_hash(b""). Prefer using self.absolute_path for the existence/read checks here (and when passing the path into compute_semantic_file_hash) to ensure consistent behavior regardless of CWD.

github-actions · 2026-03-19T22:51:17Z

Code Review - AST-based Semantic File Hashing

I've reviewed PR #259 and the implementation looks excellent! This is a well-designed feature that addresses a real problem with smart engineering.

✅ Strengths

1. Sound Architecture

Clean separation of concerns with _walk_ast_nodes, _compute_ast_hash, _get_semantic_language, and compute_semantic_file_hash functions
Proper fallback strategy: AST parsing → content hash (graceful degradation)
Uses existing ast_grep_py dependency (no new dependencies)

2. Robust Error Handling

Catches BaseException to handle pyo3_runtime.PanicException (which doesn't inherit from Exception)
Re-raises KeyboardInterrupt and SystemExit appropriately
Debug logging with exc_info=True for troubleshooting (discovery.py:102, 142)
Multiple fallback paths ensure the system remains functional

3. Excellent Test Coverage

23 comprehensive unit tests covering:
- Comment invariance (hash stays same when only comments change)
- Formatting invariance (whitespace/spacing changes don't affect hash)
- Semantic change detection (logic changes produce different hashes)
- Language detection
- Fallback behavior
- DiscoveredFile integration
Tests cover both Python and JavaScript
Tests verify docstrings ARE considered semantic (correct behavior)

4. Performance Considerations

AST parsing happens only for supported semantic files
Efficient tree traversal that skips comment nodes
Blake3 hashing remains fast

5. Documentation

Clear docstrings explaining behavior
Updated field descriptions (discovery.py:174)
Good code examples in PR description

🔍 Observations

1. Test Failures Are Unrelated
All semantic hashing tests (23/23) are passing in CI. The failing tests are in:

tests/e2e/test_user_journeys.py - E2E user journey tests
tests/integration/real/test_full_pipeline.py - Integration tests
tests/integration/real/test_search_behavior.py - Search behavior tests
tests/integration/workflows/test_search_workflows.py - Workflow tests

These appear to be pre-existing flaky tests or issues unrelated to the semantic hashing changes.

2. Implementation Notes

The AST hash includes both node kinds and leaf text: f"{kind}:{node.text()}" (discovery.py:71)
This ensures semantic content is captured while formatting is normalized
Proper handling of empty files (returns None for empty AST)

3. Integration Points

DiscoveredFile.__init__ (discovery.py:220)
DiscoveredFile.from_path (discovery.py:273)
DiscoveredFile.file_hash property (discovery.py:341)
IndexingService._process_discovery_batch (indexing_service.py:316)

All integration points properly use the new semantic hashing function.

💡 Minor Suggestions

1. Consider Caching (Future Enhancement)
For large files that are parsed frequently, consider caching the AST or hash. This could be a future optimization if profiling shows it's needed.

2. Metrics/Telemetry (Future Enhancement)
Track AST parsing failures vs. successes to monitor which languages/files are falling back to content hashing. This could help identify parser issues or unsupported edge cases.

3. Documentation Enhancement
Consider adding a comment explaining why BaseException is needed for pyo3_runtime.PanicException (discovery.py:101). The PR description mentions this, but it would be helpful in the code itself.

🎯 Impact Assessment

Positive Impacts:

✅ Significantly reduces unnecessary re-indexing from formatting changes
✅ Better developer experience (linters, formatters won't trigger re-index)
✅ More efficient resource usage
✅ Supports 25+ languages via ast-grep

Security: No security concerns identified.

Performance: Expected improvement due to fewer unnecessary re-indexing operations. AST parsing adds minimal overhead and only runs for supported semantic files.

Backward Compatibility: Fully compatible. Existing content hashes will differ from new semantic hashes, causing a one-time re-index, which is expected and acceptable.

📋 Verdict

This PR is well-implemented and ready to merge once the unrelated test failures are addressed.

The semantic hashing implementation itself is solid. I recommend:

Investigating the failing E2E and integration tests (they appear unrelated to this PR)
If those tests are flaky or pre-existing issues, consider merging this PR and addressing them separately
Consider adding the minor documentation enhancement mentioned above

Great work on this feature! 🎉

github-actions · 2026-03-19T22:52:13Z

Code Review - PR #259: Use AST-based hashing for semantic file change detection

I've thoroughly reviewed this PR and the implementation looks excellent! Here's my detailed analysis:

✅ Overview

This PR successfully implements semantic-aware file hashing using AST (Abstract Syntax Tree) instead of raw file contents for better change detection. This is a significant improvement that will reduce unnecessary re-indexing when only formatting, comments, or whitespace changes occur.

✅ Implementation Quality

Strengths:

Well-designed architecture: The implementation follows a clean separation of concerns with three helper functions (_walk_ast_nodes, _compute_ast_hash, _get_semantic_language) and one public API (compute_semantic_file_hash).
Robust error handling:
- Catches BaseException to handle pyo3_runtime.PanicException which doesn't inherit from Exception
- Properly re-raises KeyboardInterrupt and SystemExit
- Graceful fallback to content hashing on any failure
- Good use of debug logging with exc_info=True for troubleshooting
Comprehensive testing: 23 new unit tests cover:
- Comment invariance ✓
- Formatting/whitespace invariance ✓
- Semantic change detection ✓
- Language detection ✓
- Fallback behavior ✓
- DiscoveredFile integration ✓
- All tests are passing ✓
Leverages existing dependencies: Uses already-included ast-grep-py (tree-sitter) without adding new dependencies.
Documentation: Clear docstrings explaining behavior, especially the important note about docstrings being semantic content (not comments).

✅ Code Changes Analysis

src/codeweaver/core/discovery.py:

_walk_ast_nodes(): Recursively traverses AST, skipping comment nodes by checking if "comment" is in the kind name (case-insensitive). Builds a canonical representation of node kinds and leaf text.
_compute_ast_hash(): Wraps AST parsing with proper error handling, returns None on failure.
_get_semantic_language(): Determines if a file has AST support via extension or ExtCategory.
compute_semantic_file_hash(): Public API that attempts AST hashing, falls back to content hashing.
Integration points: Updated DiscoveredFile.__init__, from_path, and file_hash property to use the new function.

src/codeweaver/engine/services/indexing_service.py:

Line 316: Changed from get_blake_hash(content_bytes) to compute_semantic_file_hash(content_bytes, path) in _process_discovery_batch.

⚠️ Test Failures (Unrelated to this PR)

The CI shows 23 failing tests, but these are NOT caused by this PR:

Most failures are from MCP spec validation errors: ValueError: Output schemas must represent object types due to MCP spec limitations
Some e2e test failures with KeyError: 'max_file_size'
One FileNotFoundError in vector storage
All new semantic hashing tests (23/23) are passing ✓

These appear to be pre-existing issues in the test suite, likely from recent refactoring mentioned in the PR description.

💡 Technical Considerations

Good decisions:

Comment detection (line 67): Using "comment" in kind.lower() is simple and effective for catching comment node types across different languages (e.g., "comment", "line_comment", "block_comment").
Docstrings are semantic: The PR correctly notes that docstrings produce different hashes since they're AST expression nodes, not comments. This is the right choice as docstrings are part of the API contract.
UTF-8 decoding with error handling (line 135): Uses errors="replace" which prevents crashes on invalid UTF-8 while still allowing AST hashing to proceed.
Walrus operator usage: Clean use of assignment expressions (:=) for both language and hash checks.

🎯 Security & Performance

Security:

✅ No security concerns
✅ Proper exception handling prevents crashes
✅ No exposure of sensitive information

Performance:

✅ AST parsing is only done for supported semantic languages
✅ Graceful fallback ensures no performance regression for non-semantic files
✅ Reduces unnecessary re-indexing, improving overall system performance
✅ Already uses blake3 which is very fast

📊 Impact Assessment

Benefits:

Fewer unnecessary re-indexes when developers format code
More intelligent change detection
Better resource usage
Improves developer experience (auto-formatters won't trigger full re-indexing)

Risks:

⚠️ Very low - comprehensive fallback strategy ensures robustness
All changes are backward compatible

✅ Recommendation

APPROVE - This PR is ready to merge.

The implementation is well-designed, thoroughly tested, and delivers significant value. The failing tests are unrelated to this PR and should be addressed separately.

Suggestions for future consideration:

Consider caching parsed ASTs if files are hashed multiple times in quick succession
Monitor AST parsing performance in production to validate the benefits

Test Results: All 23 new semantic hashing tests passing ✓
CI Failures: Unrelated pre-existing issues (23 failures in MCP/e2e tests)
Code Quality: Excellent - clean architecture, robust error handling, comprehensive tests
Documentation: Clear and complete

Great work! This is a valuable improvement to the codebase. 🎉

github-actions · 2026-03-19T22:52:29Z

Additional Code Review - PR #259

I've conducted an additional review of the AST-based semantic file hashing implementation. Here are my findings:

✅ Code Quality & Design

Excellent Implementation:

Clean abstraction layers: The code properly separates concerns with _walk_ast_nodes, _compute_ast_hash, _get_semantic_language, and compute_semantic_file_hash (discovery.py:59-144)
Defensive programming: Proper exception handling with BaseException catch to handle pyo3_runtime.PanicException from unsupported languages
Graceful degradation: Falls back to content hashing when AST parsing fails or language is unsupported
No new dependencies: Leverages existing ast_grep_py (tree-sitter) dependency

🔍 Technical Observations

1. Comment Detection Algorithm (discovery.py:66-67)

if "comment" in kind.lower():
    return

This simple but effective approach filters out comment nodes. Works across all 25+ supported languages since tree-sitter parsers consistently name comment nodes.

2. AST Canonicalization (discovery.py:71, 97)
The canonical representation combines node kinds and leaf text values, ensuring:

Whitespace normalization (not represented in AST)
Formatting independence
Semantic content preservation

3. Integration Points
All file hashing locations properly updated:

DiscoveredFile.__init__ (discovery.py:220)
DiscoveredFile.from_path (discovery.py:273)
DiscoveredFile.file_hash property (discovery.py:341)
IndexingService._process_discovery_batch (indexing_service.py:316)

4. Test Coverage (test_semantic_hashing.py)
Comprehensive 285-line test suite with 23 tests covering:

Comment invariance (Python & JavaScript)
Whitespace/formatting invariance
Semantic change detection
Docstring handling (correctly treated as semantic)
Language detection
Fallback behavior
DiscoveredFile integration

🎯 Impact Analysis

Benefits:

Reduced re-indexing: Formatting changes (prettier, black, eslint) won't trigger re-index
Better CI/CD: Automated formatters won't cause false positives
Resource efficiency: Fewer unnecessary embedding computations
25+ language support: Python, JavaScript, TypeScript, Go, Rust, Java, C++, etc.

Migration Impact:
Existing file hashes will differ from new semantic hashes, triggering a one-time re-index of all semantic files. This is expected and acceptable behavior.

Performance:

AST parsing adds minimal overhead (only for semantic files)
Tree-sitter parsers are highly optimized
Blake3 hashing remains fast
Net performance gain from avoiding unnecessary re-indexing

⚠️ Test Failures (Unrelated to This PR)

The failing tests are NOT related to semantic hashing:

test_full_pipeline_index_then_search: MCP schema validation error
E2E user journey tests: Config/CLI issues
Integration tests: Pre-existing flakiness

All 23 semantic hashing tests pass successfully.

💡 Minor Recommendations

1. Documentation Enhancement
Consider adding inline comment explaining BaseException usage:

except (KeyboardInterrupt, SystemExit):
    raise
except BaseException:  # Catches pyo3_runtime.PanicException which doesn't inherit from Exception
    logger.debug("AST parsing failed for language %s", language_name, exc_info=True)

2. Future Optimization (Optional)
If profiling shows AST parsing as a bottleneck, consider:

Caching parsed ASTs for large frequently-accessed files
Parallel AST parsing for batch operations

3. Monitoring (Optional)
Track AST parsing success/failure rates to identify:

Languages with parser issues
Edge cases requiring fallback

✅ Security & Safety

No security concerns: Read-only operations, no external inputs
Safe error handling: Properly re-raises KeyboardInterrupt and SystemExit
Debug logging: Includes exc_info=True for troubleshooting (discovery.py:102, 142)

📊 Code Style & Best Practices

Follows project conventions
Type hints properly used
Docstrings clear and comprehensive
Error messages informative
Export added to __all__ (discovery.py:425)

🎉 Recommendation

APPROVE with minor suggestion

This is a well-architected, thoroughly tested feature that solves a real problem. The implementation is production-ready. The test failures are unrelated and should be addressed separately.

Suggested action: Merge this PR once the team confirms the failing tests are pre-existing issues.

Excellent work! 🚀

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>

Initial plan

e94b21d

Copilot AI assigned Copilot and bashandbone Mar 19, 2026

Copilot started work on behalf of bashandbone March 19, 2026 19:27 View session

github-code-quality bot found potential problems Mar 19, 2026

View reviewed changes

src/codeweaver/core/discovery.py Dismissed Show dismissed Hide dismissed

fix: add debug logging with exc_info to AST hashing fallback paths

6092a9f

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Implement AST-based hashing for semantic file change detection~~ Use AST-based hashing for semantic file change detection Mar 19, 2026

Copilot AI requested a review from bashandbone March 19, 2026 19:50

Copilot finished work on behalf of bashandbone March 19, 2026 19:50

bashandbone marked this pull request as ready for review March 19, 2026 22:43

Copilot AI review requested due to automatic review settings March 19, 2026 22:43

Copilot started reviewing on behalf of bashandbone March 19, 2026 22:44 View session

Copilot AI reviewed Mar 19, 2026

View reviewed changes

Apply suggestions from code review

d2d3134

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use AST-based hashing for semantic file change detection#259

Use AST-based hashing for semantic file change detection#259
Copilot wants to merge 4 commits intomainfrom
copilot/use-ast-hashing-for-file-changes

Copilot AI commented Mar 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How it works

Changes

Notes

Current Approach

Proposed Approach

For Semantic Files (discovery.py:165, 182)

Considerations

Impact

Source

Comments on the Issue (you are @copilot in this section)

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 19, 2026

Code Review - AST-based Semantic File Hashing

✅ Strengths

🔍 Observations

💡 Minor Suggestions

🎯 Impact Assessment

📋 Verdict

Uh oh!

github-actions bot commented Mar 19, 2026

Code Review - PR #259: Use AST-based hashing for semantic file change detection

✅ Overview

✅ Implementation Quality

✅ Code Changes Analysis

⚠️ Test Failures (Unrelated to this PR)

💡 Technical Considerations

🎯 Security & Performance

📊 Impact Assessment

✅ Recommendation

Uh oh!

github-actions bot commented Mar 19, 2026

Additional Code Review - PR #259

✅ Code Quality & Design

🔍 Technical Observations

🎯 Impact Analysis

⚠️ Test Failures (Unrelated to This PR)

💡 Minor Recommendations

✅ Security & Safety

📊 Code Style & Best Practices

🎉 Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Mar 19, 2026 •

edited

Loading