Use AST-based hashing for semantic file change detection#259
Use AST-based hashing for semantic file change detection#259
Conversation
Use ast-grep to parse supported language files (25+ languages) and compute hashes from the canonical AST representation instead of raw file bytes. This ignores comments, whitespace, and formatting changes so that only genuine semantic modifications trigger re-indexing. Changes: - Add compute_semantic_file_hash() and helpers to discovery.py - Update DiscoveredFile.__init__, from_path, and file_hash property - Update indexing_service._process_discovery_batch to use semantic hashing - Add comprehensive unit tests (23 tests) for semantic hashing Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Introduces semantic-aware file hashing to reduce unnecessary re-indexing by hashing an AST-derived canonical form (ignoring comments/formatting) for supported languages, with a fallback to raw content hashing when AST parsing isn’t available.
Changes:
- Add
compute_semantic_file_hash()and AST-walking helpers incore/discovery.py, and wire it intoDiscoveredFilehashing. - Update indexing discovery batching to use semantic hashing when deciding whether a file needs reindexing.
- Add unit tests covering comment/whitespace invariance, semantic-change detection, language detection, fallback behavior, and
DiscoveredFileintegration.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/codeweaver/core/discovery.py |
Adds AST-based hashing and updates DiscoveredFile hashing behavior. |
src/codeweaver/engine/services/indexing_service.py |
Uses semantic hashing when determining whether files have changed. |
tests/unit/core/test_semantic_hashing.py |
Adds unit tests validating semantic hashing behavior and integration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return get_blake_hash(canonical) | ||
| except (KeyboardInterrupt, SystemExit): | ||
| raise | ||
| except BaseException: |
There was a problem hiding this comment.
_compute_ast_hash catches BaseException, which will also swallow MemoryError/GeneratorExit and other non-recoverable errors (falling back to a content hash instead of surfacing the failure). If the goal is specifically to handle pyo3_runtime.PanicException from ast-grep, consider catching that type (or Exception plus that type) and/or explicitly re-raising MemoryError/GeneratorExit similar to how KeyboardInterrupt/SystemExit are handled.
| except BaseException: | |
| except Exception: |
| """Return the blake3 hash of the file, using AST-based hashing when supported.""" | ||
| if self._file_hash is not None: | ||
| return self._file_hash | ||
| if self.path.exists() and self.path.is_file(): | ||
| content_hash = get_blake_hash(self.path.read_bytes()) | ||
| content_bytes = self.path.read_bytes() | ||
| content_hash = compute_semantic_file_hash( | ||
| content_bytes, self.path, ext_category=self.ext_category | ||
| ) |
There was a problem hiding this comment.
DiscoveredFile.file_hash reads via self.path.exists()/read_bytes(), but path is often stored relative to the project root (see set_relative_path usage in from_path and absolute_path). If the current working directory isn’t the project root, this branch can incorrectly treat an existing file as missing and fall back to get_blake_hash(b""). Prefer using self.absolute_path for the existence/read checks here (and when passing the path into compute_semantic_file_hash) to ensure consistent behavior regardless of CWD.
Code Review - AST-based Semantic File HashingI've reviewed PR #259 and the implementation looks excellent! This is a well-designed feature that addresses a real problem with smart engineering. ✅ Strengths1. Sound Architecture
2. Robust Error Handling
3. Excellent Test Coverage
4. Performance Considerations
5. Documentation
🔍 Observations1. Test Failures Are Unrelated
These appear to be pre-existing flaky tests or issues unrelated to the semantic hashing changes. 2. Implementation Notes
3. Integration Points
All integration points properly use the new semantic hashing function. 💡 Minor Suggestions1. Consider Caching (Future Enhancement) 2. Metrics/Telemetry (Future Enhancement) 3. Documentation Enhancement 🎯 Impact AssessmentPositive Impacts:
Security: No security concerns identified. Performance: Expected improvement due to fewer unnecessary re-indexing operations. AST parsing adds minimal overhead and only runs for supported semantic files. Backward Compatibility: Fully compatible. Existing content hashes will differ from new semantic hashes, causing a one-time re-index, which is expected and acceptable. 📋 VerdictThis PR is well-implemented and ready to merge once the unrelated test failures are addressed. The semantic hashing implementation itself is solid. I recommend:
Great work on this feature! 🎉 |
Code Review - PR #259: Use AST-based hashing for semantic file change detectionI've thoroughly reviewed this PR and the implementation looks excellent! Here's my detailed analysis: ✅ OverviewThis PR successfully implements semantic-aware file hashing using AST (Abstract Syntax Tree) instead of raw file contents for better change detection. This is a significant improvement that will reduce unnecessary re-indexing when only formatting, comments, or whitespace changes occur. ✅ Implementation QualityStrengths:
✅ Code Changes Analysissrc/codeweaver/core/discovery.py:
src/codeweaver/engine/services/indexing_service.py:
|
Additional Code Review - PR #259I've conducted an additional review of the AST-based semantic file hashing implementation. Here are my findings: ✅ Code Quality & DesignExcellent Implementation:
🔍 Technical Observations1. Comment Detection Algorithm (discovery.py:66-67) if "comment" in kind.lower():
returnThis simple but effective approach filters out comment nodes. Works across all 25+ supported languages since tree-sitter parsers consistently name comment nodes. 2. AST Canonicalization (discovery.py:71, 97)
3. Integration Points
4. Test Coverage (test_semantic_hashing.py)
🎯 Impact AnalysisBenefits:
Migration Impact: Performance:
|
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>
File hashing used raw content bytes, so comment edits, whitespace changes, and reformatting all triggered unnecessary re-indexing. This replaces raw hashing with AST-based hashing for the 25+ languages already supported by ast-grep.
How it works
The AST normalizes away whitespace (not represented in the tree) and we skip comment nodes during traversal. The canonical representation is the sequence of node kinds and leaf text values, which is then blake3-hashed as before.
Non-semantic files (markdown, plain text, etc.) and AST parse failures fall back to raw content hashing.
Changes
discovery.py: Addcompute_semantic_file_hash()with internal helpers_compute_ast_hash,_walk_ast_nodes,_get_semantic_language. UpdateDiscoveredFile.__init__,from_path, andfile_hashproperty to use it.indexing_service.py: Replaceget_blake_hash(content_bytes)withcompute_semantic_file_hash(content_bytes, path)in_process_discovery_batch.test_semantic_hashing.py: 23 unit tests covering comment invariance, formatting invariance, semantic change detection, language detection, fallback behavior, andDiscoveredFileintegration.Notes
ast_grep_py(tree-sitter).pyo3_runtime.PanicExceptionfrom unsupported languages doesn't inheritException, so the fallback catchesBaseException(re-raisingKeyboardInterrupt/SystemExit).Original prompt
⌨️ Start Copilot coding agent tasks without leaving your editor — available in VS Code, Visual Studio, JetBrains IDEs and Eclipse.