Skip to content

Use AST-based hashing for semantic file change detection#259

Open
Copilot wants to merge 4 commits intomainfrom
copilot/use-ast-hashing-for-file-changes
Open

Use AST-based hashing for semantic file change detection#259
Copilot wants to merge 4 commits intomainfrom
copilot/use-ast-hashing-for-file-changes

Conversation

Copy link
Contributor

Copilot AI commented Mar 19, 2026

File hashing used raw content bytes, so comment edits, whitespace changes, and reformatting all triggered unnecessary re-indexing. This replaces raw hashing with AST-based hashing for the 25+ languages already supported by ast-grep.

How it works

The AST normalizes away whitespace (not represented in the tree) and we skip comment nodes during traversal. The canonical representation is the sequence of node kinds and leaf text values, which is then blake3-hashed as before.

# Same hash — only comments differ
compute_semantic_file_hash(b"# old\ndef f(): pass", Path("x.py"))
compute_semantic_file_hash(b"# new\ndef f(): pass", Path("x.py"))

# Same hash — only formatting differs
compute_semantic_file_hash(b"def add(a,b):\n    return a+b", Path("x.py"))
compute_semantic_file_hash(b"def add(a, b):\n    return a + b", Path("x.py"))

# Different hash — operator changed
compute_semantic_file_hash(b"return a + b", Path("x.py"))
compute_semantic_file_hash(b"return a * b", Path("x.py"))

Non-semantic files (markdown, plain text, etc.) and AST parse failures fall back to raw content hashing.

Changes

  • discovery.py: Add compute_semantic_file_hash() with internal helpers _compute_ast_hash, _walk_ast_nodes, _get_semantic_language. Update DiscoveredFile.__init__, from_path, and file_hash property to use it.
  • indexing_service.py: Replace get_blake_hash(content_bytes) with compute_semantic_file_hash(content_bytes, path) in _process_discovery_batch.
  • test_semantic_hashing.py: 23 unit tests covering comment invariance, formatting invariance, semantic change detection, language detection, fallback behavior, and DiscoveredFile integration.

Notes

  • No new dependencies — uses existing ast_grep_py (tree-sitter).
  • pyo3_runtime.PanicException from unsupported languages doesn't inherit Exception, so the fallback catches BaseException (re-raising KeyboardInterrupt/SystemExit).
  • Docstring changes do produce different hashes since docstrings are AST expression nodes, not comments.
Original prompt

This section details on the original issue you should resolve

<issue_title>Use AST-based hashing for semantic file change detection</issue_title>
<issue_description>## Description

Implement semantic-aware file hashing using AST (Abstract Syntax Tree) instead of raw file contents for better change detection.

Current Approach

  • Files hashed based on raw content
  • Formatting changes trigger unnecessary re-indexing
  • Comments and whitespace changes cause false positives

Proposed Approach

For Semantic Files (discovery.py:165, 182)

Use ast() to compute hash for semantic files like Python, JavaScript, etc.

Benefits:

  • Detect actual semantic changes
  • Ignore formatting, comments, whitespace
  • More intelligent change detection
  • Reduce unnecessary re-indexing

Implementation:

# Hash AST structure instead of raw bytes
ast_hash = hash_ast(parse_file(path))

Considerations

  • Language-specific AST parsers (tree-sitter already available)
  • Fallback to content hash for non-semantic files
  • Performance impact of parsing
  • Cache parsed ASTs when possible

Impact

  • Fewer unnecessary re-indexes
  • Better performance
  • Smarter change detection
  • More efficient resource usage

Source

  • Files: discovery.py:165, discovery.py:182
  • Branch: 003-our-aim-to</issue_description>

<agent_instructions>@copilot we need a new assessment on this issue based on the codebase's current state following major refactoring

If you assess the path to implementation is straight forward, then go ahead, complete it and submit a PR. Otherwise, submit a plan.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@bashandbone @copilot we need a new assessment on this issue based on the codebase's current state following major refactoring

⌨️ Start Copilot coding agent tasks without leaving your editor — available in VS Code, Visual Studio, JetBrains IDEs and Eclipse.

Use ast-grep to parse supported language files (25+ languages) and
compute hashes from the canonical AST representation instead of raw
file bytes. This ignores comments, whitespace, and formatting changes
so that only genuine semantic modifications trigger re-indexing.

Changes:
- Add compute_semantic_file_hash() and helpers to discovery.py
- Update DiscoveredFile.__init__, from_path, and file_hash property
- Update indexing_service._process_discovery_batch to use semantic hashing
- Add comprehensive unit tests (23 tests) for semantic hashing

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Copilot AI changed the title [WIP] Implement AST-based hashing for semantic file change detection Use AST-based hashing for semantic file change detection Mar 19, 2026
Copilot AI requested a review from bashandbone March 19, 2026 19:50
@bashandbone bashandbone marked this pull request as ready for review March 19, 2026 22:43
Copilot AI review requested due to automatic review settings March 19, 2026 22:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces semantic-aware file hashing to reduce unnecessary re-indexing by hashing an AST-derived canonical form (ignoring comments/formatting) for supported languages, with a fallback to raw content hashing when AST parsing isn’t available.

Changes:

  • Add compute_semantic_file_hash() and AST-walking helpers in core/discovery.py, and wire it into DiscoveredFile hashing.
  • Update indexing discovery batching to use semantic hashing when deciding whether a file needs reindexing.
  • Add unit tests covering comment/whitespace invariance, semantic-change detection, language detection, fallback behavior, and DiscoveredFile integration.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/codeweaver/core/discovery.py Adds AST-based hashing and updates DiscoveredFile hashing behavior.
src/codeweaver/engine/services/indexing_service.py Uses semantic hashing when determining whether files have changed.
tests/unit/core/test_semantic_hashing.py Adds unit tests validating semantic hashing behavior and integration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return get_blake_hash(canonical)
except (KeyboardInterrupt, SystemExit):
raise
except BaseException:
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_compute_ast_hash catches BaseException, which will also swallow MemoryError/GeneratorExit and other non-recoverable errors (falling back to a content hash instead of surfacing the failure). If the goal is specifically to handle pyo3_runtime.PanicException from ast-grep, consider catching that type (or Exception plus that type) and/or explicitly re-raising MemoryError/GeneratorExit similar to how KeyboardInterrupt/SystemExit are handled.

Suggested change
except BaseException:
except Exception:

Copilot uses AI. Check for mistakes.
Comment on lines +337 to +344
"""Return the blake3 hash of the file, using AST-based hashing when supported."""
if self._file_hash is not None:
return self._file_hash
if self.path.exists() and self.path.is_file():
content_hash = get_blake_hash(self.path.read_bytes())
content_bytes = self.path.read_bytes()
content_hash = compute_semantic_file_hash(
content_bytes, self.path, ext_category=self.ext_category
)
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DiscoveredFile.file_hash reads via self.path.exists()/read_bytes(), but path is often stored relative to the project root (see set_relative_path usage in from_path and absolute_path). If the current working directory isn’t the project root, this branch can incorrectly treat an existing file as missing and fall back to get_blake_hash(b""). Prefer using self.absolute_path for the existence/read checks here (and when passing the path into compute_semantic_file_hash) to ensure consistent behavior regardless of CWD.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Contributor

Code Review - AST-based Semantic File Hashing

I've reviewed PR #259 and the implementation looks excellent! This is a well-designed feature that addresses a real problem with smart engineering.

✅ Strengths

1. Sound Architecture

  • Clean separation of concerns with _walk_ast_nodes, _compute_ast_hash, _get_semantic_language, and compute_semantic_file_hash functions
  • Proper fallback strategy: AST parsing → content hash (graceful degradation)
  • Uses existing ast_grep_py dependency (no new dependencies)

2. Robust Error Handling

  • Catches BaseException to handle pyo3_runtime.PanicException (which doesn't inherit from Exception)
  • Re-raises KeyboardInterrupt and SystemExit appropriately
  • Debug logging with exc_info=True for troubleshooting (discovery.py:102, 142)
  • Multiple fallback paths ensure the system remains functional

3. Excellent Test Coverage

  • 23 comprehensive unit tests covering:
    • Comment invariance (hash stays same when only comments change)
    • Formatting invariance (whitespace/spacing changes don't affect hash)
    • Semantic change detection (logic changes produce different hashes)
    • Language detection
    • Fallback behavior
    • DiscoveredFile integration
  • Tests cover both Python and JavaScript
  • Tests verify docstrings ARE considered semantic (correct behavior)

4. Performance Considerations

  • AST parsing happens only for supported semantic files
  • Efficient tree traversal that skips comment nodes
  • Blake3 hashing remains fast

5. Documentation

  • Clear docstrings explaining behavior
  • Updated field descriptions (discovery.py:174)
  • Good code examples in PR description

🔍 Observations

1. Test Failures Are Unrelated
All semantic hashing tests (23/23) are passing in CI. The failing tests are in:

  • tests/e2e/test_user_journeys.py - E2E user journey tests
  • tests/integration/real/test_full_pipeline.py - Integration tests
  • tests/integration/real/test_search_behavior.py - Search behavior tests
  • tests/integration/workflows/test_search_workflows.py - Workflow tests

These appear to be pre-existing flaky tests or issues unrelated to the semantic hashing changes.

2. Implementation Notes

  • The AST hash includes both node kinds and leaf text: f"{kind}:{node.text()}" (discovery.py:71)
  • This ensures semantic content is captured while formatting is normalized
  • Proper handling of empty files (returns None for empty AST)

3. Integration Points

  • DiscoveredFile.__init__ (discovery.py:220)
  • DiscoveredFile.from_path (discovery.py:273)
  • DiscoveredFile.file_hash property (discovery.py:341)
  • IndexingService._process_discovery_batch (indexing_service.py:316)

All integration points properly use the new semantic hashing function.

💡 Minor Suggestions

1. Consider Caching (Future Enhancement)
For large files that are parsed frequently, consider caching the AST or hash. This could be a future optimization if profiling shows it's needed.

2. Metrics/Telemetry (Future Enhancement)
Track AST parsing failures vs. successes to monitor which languages/files are falling back to content hashing. This could help identify parser issues or unsupported edge cases.

3. Documentation Enhancement
Consider adding a comment explaining why BaseException is needed for pyo3_runtime.PanicException (discovery.py:101). The PR description mentions this, but it would be helpful in the code itself.

🎯 Impact Assessment

Positive Impacts:

  • ✅ Significantly reduces unnecessary re-indexing from formatting changes
  • ✅ Better developer experience (linters, formatters won't trigger re-index)
  • ✅ More efficient resource usage
  • ✅ Supports 25+ languages via ast-grep

Security: No security concerns identified.

Performance: Expected improvement due to fewer unnecessary re-indexing operations. AST parsing adds minimal overhead and only runs for supported semantic files.

Backward Compatibility: Fully compatible. Existing content hashes will differ from new semantic hashes, causing a one-time re-index, which is expected and acceptable.

📋 Verdict

This PR is well-implemented and ready to merge once the unrelated test failures are addressed.

The semantic hashing implementation itself is solid. I recommend:

  1. Investigating the failing E2E and integration tests (they appear unrelated to this PR)
  2. If those tests are flaky or pre-existing issues, consider merging this PR and addressing them separately
  3. Consider adding the minor documentation enhancement mentioned above

Great work on this feature! 🎉

@github-actions
Copy link
Contributor

Code Review - PR #259: Use AST-based hashing for semantic file change detection

I've thoroughly reviewed this PR and the implementation looks excellent! Here's my detailed analysis:

✅ Overview

This PR successfully implements semantic-aware file hashing using AST (Abstract Syntax Tree) instead of raw file contents for better change detection. This is a significant improvement that will reduce unnecessary re-indexing when only formatting, comments, or whitespace changes occur.

✅ Implementation Quality

Strengths:

  1. Well-designed architecture: The implementation follows a clean separation of concerns with three helper functions (_walk_ast_nodes, _compute_ast_hash, _get_semantic_language) and one public API (compute_semantic_file_hash).

  2. Robust error handling:

    • Catches BaseException to handle pyo3_runtime.PanicException which doesn't inherit from Exception
    • Properly re-raises KeyboardInterrupt and SystemExit
    • Graceful fallback to content hashing on any failure
    • Good use of debug logging with exc_info=True for troubleshooting
  3. Comprehensive testing: 23 new unit tests cover:

    • Comment invariance ✓
    • Formatting/whitespace invariance ✓
    • Semantic change detection ✓
    • Language detection ✓
    • Fallback behavior ✓
    • DiscoveredFile integration ✓
    • All tests are passing
  4. Leverages existing dependencies: Uses already-included ast-grep-py (tree-sitter) without adding new dependencies.

  5. Documentation: Clear docstrings explaining behavior, especially the important note about docstrings being semantic content (not comments).

✅ Code Changes Analysis

src/codeweaver/core/discovery.py:

  • _walk_ast_nodes(): Recursively traverses AST, skipping comment nodes by checking if "comment" is in the kind name (case-insensitive). Builds a canonical representation of node kinds and leaf text.
  • _compute_ast_hash(): Wraps AST parsing with proper error handling, returns None on failure.
  • _get_semantic_language(): Determines if a file has AST support via extension or ExtCategory.
  • compute_semantic_file_hash(): Public API that attempts AST hashing, falls back to content hashing.
  • Integration points: Updated DiscoveredFile.__init__, from_path, and file_hash property to use the new function.

src/codeweaver/engine/services/indexing_service.py:

  • Line 316: Changed from get_blake_hash(content_bytes) to compute_semantic_file_hash(content_bytes, path) in _process_discovery_batch.

⚠️ Test Failures (Unrelated to this PR)

The CI shows 23 failing tests, but these are NOT caused by this PR:

  • Most failures are from MCP spec validation errors: ValueError: Output schemas must represent object types due to MCP spec limitations
  • Some e2e test failures with KeyError: 'max_file_size'
  • One FileNotFoundError in vector storage
  • All new semantic hashing tests (23/23) are passing

These appear to be pre-existing issues in the test suite, likely from recent refactoring mentioned in the PR description.

💡 Technical Considerations

Good decisions:

  1. Comment detection (line 67): Using "comment" in kind.lower() is simple and effective for catching comment node types across different languages (e.g., "comment", "line_comment", "block_comment").

  2. Docstrings are semantic: The PR correctly notes that docstrings produce different hashes since they're AST expression nodes, not comments. This is the right choice as docstrings are part of the API contract.

  3. UTF-8 decoding with error handling (line 135): Uses errors="replace" which prevents crashes on invalid UTF-8 while still allowing AST hashing to proceed.

  4. Walrus operator usage: Clean use of assignment expressions (:=) for both language and hash checks.

🎯 Security & Performance

Security:

  • ✅ No security concerns
  • ✅ Proper exception handling prevents crashes
  • ✅ No exposure of sensitive information

Performance:

  • ✅ AST parsing is only done for supported semantic languages
  • ✅ Graceful fallback ensures no performance regression for non-semantic files
  • ✅ Reduces unnecessary re-indexing, improving overall system performance
  • ✅ Already uses blake3 which is very fast

📊 Impact Assessment

Benefits:

  • Fewer unnecessary re-indexes when developers format code
  • More intelligent change detection
  • Better resource usage
  • Improves developer experience (auto-formatters won't trigger full re-indexing)

Risks:

  • ⚠️ Very low - comprehensive fallback strategy ensures robustness
  • All changes are backward compatible

✅ Recommendation

APPROVE - This PR is ready to merge.

The implementation is well-designed, thoroughly tested, and delivers significant value. The failing tests are unrelated to this PR and should be addressed separately.

Suggestions for future consideration:

  1. Consider caching parsed ASTs if files are hashed multiple times in quick succession
  2. Monitor AST parsing performance in production to validate the benefits

Test Results: All 23 new semantic hashing tests passing ✓
CI Failures: Unrelated pre-existing issues (23 failures in MCP/e2e tests)
Code Quality: Excellent - clean architecture, robust error handling, comprehensive tests
Documentation: Clear and complete

Great work! This is a valuable improvement to the codebase. 🎉

@github-actions
Copy link
Contributor

Additional Code Review - PR #259

I've conducted an additional review of the AST-based semantic file hashing implementation. Here are my findings:

✅ Code Quality & Design

Excellent Implementation:

  • Clean abstraction layers: The code properly separates concerns with _walk_ast_nodes, _compute_ast_hash, _get_semantic_language, and compute_semantic_file_hash (discovery.py:59-144)
  • Defensive programming: Proper exception handling with BaseException catch to handle pyo3_runtime.PanicException from unsupported languages
  • Graceful degradation: Falls back to content hashing when AST parsing fails or language is unsupported
  • No new dependencies: Leverages existing ast_grep_py (tree-sitter) dependency

🔍 Technical Observations

1. Comment Detection Algorithm (discovery.py:66-67)

if "comment" in kind.lower():
    return

This simple but effective approach filters out comment nodes. Works across all 25+ supported languages since tree-sitter parsers consistently name comment nodes.

2. AST Canonicalization (discovery.py:71, 97)
The canonical representation combines node kinds and leaf text values, ensuring:

  • Whitespace normalization (not represented in AST)
  • Formatting independence
  • Semantic content preservation

3. Integration Points
All file hashing locations properly updated:

  • DiscoveredFile.__init__ (discovery.py:220)
  • DiscoveredFile.from_path (discovery.py:273)
  • DiscoveredFile.file_hash property (discovery.py:341)
  • IndexingService._process_discovery_batch (indexing_service.py:316)

4. Test Coverage (test_semantic_hashing.py)
Comprehensive 285-line test suite with 23 tests covering:

  • Comment invariance (Python & JavaScript)
  • Whitespace/formatting invariance
  • Semantic change detection
  • Docstring handling (correctly treated as semantic)
  • Language detection
  • Fallback behavior
  • DiscoveredFile integration

🎯 Impact Analysis

Benefits:

  • Reduced re-indexing: Formatting changes (prettier, black, eslint) won't trigger re-index
  • Better CI/CD: Automated formatters won't cause false positives
  • Resource efficiency: Fewer unnecessary embedding computations
  • 25+ language support: Python, JavaScript, TypeScript, Go, Rust, Java, C++, etc.

Migration Impact:
Existing file hashes will differ from new semantic hashes, triggering a one-time re-index of all semantic files. This is expected and acceptable behavior.

Performance:

  • AST parsing adds minimal overhead (only for semantic files)
  • Tree-sitter parsers are highly optimized
  • Blake3 hashing remains fast
  • Net performance gain from avoiding unnecessary re-indexing

⚠️ Test Failures (Unrelated to This PR)

The failing tests are NOT related to semantic hashing:

  • test_full_pipeline_index_then_search: MCP schema validation error
  • E2E user journey tests: Config/CLI issues
  • Integration tests: Pre-existing flakiness

All 23 semantic hashing tests pass successfully.

💡 Minor Recommendations

1. Documentation Enhancement
Consider adding inline comment explaining BaseException usage:

except (KeyboardInterrupt, SystemExit):
    raise
except BaseException:  # Catches pyo3_runtime.PanicException which doesn't inherit from Exception
    logger.debug("AST parsing failed for language %s", language_name, exc_info=True)

2. Future Optimization (Optional)
If profiling shows AST parsing as a bottleneck, consider:

  • Caching parsed ASTs for large frequently-accessed files
  • Parallel AST parsing for batch operations

3. Monitoring (Optional)
Track AST parsing success/failure rates to identify:

  • Languages with parser issues
  • Edge cases requiring fallback

✅ Security & Safety

  • No security concerns: Read-only operations, no external inputs
  • Safe error handling: Properly re-raises KeyboardInterrupt and SystemExit
  • Debug logging: Includes exc_info=True for troubleshooting (discovery.py:102, 142)

📊 Code Style & Best Practices

  • Follows project conventions
  • Type hints properly used
  • Docstrings clear and comprehensive
  • Error messages informative
  • Export added to __all__ (discovery.py:425)

🎉 Recommendation

APPROVE with minor suggestion

This is a well-architected, thoroughly tested feature that solves a real problem. The implementation is production-ready. The test failures are unrelated and should be addressed separately.

Suggested action: Merge this PR once the team confirms the failing tests are pre-existing issues.

Excellent work! 🚀

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use AST-based hashing for semantic file change detection

3 participants