Add Neo4j indexing and performance documentation #805
Open
Khushi281300 wants to merge 1 commit intoOWASP:mainfrom
Open
Add Neo4j indexing and performance documentation #805Khushi281300 wants to merge 1 commit intoOWASP:mainfrom
Khushi281300 wants to merge 1 commit intoOWASP:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR addresses Issue #622 by introducing systematic indexing for critical Neo4j node properties and providing guidance for performance optimization. These improvements target severe Stop-The-World (STW) pauses observed in production-scale datasets (~9GB).
Technical Evidence
Performance Bottleneck
Inspection of the Neo4j
debug.logrevealed major STW pauses:{"time":"2026-03-13 09:51:29.955+0000","level":"WARN","category":"o.n.k.i.c.VmPauseMonitorComponent","message":"Detected VM stop-the-world pause: {pauseTime=11548, gcTime=11703, gcCount=1}"}Analysis: An 11.5s pause indicates heavy JVM garbage collection caused by large result sets, full label scans, and unindexed graph traversals.
Before vs After Optimization
Entry Point Lookup
Query example:
Before: Neo4j executes
NodeByLabelScan, scanning every:NeoStandardnode and creating heavy memory and I/O pressure on large datasets.After: With indexing enabled, Neo4j performs
NodeIndexSeek, improving lookup complexity from O(N) to O(log N) and loading only relevant node records.Path Traversal
Query example:
Problem: Wildcard traversal expands relationships across many nodes, leading to combinatorial explosion in large graphs.
Solution: Added performance warnings and implemented a Tiered Pruning Strategy so traversals begin from indexed entry points.
Proposed Changes
Database Indexing (
db.py)Added
index=Trueto key properties:NeoDocument.name– Core lookup for standards and CREsNeoCRE.external_id– Primary identifier for CRE mappingNeoStandard.section,section_id,subsection– Granular filtering for standardsPerformance Documentation (
neo4j-indexing.md)Created a guide covering recommended indexes, use of
PROFILEto analyze execution plans, and benchmarking strategies for large graph datasets.Query Annotations
Added performance notes and profiling instructions in
db.py(Gap Analysis queries) andprompt_client.py(AI Mapping Pipeline).Verification
Optimized lookups improved from O(N) to O(log N) using
NodeIndexSeek. Syntax validated withpython -m py_compile. The changes directly reduce memory pressure responsible for previously observed 11s STW pauses.Related Issue
Fixes #622