Skip to content

Add Neo4j indexing and performance documentation #805

Open
Khushi281300 wants to merge 1 commit intoOWASP:mainfrom
Khushi281300:fix-neo4j-indexing-final-622
Open

Add Neo4j indexing and performance documentation #805
Khushi281300 wants to merge 1 commit intoOWASP:mainfrom
Khushi281300:fix-neo4j-indexing-final-622

Conversation

@Khushi281300
Copy link

Summary

This PR addresses Issue #622 by introducing systematic indexing for critical Neo4j node properties and providing guidance for performance optimization. These improvements target severe Stop-The-World (STW) pauses observed in production-scale datasets (~9GB).

Technical Evidence

Performance Bottleneck

Inspection of the Neo4j debug.log revealed major STW pauses:

{"time":"2026-03-13 09:51:29.955+0000","level":"WARN","category":"o.n.k.i.c.VmPauseMonitorComponent","message":"Detected VM stop-the-world pause: {pauseTime=11548, gcTime=11703, gcCount=1}"}

Analysis: An 11.5s pause indicates heavy JVM garbage collection caused by large result sets, full label scans, and unindexed graph traversals.

Before vs After Optimization

Entry Point Lookup

Query example:

MATCH (n:NeoStandard {name: "StandardName"})

Before: Neo4j executes NodeByLabelScan, scanning every :NeoStandard node and creating heavy memory and I/O pressure on large datasets.
After: With indexing enabled, Neo4j performs NodeIndexSeek, improving lookup complexity from O(N) to O(log N) and loading only relevant node records.

Path Traversal

Query example:

MATCH p = (BaseStandard)-[*..20]-(CompareStandard)

Problem: Wildcard traversal expands relationships across many nodes, leading to combinatorial explosion in large graphs.
Solution: Added performance warnings and implemented a Tiered Pruning Strategy so traversals begin from indexed entry points.

Proposed Changes

Database Indexing (db.py)

Added index=True to key properties:

  • NeoDocument.name – Core lookup for standards and CREs
  • NeoCRE.external_id – Primary identifier for CRE mapping
  • NeoStandard.section, section_id, subsection – Granular filtering for standards

Performance Documentation (neo4j-indexing.md)

Created a guide covering recommended indexes, use of PROFILE to analyze execution plans, and benchmarking strategies for large graph datasets.

Query Annotations

Added performance notes and profiling instructions in db.py (Gap Analysis queries) and prompt_client.py (AI Mapping Pipeline).

Verification

Optimized lookups improved from O(N) to O(log N) using NodeIndexSeek. Syntax validated with python -m py_compile. The changes directly reduce memory pressure responsible for previously observed 11s STW pauses.

Related Issue

Fixes #622

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant