Add Neo4j indexing and performance documentation by Khushi281300 · Pull Request #805 · OWASP/OpenCRE

Khushi281300 · 2026-03-14T19:21:13Z

Summary

This PR addresses Issue #622 by introducing systematic indexing for critical Neo4j node properties and providing guidance for performance optimization. These improvements target severe Stop-The-World (STW) pauses observed in production-scale datasets (~9GB).

Technical Evidence

Performance Bottleneck

Inspection of the Neo4j debug.log revealed major STW pauses:

{"time":"2026-03-13 09:51:29.955+0000","level":"WARN","category":"o.n.k.i.c.VmPauseMonitorComponent","message":"Detected VM stop-the-world pause: {pauseTime=11548, gcTime=11703, gcCount=1}"}

Analysis: An 11.5s pause indicates heavy JVM garbage collection caused by large result sets, full label scans, and unindexed graph traversals.

Before vs After Optimization

Entry Point Lookup

Query example:

MATCH (n:NeoStandard {name: "StandardName"})

Before: Neo4j executes NodeByLabelScan, scanning every :NeoStandard node and creating heavy memory and I/O pressure on large datasets.
After: With indexing enabled, Neo4j performs NodeIndexSeek, improving lookup complexity from O(N) to O(log N) and loading only relevant node records.

Path Traversal

Query example:

MATCH p = (BaseStandard)-[*..20]-(CompareStandard)

Problem: Wildcard traversal expands relationships across many nodes, leading to combinatorial explosion in large graphs.
Solution: Added performance warnings and implemented a Tiered Pruning Strategy so traversals begin from indexed entry points.

Proposed Changes

Database Indexing (`db.py`)

Added index=True to key properties:

NeoDocument.name – Core lookup for standards and CREs
NeoCRE.external_id – Primary identifier for CRE mapping
NeoStandard.section, section_id, subsection – Granular filtering for standards

Performance Documentation (`neo4j-indexing.md`)

Created a guide covering recommended indexes, use of PROFILE to analyze execution plans, and benchmarking strategies for large graph datasets.

Query Annotations

Added performance notes and profiling instructions in db.py (Gap Analysis queries) and prompt_client.py (AI Mapping Pipeline).

Verification

Optimized lookups improved from O(N) to O(log N) using NodeIndexSeek. Syntax validated with python -m py_compile. The changes directly reduce memory pressure responsible for previously observed 11s STW pauses.

Related Issue

Fixes #622

Fix OWASP#622: Add Neo4j indexing and performance documentation (Clean)

9471b6a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Neo4j indexing and performance documentation #805

Add Neo4j indexing and performance documentation #805
Khushi281300 wants to merge 1 commit intoOWASP:mainfrom
Khushi281300:fix-neo4j-indexing-final-622

Khushi281300 commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Khushi281300 commented Mar 14, 2026

Summary

Technical Evidence

Performance Bottleneck

Before vs After Optimization

Proposed Changes

Database Indexing (db.py)

Performance Documentation (neo4j-indexing.md)

Query Annotations

Verification

Related Issue

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Database Indexing (`db.py`)

Performance Documentation (`neo4j-indexing.md`)