These utilities prepare smaller, model-friendly chunks of a MERN-stack codebase for LLM-based vulnerability scanning.
/output/repo_id
file_tree.json # raw file/folder structure of the target repo
vuln_files_selection.json # OpenAI-picked folders + standalone files
vuln_file_metadata.json # per-file metadata & summaries (backend + frontend)
file_subsets.json # GPT-4-clustered file subsets
subset_pipeline_suggestions.json # suggested pipelines per subset
pipeline_outputs/ # LLM outputs for each pipeline stage
get_file_struct.py # quick console tree printer (no JSON)
get_file_struct_json.py # writes /data/file_tree.json
select_vuln_files.py # asks GPT-4 for which parts look security-relevant
generate_metadata.py # builds rich metadata + summaries for each path
group_subsets.py # clusters files into logical subsets
pipeline_suggester.py # suggests pipelines per subset
pipeline_executor.py # executes pipelines on each subset
main.py # FastAPI server for pipeline automation
OPENAI_API_KEY=<your key>
CODEBASE_PATH=<absolute path to the repo you want to analyse>
REPO_ID=<unique identifier for the repo>
# optional – limit number of files processed by generate_metadata.py
Create a .env file in the project root or export them in your shell.
This repo uses Poetry:
poetry install-
Scan filesystem and build JSON tree
poetry run python get_file_struct_json.py # -> writes data/file_tree.json -
Let GPT-4 decide which folders/files deserve security attention
poetry run python select_vuln_files.py # -> writes data/vuln_files_selection.json -
Generate per-file metadata & natural-language summaries
# full run poetry run python generate_metadata.py # or limit to N files for a cheap dry-run METADATA_MAX_FILES=10 poetry run python generate_metadata.py # or analyse a different repo root poetry run python generate_metadata.py --base /some/other/path
-
Group files into functional subsets
poetry run python group_subsets.py # -> writes data/file_subsets.json -
Suggest analysis pipelines for each subset
poetry run python pipeline_suggester.py # -> writes data/subset_pipeline_suggestions.json -
Execute pipelines on each subset
poetry run python pipeline_executor.py # -> writes results under output/REPO_ID_data/pipeline_outputs/
| Script | What it does | Key outputs |
|---|---|---|
get_file_struct.py |
Pretty-prints a depth-limited directory tree to stdout. Handy for a quick visual inspection. | – |
get_file_struct_json.py |
Recursively walks the repo (honours EXCLUDE_DIRS) and dumps a JSON object representing folders & files. Uses CODEBASE_PATH if set. |
data/file_tree.json |
select_vuln_files.py |
Sends the JSON tree to GPT-4 with a prompt asking for potentially vulnerable areas. Stores the returned JSON lists. | data/vuln_files_selection.json |
generate_metadata.py |
Reads the selection, computes language/LOC/imports per file, calls GPT-4 for a 2-3 sentence summary (cached via SHA-1), and writes a consolidated metadata file. | data/vuln_file_metadata.json |
group_subsets.py |
Uses GPT-4 to cluster files into logical subsets based on functional connections (data flow, MVC, shared state). | data/file_subsets.json |
pipeline_suggester.py |
For each subset, asks GPT-4 which vulnerability analysis pipelines should run and stores suggestions. | data/subset_pipeline_suggestions.json |
pipeline_executor.py |
Executes the suggested pipelines per subset and persists LLM outputs for each pipeline stage. | output/REPO_ID_data/pipeline_outputs/ |
• If your codebase changes, re-run the scripts in order. generate_metadata.py only re-summarises files whose SHA-1 changed, saving tokens.
• Delete files in /data to force a full rebuild.
# Assume .env has OPENAI_API_KEY and CODEBASE_PATH already
poetry run python get_file_struct_json.py && \
poetry run python select_vuln_files.py && \
poetry run python generate_metadata.py && \
poetry run python group_subsets.py && \
poetry run python pipeline_suggester.py && \
poetry run python pipeline_executor.pyYou now have everything needed to batch code & summaries into LLM-sized chunks for vulnerability analysis.
The FastAPI server (see main.py) exposes a single endpoint to automate the entire workflow from your CI/CD pipeline.
Runs the standard six-step pipeline described above.
Request body (JSON):
{
"id": "a_unique_id", // Arbitrary identifier – persisted as REPO_ID in .env
"path": "/abs/path/to/repo" // Absolute path to the codebase – persisted as CODEBASE_PATH in .env
}Behavior:
- Updates/creates
.envwithREPO_IDandCODEBASE_PATH. - Executes scripts in this order, aborting on first failure:
get_file_struct_json.pyselect_vuln_files.pygenerate_metadata.pygroup_subsets.pypipeline_suggester.pypipeline_executor.py
- Returns JSON:
- If
pipeline_executor.pyproduced an aggregated summary →{ "success": true, "results": [...] } - Otherwise →
{ "success": true, "output": "<stdout of last script>" }
- If
Successful results example:
{
"success": true,
"results": [
{
"subset_id": "subset-001",
"pipeline_id": "pipeline_injection",
"outputs": [
"subset-001_pipeline_injection_vuln_report.json",
"subset-001_pipeline_injection_owasp_only.json",
"subset-001_pipeline_injection_remediation_suggestions.json"
]
}
]
}Successful plain-output example:
{
"success": true,
"output": "Suggestions written to output/idurar-erp-crm_data/pipeline_outputs/..."
}If a script fails, the API returns HTTP 500 with a body like:
{
"detail": {
"message": "Script 'generate_metadata.py' failed (exit code 1)",
"output": "Traceback …"
}
}Ensure dependencies are installed:
poetry installThen launch FastAPI with live-reload:
poetry run uvicorn xployt_lvl2.main:app --reloadBy default the docs are available at http://127.0.0.1:8003/docs.
curl -X POST http://127.0.0.1:8003/llm/scan \
-H "Content-Type: application/json" \
-d '{
"path": "E:/PROJECTS/ACADAMIC/Xployt-ai/REPOS/idurar-erp-crm-5"
}'
poetry run uvicorn xployt_lvl2.main:app --reload
curl -X POST http://127.0.0.1:8003/llm/scan \
-H "Content-Type: application/json" \
-d '{
"path": "E:/PROJECTS/ACADAMIC/Xployt-ai/REPOS/vuln_node_express"
}'
curl -X POST "http://127.0.0.1:8003/execute-module" \
-H "Content-Type: application/json" \
-d '{"path": "E:/PROJECTS/ACADAMIC/Xployt-ai/REPOS/vuln_node_express", "module_number": 5}'
curl -X POST http://127.0.0.1:8003/llm/scan \
-H "Content-Type: application/json" \
-d '{
"path": "E:/PROJECTS/ACADAMIC/Xployt-ai/REPOS/nodejs-goof"
}'
curl -X POST http://127.0.0.1:8003/llm/scan \
-H "Content-Type: application/json" \
-d '{
"path": "E:/PROJECTS/ACADAMIC/Xployt-ai/REPOS/Zero-Health"
}'