I've created all the necessary files to run the knowledge graph extraction system locally:
.env- Configuration file where you add your Gemini API keyrequirements.txt- All Python dependenciesexample_usage.py- Complete example showing how to use the systemrun.py- Quick start script that runs everythingtest_setup.py- Test script to verify your setup
setup.sh- Automated setup script (Unix/Mac).gitignore- Git ignore file for common artifactsREADME.md- Updated with installation and usage instructions
# Make setup script executable and run it
chmod +x setup.sh
./setup.sh# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt- Get a Gemini API key from: https://ai.google.dev/
- Edit the
.envfile and replaceyour_gemini_api_key_herewith your actual key:GEMINI_API_KEY=your_actual_api_key_here
Run the test script to verify everything is working:
python test_setup.pypython run.pyThis will automatically:
- Extract a knowledge graph from sample Wikipedia articles
- Save visualization as
knowledge_graph.png - Save data as
extracted_knowledge_graph.json
from example_usage import run_custom_example
# Your own text and entity types
text = "Your custom text here..."
entity_types = ["PERSON", "COMPANY", "LOCATION"]
result = run_custom_example(text, entity_types)- Splits text into chunks (4096 chars by default)
- Uses LLM to extract relation triplets:
(subject:type, relation, object:type) - Example:
(Mark Zuckerberg:PERSON, founded, Facebook:COMPANY)
- Evaluates each extracted relation for consistency
- Merges similar entities (e.g., "Mark Zuckerberg" and "Zuckerberg")
- Builds coherent knowledge graph avoiding duplicates
- Entity Types: Customizable (PERSON, COMPANY, LOCATION, etc.)
- Source Linking: Relations linked back to original text passages
- Visualization: Automatic graph visualization with NetworkX
- Export: JSON format for further processing
- Scalable: Works with multiple documents/sources
Edit .env file to customize:
# Model settings
DEFAULT_MODEL=gemini-2.0-flash-exp
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
# Processing settings
CHUNK_SIZE=4096
CHUNK_OVERLAP=128
MAX_NEW_TOKENS=4096
TEMPERATURE=0.0After running, you'll get:
knowledge_graph.png- Visual representation of the graphextracted_knowledge_graph.json- Structured data with all relations- Console output showing extraction progress and summary
- Import Errors: Run
pip install -r requirements.txt - API Key Issues: Check
.envfile and verify key at https://ai.google.dev/ - Memory Issues: Try smaller
CHUNK_SIZEin.env - Network Issues: Check internet connection for API calls
- Run
python test_setup.pyto diagnose issues - Check the Colab notebook: https://colab.research.google.com/drive/1st_E7SBEz5GpwCnzGSvKaVUiQuKv3QGZ
- Review the blog post for detailed explanations
- Start with the basic example:
python run.py - Try your own text: Edit
example_usage.py - Customize entity types: Modify the
entity_typeslist - Experiment with different models: Try local HuggingFace models
- Build GraphRAG applications: Use the extracted graphs for retrieval
- First run: May take longer due to model downloads
- Gemini API: Recommended for best results (as per blog post)
- Local models: Possible but may give lower quality results
- Processing time: Depends on text length and API response times
The workflow uses a two-phase approach:
- Extract: LLM extracts raw triplets from text chunks
- Build: LLM validates and merges triplets into consistent graph
This ensures entity disambiguation and prevents duplicate information while maintaining links to source material.
Happy knowledge graph building! 🎉