Date: November 23, 2025
Project: KiloCode Codebase Indexing
Model: qwen3-embedding:8b-fp16
Vector Database: Qdrant (local Docker deployment)
This guide walks you through setting up Qdrant vector database to work with KiloCode's codebase indexing feature using the Qwen3-Embedding-8B model. We'll deploy Qdrant (optionally on the same Docker network as Ollama if using custom networks), configure it for 4096-dimensional embeddings (what Qwen3-8B outputs via Ollama), and integrate it with KiloCode.
Prerequisites:
- Ollama running (Docker or native installation)
- Model
qwen3-embedding:8b-fp16already pulled in Ollama - Docker and Docker Compose installed
- Ubuntu Desktop with GPU access
- Docker Network Configuration (Optional)
- Deploy Qdrant with Docker Compose
- Verify Qdrant is Running
- Create the Collection
- Test End-to-End Integration
- Configure KiloCode
- Monitor Initial Indexing
- Understanding the Data Flow
- Performance Expectations
- Troubleshooting
- Maintenance Commands
If you're using a custom Docker network (like ollama-network in my setup), deploying Qdrant on the same network provides benefits:
- Container-to-container communication: Direct internal communication without host routing
- Name resolution: Containers can reference each other by name (e.g.,
http://qdrant:6333) - Security: Internal network traffic doesn't expose ports unnecessarily
- Performance: Slightly faster than localhost routing
However, this is entirely optional. If you're running Ollama without a custom Docker network, simply omit the networks section from the docker-compose.yml below. Qdrant will work perfectly fine using localhost connections.
Note: KiloCode (running on your host) will always use localhost:6333 to connect to Qdrant, regardless of Docker network configuration.
Create a file called docker-compose.yml in your preferred directory:
services:
qdrant:
image: qdrant/qdrant:latest
container_name: qdrant
networks:
- ollama-network
ports:
- "6333:6333" # HTTP API
- "6334:6334" # gRPC (optional)
volumes:
- qdrant_storage:/qdrant/storage
restart: unless-stopped
# Optional: Add health check
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:6333/"]
interval: 30s
timeout: 10s
retries: 3
networks:
ollama-network:
external: true # Use existing network
volumes:
qdrant_storage:
driver: localWhat each section does:
- image: Latest Qdrant version (~200MB download)
- networks: (Optional) Joins your existing
ollama-networkif you use one; omit this section entirely if you don't use custom Docker networks - ports:
6333: HTTP API (for KiloCode and curl commands)6334: gRPC (optional, for advanced use)
- volumes: Persistent storage for your vector data
- restart: Auto-restart on system reboot
- healthcheck: Monitors Qdrant's health status
Note: If you're not using a custom Docker network like ollama-network, simply remove the networks: section from both the service definition (lines 62-63) and the networks declaration (lines 77-79). Qdrant will work perfectly with localhost connections.
Deploy the container:
docker compose up -dExpected output (with custom network):
[+] Running 2/2
✔ Network ollama-network Found
✔ Container qdrant Started
Expected output (without custom network):
[+] Running 1/1
✔ Container qdrant Started
Run these verification commands:
# 1. Check container is running
docker ps | grep qdrant
# Expected output:
# qdrant qdrant/qdrant:latest ... Up ... 0.0.0.0:6333->6333/tcp
# 2. Test the API endpoint
curl http://localhost:6333/
# Expected output:
# {"title":"qdrant - vector search engine","version":"1.x.x"}
# 3. (Optional) If using custom Docker network, verify Qdrant joined it
docker network inspect ollama-network | grep -A 5 qdrant
# Should show qdrant in the containers list (skip if not using custom network)
# 4. Open Qdrant dashboard (optional)
# Browse to: http://localhost:6333/dashboardIf all commands succeed, Qdrant is ready!
Create a collection configured for Qwen3's 4096-dimensional embeddings:
curl -X PUT http://localhost:6333/collections/kilocode_codebase \
-H 'Content-Type: application/json' \
-d '{
"vectors": {
"size": 4096,
"distance": "Cosine"
}
}'Expected response:
{"result":true,"status":"ok","time":0.001234}What this creates:
- Collection name:
kilocode_codebase - Vector dimensions: 4096 (matches Qwen3-Embedding-8B-FP16 output via Ollama)
- Distance metric: Cosine (best for embeddings)
Verify collection was created:
curl http://localhost:6333/collections
# Should list kilocode_codebase in the resultView collection details:
curl http://localhost:6333/collections/kilocode_codebase | jq
# Shows: vectors_count, indexed_vectors_count, points_count, statusTest that Ollama and Qdrant can work together:
# Generate a test embedding with Qwen3
curl http://localhost:11434/api/embeddings -d '{
"model": "qwen3-embedding:8b-fp16",
"prompt": "def hello_world(): print(\"Hello, World!\")"
}' > test_embedding.json
# Check the embedding was generated
cat test_embedding.json | head -20
# If you have jq installed, verify dimension:
cat test_embedding.json | jq '.embedding | length'
# Should output: 4096What this tests:
- Ollama responds and generates embeddings
- Qwen3-Embedding-8B model is working
- Output is 4096 dimensions (verified for this model)
Open KiloCode settings (⚙️ icon in VS Code) and navigate to Codebase Indexing:
Codebase Indexing:
✅ Enable Codebase Indexing: ON
Embedding Provider:
Provider: Ollama
Base URL: http://localhost:11434
Model: qwen3-embedding:8b-fp16
Dimensions: 4096 (must match model output)
Vector Database:
Provider: Qdrant
URL: http://localhost:6333
API Key: (leave empty for local setup)
Collection Name: kilocode_codebase
Search Settings:
Max Search Results: 50
Min Block Size: 100 chars
Max Block Size: 1000 chars-
Model Name Must Match Exactly:
qwen3-embedding:8b-fp16- Case-sensitive
- Must include the
:8b-fp16tag
-
Dimensions:
- Must be set to 4096 to match Qwen3-8B output
- Verify with:
curl http://localhost:11434/api/embeddings -d '{"model": "qwen3-embedding:8b-fp16", "prompt": "test"}' | jq '.embedding | length'
-
No API Key Needed:
- Both services run locally without authentication
- Only set API key if you've secured Qdrant (advanced)
-
Collection Name:
- Must match the collection we created:
kilocode_codebase
- Must match the collection we created:
Click Save to start indexing your codebase.
Watch the indexing process:
# Monitor GPU usage (should spike during indexing)
watch -n 1 nvidia-smi
# Monitor container resources
docker stats ollama qdrant
# Check Qdrant collection size (grows as vectors are added)
watch -n 5 'curl -s http://localhost:6333/collections/kilocode_codebase | jq .result.points_count'Expected behavior during indexing:
- GPU 0 (RTX 4090): VRAM usage increases to ~15GB
- CPU: Spikes as Tree-sitter parses files
- Qdrant:
points_countincreases as code blocks are indexed - KiloCode UI: Shows "Indexing" status (yellow indicator)
When complete:
- KiloCode status shows "Indexed" (green indicator)
- Qdrant
points_countmatches your codebase size - GPU usage drops back to idle
Here's how the entire system works:
1. KiloCode scans your project files
↓
2. Tree-sitter parses code into semantic blocks (functions, classes, methods)
↓
3. Each code block → Ollama API (http://localhost:11434)
↓
4. Qwen3-Embedding-8B processes text on GPU
↓
5. Returns 4096-dimensional vector
↓
6. KiloCode stores vector in Qdrant (http://localhost:6333)
↓
7. Qdrant indexes vector for fast similarity search
Your search query
↓
Ollama (Qwen3-Embedding-8B)
↓
4096-dimensional query vector
↓
Qdrant similarity search
↓
Top-K most similar code vectors
↓
KiloCode retrieves corresponding code blocks
↓
Results displayed in KiloCode
Key Points:
- Parsing happens locally (Tree-sitter)
- Embeddings generated locally (Ollama + GPU)
- Vectors stored locally (Qdrant)
- No data leaves your machine
Indexing Performance:
- Initial build time: Varies by codebase size (GPU-accelerated)
- Bottleneck: GPU embedding generation (not Qdrant)
- VRAM usage: ~15GB (Qwen3) + minimal for OS
Search Performance:
- Query latency: Fast local search (milliseconds)
- Embedding generation: GPU-accelerated (Qwen3)
- Vector search: Fast similarity matching (Qdrant)
- Post-processing: Minimal overhead (KiloCode)
- Consistent latency (local = no network variance)
Quality Metrics:
- High retrieval accuracy observed
- Top results typically very relevant to query
- Semantic understanding finds conceptually similar code
GPU (RTX 4090):
Idle: ~2GB VRAM (OS)
Indexing: ~15GB VRAM (Qwen3 model)
Searching: ~15GB VRAM (Qwen3 model)
Available: ~9GB VRAM (for other tasks)
Qdrant Memory (typical codebase with 10K code blocks):
Vectors: ~160MB (10K × 4096 dims × 4 bytes)
Qdrant: ~240MB (indexes + overhead)
Total: ~400MB RAM
Disk Storage:
Qwen3 model: ~15GB (one-time)
Qdrant data: ~100-500MB (depends on codebase size)
Docker images: ~200MB (Qdrant)
Note: For Ollama-specific issues (model not found, embedding generation errors), see 3_QWEN3_OLLAMA_GUIDE.md. For general KiloCode troubleshooting, see README.md or FAQ.md. This section covers Qdrant-specific issues only.
Symptom: KiloCode can't connect to Qdrant
Diagnosis:
# Check Qdrant is running
docker ps | grep qdrant
# Test API endpoint
curl http://localhost:6333/Fix:
# Restart Qdrant
docker compose restart qdrant
# Check logs for errors
docker logs qdrant --tail 50Symptom: Error about vector dimensions not matching
This means: Collection was created with wrong dimensions or model output changed
Fix:
# Delete collection
curl -X DELETE http://localhost:6333/collections/kilocode_codebase
# Recreate with correct dimensions (4096)
curl -X PUT http://localhost:6333/collections/kilocode_codebase \
-H 'Content-Type: application/json' \
-d '{
"vectors": {
"size": 4096,
"distance": "Cosine"
}
}'
# Rebuild index in KiloCode (Settings → Rebuild Index)Symptom: Search doesn't find expected code
Possible causes:
- Index is stale (code changed but not reindexed)
- Files excluded by .gitignore or .kilocode patterns
- Search query too vague
Fix:
# Check what's indexed
curl http://localhost:6333/collections/kilocode_codebase | jq .result.points_count
# Rebuild index in KiloCode
# Settings → Codebase Indexing → Rebuild Index button
# Try more specific search queries
# Example: "authentication middleware" vs "auth"# View collection info
curl http://localhost:6333/collections/kilocode_codebase | jq
# Count indexed vectors
curl -s http://localhost:6333/collections/kilocode_codebase | jq .result.points_count
# Check collection health
curl http://localhost:6333/collections/kilocode_codebase | jq .result.status
# Should be "green"# Create backup of Qdrant storage
docker run --rm \
-v qdrant_storage:/data \
-v $(pwd):/backup \
alpine tar czf /backup/qdrant-backup-$(date +%Y%m%d).tar.gz /data
# Backup will be saved in current directory
ls -lh qdrant-backup-*.tar.gz# Stop Qdrant
docker compose stop qdrant
# Restore backup
docker run --rm \
-v qdrant_storage:/data \
-v $(pwd):/backup \
alpine tar xzf /backup/qdrant-backup-20251123.tar.gz -C /
# Start Qdrant
docker compose start qdrantOpen in your browser:
http://localhost:6333/dashboard
Dashboard features:
- Collection overview
- Vector count and storage
- Search performance metrics
- Collection configuration
# Restart Qdrant only
docker compose restart qdrant
# Restart both Ollama and Qdrant
docker restart ollama
docker compose restart qdrant
# View logs
docker logs qdrant --tail 100 --follow
docker logs ollama --tail 100 --follow# Pull latest image
docker compose pull qdrant
# Recreate container with new image
docker compose up -d --force-recreate qdrant
# Verify version
curl http://localhost:6333/ | jq .versionIf you want to secure Qdrant with an API key:
1. Update docker-compose.yml:
services:
qdrant:
image: qdrant/qdrant:latest
container_name: qdrant
networks:
- ollama-network
ports:
- "6333:6333"
- "6334:6334"
volumes:
- qdrant_storage:/qdrant/storage
restart: unless-stopped
environment:
- QDRANT__SERVICE__API_KEY=your-super-secret-key-here # Add this line
networks:
ollama-network:
external: true
volumes:
qdrant_storage:
driver: local2. Recreate container:
docker compose up -d --force-recreate qdrant3. Update KiloCode settings:
- Add the API key in "Qdrant API Key" field
- Save settings
4. Test with API key:
curl -H "api-key: your-super-secret-key-here" \
http://localhost:6333/collections-
Test search quality: Try natural language queries in KiloCode
- Example: "user authentication logic"
- Example: "database connection setup"
- Example: "error handling patterns"
-
Monitor performance: Check the Qdrant dashboard
- Watch search latency
- Verify vector count matches expectations
- Check memory usage
-
Adjust settings if needed:
- Increase "Max Search Results" for more context (20 → 50)
- Modify "Max Block Size" for larger code blocks (1000 → 1500)
-
Set up file watching: KiloCode auto-reindexes changed files
- Edit a file and save
- Watch Qdrant
points_countupdate - Verify search finds new content
You now have a production-ready local codebase indexing system:
✅ Qwen3-Embedding-8B: State-of-the-art code embeddings (SOTA for consumer GPUs, 80.68 on MTEB Code) ✅ Qdrant: Fast, efficient vector database ✅ 4096 dimensions: Maximum quality (Qwen3-8B output) ✅ Local setup: Complete privacy, no API costs ✅ GPU-accelerated: Fast indexing and search
Your setup delivers:
- Fast local search (milliseconds)
- High retrieval accuracy
- Minimal ongoing electricity costs
- Unlimited searches with no rate limits
Architecture:
KiloCode (VS Code)
↓
Ollama (qwen3-embedding:8b-fp16) → RTX 4090 (15GB VRAM)
↓
Qdrant (kilocode_codebase collection) → ~100MB RAM
Happy coding! 🚀
Document Version: 1.0
Last Updated: November 23, 2025
Author: AI Implementation Guide
Project: KiloCode Codebase Indexing with Qwen3-Embedding-8B + Qdrant