Skip to content

Latest commit

 

History

History
664 lines (491 loc) · 16.2 KB

File metadata and controls

664 lines (491 loc) · 16.2 KB

Qdrant Setup Guide for KiloCode Codebase Indexing

Date: November 23, 2025
Project: KiloCode Codebase Indexing
Model: qwen3-embedding:8b-fp16
Vector Database: Qdrant (local Docker deployment)


Introduction

This guide walks you through setting up Qdrant vector database to work with KiloCode's codebase indexing feature using the Qwen3-Embedding-8B model. We'll deploy Qdrant (optionally on the same Docker network as Ollama if using custom networks), configure it for 4096-dimensional embeddings (what Qwen3-8B outputs via Ollama), and integrate it with KiloCode.

Prerequisites:

  • Ollama running (Docker or native installation)
  • Model qwen3-embedding:8b-fp16 already pulled in Ollama
  • Docker and Docker Compose installed
  • Ubuntu Desktop with GPU access

Table of Contents

  1. Docker Network Configuration (Optional)
  2. Deploy Qdrant with Docker Compose
  3. Verify Qdrant is Running
  4. Create the Collection
  5. Test End-to-End Integration
  6. Configure KiloCode
  7. Monitor Initial Indexing
  8. Understanding the Data Flow
  9. Performance Expectations
  10. Troubleshooting
  11. Maintenance Commands

Docker Network Configuration (Optional)

If you're using a custom Docker network (like ollama-network in my setup), deploying Qdrant on the same network provides benefits:

  • Container-to-container communication: Direct internal communication without host routing
  • Name resolution: Containers can reference each other by name (e.g., http://qdrant:6333)
  • Security: Internal network traffic doesn't expose ports unnecessarily
  • Performance: Slightly faster than localhost routing

However, this is entirely optional. If you're running Ollama without a custom Docker network, simply omit the networks section from the docker-compose.yml below. Qdrant will work perfectly fine using localhost connections.

Note: KiloCode (running on your host) will always use localhost:6333 to connect to Qdrant, regardless of Docker network configuration.


Deploy Qdrant with Docker Compose

Create a file called docker-compose.yml in your preferred directory:

services:
  qdrant:
    image: qdrant/qdrant:latest
    container_name: qdrant
    networks:
      - ollama-network
    ports:
      - "6333:6333"  # HTTP API
      - "6334:6334"  # gRPC (optional)
    volumes:
      - qdrant_storage:/qdrant/storage
    restart: unless-stopped
    # Optional: Add health check
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:6333/"]
      interval: 30s
      timeout: 10s
      retries: 3

networks:
  ollama-network:
    external: true  # Use existing network

volumes:
  qdrant_storage:
    driver: local

What each section does:

  • image: Latest Qdrant version (~200MB download)
  • networks: (Optional) Joins your existing ollama-network if you use one; omit this section entirely if you don't use custom Docker networks
  • ports:
    • 6333: HTTP API (for KiloCode and curl commands)
    • 6334: gRPC (optional, for advanced use)
  • volumes: Persistent storage for your vector data
  • restart: Auto-restart on system reboot
  • healthcheck: Monitors Qdrant's health status

Note: If you're not using a custom Docker network like ollama-network, simply remove the networks: section from both the service definition (lines 62-63) and the networks declaration (lines 77-79). Qdrant will work perfectly with localhost connections.

Deploy the container:

docker compose up -d

Expected output (with custom network):

[+] Running 2/2
 ✔ Network ollama-network    Found
 ✔ Container qdrant          Started

Expected output (without custom network):

[+] Running 1/1
 ✔ Container qdrant          Started

Verify Qdrant is Running

Run these verification commands:

# 1. Check container is running
docker ps | grep qdrant

# Expected output:
# qdrant  qdrant/qdrant:latest  ... Up ... 0.0.0.0:6333->6333/tcp

# 2. Test the API endpoint
curl http://localhost:6333/

# Expected output:
# {"title":"qdrant - vector search engine","version":"1.x.x"}

# 3. (Optional) If using custom Docker network, verify Qdrant joined it
docker network inspect ollama-network | grep -A 5 qdrant

# Should show qdrant in the containers list (skip if not using custom network)

# 4. Open Qdrant dashboard (optional)
# Browse to: http://localhost:6333/dashboard

If all commands succeed, Qdrant is ready!


Create the Collection

Create a collection configured for Qwen3's 4096-dimensional embeddings:

curl -X PUT http://localhost:6333/collections/kilocode_codebase \
  -H 'Content-Type: application/json' \
  -d '{
    "vectors": {
      "size": 4096,
      "distance": "Cosine"
    }
  }'

Expected response:

{"result":true,"status":"ok","time":0.001234}

What this creates:

  • Collection name: kilocode_codebase
  • Vector dimensions: 4096 (matches Qwen3-Embedding-8B-FP16 output via Ollama)
  • Distance metric: Cosine (best for embeddings)

Verify collection was created:

curl http://localhost:6333/collections

# Should list kilocode_codebase in the result

View collection details:

curl http://localhost:6333/collections/kilocode_codebase | jq

# Shows: vectors_count, indexed_vectors_count, points_count, status

Test End-to-End Integration

Test that Ollama and Qdrant can work together:

# Generate a test embedding with Qwen3
curl http://localhost:11434/api/embeddings -d '{
  "model": "qwen3-embedding:8b-fp16",
  "prompt": "def hello_world(): print(\"Hello, World!\")"
}' > test_embedding.json

# Check the embedding was generated
cat test_embedding.json | head -20

# If you have jq installed, verify dimension:
cat test_embedding.json | jq '.embedding | length'
# Should output: 4096

What this tests:

  • Ollama responds and generates embeddings
  • Qwen3-Embedding-8B model is working
  • Output is 4096 dimensions (verified for this model)

Configure KiloCode

Open KiloCode settings (⚙️ icon in VS Code) and navigate to Codebase Indexing:

Settings Configuration

Codebase Indexing:
  ✅ Enable Codebase Indexing: ON

Embedding Provider:
  Provider: Ollama
  Base URL: http://localhost:11434
  Model: qwen3-embedding:8b-fp16
  Dimensions: 4096 (must match model output)

Vector Database:
  Provider: Qdrant
  URL: http://localhost:6333
  API Key: (leave empty for local setup)
  Collection Name: kilocode_codebase

Search Settings:
  Max Search Results: 50
  Min Block Size: 100 chars
  Max Block Size: 1000 chars

Critical Configuration Notes

  1. Model Name Must Match Exactly: qwen3-embedding:8b-fp16

    • Case-sensitive
    • Must include the :8b-fp16 tag
  2. Dimensions:

    • Must be set to 4096 to match Qwen3-8B output
    • Verify with: curl http://localhost:11434/api/embeddings -d '{"model": "qwen3-embedding:8b-fp16", "prompt": "test"}' | jq '.embedding | length'
  3. No API Key Needed:

    • Both services run locally without authentication
    • Only set API key if you've secured Qdrant (advanced)
  4. Collection Name:

    • Must match the collection we created: kilocode_codebase

Click Save to start indexing your codebase.


Monitor Initial Indexing

Watch the indexing process:

# Monitor GPU usage (should spike during indexing)
watch -n 1 nvidia-smi

# Monitor container resources
docker stats ollama qdrant

# Check Qdrant collection size (grows as vectors are added)
watch -n 5 'curl -s http://localhost:6333/collections/kilocode_codebase | jq .result.points_count'

Expected behavior during indexing:

  • GPU 0 (RTX 4090): VRAM usage increases to ~15GB
  • CPU: Spikes as Tree-sitter parses files
  • Qdrant: points_count increases as code blocks are indexed
  • KiloCode UI: Shows "Indexing" status (yellow indicator)

When complete:

  • KiloCode status shows "Indexed" (green indicator)
  • Qdrant points_count matches your codebase size
  • GPU usage drops back to idle

Understanding the Data Flow

Here's how the entire system works:

Indexing Flow

1. KiloCode scans your project files
   ↓
2. Tree-sitter parses code into semantic blocks (functions, classes, methods)
   ↓
3. Each code block → Ollama API (http://localhost:11434)
   ↓
4. Qwen3-Embedding-8B processes text on GPU
   ↓
5. Returns 4096-dimensional vector
   ↓
6. KiloCode stores vector in Qdrant (http://localhost:6333)
   ↓
7. Qdrant indexes vector for fast similarity search

Search Flow

Your search query
   ↓
Ollama (Qwen3-Embedding-8B)
   ↓
4096-dimensional query vector
   ↓
Qdrant similarity search
   ↓
Top-K most similar code vectors
   ↓
KiloCode retrieves corresponding code blocks
   ↓
Results displayed in KiloCode

Key Points:

  • Parsing happens locally (Tree-sitter)
  • Embeddings generated locally (Ollama + GPU)
  • Vectors stored locally (Qdrant)
  • No data leaves your machine

Performance Expectations

For a Typical Project (5K-10K code blocks)

Indexing Performance:

  • Initial build time: Varies by codebase size (GPU-accelerated)
  • Bottleneck: GPU embedding generation (not Qdrant)
  • VRAM usage: ~15GB (Qwen3) + minimal for OS

Search Performance:

  • Query latency: Fast local search (milliseconds)
  • Embedding generation: GPU-accelerated (Qwen3)
  • Vector search: Fast similarity matching (Qdrant)
  • Post-processing: Minimal overhead (KiloCode)
  • Consistent latency (local = no network variance)

Quality Metrics:

  • High retrieval accuracy observed
  • Top results typically very relevant to query
  • Semantic understanding finds conceptually similar code

Resource Usage

GPU (RTX 4090):

Idle:      ~2GB VRAM (OS)
Indexing:  ~15GB VRAM (Qwen3 model)
Searching: ~15GB VRAM (Qwen3 model)
Available: ~9GB VRAM (for other tasks)

Qdrant Memory (typical codebase with 10K code blocks):

Vectors:   ~160MB (10K × 4096 dims × 4 bytes)
Qdrant:    ~240MB (indexes + overhead)
Total:     ~400MB RAM

Disk Storage:

Qwen3 model:     ~15GB (one-time)
Qdrant data:     ~100-500MB (depends on codebase size)
Docker images:   ~200MB (Qdrant)

Troubleshooting

Note: For Ollama-specific issues (model not found, embedding generation errors), see 3_QWEN3_OLLAMA_GUIDE.md. For general KiloCode troubleshooting, see README.md or FAQ.md. This section covers Qdrant-specific issues only.


Issue 1: Qdrant connection refused

Symptom: KiloCode can't connect to Qdrant

Diagnosis:

# Check Qdrant is running
docker ps | grep qdrant

# Test API endpoint
curl http://localhost:6333/

Fix:

# Restart Qdrant
docker compose restart qdrant

# Check logs for errors
docker logs qdrant --tail 50

Issue 2: Dimension mismatch error

Symptom: Error about vector dimensions not matching

This means: Collection was created with wrong dimensions or model output changed

Fix:

# Delete collection
curl -X DELETE http://localhost:6333/collections/kilocode_codebase

# Recreate with correct dimensions (4096)
curl -X PUT http://localhost:6333/collections/kilocode_codebase \
  -H 'Content-Type: application/json' \
  -d '{
    "vectors": {
      "size": 4096,
      "distance": "Cosine"
    }
  }'

# Rebuild index in KiloCode (Settings → Rebuild Index)

Issue 3: Search results seem irrelevant

Symptom: Search doesn't find expected code

Possible causes:

  • Index is stale (code changed but not reindexed)
  • Files excluded by .gitignore or .kilocode patterns
  • Search query too vague

Fix:

# Check what's indexed
curl http://localhost:6333/collections/kilocode_codebase | jq .result.points_count

# Rebuild index in KiloCode
# Settings → Codebase Indexing → Rebuild Index button

# Try more specific search queries
# Example: "authentication middleware" vs "auth"

Maintenance Commands

Check Collection Status

# View collection info
curl http://localhost:6333/collections/kilocode_codebase | jq

# Count indexed vectors
curl -s http://localhost:6333/collections/kilocode_codebase | jq .result.points_count

# Check collection health
curl http://localhost:6333/collections/kilocode_codebase | jq .result.status
# Should be "green"

Backup Qdrant Data

# Create backup of Qdrant storage
docker run --rm \
  -v qdrant_storage:/data \
  -v $(pwd):/backup \
  alpine tar czf /backup/qdrant-backup-$(date +%Y%m%d).tar.gz /data

# Backup will be saved in current directory
ls -lh qdrant-backup-*.tar.gz

Restore from Backup

# Stop Qdrant
docker compose stop qdrant

# Restore backup
docker run --rm \
  -v qdrant_storage:/data \
  -v $(pwd):/backup \
  alpine tar xzf /backup/qdrant-backup-20251123.tar.gz -C /

# Start Qdrant
docker compose start qdrant

View Qdrant Dashboard

Open in your browser:

http://localhost:6333/dashboard

Dashboard features:

  • Collection overview
  • Vector count and storage
  • Search performance metrics
  • Collection configuration

Restart Services

# Restart Qdrant only
docker compose restart qdrant

# Restart both Ollama and Qdrant
docker restart ollama
docker compose restart qdrant

# View logs
docker logs qdrant --tail 100 --follow
docker logs ollama --tail 100 --follow

Update Qdrant

# Pull latest image
docker compose pull qdrant

# Recreate container with new image
docker compose up -d --force-recreate qdrant

# Verify version
curl http://localhost:6333/ | jq .version

Optional: Add Security (For LAN Sharing)

If you want to secure Qdrant with an API key:

1. Update docker-compose.yml:

services:
  qdrant:
    image: qdrant/qdrant:latest
    container_name: qdrant
    networks:
      - ollama-network
    ports:
      - "6333:6333"
      - "6334:6334"
    volumes:
      - qdrant_storage:/qdrant/storage
    restart: unless-stopped
    environment:
      - QDRANT__SERVICE__API_KEY=your-super-secret-key-here  # Add this line

networks:
  ollama-network:
    external: true

volumes:
  qdrant_storage:
    driver: local

2. Recreate container:

docker compose up -d --force-recreate qdrant

3. Update KiloCode settings:

  • Add the API key in "Qdrant API Key" field
  • Save settings

4. Test with API key:

curl -H "api-key: your-super-secret-key-here" \
  http://localhost:6333/collections

Next Steps After Setup

  1. Test search quality: Try natural language queries in KiloCode

    • Example: "user authentication logic"
    • Example: "database connection setup"
    • Example: "error handling patterns"
  2. Monitor performance: Check the Qdrant dashboard

    • Watch search latency
    • Verify vector count matches expectations
    • Check memory usage
  3. Adjust settings if needed:

    • Increase "Max Search Results" for more context (20 → 50)
    • Modify "Max Block Size" for larger code blocks (1000 → 1500)
  4. Set up file watching: KiloCode auto-reindexes changed files

    • Edit a file and save
    • Watch Qdrant points_count update
    • Verify search finds new content

Summary

You now have a production-ready local codebase indexing system:

Qwen3-Embedding-8B: State-of-the-art code embeddings (SOTA for consumer GPUs, 80.68 on MTEB Code) ✅ Qdrant: Fast, efficient vector database ✅ 4096 dimensions: Maximum quality (Qwen3-8B output) ✅ Local setup: Complete privacy, no API costs ✅ GPU-accelerated: Fast indexing and search

Your setup delivers:

  • Fast local search (milliseconds)
  • High retrieval accuracy
  • Minimal ongoing electricity costs
  • Unlimited searches with no rate limits

Architecture:

KiloCode (VS Code)
    ↓
Ollama (qwen3-embedding:8b-fp16) → RTX 4090 (15GB VRAM)
    ↓
Qdrant (kilocode_codebase collection) → ~100MB RAM

Happy coding! 🚀


Document Version: 1.0
Last Updated: November 23, 2025
Author: AI Implementation Guide
Project: KiloCode Codebase Indexing with Qwen3-Embedding-8B + Qdrant