Date: November 23, 2025 Project: KiloCode Codebase Indexing Model: qwen3-embedding:8b-fp16
The default Ollama modelfile for qwen3-embedding:8b-fp16 requires NO modification. Unlike text generation models, embedding models don't use parameters like temperature, top_p, or num_ctx. The minimal default modelfile is exactly what's needed for optimal performance.
- Understanding the Default Modelfile
- Why Embedding Models Are Different
- What Ollama Handles Automatically
- What NOT to Add to the Modelfile
- Context Window Details
- Dimension Configuration
- Common Questions (FAQ)
- Summary and Best Practices
- References
When you run ollama show qwen3-embedding:8b-fp16 --modelfile, you see:
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM qwen3-embedding:8b-fp16
FROM /root/.ollama/models/blobs/sha256-9a2dfcc2e867828909456dd52a69e3775b677bdce1816f7cc55f3657393e7e53
TEMPLATE {{ .Prompt }}
This is complete and correct. Here's why:
-
FROM: Points to the model weights blob
- This is the actual FP16 quantized Qwen3-Embedding-8B model
- ~15GB of trained neural network parameters
-
TEMPLATE: Defines input format
{{ .Prompt }}means "insert the input text here"- No special formatting needed for embeddings
- Ollama handles Qwen3-specific tokens automatically
That's it. Nothing else is needed or beneficial to add.
Purpose: Create new text by predicting the next token
Key Parameters:
PARAMETER temperature 0.8 # Controls randomness/creativity
PARAMETER num_ctx 4096 # Context window size for generation
PARAMETER top_p 0.9 # Nucleus sampling threshold
PARAMETER repeat_penalty 1.1 # Penalizes repetition
PARAMETER stop "<|endoftext|>" # Stop sequences
Process:
- Takes input text
- Generates probability distribution over next possible tokens
- Samples from distribution (using temperature, top_p, etc.)
- Repeats until stop condition or max length
Purpose: Convert text into fixed-size numerical vectors
Key Parameters:
(none - embeddings are deterministic)
Process:
- Takes input text
- Processes through neural network layers
- Outputs fixed-size vector (e.g., 4096 dimensions)
- Same input ALWAYS produces same output
No sampling, no randomness, no generation = no need for those parameters.
According to the official Qwen3-Embedding documentation, these models have specific requirements:
-
Special Token Appending
- Qwen3-embedding needs
<|endoftext|>appended to inputs - ✅ Ollama does this automatically via the
/api/embeddingsendpoint
- Qwen3-embedding needs
-
Output Normalization
- Embedding vectors must be normalized (L2 norm = 1)
- ✅ Ollama handles this internally
-
Pooling Strategy
- Uses "last token" pooling to create embeddings
- ✅ Built into the model architecture
-
Context Window
- Supports up to 32,000 tokens
- ✅ Built into model, no configuration needed
You don't need to add anything to the modelfile to enable these features. Ollama's embedding API endpoint is specifically designed to handle embedding models correctly.
Don't add these parameters - they are irrelevant or counterproductive for embeddings:
# ❌ DON'T ADD THESE ❌
PARAMETER temperature 0.8
# Why: Embeddings are deterministic, no sampling occurs
PARAMETER num_ctx 32000
# Why: Context window is built into the model architecture
PARAMETER top_p 0.9
PARAMETER top_k 50
# Why: No probability distribution sampling happens
PARAMETER repeat_penalty 1.1
# Why: Not generating text, so no repetition to penalize
PARAMETER stop "<|endoftext|>"
# Why: Not generating sequences that need to stop
SYSTEM "You are a helpful assistant"
# Why: Embeddings don't have personalities or roles
Adding these parameters won't cause errors, but they'll be completely ignored by Ollama's embedding endpoint. They're visual clutter that serves no purpose.
Qwen3-Embedding-8B supports 32,000 tokens of input context:
- ~25,000 words
- ~180,000 characters
- Entire chapters of books
- Large code files
Your KiloCode settings:
- Min block size: 100 characters
- Max block size: 1,000 characters
Result: You're using 0.5% of the available context window per code block. There's enormous headroom - you'll never hit the limit.
The 32K context window is hardcoded in the model's architecture:
# Inside the model architecture (conceptual)
class Qwen3Embedding:
def __init__(self):
self.max_position_embeddings = 32768 # Built-in
self.hidden_size = 4096
# ... other architecture detailsYou can't change it via modelfile, and you don't need to - it's already optimal.
Qwen3-Embedding-8B outputs 4096 dimensions by default when used through Ollama's API:
curl http://localhost:11434/api/embeddings -d '{
"model": "qwen3-embedding:8b-fp16",
"prompt": "test"
}'
# Returns: array of 4096 floatsThis provides:
- 100% maximum model quality (no truncation)
- No configuration needed (works out of the box)
- Simplicity over optimization
- Best accuracy for code search tasks
The model supports flexible dimensions (32-4096) via Matryoshka Representation Learning. However, this is NOT configured in the modelfile. Instead, you specify it in the API call:
# Request 512 dimensions instead
curl http://localhost:11434/api/embeddings -d '{
"model": "qwen3-embedding:8b-fp16",
"prompt": "test",
"options": {
"num_embed": 512
}
}'For KiloCode: The dimension setting goes in KiloCode's configuration UI, not in the Ollama modelfile.
A: No. The 32K context window is already built into the model. Adding this parameter:
- Won't increase the context limit (already at max)
- Won't improve performance
- Has no effect on embedding generation
A: This is a misunderstanding of what embeddings are. Embeddings are deterministic mathematical transformations, not creative generation. There's no creativity slider - the embedding of "hello" is always the same vector, by design.
A: No. The model was trained specifically for code embeddings. Adding a system prompt:
- Has no effect (embeddings don't use prompts)
- Might confuse the API endpoint
- Is conceptually wrong - embeddings encode meaning, not follow instructions
A: You're already using FP16, which is the highest quality. Quantizing further:
- Saves VRAM (useful if you have <16GB)
- Reduces quality by ~0.5-4% depending on quantization
- For your hardware (24GB VRAM), FP16 is optimal
From your setup report, you chose FP16 specifically for maximum quality. Stick with it.
A: No. Embedding models ARE simpler than generation models. Compare:
Generation Model Modelfile (~20 lines):
FROM llama3.1:8b
TEMPLATE """[INST] {{ .Prompt }} [/INST]"""
PARAMETER temperature 0.8
PARAMETER num_ctx 4096
PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"
SYSTEM "You are a helpful assistant"
Embedding Model Modelfile (~3 lines):
FROM qwen3-embedding:8b-fp16
TEMPLATE {{ .Prompt }}
The embedding model modelfile is shorter because it does one thing well: convert text to vectors. It doesn't need all the generation control parameters.
- Use the default modelfile - it's perfect as-is
- Test with curl before configuring KiloCode
- Monitor VRAM to ensure model fits comfortably
- Keep model name exact:
qwen3-embedding:8b-fp16 - Let Ollama handle special tokens and normalization
- Don't add generation parameters (temperature, top_p, etc.)
- Don't modify TEMPLATE - keep it as
{{ .Prompt }} - Don't add SYSTEM prompts - embeddings don't use them
- Don't try to set context window - it's already 32K
- Don't overthink it - simpler is better for embedding models
The "minimal" modelfile is actually the optimal configuration. Embedding models are fundamentally simpler than generation models, and their modelfiles should reflect that simplicity. Trust the defaults, they're correct by design.
- Ollama Embedding Documentation: https://ollama.com/blog/embedding-models
- Ollama API Reference: https://github.com/ollama/ollama/blob/main/docs/api.md
- Qwen3-Embedding Official Docs: https://ollama.com/library/qwen3-embedding
- Qwen3-Embedding GitHub: https://github.com/QwenLM/Qwen3-Embedding
- Related Documentation: 2_EMBEDDING_MODEL_SELECTION.md, 1_CODEBASE_INDEXING_FEATURE.md
Document Version: 2.0 Last Updated: November 23, 2025 Purpose: Configuration Guide & FAQ Status: Verified Configuration