Skip to content

Latest commit

 

History

History
325 lines (225 loc) · 10 KB

File metadata and controls

325 lines (225 loc) · 10 KB

Qwen3-Embedding with Ollama: Guide & FAQ

Date: November 23, 2025 Project: KiloCode Codebase Indexing Model: qwen3-embedding:8b-fp16


Executive Summary

The default Ollama modelfile for qwen3-embedding:8b-fp16 requires NO modification. Unlike text generation models, embedding models don't use parameters like temperature, top_p, or num_ctx. The minimal default modelfile is exactly what's needed for optimal performance.


Table of Contents

  1. Understanding the Default Modelfile
  2. Why Embedding Models Are Different
  3. What Ollama Handles Automatically
  4. What NOT to Add to the Modelfile
  5. Context Window Details
  6. Dimension Configuration
  7. Common Questions (FAQ)
  8. Summary and Best Practices
  9. References

Understanding the Default Modelfile

When you run ollama show qwen3-embedding:8b-fp16 --modelfile, you see:

# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM qwen3-embedding:8b-fp16

FROM /root/.ollama/models/blobs/sha256-9a2dfcc2e867828909456dd52a69e3775b677bdce1816f7cc55f3657393e7e53
TEMPLATE {{ .Prompt }}

This is complete and correct. Here's why:

Component Breakdown

  1. FROM: Points to the model weights blob

    • This is the actual FP16 quantized Qwen3-Embedding-8B model
    • ~15GB of trained neural network parameters
  2. TEMPLATE: Defines input format

    • {{ .Prompt }} means "insert the input text here"
    • No special formatting needed for embeddings
    • Ollama handles Qwen3-specific tokens automatically

That's it. Nothing else is needed or beneficial to add.


Why Embedding Models Are Different

Text Generation Models (e.g., Llama, Qwen3-Instruct)

Purpose: Create new text by predicting the next token

Key Parameters:

PARAMETER temperature 0.8       # Controls randomness/creativity
PARAMETER num_ctx 4096          # Context window size for generation
PARAMETER top_p 0.9             # Nucleus sampling threshold
PARAMETER repeat_penalty 1.1    # Penalizes repetition
PARAMETER stop "<|endoftext|>"  # Stop sequences

Process:

  1. Takes input text
  2. Generates probability distribution over next possible tokens
  3. Samples from distribution (using temperature, top_p, etc.)
  4. Repeats until stop condition or max length

Embedding Models (e.g., qwen3-embedding)

Purpose: Convert text into fixed-size numerical vectors

Key Parameters:

(none - embeddings are deterministic)

Process:

  1. Takes input text
  2. Processes through neural network layers
  3. Outputs fixed-size vector (e.g., 4096 dimensions)
  4. Same input ALWAYS produces same output

No sampling, no randomness, no generation = no need for those parameters.


What Ollama Handles Automatically

According to the official Qwen3-Embedding documentation, these models have specific requirements:

Requirements (Handled by Ollama)

  1. Special Token Appending

    • Qwen3-embedding needs <|endoftext|> appended to inputs
    • ✅ Ollama does this automatically via the /api/embeddings endpoint
  2. Output Normalization

    • Embedding vectors must be normalized (L2 norm = 1)
    • ✅ Ollama handles this internally
  3. Pooling Strategy

    • Uses "last token" pooling to create embeddings
    • ✅ Built into the model architecture
  4. Context Window

    • Supports up to 32,000 tokens
    • ✅ Built into model, no configuration needed

What This Means

You don't need to add anything to the modelfile to enable these features. Ollama's embedding API endpoint is specifically designed to handle embedding models correctly.


What NOT to Add to the Modelfile

Don't add these parameters - they are irrelevant or counterproductive for embeddings:

# ❌ DON'T ADD THESE ❌

PARAMETER temperature 0.8
# Why: Embeddings are deterministic, no sampling occurs

PARAMETER num_ctx 32000
# Why: Context window is built into the model architecture

PARAMETER top_p 0.9
PARAMETER top_k 50
# Why: No probability distribution sampling happens

PARAMETER repeat_penalty 1.1
# Why: Not generating text, so no repetition to penalize

PARAMETER stop "<|endoftext|>"
# Why: Not generating sequences that need to stop

SYSTEM "You are a helpful assistant"
# Why: Embeddings don't have personalities or roles

Adding these parameters won't cause errors, but they'll be completely ignored by Ollama's embedding endpoint. They're visual clutter that serves no purpose.


Context Window Details

Built-In Capacity

Qwen3-Embedding-8B supports 32,000 tokens of input context:

  • ~25,000 words
  • ~180,000 characters
  • Entire chapters of books
  • Large code files

For KiloCode Codebase Indexing

Your KiloCode settings:

  • Min block size: 100 characters
  • Max block size: 1,000 characters

Result: You're using 0.5% of the available context window per code block. There's enormous headroom - you'll never hit the limit.

Why You Don't Configure It

The 32K context window is hardcoded in the model's architecture:

# Inside the model architecture (conceptual)
class Qwen3Embedding:
    def __init__(self):
        self.max_position_embeddings = 32768  # Built-in
        self.hidden_size = 4096
        # ... other architecture details

You can't change it via modelfile, and you don't need to - it's already optimal.


Dimension Configuration (Matryoshka Support)

Default Behavior

Qwen3-Embedding-8B outputs 4096 dimensions by default when used through Ollama's API:

curl http://localhost:11434/api/embeddings -d '{
  "model": "qwen3-embedding:8b-fp16",
  "prompt": "test"
}'

# Returns: array of 4096 floats

Why 4096 is the Default

This provides:

  • 100% maximum model quality (no truncation)
  • No configuration needed (works out of the box)
  • Simplicity over optimization
  • Best accuracy for code search tasks

Changing Dimensions (If Needed)

The model supports flexible dimensions (32-4096) via Matryoshka Representation Learning. However, this is NOT configured in the modelfile. Instead, you specify it in the API call:

# Request 512 dimensions instead
curl http://localhost:11434/api/embeddings -d '{
  "model": "qwen3-embedding:8b-fp16",
  "prompt": "test",
  "options": {
    "num_embed": 512
  }
}'

For KiloCode: The dimension setting goes in KiloCode's configuration UI, not in the Ollama modelfile.


Common Questions

Q: Should I add PARAMETER num_ctx 32000 to utilize the full context?

A: No. The 32K context window is already built into the model. Adding this parameter:

  • Won't increase the context limit (already at max)
  • Won't improve performance
  • Has no effect on embedding generation

Q: What about temperature for more "creative" embeddings?

A: This is a misunderstanding of what embeddings are. Embeddings are deterministic mathematical transformations, not creative generation. There's no creativity slider - the embedding of "hello" is always the same vector, by design.

Q: Can I add a SYSTEM prompt to make embeddings better for code?

A: No. The model was trained specifically for code embeddings. Adding a system prompt:

  • Has no effect (embeddings don't use prompts)
  • Might confuse the API endpoint
  • Is conceptually wrong - embeddings encode meaning, not follow instructions

Q: Should I quantize the model further (Q8, Q4)?

A: You're already using FP16, which is the highest quality. Quantizing further:

  • Saves VRAM (useful if you have <16GB)
  • Reduces quality by ~0.5-4% depending on quantization
  • For your hardware (24GB VRAM), FP16 is optimal

From your setup report, you chose FP16 specifically for maximum quality. Stick with it.

Q: The modelfile looks "incomplete" compared to Llama models. Is something wrong?

A: No. Embedding models ARE simpler than generation models. Compare:

Generation Model Modelfile (~20 lines):

FROM llama3.1:8b
TEMPLATE """[INST] {{ .Prompt }} [/INST]"""
PARAMETER temperature 0.8
PARAMETER num_ctx 4096
PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"
SYSTEM "You are a helpful assistant"

Embedding Model Modelfile (~3 lines):

FROM qwen3-embedding:8b-fp16
TEMPLATE {{ .Prompt }}

The embedding model modelfile is shorter because it does one thing well: convert text to vectors. It doesn't need all the generation control parameters.


Summary and Best Practices

✅ Do This

  1. Use the default modelfile - it's perfect as-is
  2. Test with curl before configuring KiloCode
  3. Monitor VRAM to ensure model fits comfortably
  4. Keep model name exact: qwen3-embedding:8b-fp16
  5. Let Ollama handle special tokens and normalization

❌ Don't Do This

  1. Don't add generation parameters (temperature, top_p, etc.)
  2. Don't modify TEMPLATE - keep it as {{ .Prompt }}
  3. Don't add SYSTEM prompts - embeddings don't use them
  4. Don't try to set context window - it's already 32K
  5. Don't overthink it - simpler is better for embedding models

Key Takeaway

The "minimal" modelfile is actually the optimal configuration. Embedding models are fundamentally simpler than generation models, and their modelfiles should reflect that simplicity. Trust the defaults, they're correct by design.


References


Document Version: 2.0 Last Updated: November 23, 2025 Purpose: Configuration Guide & FAQ Status: Verified Configuration