Qwen3-Embedding with Ollama: Guide & FAQ

Date: November 23, 2025 Project: KiloCode Codebase Indexing Model: qwen3-embedding:8b-fp16

Executive Summary

The default Ollama modelfile for qwen3-embedding:8b-fp16 requires NO modification. Unlike text generation models, embedding models don't use parameters like temperature, top_p, or num_ctx. The minimal default modelfile is exactly what's needed for optimal performance.

Understanding the Default Modelfile
Why Embedding Models Are Different
What Ollama Handles Automatically
What NOT to Add to the Modelfile
Context Window Details
Dimension Configuration
Common Questions (FAQ)
Summary and Best Practices
References

Understanding the Default Modelfile

When you run ollama show qwen3-embedding:8b-fp16 --modelfile, you see:

# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM qwen3-embedding:8b-fp16

FROM /root/.ollama/models/blobs/sha256-9a2dfcc2e867828909456dd52a69e3775b677bdce1816f7cc55f3657393e7e53
TEMPLATE {{ .Prompt }}

This is complete and correct. Here's why:

Component Breakdown

FROM: Points to the model weights blob
- This is the actual FP16 quantized Qwen3-Embedding-8B model
- ~15GB of trained neural network parameters
TEMPLATE: Defines input format
- {{ .Prompt }} means "insert the input text here"
- No special formatting needed for embeddings
- Ollama handles Qwen3-specific tokens automatically

That's it. Nothing else is needed or beneficial to add.

Why Embedding Models Are Different

Text Generation Models (e.g., Llama, Qwen3-Instruct)

Purpose: Create new text by predicting the next token

Key Parameters:

PARAMETER temperature 0.8       # Controls randomness/creativity
PARAMETER num_ctx 4096          # Context window size for generation
PARAMETER top_p 0.9             # Nucleus sampling threshold
PARAMETER repeat_penalty 1.1    # Penalizes repetition
PARAMETER stop "<|endoftext|>"  # Stop sequences

Process:

Takes input text
Generates probability distribution over next possible tokens
Samples from distribution (using temperature, top_p, etc.)
Repeats until stop condition or max length

Embedding Models (e.g., qwen3-embedding)

Purpose: Convert text into fixed-size numerical vectors

Key Parameters:

(none - embeddings are deterministic)

Process:

Takes input text
Processes through neural network layers
Outputs fixed-size vector (e.g., 4096 dimensions)
Same input ALWAYS produces same output

No sampling, no randomness, no generation = no need for those parameters.

What Ollama Handles Automatically

According to the official Qwen3-Embedding documentation, these models have specific requirements:

Requirements (Handled by Ollama)

Special Token Appending
- Qwen3-embedding needs <|endoftext|> appended to inputs
- ✅ Ollama does this automatically via the /api/embeddings endpoint
Output Normalization
- Embedding vectors must be normalized (L2 norm = 1)
- ✅ Ollama handles this internally
Pooling Strategy
- Uses "last token" pooling to create embeddings
- ✅ Built into the model architecture
Context Window
- Supports up to 32,000 tokens
- ✅ Built into model, no configuration needed

What This Means

You don't need to add anything to the modelfile to enable these features. Ollama's embedding API endpoint is specifically designed to handle embedding models correctly.

What NOT to Add to the Modelfile

Don't add these parameters - they are irrelevant or counterproductive for embeddings:

# ❌ DON'T ADD THESE ❌

PARAMETER temperature 0.8
# Why: Embeddings are deterministic, no sampling occurs

PARAMETER num_ctx 32000
# Why: Context window is built into the model architecture

PARAMETER top_p 0.9
PARAMETER top_k 50
# Why: No probability distribution sampling happens

PARAMETER repeat_penalty 1.1
# Why: Not generating text, so no repetition to penalize

PARAMETER stop "<|endoftext|>"
# Why: Not generating sequences that need to stop

SYSTEM "You are a helpful assistant"
# Why: Embeddings don't have personalities or roles

Adding these parameters won't cause errors, but they'll be completely ignored by Ollama's embedding endpoint. They're visual clutter that serves no purpose.

Context Window Details

Built-In Capacity

Qwen3-Embedding-8B supports 32,000 tokens of input context:

~25,000 words
~180,000 characters
Entire chapters of books
Large code files

For KiloCode Codebase Indexing

Your KiloCode settings:

Min block size: 100 characters
Max block size: 1,000 characters

Result: You're using 0.5% of the available context window per code block. There's enormous headroom - you'll never hit the limit.

Why You Don't Configure It

The 32K context window is hardcoded in the model's architecture:

# Inside the model architecture (conceptual)
class Qwen3Embedding:
    def __init__(self):
        self.max_position_embeddings = 32768  # Built-in
        self.hidden_size = 4096
        # ... other architecture details

You can't change it via modelfile, and you don't need to - it's already optimal.

Dimension Configuration (Matryoshka Support)

Default Behavior

Qwen3-Embedding-8B outputs 4096 dimensions by default when used through Ollama's API:

curl http://localhost:11434/api/embeddings -d '{
  "model": "qwen3-embedding:8b-fp16",
  "prompt": "test"
}'

# Returns: array of 4096 floats

Why 4096 is the Default

This provides:

100% maximum model quality (no truncation)
No configuration needed (works out of the box)
Simplicity over optimization
Best accuracy for code search tasks

Changing Dimensions (If Needed)

The model supports flexible dimensions (32-4096) via Matryoshka Representation Learning. However, this is NOT configured in the modelfile. Instead, you specify it in the API call:

# Request 512 dimensions instead
curl http://localhost:11434/api/embeddings -d '{
  "model": "qwen3-embedding:8b-fp16",
  "prompt": "test",
  "options": {
    "num_embed": 512
  }
}'

For KiloCode: The dimension setting goes in KiloCode's configuration UI, not in the Ollama modelfile.

Common Questions

Q: Should I add `PARAMETER num_ctx 32000` to utilize the full context?

A: No. The 32K context window is already built into the model. Adding this parameter:

Won't increase the context limit (already at max)
Won't improve performance
Has no effect on embedding generation

Q: What about temperature for more "creative" embeddings?

A: This is a misunderstanding of what embeddings are. Embeddings are deterministic mathematical transformations, not creative generation. There's no creativity slider - the embedding of "hello" is always the same vector, by design.

Q: Can I add a SYSTEM prompt to make embeddings better for code?

A: No. The model was trained specifically for code embeddings. Adding a system prompt:

Has no effect (embeddings don't use prompts)
Might confuse the API endpoint
Is conceptually wrong - embeddings encode meaning, not follow instructions

Q: Should I quantize the model further (Q8, Q4)?

A: You're already using FP16, which is the highest quality. Quantizing further:

Saves VRAM (useful if you have <16GB)
Reduces quality by ~0.5-4% depending on quantization
For your hardware (24GB VRAM), FP16 is optimal

From your setup report, you chose FP16 specifically for maximum quality. Stick with it.

Q: The modelfile looks "incomplete" compared to Llama models. Is something wrong?

A: No. Embedding models ARE simpler than generation models. Compare:

Generation Model Modelfile (~20 lines):

FROM llama3.1:8b
TEMPLATE """[INST] {{ .Prompt }} [/INST]"""
PARAMETER temperature 0.8
PARAMETER num_ctx 4096
PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"
SYSTEM "You are a helpful assistant"

Embedding Model Modelfile (~3 lines):

FROM qwen3-embedding:8b-fp16
TEMPLATE {{ .Prompt }}

The embedding model modelfile is shorter because it does one thing well: convert text to vectors. It doesn't need all the generation control parameters.

Summary and Best Practices

✅ Do This

Use the default modelfile - it's perfect as-is
Test with curl before configuring KiloCode
Monitor VRAM to ensure model fits comfortably
Keep model name exact: qwen3-embedding:8b-fp16
Let Ollama handle special tokens and normalization

❌ Don't Do This

Don't add generation parameters (temperature, top_p, etc.)
Don't modify TEMPLATE - keep it as {{ .Prompt }}
Don't add SYSTEM prompts - embeddings don't use them
Don't try to set context window - it's already 32K
Don't overthink it - simpler is better for embedding models

Key Takeaway

The "minimal" modelfile is actually the optimal configuration. Embedding models are fundamentally simpler than generation models, and their modelfiles should reflect that simplicity. Trust the defaults, they're correct by design.

References

Ollama Embedding Documentation: https://ollama.com/blog/embedding-models
Ollama API Reference: https://github.com/ollama/ollama/blob/main/docs/api.md
Qwen3-Embedding Official Docs: https://ollama.com/library/qwen3-embedding
Qwen3-Embedding GitHub: https://github.com/QwenLM/Qwen3-Embedding
Related Documentation: 2_EMBEDDING_MODEL_SELECTION.md, 1_CODEBASE_INDEXING_FEATURE.md

Document Version: 2.0 Last Updated: November 23, 2025 Purpose: Configuration Guide & FAQ Status: Verified Configuration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3-Embedding with Ollama: Guide & FAQ

Executive Summary

Table of Contents

Understanding the Default Modelfile

Component Breakdown

Why Embedding Models Are Different

Text Generation Models (e.g., Llama, Qwen3-Instruct)

Embedding Models (e.g., qwen3-embedding)

What Ollama Handles Automatically

Requirements (Handled by Ollama)

What This Means

What NOT to Add to the Modelfile

Context Window Details

Built-In Capacity

For KiloCode Codebase Indexing

Why You Don't Configure It

Dimension Configuration (Matryoshka Support)

Default Behavior

Why 4096 is the Default

Changing Dimensions (If Needed)

Common Questions

Q: Should I add `PARAMETER num_ctx 32000` to utilize the full context?

Q: What about temperature for more "creative" embeddings?

Q: Can I add a SYSTEM prompt to make embeddings better for code?

Q: Should I quantize the model further (Q8, Q4)?

Q: The modelfile looks "incomplete" compared to Llama models. Is something wrong?

Summary and Best Practices

✅ Do This

❌ Don't Do This

Key Takeaway

References

FilesExpand file tree

3_QWEN3_OLLAMA_GUIDE.md

Latest commit

History

3_QWEN3_OLLAMA_GUIDE.md

File metadata and controls

Qwen3-Embedding with Ollama: Guide & FAQ

Executive Summary

Table of Contents

Understanding the Default Modelfile

Component Breakdown

Why Embedding Models Are Different

Text Generation Models (e.g., Llama, Qwen3-Instruct)

Embedding Models (e.g., qwen3-embedding)

What Ollama Handles Automatically

Requirements (Handled by Ollama)

What This Means

What NOT to Add to the Modelfile

Context Window Details

Built-In Capacity

For KiloCode Codebase Indexing

Why You Don't Configure It

Dimension Configuration (Matryoshka Support)

Default Behavior

Why 4096 is the Default

Changing Dimensions (If Needed)

Common Questions

Q: Should I add PARAMETER num_ctx 32000 to utilize the full context?

Q: What about temperature for more "creative" embeddings?

Q: Can I add a SYSTEM prompt to make embeddings better for code?

Q: Should I quantize the model further (Q8, Q4)?

Q: The modelfile looks "incomplete" compared to Llama models. Is something wrong?

Summary and Best Practices

✅ Do This

❌ Don't Do This

Key Takeaway

References

Q: Should I add `PARAMETER num_ctx 32000` to utilize the full context?