Skip to content

perf: switch to fastembed ONNX backend + user-selectable model/backend#29

Merged
iamvirul merged 1 commit intomainfrom
perf/lazy-import-startup
Feb 26, 2026
Merged

perf: switch to fastembed ONNX backend + user-selectable model/backend#29
iamvirul merged 1 commit intomainfrom
perf/lazy-import-startup

Conversation

@iamvirul
Copy link
Member

Pull Request

Type of Change

  • ✨ New feature
  • ♻️ Refactor
  • 📖 Documentation update
  • 🔧 Chore (build process, CI/CD, dependency updates)
  • ✅ Test improvement

Description

The MCP server was slow to start (~6.6s) because sentence_transformers (PyTorch) was imported at module level. This delayed every Claude Code session on the first tool call.

This PR replaces the default embedding backend with fastembed (ONNX Runtime), bringing startup from ~6.6s down to ~1.25s. It also adds user-selectable backend and model via environment variables.

Related Issues / PRs

Closes #27
Closes #28

Changes Made

  • src/vecgrep/embedder.py — full rewrite with dual-backend lazy loading:
    • VECGREP_BACKEND=onnx (default): fastembed + ONNX Runtime, ~100ms model load
    • VECGREP_BACKEND=torch: sentence-transformers + PyTorch, supports any HF model
    • VECGREP_MODEL: override the default HuggingFace model
    • All heavy imports deferred to first embed() call
    • Registers isuruwijesiri/all-MiniLM-L6-v2-code-search-512 as a custom fastembed ONNX model
  • pyproject.toml — add fastembed>=0.4.0 runtime dependency
  • tests/test_embedder.py — add TestTorchBackend class + fix TestDetectDevice for lazy import pattern
  • README.md — add Configuration section documenting VECGREP_BACKEND and VECGREP_MODEL

Testing

  • Unit tests
  • Manual testing (describe steps below)

All 110 existing tests pass. New tests added:

  • TestTorchBackend: validates shape (1, 384) and unit-norm vectors via VECGREP_BACKEND=torch
  • TestDetectDevice: covers cuda/mps/cpu paths by patching torch.cuda.is_available and torch.backends.mps.is_available directly

Startup benchmark:

Before: python -c "import vecgrep.server"  → ~6.6s
After:  python -c "import vecgrep.server"  → ~1.25s  (5× faster)

Checklist

  • My code follows the project's style guidelines (ruff passes)
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation (if applicable)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

…kend

- Replace sentence-transformers module-level import with dual-backend
  lazy loading system (fastembed ONNX default, torch opt-in)
- Startup time: ~6.6s → ~1.25s (5× improvement)
- ONNX model load: ~100ms vs ~2-3s for PyTorch on first embed() call
- Register isuruwijesiri/all-MiniLM-L6-v2-code-search-512 as custom
  fastembed model via TextEmbedding.add_custom_model() with ONNX files
  from HuggingFace
- Add VECGREP_BACKEND env var (onnx|torch) for backend selection
- Add VECGREP_MODEL env var for custom HuggingFace model selection
- Add fastembed>=0.4.0 to runtime dependencies
- Update tests to cover torch backend and all device detection paths
- Document new env vars in README

Closes #28
Closes #27
@iamvirul iamvirul self-assigned this Feb 26, 2026
@iamvirul iamvirul merged commit dcfa869 into main Feb 26, 2026
1 check passed
@codecov
Copy link

codecov bot commented Feb 26, 2026

Codecov Report

❌ Patch coverage is 93.93939% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/vecgrep/embedder.py 93.93% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@iamvirul iamvirul mentioned this pull request Feb 28, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant