Skip to content

research: evaluate grepai vs Morph for semantic code search #875

@christso

Description

@christso

Objective

Use AgentV as the benchmark harness to evaluate grepai vs Morph for semantic code search — demonstrating AgentV as an alternative to SWE-Bench for tool evaluation.

Context

Morph claims #1 on SWE-Bench Pro. Rather than relying on external benchmarks, we should design AgentV evals that measure code search tool effectiveness directly. This serves two purposes:

  1. Compare grepai vs Morph on dimensions that matter for agentic workflows
  2. Prove AgentV as a benchmark harness — if we can eval code search tools with AgentV, others can too

grepai (open-source, self-hosted)

  • Go CLI, local vector embeddings + similarity search
  • Swappable backends: embedders (Ollama, OpenAI, LM Studio) and vector stores (GOB, pgvector, Qdrant)
  • Hybrid search: vector similarity + text matching via Reciprocal Rank Fusion (RRF)
  • MCP server mode (mcp-serve) exposes search as native AI agent tools
  • Multi-project workspace support with hierarchical config
  • Research: agentevals-research/research/findings/grepai/README.md

Morph (commercial API, YC-backed)

  • WarpGrep: Dedicated search LLM, 8 parallel tool calls per turn, ~3.8 steps to results
  • Fast Apply: Merges edit snippets at 10,500+ tok/s, 98% accuracy
  • Compact: Compresses context 50-70% in <2s
  • Claims BbEval TypeScript Migration #1 on SWE-Bench Pro, 15.8% cheaper and 22% faster
  • Available as MCP server and liteLLM provider

Eval Design (AgentV as Harness)

Design AgentV eval cases that measure code search tools on real tasks:

  • Retrieval accuracy: Given a natural language query + known-relevant files, does the tool return them? (precision/recall)
  • End-to-end task completion: Agent with grepai MCP vs agent with Morph MCP — which leads to more correct solutions?
  • Latency & cost: Measure wall-clock time and token/compute cost per search across eval runs
  • Context efficiency: How much relevant context does each tool surface vs noise?
  • Privacy tradeoff: Local-only (grepai) vs API-dependent (Morph) — eval with air-gapped constraints

Non-Goals

  • Reproducing SWE-Bench itself inside AgentV
  • Building code search into AgentV core

Acceptance Signals

  • AgentV eval file(s) that benchmark code search tool effectiveness
  • Comparison results: grepai vs Morph graded by AgentV
  • Writeup on AgentV-as-harness viability for tool evaluation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions