research: evaluate grepai vs Morph for semantic code search

## Objective

Use AgentV as the benchmark harness to evaluate [grepai](https://github.com/yoanbernabeu/grepai) vs [Morph](https://www.morphllm.com/) for semantic code search — demonstrating AgentV as an alternative to SWE-Bench for tool evaluation.

## Context

Morph claims #1 on SWE-Bench Pro. Rather than relying on external benchmarks, we should design AgentV evals that measure code search tool effectiveness directly. This serves two purposes:

1. **Compare grepai vs Morph** on dimensions that matter for agentic workflows
2. **Prove AgentV as a benchmark harness** — if we can eval code search tools with AgentV, others can too

### grepai (open-source, self-hosted)
- Go CLI, local vector embeddings + similarity search
- Swappable backends: embedders (Ollama, OpenAI, LM Studio) and vector stores (GOB, pgvector, Qdrant)
- Hybrid search: vector similarity + text matching via Reciprocal Rank Fusion (RRF)
- MCP server mode (`mcp-serve`) exposes search as native AI agent tools
- Multi-project workspace support with hierarchical config
- Research: `agentevals-research/research/findings/grepai/README.md`

### Morph (commercial API, YC-backed)
- **WarpGrep**: Dedicated search LLM, 8 parallel tool calls per turn, ~3.8 steps to results
- **Fast Apply**: Merges edit snippets at 10,500+ tok/s, 98% accuracy
- **Compact**: Compresses context 50-70% in <2s
- Claims #1 on SWE-Bench Pro, 15.8% cheaper and 22% faster
- Available as MCP server and liteLLM provider

## Eval Design (AgentV as Harness)

Design AgentV eval cases that measure code search tools on real tasks:

- [ ] **Retrieval accuracy**: Given a natural language query + known-relevant files, does the tool return them? (precision/recall)
- [ ] **End-to-end task completion**: Agent with grepai MCP vs agent with Morph MCP — which leads to more correct solutions?
- [ ] **Latency & cost**: Measure wall-clock time and token/compute cost per search across eval runs
- [ ] **Context efficiency**: How much relevant context does each tool surface vs noise?
- [ ] **Privacy tradeoff**: Local-only (grepai) vs API-dependent (Morph) — eval with air-gapped constraints

## Non-Goals

- Reproducing SWE-Bench itself inside AgentV
- Building code search into AgentV core

## Acceptance Signals

- AgentV eval file(s) that benchmark code search tool effectiveness
- Comparison results: grepai vs Morph graded by AgentV
- Writeup on AgentV-as-harness viability for tool evaluation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research: evaluate grepai vs Morph for semantic code search #875

Objective

Context

grepai (open-source, self-hosted)

Morph (commercial API, YC-backed)

Eval Design (AgentV as Harness)

Non-Goals

Acceptance Signals

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

research: evaluate grepai vs Morph for semantic code search #875

Description

Objective

Context

grepai (open-source, self-hosted)

Morph (commercial API, YC-backed)

Eval Design (AgentV as Harness)

Non-Goals

Acceptance Signals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions