Hybrid search: index RDF data into Typesense

## Summary

Index RDF data from SPARQL stores into Typesense for hybrid search (keyword + vector), enabling fuzzy matching, relevance ranking, typo tolerance, and semantic search over Linked Data.

## Context

Applications like the [NDE Dataset Register browser](https://github.com/netwerk-digitaal-erfgoed/dataset-register) currently search RDF data via SPARQL `CONTAINS()` — substring matching with no relevance ranking, typo tolerance, or semantic understanding. A dedicated search engine would significantly improve search quality.

## Approach

### RDF-to-search-index pipeline

1. **Accept RDF triples** as input (e.g. `N3.Store`) — the caller is responsible for fetching data (e.g. via SPARQL `CONSTRUCT`)
2. **Transform** using JSON-LD Framing (W3C standard) to reshape the RDF graph into deterministic JSON documents — the same pattern already used in [`@lde/docgen`](https://github.com/ldengine/lde/blob/main/packages/docgen/src/frame.ts)
3. **Post-process** the framed output: flatten language maps to per-language fields (`title: { nl, en }` → `title_nl`, `title_en`), use `@id` (the resource's URI) as the Typesense document `id` for upserts, flatten nested structures
4. **Index** documents into Typesense

### Querying

- Hybrid search combining BM25 keyword search with vector similarity (using a multilingual embedding model like `paraphrase-multilingual-MiniLM-L12-v2`)
- Faceted search with counts (maps directly to Typesense's `facet_by`)
- Per-language field weighting to boost the user's preferred language

## Package structure

A single new package `@lde/search-typesense` covering:

- **Collection schema definition** shared between indexing and querying
- **Indexer**: takes RDF triples (e.g. `N3.Store`) + a JSON-LD frame + Typesense connection → frames, post-processes, and indexes documents. SPARQL fetching is the caller's responsibility, keeping the package focused and composable.
- **Document management** with two sync strategies:
  - **Full reindex via collection alias swap** (recommended to start): create new timestamped collection → index all documents → swap [collection alias](https://typesense.org/docs/29.0/api/collection-alias.html) → drop old collection. Zero-downtime, always a clean slate, no stale documents.
  - **Incremental upsert/delete**: upsert by URI (`@id` → Typesense `id`) when a dataset is updated, delete by URI when removed. Useful if freshness requirements increase later.
- **Searcher**: typed query interface with facet support

The JSON-LD Framing step itself doesn't need a separate package — it's a thin call to `jsonld.frame()` (same as in `@lde/docgen`). Callers provide their own frame definition (project-specific) and the package handles the Typesense-specific parts (field flattening, language map expansion, indexing, querying).

### Alternatively: multiple packages

If reuse across search engines becomes a goal, the transformation layer (SPARQL → JSON-LD Framing → flat documents) could be split into a separate `@lde/rdf-to-json` package. But this seems premature — start with one package and extract if needed.

## Relates to

- `@lde/docgen` — already uses JSON-LD Framing for RDF → JSON transformation


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hybrid search: index RDF data into Typesense #252

Summary

Context

Approach

RDF-to-search-index pipeline

Querying

Package structure

Alternatively: multiple packages

Relates to

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hybrid search: index RDF data into Typesense #252

Description

Summary

Context

Approach

RDF-to-search-index pipeline

Querying

Package structure

Alternatively: multiple packages

Relates to

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions