-
Notifications
You must be signed in to change notification settings - Fork 0
Hybrid search: index RDF data into Typesense #252
Description
Summary
Index RDF data from SPARQL stores into Typesense for hybrid search (keyword + vector), enabling fuzzy matching, relevance ranking, typo tolerance, and semantic search over Linked Data.
Context
Applications like the NDE Dataset Register browser currently search RDF data via SPARQL CONTAINS() — substring matching with no relevance ranking, typo tolerance, or semantic understanding. A dedicated search engine would significantly improve search quality.
Approach
RDF-to-search-index pipeline
- Accept RDF triples as input (e.g.
N3.Store) — the caller is responsible for fetching data (e.g. via SPARQLCONSTRUCT) - Transform using JSON-LD Framing (W3C standard) to reshape the RDF graph into deterministic JSON documents — the same pattern already used in
@lde/docgen - Post-process the framed output: flatten language maps to per-language fields (
title: { nl, en }→title_nl,title_en), use@id(the resource's URI) as the Typesense documentidfor upserts, flatten nested structures - Index documents into Typesense
Querying
- Hybrid search combining BM25 keyword search with vector similarity (using a multilingual embedding model like
paraphrase-multilingual-MiniLM-L12-v2) - Faceted search with counts (maps directly to Typesense's
facet_by) - Per-language field weighting to boost the user's preferred language
Package structure
A single new package @lde/search-typesense covering:
- Collection schema definition shared between indexing and querying
- Indexer: takes RDF triples (e.g.
N3.Store) + a JSON-LD frame + Typesense connection → frames, post-processes, and indexes documents. SPARQL fetching is the caller's responsibility, keeping the package focused and composable. - Document management with two sync strategies:
- Full reindex via collection alias swap (recommended to start): create new timestamped collection → index all documents → swap collection alias → drop old collection. Zero-downtime, always a clean slate, no stale documents.
- Incremental upsert/delete: upsert by URI (
@id→ Typesenseid) when a dataset is updated, delete by URI when removed. Useful if freshness requirements increase later.
- Searcher: typed query interface with facet support
The JSON-LD Framing step itself doesn't need a separate package — it's a thin call to jsonld.frame() (same as in @lde/docgen). Callers provide their own frame definition (project-specific) and the package handles the Typesense-specific parts (field flattening, language map expansion, indexing, querying).
Alternatively: multiple packages
If reuse across search engines becomes a goal, the transformation layer (SPARQL → JSON-LD Framing → flat documents) could be split into a separate @lde/rdf-to-json package. But this seems premature — start with one package and extract if needed.
Relates to
@lde/docgen— already uses JSON-LD Framing for RDF → JSON transformation