Skip to content

Hybrid search: index RDF data into Typesense #252

@ddeboer

Description

@ddeboer

Summary

Index RDF data from SPARQL stores into Typesense for hybrid search (keyword + vector), enabling fuzzy matching, relevance ranking, typo tolerance, and semantic search over Linked Data.

Context

Applications like the NDE Dataset Register browser currently search RDF data via SPARQL CONTAINS() — substring matching with no relevance ranking, typo tolerance, or semantic understanding. A dedicated search engine would significantly improve search quality.

Approach

RDF-to-search-index pipeline

  1. Accept RDF triples as input (e.g. N3.Store) — the caller is responsible for fetching data (e.g. via SPARQL CONSTRUCT)
  2. Transform using JSON-LD Framing (W3C standard) to reshape the RDF graph into deterministic JSON documents — the same pattern already used in @lde/docgen
  3. Post-process the framed output: flatten language maps to per-language fields (title: { nl, en }title_nl, title_en), use @id (the resource's URI) as the Typesense document id for upserts, flatten nested structures
  4. Index documents into Typesense

Querying

  • Hybrid search combining BM25 keyword search with vector similarity (using a multilingual embedding model like paraphrase-multilingual-MiniLM-L12-v2)
  • Faceted search with counts (maps directly to Typesense's facet_by)
  • Per-language field weighting to boost the user's preferred language

Package structure

A single new package @lde/search-typesense covering:

  • Collection schema definition shared between indexing and querying
  • Indexer: takes RDF triples (e.g. N3.Store) + a JSON-LD frame + Typesense connection → frames, post-processes, and indexes documents. SPARQL fetching is the caller's responsibility, keeping the package focused and composable.
  • Document management with two sync strategies:
    • Full reindex via collection alias swap (recommended to start): create new timestamped collection → index all documents → swap collection alias → drop old collection. Zero-downtime, always a clean slate, no stale documents.
    • Incremental upsert/delete: upsert by URI (@id → Typesense id) when a dataset is updated, delete by URI when removed. Useful if freshness requirements increase later.
  • Searcher: typed query interface with facet support

The JSON-LD Framing step itself doesn't need a separate package — it's a thin call to jsonld.frame() (same as in @lde/docgen). Callers provide their own frame definition (project-specific) and the package handles the Typesense-specific parts (field flattening, language map expansion, indexing, querying).

Alternatively: multiple packages

If reuse across search engines becomes a goal, the transformation layer (SPARQL → JSON-LD Framing → flat documents) could be split into a separate @lde/rdf-to-json package. But this seems premature — start with one package and extract if needed.

Relates to

  • @lde/docgen — already uses JSON-LD Framing for RDF → JSON transformation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions