Skip to content

Rewrite VoID queries to avoid QLever memory exhaustion on large datasets #302

@ddeboer

Description

@ddeboer

Problem

On datasets with hundreds of millions of triples (e.g. dataset_bhi with 686M triples), several VoID aggregation queries fail with QLever memory allocation errors even with --memory-max-size 12G.

QLever stores its index on disk (the OS caches it into RAM as available). The --memory-max-size parameter is a budget for query processing and caching only. On dataset_bhi, available memory within this budget shrank as stages progressed: 3.4 GB remained for subjects.rq and object-literals.rq, dropping to just 787 MB by entity-properties.rq. The cause of this shrinkage isn't confirmed — it could be cached query results from earlier stages, memory not fully released after intermediate queries, or fragmentation.

The root cause of the failures themselves is that FILTER expressions (ISBLANK, ISLITERAL) force QLever to materialize the full index scan row-by-row, then deduplicate in memory. Without a filter, QLever can answer COUNT(DISTINCT) directly from its sorted permutation indexes with minimal memory.

See #284 for the full error analysis.

Failing queries

Query Pattern Memory needed Available
subjects.rq COUNT(DISTINCT ?s) FILTER(!ISBLANK(?s)) 4.3 GB 3.4 GB
object-literals.rq COUNT(DISTINCT ?o) FILTER(ISLITERAL(?o)) 4.3 GB 3.4 GB
entity-properties.rq COUNT(DISTINCT ?s), COUNT(DISTINCT ?o) GROUP BY ?p 5.5 GB 787 MB

Queries that succeed on the same dataset

Query Pattern Why it works
properties.rq COUNT(DISTINCT ?p) — no filter Few distinct predicates
triples.rq COUNT(*) — no filter, no distinct Just a count
object-uris.rq COUNT(DISTINCT ?o) FILTER(ISIRI(?o)) Fewer distinct IRI objects fit in memory

Proposed fixes

1. Investigate and tune QLever memory settings

The @lde/sparql-qlever Server class currently only exposes --memory-max-size. QLever also supports --cache-max-size (e.g. Wikidata uses --cache-max-size 15G alongside --memory-max-size 20G). Tuning the cache cap might reserve more of the budget for query execution — but first we need to understand what's consuming the available memory as stages progress.

2. Rewrite subjects.rq: drop the ISBLANK filter

The VoID spec defines void:distinctSubjects as the number of distinct subjects — it doesn't require excluding blank nodes. Removing FILTER(!ISBLANK(?s)) lets QLever answer directly from its SPO permutation index without materialization.

3. Rewrite object-literals.rq: compute by subtraction

Instead of COUNT(DISTINCT ?o) FILTER(ISLITERAL(?o)), compute:

  • Total COUNT(DISTINCT ?o) (no filter — answered from OPS index, minimal memory)
  • Minus COUNT(DISTINCT ?o) FILTER(ISIRI(?o)) (already succeeds on large datasets)
  • Minus blank node objects if needed

This avoids the ISLITERAL filter that forces materialization of all distinct literals.

4. Rewrite entity-properties.rq: split into two queries

The dual COUNT(DISTINCT ?s), COUNT(DISTINCT ?o) GROUP BY ?p requires deduplicating two columns simultaneously. Splitting into separate queries — one for COUNT(DISTINCT ?s) GROUP BY ?p and one for COUNT(DISTINCT ?o) GROUP BY ?p — halves peak memory per query.

5. Reorder stages

Run the most memory-intensive queries first, before other stages consume available memory.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions