-
Notifications
You must be signed in to change notification settings - Fork 0
Rewrite VoID queries to avoid QLever memory exhaustion on large datasets #302
Description
Problem
On datasets with hundreds of millions of triples (e.g. dataset_bhi with 686M triples), several VoID aggregation queries fail with QLever memory allocation errors even with --memory-max-size 12G.
QLever stores its index on disk (the OS caches it into RAM as available). The --memory-max-size parameter is a budget for query processing and caching only. On dataset_bhi, available memory within this budget shrank as stages progressed: 3.4 GB remained for subjects.rq and object-literals.rq, dropping to just 787 MB by entity-properties.rq. The cause of this shrinkage isn't confirmed — it could be cached query results from earlier stages, memory not fully released after intermediate queries, or fragmentation.
The root cause of the failures themselves is that FILTER expressions (ISBLANK, ISLITERAL) force QLever to materialize the full index scan row-by-row, then deduplicate in memory. Without a filter, QLever can answer COUNT(DISTINCT) directly from its sorted permutation indexes with minimal memory.
See #284 for the full error analysis.
Failing queries
| Query | Pattern | Memory needed | Available |
|---|---|---|---|
subjects.rq |
COUNT(DISTINCT ?s) FILTER(!ISBLANK(?s)) |
4.3 GB | 3.4 GB |
object-literals.rq |
COUNT(DISTINCT ?o) FILTER(ISLITERAL(?o)) |
4.3 GB | 3.4 GB |
entity-properties.rq |
COUNT(DISTINCT ?s), COUNT(DISTINCT ?o) GROUP BY ?p |
5.5 GB | 787 MB |
Queries that succeed on the same dataset
| Query | Pattern | Why it works |
|---|---|---|
properties.rq |
COUNT(DISTINCT ?p) — no filter |
Few distinct predicates |
triples.rq |
COUNT(*) — no filter, no distinct |
Just a count |
object-uris.rq |
COUNT(DISTINCT ?o) FILTER(ISIRI(?o)) |
Fewer distinct IRI objects fit in memory |
Proposed fixes
1. Investigate and tune QLever memory settings
The @lde/sparql-qlever Server class currently only exposes --memory-max-size. QLever also supports --cache-max-size (e.g. Wikidata uses --cache-max-size 15G alongside --memory-max-size 20G). Tuning the cache cap might reserve more of the budget for query execution — but first we need to understand what's consuming the available memory as stages progress.
2. Rewrite subjects.rq: drop the ISBLANK filter
The VoID spec defines void:distinctSubjects as the number of distinct subjects — it doesn't require excluding blank nodes. Removing FILTER(!ISBLANK(?s)) lets QLever answer directly from its SPO permutation index without materialization.
3. Rewrite object-literals.rq: compute by subtraction
Instead of COUNT(DISTINCT ?o) FILTER(ISLITERAL(?o)), compute:
- Total
COUNT(DISTINCT ?o)(no filter — answered from OPS index, minimal memory) - Minus
COUNT(DISTINCT ?o) FILTER(ISIRI(?o))(already succeeds on large datasets) - Minus blank node objects if needed
This avoids the ISLITERAL filter that forces materialization of all distinct literals.
4. Rewrite entity-properties.rq: split into two queries
The dual COUNT(DISTINCT ?s), COUNT(DISTINCT ?o) GROUP BY ?p requires deduplicating two columns simultaneously. Splitting into separate queries — one for COUNT(DISTINCT ?s) GROUP BY ?p and one for COUNT(DISTINCT ?o) GROUP BY ?p — halves peak memory per query.
5. Reorder stages
Run the most memory-intensive queries first, before other stages consume available memory.