Problem
Pipelines like the Dataset Knowledge Graph reprocess every dataset on each run, even when nothing has changed. This wastes time and resources.
Proposal
Track a per-dataset fingerprint and skip reprocessing when the fingerprint hasn't changed. Change can come from two sources:
- Origin data — the source dataset itself has changed (detectable via ETag, Last-Modified, or content hash).
- Pipeline behaviour — the SPARQL queries or LDE package versions used to process the dataset have changed (detectable via a hash of queries + relevant package versions).
A combined fingerprint of both determines whether reprocessing is needed.
Storage: provenance in the RDF output
Store processing metadata as provenance on named graphs in the triplestore rather than in a separate file-system cache:
- Named graphs per dataset already exist — they provide a natural key.
- If the triplestore is wiped, reprocessing happens automatically (no stale file-system cache).
- Metadata is self-describing and queryable with SPARQL.
- Aligns with PROV-O / DCAT provenance patterns.
Per-dataset metadata to store
| Property |
Purpose |
| Source fingerprint (ETag / content hash) |
Detect origin data changes |
| Pipeline fingerprint (hash of queries + package versions) |
Detect pipeline behaviour changes |
| Processing timestamp |
Provenance / debugging |
Behaviour
On each pipeline run, for each dataset:
- Compute the current combined fingerprint (source + pipeline).
- Query the named graph's stored fingerprint.
- If they match, skip reprocessing.
- Otherwise, reprocess and update the stored fingerprint.
Problem
Pipelines like the Dataset Knowledge Graph reprocess every dataset on each run, even when nothing has changed. This wastes time and resources.
Proposal
Track a per-dataset fingerprint and skip reprocessing when the fingerprint hasn't changed. Change can come from two sources:
A combined fingerprint of both determines whether reprocessing is needed.
Storage: provenance in the RDF output
Store processing metadata as provenance on named graphs in the triplestore rather than in a separate file-system cache:
Per-dataset metadata to store
Behaviour
On each pipeline run, for each dataset: