Skip reprocessing unchanged datasets in pipelines

## Problem

Pipelines like the Dataset Knowledge Graph reprocess every dataset on each run, even when nothing has changed. This wastes time and resources.

## Proposal

Track a per-dataset fingerprint and skip reprocessing when the fingerprint hasn't changed. Change can come from two sources:

1. **Origin data** — the source dataset itself has changed (detectable via ETag, Last-Modified, or content hash).
2. **Pipeline behaviour** — the SPARQL queries or LDE package versions used to process the dataset have changed (detectable via a hash of queries + relevant package versions).

A combined fingerprint of both determines whether reprocessing is needed.

### Storage: provenance in the RDF output

Store processing metadata as provenance on named graphs in the triplestore rather than in a separate file-system cache:

- Named graphs per dataset already exist — they provide a natural key.
- If the triplestore is wiped, reprocessing happens automatically (no stale file-system cache).
- Metadata is self-describing and queryable with SPARQL.
- Aligns with PROV-O / DCAT provenance patterns.

### Per-dataset metadata to store

| Property | Purpose |
|---|---|
| Source fingerprint (ETag / content hash) | Detect origin data changes |
| Pipeline fingerprint (hash of queries + package versions) | Detect pipeline behaviour changes |
| Processing timestamp | Provenance / debugging |

### Behaviour

On each pipeline run, for each dataset:
1. Compute the current combined fingerprint (source + pipeline).
2. Query the named graph's stored fingerprint.
3. If they match, skip reprocessing.
4. Otherwise, reprocess and update the stored fingerprint.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip reprocessing unchanged datasets in pipelines #308

Problem

Proposal

Storage: provenance in the RDF output

Per-dataset metadata to store

Behaviour

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Property	Purpose
Source fingerprint (ETag / content hash)	Detect origin data changes
Pipeline fingerprint (hash of queries + package versions)	Detect pipeline behaviour changes
Processing timestamp	Provenance / debugging

Skip reprocessing unchanged datasets in pipelines #308

Description

Problem

Proposal

Storage: provenance in the RDF output

Per-dataset metadata to store

Behaviour

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions