Skip to content

Skip reprocessing unchanged datasets in pipelines #308

@ddeboer

Description

@ddeboer

Problem

Pipelines like the Dataset Knowledge Graph reprocess every dataset on each run, even when nothing has changed. This wastes time and resources.

Proposal

Track a per-dataset fingerprint and skip reprocessing when the fingerprint hasn't changed. Change can come from two sources:

  1. Origin data — the source dataset itself has changed (detectable via ETag, Last-Modified, or content hash).
  2. Pipeline behaviour — the SPARQL queries or LDE package versions used to process the dataset have changed (detectable via a hash of queries + relevant package versions).

A combined fingerprint of both determines whether reprocessing is needed.

Storage: provenance in the RDF output

Store processing metadata as provenance on named graphs in the triplestore rather than in a separate file-system cache:

  • Named graphs per dataset already exist — they provide a natural key.
  • If the triplestore is wiped, reprocessing happens automatically (no stale file-system cache).
  • Metadata is self-describing and queryable with SPARQL.
  • Aligns with PROV-O / DCAT provenance patterns.

Per-dataset metadata to store

Property Purpose
Source fingerprint (ETag / content hash) Detect origin data changes
Pipeline fingerprint (hash of queries + package versions) Detect pipeline behaviour changes
Processing timestamp Provenance / debugging

Behaviour

On each pipeline run, for each dataset:

  1. Compute the current combined fingerprint (source + pipeline).
  2. Query the named graph's stored fingerprint.
  3. If they match, skip reprocessing.
  4. Otherwise, reprocess and update the stored fingerprint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions