Improve plugin documentation (second batch) by edufuga · Pull Request #987 · silk-framework/silk

edufuga · 2025-11-24T08:20:43Z

Silk — Improve plugin documentation (second batch)

https://jira.eccenca.com/browse/CMEM-7013

This PR adds documentation for the following Silk dataset plugins:

RDF file dataset (local RDF file + ZIP ingestion, format/graph handling, in-memory limits, optional N-Triples output).
In-memory dataset (embedded RDF store for temporary workflow graphs, SPARQL-based read/write, lifecycle + “clear before run” semantics).
Alignment dataset (write-only export of link results as Alignment files following the AlignAPI specification, with <Cell>-level structure and optional relation/measure).

RdfFileDataset.md

RDF file reads RDF data from a local file (or ZIP archive) into the project as an in-memory dataset and, for supported formats, can also write RDF back to a file.

The doc starts with the intended usage window (small/medium files, snapshots for exploration/mapping/linking, simple export) and immediately flags the hard constraint: everything is loaded in memory, so very large files belong in an external store. Then it walks the data shape and IO story: single file vs ZIP input (plus the regex gate for which ZIP entries are considered), dataset output as queryable graph(s), and the graph-selection rule (named graph only where the chosen format supports it; otherwise default graph, with the graph parameter ignored for graph-less formats). Configuration notes focus on how to think, not just what to fill: file/ZIP behavior, format auto-detection (and the “can’t detect → error” path), the write restriction (only N-Triples as output), advanced narrowing via an entity list, and ZIP file filtering via regex. Behavior is described as a sequence you can predict: size check → parse into an in-memory dataset (default + possibly named graphs) → select graph → serve repeated reads from memory until the underlying file timestamp changes → reload on next access → write path serializes as N-Triples only. It ends with limitations + “when to use” guidance and concrete examples (simple Turtle, N-Quads with an explicit graph, ZIP with multiple RDF files).

InMemoryDataset.md

In-memory dataset is a small embedded RDF store that keeps all data in memory and exposes it via SPARQL as a temporary working graph inside workflows.

The doc frames it as a deliberately non-persistent scratch graph: one in-memory RDF model, all reads and writes mediated through a SPARQL endpoint, and an empty state after application restart. Within workflows it’s explicitly bidirectional—usable as both source and sink—so upstream components can write entities/links/triples into it and downstream components query it like a normal SPARQL dataset (entity retrieval, path/type discovery, sampling, etc.), with no file backing at all. Writing is explained by sink type but unified in effect: entity sink converts entities to triples, link sink writes link triples, triple sink adds triples directly; all converge into the same single in-memory graph. The one configuration knob (“Clear graph before workflow execution”, default true) is treated as the semantic switch: either a fresh empty graph per run, or a longer-lived in-memory graph across runs within the same process. Limitations are stated as operational consequences (memory-bound, no persistence, best for small/medium intermediates and prototyping) and the examples reinforce the intended patterns: temporary integration graph, scratch experimentation area, small lookup store.

AlignmentDataset.md

Alignment is a write-only dataset that exports link results as Alignment files following the AlignAPI format specification (and the SWJ60 description).

The doc keeps scope tight from the start: it exists to serialize links between entities in a standardized alignment format, not to read entities, run transformations, or do extra processing. It motivates the shape via separation of concerns and interoperability: a focused exporter that produces files consumable by alignment-aware tooling and usable in subsequent workflows. The core mechanics are explained at the link-record level—each link becomes one <Cell> with explicit source URI, target URI, optional relation (e.g., =), and an optional confidence measure (0.0–1.0)—and the plugin is responsible for emitting a well-formed file (structure, header/footer, UTF-8). A minimal example anchors how multiple links map to multiple <Cell> entries, and the references section points to the AlignAPI format spec and the SWJ60 paper for full semantics and edge details.

@Schema

* develop: (55 commits) Update @Schema annotations to not use deprecated attributes. Make sure that Sources are closed, even if they fail. Added Codec where it was missing. - Added WorkbenchConfig.dataIntegrationUrl analog to dataManagerUrl and dataPlatformUrl - Renamed `dataplatformUrl` to `dataPlatformUrl` Add localBaseUrl to WorkbenchConfig Heading instead of emphasis. Remove duplicated endpoint and replace usages in UI code JSON dataset: `#arrayText` on non-existing properties now returns empty result instead of empty array string. fix module import for webpack refactor: update metadata type from IMetaData to IMetadata in DatasetClearButton install yarn dependencies with frozen lockfile Move clear dataset function into correct place build frontend code first, only call tests if it succeeded fetch explicitly submodules and re-order frontend/backend parts, group them together move rootDir config to jest config file try different approach to call command in workspace folder use correct file name for config script do not use TypeScript for jest config add ts-node package move arg to correct place ...

adelahaye-ecc · 2026-02-24T08:16:19Z

Feedbacks

Alignment

Goal is clear (serialize Alignment file, export entities equivalencies to other platforms)

But how to use ?
Is there another part of the doc (that it could point to) that describes how to create the <Cell> entries ?

A Usage Patterns & Recommendations section could be helpful for this item.

Reference links are redundant with the introduction (exact same links in Overview)

RDF File Dataset

4.1 File size check

If the file is too large

What is the value, to be specific ?

In Memory Dataset

Nice use cases , examples and explanations

Overall LGTM

edufuga · 2026-02-24T08:55:08Z

4.1 File size check

If the file is too large

What is the value, to be specific ?

That's configurable in the maxInMemorySize property:

org.silkframework.runtime.resource.Resource = {
  # Maximum resource size in bytes that should be loaded into memory.
  maxInMemorySize = 100MB
  # Minimum free disk space that must be left before file write operations.
  minDiskSpace = 10MB
}

edufuga · 2026-02-24T08:56:50Z

Reference links are redundant with the introduction (exact same links in Overview)

That was my writing style. At the beginning, those links are mentioned, at the end, they're just references. So yes, it's "redundant", in the usual "bibliographic" sense (those anchors are used in the text, not dangling links).

adelahaye-ecc · 2026-02-24T09:19:33Z

4.1 File size check

If the file is too large

What is the value, to be specific ?

That's configurable in the maxInMemorySize property:
org.silkframework.runtime.resource.Resource = {
  # Maximum resource size in bytes that should be loaded into memory.
  maxInMemorySize = 100MB
  # Minimum free disk space that must be left before file write operations.
  minDiskSpace = 10MB
}

Could be useful to mention this parameter in the doc (as well as the default value)

Eduard Fugarolas added 2 commits November 24, 2025 09:17

Alignment dataset documentation.

eaa00cb

New dataset documentations.

122c46b

edufuga marked this pull request as ready for review December 16, 2025 15:40

edufuga requested a review from robertisele December 16, 2025 15:40

Eduard Fugarolas added 3 commits December 17, 2025 12:03

Merge branch 'develop' into feature/furtherImprovePluginDoc-CMEM-7013

d9e447a

Merge branch 'develop' into feature/furtherImprovePluginDoc-CMEM-7013

9a0e9e5

edufuga merged commit 09ea16a into develop Feb 24, 2026
1 check failed

robertisele deleted the feature/furtherImprovePluginDoc-CMEM-7013 branch March 18, 2026 09:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve plugin documentation (second batch)#987

Improve plugin documentation (second batch)#987
edufuga merged 5 commits intodevelopfrom
feature/furtherImprovePluginDoc-CMEM-7013

edufuga commented Nov 24, 2025 •

edited

Loading

Uh oh!

adelahaye-ecc commented Feb 24, 2026

Uh oh!

edufuga commented Feb 24, 2026 •

edited

Loading

Uh oh!

edufuga commented Feb 24, 2026

Uh oh!

Uh oh!

adelahaye-ecc commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

edufuga commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Silk — Improve plugin documentation (second batch)

RdfFileDataset.md

InMemoryDataset.md

AlignmentDataset.md

Uh oh!

adelahaye-ecc commented Feb 24, 2026

Feedbacks

Alignment

RDF File Dataset

In Memory Dataset

Uh oh!

edufuga commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edufuga commented Feb 24, 2026

Uh oh!

Uh oh!

adelahaye-ecc commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

edufuga commented Nov 24, 2025 •

edited

Loading

edufuga commented Feb 24, 2026 •

edited

Loading