You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To understand the shift from Task-Centric (traditional) to Artifact-Centric (FlowyML) pipelines, we have to look at how the execution engine views the relationship between code and data.
2
+
3
+
Technically, this isn't just a naming convention; it’s a change in how the Directed Acyclic Graph (DAG) is constructed and how the state is persisted.
4
+
5
+
1. Declarative Signatures vs. Imperative Sequences
6
+
In a task-centric system (like Airflow), you define the order of operations. You essentially write a script that says, "Run preprocess, then run train." The movement of data between them is usually an afterthought—you manualy pass S3 paths or local file locations between functions.
7
+
8
+
In FlowyML (Artifact-Centric), the system builds the DAG by looking at the Input/Output signatures of your steps.
9
+
10
+
Technical Implementation: When you define a step, you declare: "I produce an artifact named 'train_data' of type Dataset." The Orchestrator looks at another step that says, "I require an input named 'train_data' of type Dataset."
11
+
Result: The "edge" in the graph is formed automatically because of a data dependency, not because you wrote step_a >> step_b. If you change an output name, the graph breaks at build-time (checked by the
12
+
13
+
TypeValidator
14
+
).
15
+
2. The Global Artifact Catalog vs. Manual "Handoffs"
16
+
The biggest technical hurdle in task-centric pipelines is the "handoff." You often see code like: pd.read_csv(f"s3://my-bucket/{run_id}/data.csv"). This hardcodes the infrastructure and pathing logic inside your business logic.
17
+
18
+
In an artifact-centric system, FlowyML uses the Catalog (Registry Pattern):
19
+
20
+
Unique Identity: Every artifact is registered in the
21
+
22
+
Catalog
23
+
(via
24
+
25
+
register()
26
+
which I just fixed) with a
27
+
28
+
content_hash
29
+
, source_step, and source_run_id.
30
+
Discovery: A downstream step doesn't need to know where the model is stored (S3 vs. Azure vs. Local). It asks the Catalog for the artifact by name/version. The
31
+
32
+
CatalogBackend
33
+
resolves the storage URI and handles the high-level fetching.
34
+
Immutability: Each artifact is a record of truth. If the input data hash hasn't changed, the system knows it can skip the task entirely (Caching/Memoization).
35
+
3. Automatic Lineage (The "Parents" Concept)
36
+
In task-centric systems, if you find a bad model in production, tracing it back to the exact version of the SQL query and the raw CSV that created it is a manual forensic exercise.
37
+
38
+
In Artifact-Centric FlowyML:
39
+
40
+
Lineage Tracking: As seen in the
41
+
42
+
CatalogEntry
43
+
structure, every artifact stores parent_ids.
44
+
Technical Flow: When Step B consumes Artifact A, FlowyML automatically records that Artifact A is the parent of whatever Step B produces.
45
+
Observability: You can call
46
+
47
+
get_lineage(artifact_id)
48
+
to get a full recursive tree of every transformation that touched that specific piece of data, from raw ingestion to the final insight.
49
+
4. Infrastructure as Configuration (flowyml.yaml)
50
+
In task-centric code, you often specify cpu=4, memory='16Gi' inside your Python @task decorator. This locks your code to specific hardware.
51
+
52
+
In Artifact-Centric design, we decouple "What happens" from "Where it happens":
53
+
54
+
The Code: Pure Python logic defined by inputs and outputs.
55
+
The YAML: Defines the Stack. It specifies that the "Model" artifact produced by train_step should be stored in an S3ArtifactStore and that the step should run on a KubernetesOrchestrator.
56
+
Benefit: You can run the exact same artifact logic on your local machine (using
57
+
58
+
LocalCatalogBackend
59
+
) or in Great-Grandchild-scale production without changing a single line of Python.
60
+
Summary Comparison
61
+
Metric Task-Centric Artifact-Centric (FlowyML)
62
+
Logic Focus "What do I run?" (Verbs) "What do I produce?" (Nouns)
63
+
Data Flow Manual path passing Automatic resolution via Catalog
64
+
Validation Errors happen at runtime (file not found) Errors happen at build-time (type mismatch)
65
+
Debugging Check the logs of Task X Inspect the state of Artifact Y
By focusing on the Artifact, FlowyML treats data as a first-class citizen of the deployment, enabling reproducible machine learning where every result is mathematically linked to its origin.
0 commit comments