UnicoLab
diff --git a/‎announcement.md‎
Lines changed: 0 additions & 128 deletions b/‎announcement.md‎
Lines changed: 0 additions & 128 deletions
diff --git a/‎docs/artifact-centric.md‎
Lines changed: 67 additions & 0 deletions b/‎docs/artifact-centric.md‎
Lines changed: 67 additions & 0 deletions
diff --git a/‎flowyml/__init__.py‎
Lines changed: 4 additions & 1 deletion b/‎flowyml/__init__.py‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎flowyml/core/__init__.py‎
Lines changed: 4 additions & 1 deletion b/‎flowyml/core/__init__.py‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎flowyml/core/pipeline.py‎
Lines changed: 77 additions & 1 deletion b/‎flowyml/core/pipeline.py‎
Lines changed: 77 additions & 1 deletion
@@ -0,0 +1,67 @@
+To understand the shift from Task-Centric (traditional) to Artifact-Centric (FlowyML) pipelines, we have to look at how the execution engine views the relationship between code and data.
+
+Technically, this isn't just a naming convention; it’s a change in how the Directed Acyclic Graph (DAG) is constructed and how the state is persisted.
+
+1. Declarative Signatures vs. Imperative Sequences
+In a task-centric system (like Airflow), you define the order of operations. You essentially write a script that says, "Run preprocess, then run train." The movement of data between them is usually an afterthought—you manualy pass S3 paths or local file locations between functions.
+
+In FlowyML (Artifact-Centric), the system builds the DAG by looking at the Input/Output signatures of your steps.
+
+Technical Implementation: When you define a step, you declare: "I produce an artifact named 'train_data' of type Dataset." The Orchestrator looks at another step that says, "I require an input named 'train_data' of type Dataset."
+Result: The "edge" in the graph is formed automatically because of a data dependency, not because you wrote step_a >> step_b. If you change an output name, the graph breaks at build-time (checked by the
+
+TypeValidator
+).
+2. The Global Artifact Catalog vs. Manual "Handoffs"
+The biggest technical hurdle in task-centric pipelines is the "handoff." You often see code like: pd.read_csv(f"s3://my-bucket/{run_id}/data.csv"). This hardcodes the infrastructure and pathing logic inside your business logic.
+
+In an artifact-centric system, FlowyML uses the Catalog (Registry Pattern):
+
+Unique Identity: Every artifact is registered in the
+
+Catalog
+ (via
+
+register()
+ which I just fixed) with a
+
+content_hash
+, source_step, and source_run_id.
+Discovery: A downstream step doesn't need to know where the model is stored (S3 vs. Azure vs. Local). It asks the Catalog for the artifact by name/version. The
+
+CatalogBackend
+ resolves the storage URI and handles the high-level fetching.
+Immutability: Each artifact is a record of truth. If the input data hash hasn't changed, the system knows it can skip the task entirely (Caching/Memoization).
+3. Automatic Lineage (The "Parents" Concept)
+In task-centric systems, if you find a bad model in production, tracing it back to the exact version of the SQL query and the raw CSV that created it is a manual forensic exercise.
+
+In Artifact-Centric FlowyML:
+
+Lineage Tracking: As seen in the
+
+CatalogEntry
+ structure, every artifact stores parent_ids.
+Technical Flow: When Step B consumes Artifact A, FlowyML automatically records that Artifact A is the parent of whatever Step B produces.
+Observability: You can call
+
+get_lineage(artifact_id)
+ to get a full recursive tree of every transformation that touched that specific piece of data, from raw ingestion to the final insight.
+4. Infrastructure as Configuration (flowyml.yaml)
+In task-centric code, you often specify cpu=4, memory='16Gi' inside your Python @task decorator. This locks your code to specific hardware.
+
+In Artifact-Centric design, we decouple "What happens" from "Where it happens":
+
+The Code: Pure Python logic defined by inputs and outputs.
+The YAML: Defines the Stack. It specifies that the "Model" artifact produced by train_step should be stored in an S3ArtifactStore and that the step should run on a KubernetesOrchestrator.
+Benefit: You can run the exact same artifact logic on your local machine (using
+
+LocalCatalogBackend
+) or in Great-Grandchild-scale production without changing a single line of Python.
+Summary Comparison
+Metric	Task-Centric	Artifact-Centric (FlowyML)
+Logic Focus	"What do I run?" (Verbs)	"What do I produce?" (Nouns)
+Data Flow	Manual path passing	Automatic resolution via Catalog
+Validation	Errors happen at runtime (file not found)	Errors happen at build-time (type mismatch)
+Debugging	Check the logs of Task X	Inspect the state of Artifact Y
+Portability	Hardcoded file paths/infra	Stack-based storage abstraction
+By focusing on the Artifact, FlowyML treats data as a first-class citizen of the deployment, enabling reproducible machine learning where every result is mathematically linked to its origin.
@@ -10,7 +10,7 @@
 
 # Core imports
 from flowyml.core.context import Context, context
-from flowyml.core.step import step, Step
+from flowyml.core.step import step, Step, StepRegistry, get_registered_steps, clear_step_registry
 from flowyml.core.pipeline import Pipeline
 from flowyml.core.executor import Executor, LocalExecutor
 from flowyml.core.cache import CacheStrategy
@@ -176,6 +176,9 @@
     "context",
     "step",
     "Step",
+    "StepRegistry",
+    "get_registered_steps",
+    "clear_step_registry",
     "Pipeline",
     "Executor",
     "LocalExecutor",
 
@@ -1,7 +1,7 @@
 """Core pipeline execution components."""
 
 from flowyml.core.context import Context, context
-from flowyml.core.step import step, Step
+from flowyml.core.step import step, Step, StepRegistry, get_registered_steps, clear_step_registry
 from flowyml.core.pipeline import Pipeline
 from flowyml.core.executor import Executor, LocalExecutor
 from flowyml.core.cache import CacheStrategy
@@ -44,6 +44,9 @@
     # Steps & Pipelines
     "step",
     "Step",
+    "StepRegistry",
+    "get_registered_steps",
+    "clear_step_registry",
     "Pipeline",
     # Execution
     "Executor",
 
@@ -135,6 +135,19 @@ class Pipeline:
         >>> @step(outputs=["model/trained"])
         ... def train(learning_rate: float, epochs: int):
         ...     return train_model(learning_rate, epochs)
+
+        # Option 1: Auto-discover all @step-decorated functions
+        >>> pipeline = Pipeline("my_pipeline", context=ctx, auto_discover=True)
+        >>> result = pipeline.run()
+
+        # Option 2: Concise explicit selection
+        >>> pipeline = Pipeline.from_steps(train, name="my_pipeline", context=ctx)
+
+        # Option 3: Batch add
+        >>> pipeline = Pipeline("my_pipeline", context=ctx)
+        >>> pipeline.add_steps([train])
+
+        # Option 4: Manual add_step (existing, still works)
         >>> pipeline = Pipeline("my_pipeline", context=ctx)
         >>> pipeline.add_step(train)
         >>> result = pipeline.run()
@@ -184,6 +197,7 @@ def __init__(
         project: str | None = None,  # Project name to attach to (deprecated, use project_name)
         project_name: str | None = None,  # Project name to attach to (creates if doesn't exist)
         version: str | None = None,  # If provided, VersionedPipeline is created via __new__
+        auto_discover: bool = False,  # Auto-discover @step-decorated functions
         **kwargs,
     ):
         """Initialize pipeline.
@@ -202,8 +216,10 @@ def __init__(
                 If the project doesn't exist, it will be created automatically.
             version: Optional version string. If provided, a VersionedPipeline
                 instance will be created instead of a regular Pipeline.
+            auto_discover: If True, automatically discover all ``@step``-decorated
+                functions from the global registry at build time. Steps with a
+                matching ``pipeline`` tag are preferred. Defaults to False.
             **kwargs: Additional keyword arguments passed to the pipeline.
-                instance is automatically created instead of a regular Pipeline.
         """
         from flowyml.utils.config import get_config
 
@@ -290,6 +306,7 @@ def __init__(
 
         # State
         self._built = False
+        self._auto_discover = auto_discover
         self.step_groups: list[Any] = []  # Will hold StepGroup objects
         self.control_flows: list[Any] = []  # Store conditional control flows (If, Switch, etc.)
 
@@ -318,6 +335,56 @@ def add_step(self, step: Step) -> "Pipeline":
         self._built = False
         return self
 
+    def add_steps(self, steps: list[Step]) -> "Pipeline":
+        """Add multiple steps to the pipeline at once.
+
+        Args:
+            steps: List of Step instances to add
+
+        Returns:
+            Self for chaining
+
+        Example:
+            >>> pipeline.add_steps([load_data, train_model, evaluate])
+        """
+        for s in steps:
+            self.steps.append(s)
+        self._built = False
+        return self
+
+    @classmethod
+    def from_steps(
+        cls,
+        *steps: Step,
+        name: str,
+        **kwargs,
+    ) -> "Pipeline":
+        """Create a pipeline from an explicit list of steps.
+
+        Convenience constructor that avoids repetitive ``add_step()`` calls
+        while still giving you full control over which steps are included.
+
+        Args:
+            *steps: Step instances to include
+            name: Pipeline name (keyword-only)
+            **kwargs: Additional arguments passed to Pipeline()
+
+        Returns:
+            Configured Pipeline instance
+
+        Example:
+            >>> pipeline = Pipeline.from_steps(
+            ...     load_data,
+            ...     train_model,
+            ...     evaluate,
+            ...     name="training",
+            ...     enable_cache=False,
+            ... )
+        """
+        pipeline = cls(name=name, **kwargs)
+        pipeline.add_steps(list(steps))
+        return pipeline
+
     def add_control_flow(self, control_flow: Any) -> "Pipeline":
         """Add conditional control flow to the pipeline.
 
@@ -397,6 +464,15 @@ def build(self) -> None:
         if self._built:
             return
 
+        # Auto-discover steps from global registry if enabled
+        if self._auto_discover and not self.steps:
+            from flowyml.core.step import get_registered_steps
+
+            discovered = get_registered_steps(pipeline=self.name)
+            if not discovered:
+                discovered = get_registered_steps()
+            self.steps = list(discovered)
+
         # Clear previous DAG
         self.dag = DAG()