-
Notifications
You must be signed in to change notification settings - Fork 779
Added low-level workflow protocol documentation #5024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
b6d1842
Initial commit adding low-level workflow protocol information as deri…
WhitWaldo cec2e1a
Added page for workflow versioning
WhitWaldo 5fd61bc
Added section for Task Activity IDs
WhitWaldo 8fcbca8
Updated state value to include version information
WhitWaldo efc3df6
Added patch versioning and stalled sections to orchestration lifecycle
WhitWaldo c0d20e6
Added missing article
WhitWaldo 1ba6a96
Added documentation about the continuation tokens and task execution IDs
WhitWaldo 687f667
Added more context around task execution IDs
WhitWaldo 3f1ea60
Fleshed out the continuation tokens in management API
WhitWaldo 02a8c6d
Added completion tokens to activity lifecycle
WhitWaldo 96ea383
Added completion token to protocol execution
WhitWaldo 8d3e8ce
Removed bad link
WhitWaldo 11fb8a6
Merge branch 'v1.17' into runtime-workflow-docs
marcduiker cec592f
Merge branch 'v1.17' into runtime-workflow-docs
WhitWaldo 7d06616
Merge branch 'v1.17' into runtime-workflow-docs
marcduiker 5821078
Added rerun/purge per Josh's review
WhitWaldo b40cfb6
Merge remote-tracking branch 'origin/runtime-workflow-docs' into runt…
WhitWaldo 98a26f6
Added re-run and purge descriptions
WhitWaldo fa6937f
Removed text that didn't make sense
WhitWaldo f85153f
Removed unnecessary spaces
WhitWaldo 5fbbf90
Added clarification as to where the SDK's role lies
WhitWaldo c79282f
Fixed line indentation
WhitWaldo 20e8f33
Merge branch 'v1.17' into runtime-workflow-docs
WhitWaldo d5f19d0
Updated to use nomenclature of Dapr Workflows
WhitWaldo 65335ff
Added section clarifying named workflow versioning in addition to pat…
WhitWaldo 980f5df
Clarified that the management APIs are often exposed by the SDKs
WhitWaldo 88cfb97
Clarified that the SDK also retrieves work history (as it's not immed…
WhitWaldo 672332d
Added clarifying verbiage about versioning properties and how a stall…
WhitWaldo 4b2cceb
Update daprdocs/content/en/contributing/protocol-reference/workflow-p…
marcduiker d693142
Merge branch 'v1.17' into runtime-workflow-docs
marcduiker File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
23 changes: 23 additions & 0 deletions
23
daprdocs/content/en/contributing/protocol-reference/_index.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| --- | ||
| type: docs | ||
| title: "Protocol Reference" | ||
| linkTitle: "Protocol Reference" | ||
| weight: 100 | ||
| description: > | ||
| Low-level technical documentation of the Dapr runtime protocols and internal mechanics for each building block. | ||
| --- | ||
|
|
||
| This section provides a deep dive into the internal workings of the Dapr runtime. It is intended for maintainers, | ||
| contributors, and anyone interested in the low-level implementation details of Dapr's building blocks. | ||
|
|
||
| Unlike the user-facing API reference, these documents focus on: | ||
| - How the runtime processes requests. | ||
| - Internal state transitions. | ||
| - Interaction with component interfaces. | ||
| - Protocol-level details (gRPC and HTTP). | ||
|
|
||
| ## Building Blocks | ||
|
|
||
| Select a building block to explore its internal protocol and mechanics: | ||
|
|
||
| - [Workflow]({{% ref workflow-protocol %}}) | ||
139 changes: 139 additions & 0 deletions
139
daprdocs/content/en/contributing/protocol-reference/workflow-protocol/_index.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,139 @@ | ||
| --- | ||
|
WhitWaldo marked this conversation as resolved.
|
||
| type: docs | ||
| title: "Workflow Protocol" | ||
| linkTitle: "Workflow" | ||
| weight: 10 | ||
| description: "Low-level description of the Workflow building block internals." | ||
| --- | ||
|
|
||
| This document specifies the Dapr Workflow protocol and runtime contract at a low level. It targets SDK authors building | ||
| Workflow Workers and runtime maintainers evolving the Dapr sidecar’s Workflow Engine. | ||
|
|
||
| ## Overview | ||
| Dapr Workflow implements a sidecar-as-scheduler pattern: the Dapr runtime (sidecar) acts as the Workflow Engine, and | ||
| the application SDK acts as the Workflow Worker. All control and execution traffic flows over gRPC. | ||
|
|
||
| There are two protocol surfaces: | ||
| 1. Management API (standard Dapr gRPC accessible via SDK): | ||
| - Start, terminate, pause, resume, re-run, purge and query workflow instances. | ||
| 2. Execution API (Task Hub Protocol): | ||
| - Worker facing, used to receive orchestration/activity work items and to report completion (e.g., via | ||
| `TaskHubSidecarService`). | ||
|
|
||
| [//]: # (graph LR) | ||
|
WhitWaldo marked this conversation as resolved.
|
||
|
|
||
| [//]: # (Client[Workflow Client] -- Management API --> Sidecar[Dapr Sidecar Engine]) | ||
|
|
||
| [//]: # (Sidecar -- Task Hub Protocol --> Worker[Workflow Worker SDK]) | ||
|
|
||
| [//]: # (Sidecar -- State Store --> Actors[Dapr Actors Backend]) | ||
|
|
||
|
|
||
| Key Components | ||
| - **Workflow Engine (Dapr Sidecar)** | ||
|
|
||
| Manages workflow state transitions, history persistence, scheduling of orchestration and activity tasks, and | ||
| reliable delivery semantics. By default, it leverages Dapr Actors as the backend for durable, partitioned execution. | ||
|
|
||
| - **Workflow Worker (Application SDK)** | ||
|
|
||
| Connects to the sidecar, polls for orchestration and activity work items, executes user-defined logic, and returns | ||
| results, failures, and heartbeats to the engine. Orchestration logic must be deterministic; activity logic need not | ||
| be. | ||
|
|
||
| - **Orchestration** | ||
|
|
||
| The deterministic coordinator that defines the workflow. The engine drives orchestrations via history replay to | ||
| rebuild state and schedule outbound tasks (activities, sub-orchestrations, timers, external events). | ||
|
|
||
| - **Activity** | ||
|
|
||
| The atomic unit of work. Activities are executed **at-least-once** and report results or failures back to the engine. | ||
| Idempotency is recommended and task execution identifiers are available on context to assist with this. | ||
|
|
||
| - **State Store & Backend** | ||
|
|
||
| Workflow history and state are durably persisted. The engine typically implements a task hub pattern over the chosen | ||
| persistence and uses Dapr Actors as the default reliability substrate. | ||
|
|
||
| ## Execution Model | ||
| Dapr Workflow is based on the Durable Task Framework (DTFx) execution semantics: | ||
|
|
||
| ### Replay-based Execution | ||
| Orchestrators are replayed from their event history to rebuild deterministic state. All nondeterministic operations | ||
| (time, random values, I/O) must be mediated by the engine (e.g., timers, activity calls, external events). | ||
|
|
||
| ### Deterministic Orchestrators | ||
| Orchestrator code must be side-effect free except via engine-mediated effects. Control flow must be reproducible | ||
| during replay. | ||
|
|
||
| ### At-least-once Activities, Exactly-once State Commit | ||
| Activities may be delivered more than once. The engine ensures workflow state commits are idempotent and applied | ||
| exactly once. | ||
|
|
||
| ### Sidecar-as-Scheduler | ||
| The sidecar owns scheduling and persists all history/events before dispatching work to workers. Workers are stateless | ||
| executors from the engine’s perspective. | ||
|
|
||
| ## Protocol Surfaces | ||
| 1) Management API (Standard Dapr gRPC) | ||
| - **Start Workflow**: Create and persist an initial history event; return instance metadata | ||
| - **Terminate / Pause/ Resume**: Drive lifecycle transitions through persisted control events. | ||
| - **Query**: Retrieve instance status, history, output, failure details, and custom metadata. | ||
| - **Re-run**: Start a new workflow instance from a history event. | ||
| - **Purge**: Proactively clear workflow history and state. | ||
|
|
||
| > **Note**: See: [Management API specification]({{% ref workflow-protocol-management-api %}}) for exact RPC shapes, | ||
| > error codes and semantics. | ||
|
|
||
| 2) Execution API (Task Hub Protocol) | ||
| - **Poll for Work**: Workers fetch orchestration and activity work items. | ||
| - **Complete / Fail Work**: Workers report completion results or failures; the engine appends these to history and advances | ||
| orchestration progress. | ||
| - **Heartbeats / Leases**: Optional mechanisms for long-running activities and cooperative rebalancing. | ||
| - **Timers & External Events**: Delivered to orchestrations as history events to keep replay deterministic. | ||
|
|
||
| > **Note**: See [Execution API specification]({{% ref workflow-protocol-execution-api %}}) defining `TaskHubSidecarService` | ||
| > contracts, payload schemas and sequencing rules. | ||
|
|
||
| # Request & Runtime Lifecycle | ||
| 1. Start Workflow | ||
| - Client calls `StartWorkflow` via the Management API. | ||
| - Engine persists the initial event (e.g., `ExecutionStarted`) and materializes an instance. | ||
| 2. Orchestrator Execution (Replay-driven) | ||
| - Engine replays orchestration history to rehydrate state. | ||
| - Orchestrator schedules effects (activities, sub-orchestrations, timers) by issuing commands, which the engine | ||
| persists as new history events. | ||
| 3. Activity Dispatch & Execution | ||
| - Engine dispatches activity work items to workers | ||
| - Worker runs the activity (may be retried and delivered at least once). | ||
| - Worker responds with completion (result) or failure; engine appends to history. | ||
| 4. Timers & External Signals | ||
| - Engine delivers timer fired or external event records as history entries. | ||
| - Orchestrator consumes these deterministically on next replay. | ||
| 5. Progress & Checkpointing | ||
| - Each step appends to the history log and advances orchestration state. | ||
| - The engine safeguards idempotence and exactly-once commit of orchestration state. | ||
| 6. Completion | ||
| - Orchestration returns an output (success) or a failure (exception details). | ||
| - Final state and output are persisted; status queries reflect the terminal state. | ||
|
|
||
| # Protocol Principles | ||
| - **GRIEF (GRpc IntErFace)**: All worker/engine and client/engine communication is gRPC. | ||
| - **Replay-based Orchestration**: Determinism enforced through history replay. | ||
| - **At-least-once Activity Delivery**: Activities may re-execute; design for idempotency. | ||
| - **Engine-mediated Effects**: All nondeterminism/time/IO flows through the engine to remain replay-safe. | ||
|
|
||
| # Documentation Map | ||
| 1. [Management API]({{% ref workflow-protocol-management-api %}}) | ||
| Detailed Dapr gRPC control-plane operations and payloads. | ||
| 2. [Execution API (Task Hub Protocol)]({{% ref workflow-protocol-execution-api %}}) | ||
| `TaskHubSidecarService` worker protocol, work item contracts, result/failure reporting, and sequencing. | ||
| 3. [Orchestration Lifecycle]({{% ref workflow-protocol-orchestration-lifecycle %}}) | ||
| Replay semantics, scheduling, external events, timers, and completion. | ||
| 4. [Activity Lifecycle]({{% ref workflow-protocol-activity-lifecycle %}}) | ||
| Dispatch, retries, idempotency, heartbeat semantics, and failure handling. | ||
| 5. [State & History]({{% ref workflow-protocol-state-and-history %}}) | ||
| History schema, state snapshots, and persistence guarantees. | ||
| 6. [Versioning]({{% ref workflow-protocol-versioning %}}) | ||
| How Dapr handles multiple versions of the same workflow definition. | ||
118 changes: 118 additions & 0 deletions
118
...ng/protocol-reference/workflow-protocol/workflow-protocol-activity-lifecycle.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,118 @@ | ||
| --- | ||
|
marcduiker marked this conversation as resolved.
|
||
| type: docs | ||
| title: "Workflow Protocol - Activity Lifecycle" | ||
| linkTitle: "Activity Lifecycle" | ||
| weight: 400 | ||
| description: "Low-level description of the Workflow building block internals." | ||
| --- | ||
|
|
||
| # Activity Lifecycle | ||
|
|
||
| Activities are the basic units of work in a Dapr Workflow. Unlike orchestrations, activities are not replayed and do | ||
| not need to be deterministic. They are executed exactly once per "schedule" (though retries may occur). | ||
|
|
||
| ## Execution Flow | ||
|
|
||
| 1. **Scheduling**: An orchestration requests an activity by sending a `ScheduleTask` action to the Dapr engine. | ||
| 2. **Work Item Dispatch**: The Dapr engine enqueues an activity task. When an activity worker (SDK) is available, the | ||
| engine sends an `ActivityWorkItem` via the `GetWorkItems` stream. | ||
| 3. **Execution**: The SDK receives the `ActivityWorkItem`, which contains: | ||
| * `name`: The name of the activity to execute. | ||
| * `input`: The input data for the activity. | ||
| * `instance_id`: The ID of the workflow instance that scheduled the activity. | ||
| * `task_id`: A unique identifier for this specific activity execution. | ||
| * `task_execution_id`: A unique identifier for the specific *attempt* of this activity. This is useful for | ||
| implementing idempotency in activity logic. | ||
| * `completion_token`: An opaque token used to correlate the response with this specific work item. | ||
| 4. **Reporting**: After the activity logic finishes, the SDK sends a `CompleteActivityTask` request back to Dapr. | ||
| * **Success**: The SDK provides the serialized output in the `result` field. | ||
| * **Failure**: The SDK provides `failure_details` (error message, type, stack trace). | ||
|
|
||
| ## Task Execution IDs | ||
|
|
||
| The `task_execution_id` (also known as the Task Execution Key) is a unique, runtime-generated string (typically a UUID) | ||
| that identifies a specific **attempt** to execute an activity task. | ||
|
|
||
| ### Why it matters to the SDK | ||
|
|
||
| While the Workflow SDK is generally stateless between work items, the `task_execution_id` provides critical context | ||
| for the **Activity Worker**: | ||
|
|
||
| 1. **Distributed Idempotency**: If an activity performs a side effect (e.g., charging a credit card), it | ||
| should use the `task_execution_id` as an idempotency key. | ||
| 2. **Distinguishing Retries**: Unlike the `task_id` (which remains constant for a specific step in the workflow), | ||
| the `task_execution_id` **changes every time the engine retries the activity** (e.g., due to a timeout or worker crash). | ||
| 3. **Zombie Detection**: If an activity worker takes too long and the engine times it out and retries on another | ||
| worker, the original worker might eventually finish. By checking the `task_execution_id` against a persistent store | ||
| or external API, the worker can determine if it is a "zombie" whose results are no longer wanted. | ||
|
|
||
| ### Implementation Guidelines for SDKs: | ||
|
|
||
| * **Expose to User**: The SDK MUST expose the `task_execution_id` to the activity implementation logic | ||
| (e.g., via an `ActivityContext`). | ||
| * **Do Not Cache**: The SDK should not attempt to cache or reuse this ID across different work items. | ||
| * **Opaque Usage**: The SDK should treat the value as an opaque string. It is generated by the Dapr sidecar when | ||
| the activity is dispatched and is not something the SDK needs to create or parse. | ||
|
|
||
| ## Completion Tokens | ||
|
|
||
| The `completion_token` is an opaque string generated by the Dapr runtime and delivered to the SDK as part of the | ||
| `ActivityWorkItem`. | ||
|
|
||
| ### Purpose and Intent | ||
|
|
||
| 1. **Response Correlation**: The sidecar uses the `completion_token` to reliably match an `ActivityResponse` | ||
| (from `CompleteActivityTask`) to the original task it dispatched. | ||
| 2. **Stateless Tracking**: It allows the sidecar to remain stateless or minimize state lookups when receiving a | ||
| completion, as the token contains (or points to) the necessary context (instance ID, task ID, etc.). | ||
| 3. **Zombie Prevention**: If an activity times out and is retried, the new attempt will have a different | ||
| `completion_token`. If the original "zombie" worker eventually responds with the old token, the sidecar can easily | ||
| identify and ignore the late response. | ||
|
|
||
| ### SDK Implementation Guidelines | ||
|
|
||
| * **Capture**: The SDK MUST capture the `completion_token` from the incoming `ActivityWorkItem`. | ||
| * **Propagation**: The SDK MUST include the exact same `completion_token` in the `ActivityResponse` sent via | ||
| `CompleteActivityTask`. | ||
| * **Opaqueness**: The SDK MUST treat the token as a black box. It should not attempt to parse, modify, or construct | ||
| its own tokens. | ||
| * **Storage**: While the activity is executing, the SDK must keep this token in memory (e.g., in the `ActivityContext`). | ||
|
|
||
|
|
||
| ## Task Activity IDs | ||
|
|
||
| In the Dapr runtime (specifically when using the Actors backend), activities are represented as actors. Each activity | ||
| execution has a unique **Task Activity ID** (also known as the Activity Actor ID). | ||
|
|
||
| The ID follows a specific pattern: | ||
| `{workflowInstanceID}::{taskID}::{generation}` | ||
|
|
||
| * **workflowInstanceID**: The unique ID of the workflow instance that scheduled the activity. | ||
| * **taskID**: The sequence number of the task within the workflow execution (e.g., 0, 1, 2...). | ||
| * **generation**: A counter that increments if the workflow is restarted or "continued as new". | ||
|
|
||
| This unique ID ensures that activity executions are isolated and can be tracked reliably across retries and restarts. | ||
|
|
||
| ## Retries | ||
|
|
||
| Dapr handles activity retries based on the policy defined in the orchestration (if the SDK supports defining retry | ||
| policies in the `ScheduleTask` action). If an activity fails and a retry policy is in place, the engine will re-enqueue | ||
| the activity task after the specified delay. | ||
|
|
||
| From the activity worker's perspective, a retry is simply a new `ActivityWorkItem` with the same name and input, but | ||
| potentially a different `task_id` (or the same, depending on the backend implementation). | ||
|
|
||
| ## Idempotency | ||
|
|
||
| Because activities might be executed more than once (e.g., if the worker crashes after execution but before reporting | ||
| completion), it is recommended that activity logic be idempotent where possible. | ||
|
|
||
| ## Comparison with Workflows | ||
|
|
||
| | Feature | Orchestration | Activity | | ||
| | :--- | :--- | :--- | | ||
| | **Execution Style** | Replay-based (Deterministic) | Direct execution | | ||
| | **State** | Managed via History Events | No internal workflow state | | ||
| | **Side Effects** | Forbidden (must use activities) | Allowed (IO, Database, etc.) | | ||
| | **Lifetime** | Can be long-running (days/months) | Usually short-lived | | ||
| | **Connectivity** | Connected via `GetWorkItems` | Connected via `GetWorkItems` | | ||
110 changes: 110 additions & 0 deletions
110
...ibuting/protocol-reference/workflow-protocol/workflow-protocol-execution-api.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,110 @@ | ||||||
| --- | ||||||
|
||||||
| --- | |
| --- |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.