Skip to content

Implement timer origin #1729

@acroca

Description

@acroca

We've introduced a new origin field to timers. See the proposal: dapr/proposals#104
This field is used to specify the reason why the timer has been created. So far we support four types:

  • CreateTimer - timers created explicitly
  • WaitForExternalEvent - timers created to track timeout of external events
  • ActivityRetry - timers to manage when an activity will be retried
  • ChildWorkflowRetry - timers to manage when a child workflow will be retried

This change is fairly straightforward except for a small detail around the waitForExternalEvent timers. Until now, some SDKs don't create the timer if there is no timeout specified, but now we require a timer to be always created. In order to keep this new requirement backwards compatible, we need to make sure the lack of those timers is acceptable, as in-flight workflows created before adding this feature won't have it.

I prepared a prompt to provide to an coding agent to implement this feature hopefully in one-shot:

click to expand
# Prompt: Implement Timer Origins and Backwards-Compatible Optional Timers in Durable Task SDKs

## Reference implementation

The Go SDK (`durabletask-go`) has already shipped this feature. Use
`task/orchestrator.go` in that repo as the reference whenever the prose
below is ambiguous.

## Background

When a durable workflow creates a timer, the timer has, until now, had no
metadata about **why** it was created. A timer could back:

1. An explicit `CreateTimer(delay)` call.
2. A timeout attached to `WaitForExternalEvent(name, timeout)`.
3. The delay between attempts of an activity under a retry policy.
4. The delay between attempts of a child workflow under a retry policy.

The `origin` field (added to the `CreateTimerAction` and `TimerCreatedEvent`
protobuf messages) makes the cause of every timer explicit. This enables
observability, tracing retry chains across logs, and — as described later in
this document — lets the runtime distinguish a synthetic "wait indefinitely"
timer from every other kind of timer at replay time.

## Protobuf changes

### Pull in the latest protos

Before you change SDK code, update the `durabletask-protobuf` dependency (or
your SDK's vendored protos) to the current `main` branch and regenerate the
language-specific bindings. The feature depends on messages and fields that
only exist in that revision.

### What is new in the protos

Only two message families changed. In each case the field is a `oneof`
called `origin` that is **optional** on the wire (no origin set is a valid
decoded message — see the next section on backwards compatibility).

- **`CreateTimerAction`** (in `orchestrator_actions.proto`) — the action your
  SDK emits when it wants a timer scheduled. A new `origin` oneof with four
  variants was added alongside the existing `fireAt` and `name` fields.
- **`TimerCreatedEvent`** (in `history_events.proto`) — the history event
  your SDK consumes during replay to learn that a timer was previously
  scheduled. The same `origin` oneof was added here so the information
  round-trips through history.

### The four origin variants

Each variant is a distinct protobuf message so new fields can be added per
origin without breaking others. Three carry a payload; one is intentionally
empty:

| Origin message | Payload | Set when your SDK is emitting a timer because… |
|---|---|---|
| `TimerOriginCreateTimer` | none (empty marker message) | The workflow called the explicit `CreateTimer(delay)` API. |
| `TimerOriginExternalEvent` | `name` — event name being waited on | The workflow called `WaitForExternalEvent(name, timeout)` and the SDK needs a timeout timer (or a synthetic indefinite timer — see below). |
| `TimerOriginActivityRetry` | `taskExecutionId` — see stability rule below | The SDK is scheduling the delay between attempts of an activity governed by a retry policy. |
| `TimerOriginChildWorkflowRetry` | `instanceId` — see first-child rule below | The SDK is scheduling the delay between attempts of a child workflow governed by a retry policy. |

### Backwards compatibility of the origin field itself

`origin` is a `oneof` whose variants are all optional on the wire. Two
compatibility rules follow directly from that:

1. **Producer side**: when your SDK builds a `CreateTimerAction` for any
   reason listed in the table above, it **must** set one of the four origin
   variants. Emitting a `CreateTimerAction` with no origin is allowed by the
   wire format but is a bug — downstream consumers rely on origin being
   present for anything this SDK creates post-upgrade.
2. **Consumer side**: when your SDK replays a `TimerCreatedEvent` from
   history, the `origin` field may be unset. Histories produced before this
   feature shipped will have no origin on any of their timer events. Your
   replay code must tolerate `origin == nil/unset` and behave exactly as it
   did before this feature existed.

## Behavior to implement

This is the full list of places in your SDK where you must set an origin on a
`CreateTimerAction`.

### 1. Explicit `CreateTimer(delay)``TimerOriginCreateTimer`

Every `CreateTimerAction` emitted by the `CreateTimer` API gets
`TimerOriginCreateTimer{}` (the empty marker). Never leave origin unset on a
new action.

### 2. `WaitForExternalEvent(name, timeout)``TimerOriginExternalEvent`

The timer that implements the timeout gets `TimerOriginExternalEvent` with
`name` set to the same event name the workflow is waiting on. This includes
the indefinite-timeout case described in the next section; indefinite waits
are also `TimerOriginExternalEvent`, just with a sentinel `fireAt`.

### 3. Activity retry delay → `TimerOriginActivityRetry`

When a `CallActivity(...)` call is governed by a retry policy and the SDK
needs to wait `nextDelay` before attempting the activity again, the retry
delay timer gets `TimerOriginActivityRetry{ taskExecutionId }`.

**Stability rule**: `taskExecutionId` identifies the **logical activity
call**, not an individual attempt. Generate it once (e.g., a UUID) when the
retry loop starts and reuse the same value for every retry timer produced by
that loop. Every attempt's scheduled `TaskScheduledEvent` must also carry
this same `taskExecutionId`. It's the stable key that lets external tools
stitch a retry chain back together.

### 4. Child workflow retry delay → `TimerOriginChildWorkflowRetry`

When a `CallChildWorkflow(...)` call is governed by a retry policy and the
SDK needs to wait `nextDelay` before spawning a replacement child, the retry
delay timer gets `TimerOriginChildWorkflowRetry{ instanceId }`.

**First-child rule**: `instanceId` must always be the instance ID of the
**first** child scheduled by this call — not the instance ID of the child
that just failed, and not the instance ID about to be scheduled on the next
attempt.

- If the user supplied an explicit instance ID, use it.
- Otherwise compute a deterministic id once (for example from the parent's
  instance ID plus the sequence number at the start of the call) and reuse
  that value on every retry timer belonging to this call.

This gives external systems a single, stable pointer to the whole retry
chain even though subsequent attempts may use different generated instance
IDs.

### Symmetry with the Go reference

In `durabletask-go`, origin assignment for all four cases lives in
`task/orchestrator.go`:

- `CreateTimer` path: `createTimerInternal` sets `TimerOriginCreateTimer`.
- `WaitForSingleEvent` path: `createExternalEventTimerInternal` sets
  `TimerOriginExternalEvent{ name }`.
- `CallActivity` retry path: `CallActivity` stores a freshly-generated UUID
  as `taskExecutionId`, passes it into `internalScheduleTaskWithRetries`,
  which sets `TimerOriginActivityRetry` on every retry delay timer using the
  same id.
- `CallChildWorkflow` retry path: `CallChildWorkflow` captures the first
  child's instance ID (user-provided or deterministically generated once)
  and sets `TimerOriginChildWorkflowRetry` on every retry timer.

Mirror these four assignment points in your SDK. The rest of this document
focuses on the one behavioral change that is easy to miss: the optional
timer for indefinite waits.

## Optional timers: `WaitForExternalEvent(name, timeout < 0)`

This is the trickiest part of the feature. Read it carefully.

### Why it exists

`WaitForExternalEvent(name, timeout)` has three input shapes. Your SDK must
behave as follows:

| Timeout value | Behavior |
|---|---|
| `timeout == 0` | Return an already-canceled task. **No `CreateTimerAction` is emitted.** |
| `timeout > 0`  | Emit a `CreateTimerAction` with `origin = TimerOriginExternalEvent{name}` and `fireAt = now + timeout`. |
| `timeout < 0` (indefinite) | Emit a **synthetic** `CreateTimerAction` with `origin = TimerOriginExternalEvent{name}` and `fireAt` set to the sentinel described below. The timer effectively never fires; it exists purely so the backend has a record that this instance is parked waiting on a named event. |

Call the `timeout < 0` timer the **optional timer**. It is optional in two
senses:

- **It never needs to fire.** The `fireAt` sentinel is so far in the future
  that it is functionally "never". The timer is only ever resolved by the
  matching external event arriving (or by the workflow terminating for some
  other reason). Consumers that match timers by `fireAt` should treat the
  sentinel as "never".
- **It may be absent from pre-upgrade histories.** Earlier releases of your
  SDK (and every other SDK, today) did not emit any timer for
  `timeout < 0`. A workflow that started on the old code path and is being
  replayed on the new code path will have a history that **lacks** the
  optional timer's `TimerCreatedEvent`. Your SDK must tolerate this on
  replay (full algorithm below).

### Required sentinel

Use this exact UTC value as the `fireAt` of every optional timer: `9999-12-31T23:59:59.999999999Z`

Choose a representation in your language that round-trips through
`google.protobuf.Timestamp` without drift (for example, in Go, the reference
implementation uses `time.Date(9999, 12, 31, 23, 59, 59, 999999999, time.UTC)`).
All SDKs and backends must recognize the **exact** value; a sentinel that is
even one nanosecond off will not be detected as optional.

### Recognition rules (do not relax these)

A pending `CreateTimerAction` is **optional** if and only if **all three**
hold:

1. The action is a `CreateTimer` action (not `ScheduleTask`, not
   `CreateChildWorkflow`).
2. Its `origin` is `TimerOriginExternalEvent` (any `name`).
3. Its `fireAt` equals the sentinel above.

A `TimerCreatedEvent` from history is **optional** if and only if its
`origin` is `TimerOriginExternalEvent` and its `fireAt` equals the sentinel.

**Never** match on `fireAt` alone. A workflow is allowed to call
`CreateTimer(farFuture)` with an arbitrary long delay; that produces a
`TimerOriginCreateTimer` timer that happens to fall in year 9999 and is
**not** an optional timer. Similarly, a `WaitForExternalEvent` with a finite
but very long timeout produces a `TimerOriginExternalEvent` timer whose
`fireAt` is **not** the sentinel and is therefore not optional.

### Replay tolerance algorithm

Your SDK already, today, maps incoming history events at sequence id `N` to
the pending action it expects at id `N`, and raises a non-determinism error
when they do not match. Patch that matching step as follows.

When a **scheduling-side** history event arrives at sequence id `N` (that is:
`TaskScheduled`, `ChildWorkflowInstanceCreated`, or `TimerCreated`):

1. Look up the pending action at id `N`.
2. **Type mismatch** (e.g., incoming is `TaskScheduled`, pending is
   `CreateTimer`):
   - If the pending action is an optional timer (by the recognition rule
     above), drop it: remove it from the pending map, remove any pending
     task bound to id `N`, then shift every pending action and every pending
     task with id `> N` down by one (rewrite each action's `Id` field), and
     decrement your next-sequence-number counter by one. Retry the lookup at
     id `N` — the shifted-down action now occupies that slot and should
     match the incoming event through the normal path.
   - Otherwise, raise the existing non-determinism error. No change to that
     error path.
3. **Type match but `TimerCreated` specifically** — both sides are
   `CreateTimer`, but consider the asymmetric case where the pending action
   is an optional timer and the incoming `TimerCreated` is **not** optional
   (for example, pre-patch code emitted a normal `CreateTimer` at the slot
   your patched code wants to use for the optional timer). Apply the same
   drop-and-shift before the normal match. If both sides are optional
   (post-patch history replaying against post-patch code), do nothing
   special — the normal match path handles it and must not shift.

**Completion-side** handlers (`TaskCompleted`, `TimerFired`,
`ChildWorkflowInstanceCompleted`, etc.) do **not** need this shift logic.
Every completion is preceded in history by its corresponding scheduling
event, which is where the shift runs. By the time the completion arrives,
the pending map has already been realigned.

### What the Go SDK does

If it helps, here are the exact entry points in
`durabletask-go/task/orchestrator.go` that implement the above:

- `externalEventIndefiniteFireAt` — the sentinel constant.
- `isOptionalExternalEventTimerAction` / `isOptionalExternalEventTimerCreatedEvent` — the recognition predicates (encoding the three-part rule for actions and the two-part rule for events).
- `dropOptionalExternalEventTimerAt(atID)` — the drop-and-shift primitive: removes the optional pending action at `atID`, deletes any pending task at `atID`, shifts later pending ids down by one, and decrements the sequence-number counter.
- Call sites: `onTaskScheduled`, `onChildWorkflowScheduled`, and
  `onTimerCreated` (the last one containing the `TimerCreated`-specific
  asymmetric case).

Port this shape. Names will differ per language, but the three concepts
(sentinel, recognition predicate, drop-and-shift primitive) and the three
call sites should all appear.

### What happens if the optional timer leaks to history

If a workflow is mid-wait at replay time and no conflicting event is ever
processed during that slice of replay, the optional `CreateTimerAction` will
be emitted to the backend and written to history as a real
`TimerCreatedEvent` with `origin = TimerOriginExternalEvent` and
`fireAt = sentinel`. That is fine and expected. From then on the history is
self-consistent with post-patch code, later replays match through the normal
path, and the backend's timer queue simply holds an entry due at the
sentinel that never fires.

## Test cases to implement

Mirror every test below in your SDK's test framework. Tests 1–6 cover origin
assignment across the four API surfaces (including retry timers, which
matter just as much as external-event timers). Tests 7–13 cover the optional
timer rule and its replay-compatibility machinery.

### Origin assignment

#### Test 1 — `CreateTimer(delay)` sets `TimerOriginCreateTimer`

Orchestration calls `CreateTimer(delay)`. Assert the resulting
`CreateTimerAction` has `origin` set to `TimerOriginCreateTimer{}` (empty
marker).

#### Test 2 — finite-timeout `WaitForExternalEvent` sets `TimerOriginExternalEvent`

Orchestration calls `WaitForExternalEvent("myEvent", timeout > 0)`. Assert
the resulting `CreateTimerAction` has `origin` set to
`TimerOriginExternalEvent{ name: "myEvent" }` and `fireAt = now + timeout`.

#### Test 3 — activity retry timer sets `TimerOriginActivityRetry`

Orchestration calls `CallActivity(myActivity)` with a retry policy where
`MaxAttempts >= 2`. Activity fails on the first attempt. Assert the retry
delay `CreateTimerAction` has `origin = TimerOriginActivityRetry{ taskExecutionId: <id> }`
where `<id>` matches the `taskExecutionId` attached to the original
`ScheduleTaskAction` for the activity.

#### Test 4 — activity retry `taskExecutionId` is stable across attempts

Configure a retry policy with `MaxAttempts >= 3`. Activity fails on attempts
1 and 2, producing two retry delay timers. Assert both timers carry the
**same** `taskExecutionId` and that it equals the id on the original
`ScheduleTaskAction`.

#### Test 5 — child workflow retry timer sets `TimerOriginChildWorkflowRetry`

Orchestration calls `CallChildWorkflow(myWorkflow)` with a retry policy where
`MaxAttempts >= 2`. Child workflow fails on the first attempt. Assert the
retry delay `CreateTimerAction` has
`origin = TimerOriginChildWorkflowRetry{ instanceId: <firstChildId> }` where
`<firstChildId>` is the instance ID of the first child scheduled.

#### Test 6 — child workflow retry `instanceId` always points to first child

Configure a retry policy with `MaxAttempts >= 3`. Child fails on attempt 1;
timer fires; a second child with a **different** instance ID is spawned;
that child fails on attempt 2. Assert the second retry timer still has
`instanceId = firstChildId`, i.e., it does not track the per-attempt
instance ID.

### Optional timer — happy path

#### Test 7 — indefinite `WaitForExternalEvent` emits the sentinel optional timer

Orchestration calls `WaitForExternalEvent("myEvent", -1)` and blocks. Assert
the single resulting `CreateTimerAction`:

- has `origin = TimerOriginExternalEvent{ name: "myEvent" }`;
- has `fireAt` exactly equal to `9999-12-31T23:59:59.999999999Z`.

#### Test 8 — zero-timeout `WaitForExternalEvent` emits no timer

Orchestration calls `WaitForExternalEvent("myEvent", 0)` when no buffered
event of that name exists. Assert the returned task is already canceled and
that **no** `CreateTimerAction` is produced.

### Optional timer — replay compatibility

Each of these replays a crafted history. Use the same pattern your existing
replay tests use (split into `oldEvents` / `newEvents` or whatever your SDK
calls them).

#### Test 9 — post-patch replay matches the optional timer normally

Orchestration: `WaitForExternalEvent("myEvent", -1)` and then return.

- `oldEvents`: `WorkflowStarted`, `ExecutionStarted`, and a
  `TimerCreated(EventId=0, origin=ExternalEvent{name="myEvent"}, fireAt=sentinel)`.
- `newEvents`: `EventRaised("myEvent", payload)`.

Assert replay produces a single `CompleteWorkflow` action and no
shift-induced errors. This is a negative-control regression test: it guards
against a future change that over-shifts and breaks post-patch histories.

#### Test 10 — pre-patch replay, indefinite wait followed by `CallActivity`

Orchestration: `WaitForExternalEvent("myEvent", -1)``CallActivity("A")`
→ return activity result.

- `oldEvents` simulate a pre-patch history: `WorkflowStarted`,
  `ExecutionStarted`, `EventRaised("myEvent")`,
  `TaskScheduled(EventId=0, name="A")`. Note `EventId=0` for the activity:
  the old code path did not reserve a sequence number for the indefinite
  wait.
- `newEvents`: `TaskCompleted(TaskScheduledId=0, result)`.

Assert:

- replay succeeds;
- the final action is a single `CompleteWorkflow`;
- **no** `CreateTimerAction` leaks into the result — the optional timer
  must be dropped, not flushed to history.

#### Test 11 — pre-patch replay, indefinite wait followed by `CallChildWorkflow`

Orchestration: `WaitForExternalEvent("myEvent", -1)``CallChildWorkflow("Child")` → return child result.

- `oldEvents`: same shape as Test 10 but with
  `ChildWorkflowInstanceCreated(EventId=0, instanceId="child-1", name="Child")`
  in place of the `TaskScheduled`.
- `newEvents`: `ChildWorkflowInstanceCompleted(TaskScheduledId=0, result)`.

Assert replay completes cleanly. This exercises the **child-workflow
scheduling** branch of the shift, which is a different call site than
Test 10.

#### Test 12 — pre-patch replay, indefinite wait followed by a user `CreateTimer`

Orchestration: `WaitForExternalEvent("myEvent", -1)``CreateTimer(5s)` →
return event payload.

- `oldEvents`: pre-patch history with
  `TimerCreated(EventId=0, origin=CreateTimer, fireAt=startTime+5s)` — same
  event type as what the patched code would optionally emit, but a
  non-sentinel `fireAt` and a `CreateTimer` origin.
- `newEvents`: `TimerFired(TimerId=0)`.

Assert replay succeeds and produces a single `CompleteWorkflow`. This is the
asymmetric `TimerCreated`-specific branch: the pending action is an optional
timer, and the incoming `TimerCreated` is also a `CreateTimer` action but is
**not** optional, so the SDK must distinguish by origin + `fireAt`-sentinel,
not by action type.

#### Test 13 — pre-patch replay, two indefinite waits in sequence

Orchestration: `WaitForExternalEvent("A", -1)``CallActivity("ActA")``WaitForExternalEvent("B", -1)``CallActivity("ActB")` → return.

- `oldEvents`: pre-patch history where `ActA` is scheduled at `EventId=0`,
  `ActA` completes, then `ActB` is scheduled at `EventId=1`.
- `newEvents`: `TaskCompleted` for `ActB`.

Assert replay completes with a single `CompleteWorkflow` and no optional
timers leak into the result. Validates that the drop-and-shift primitive
composes correctly across multiple optional timers in the same replay.

## Key design principles

1. **Origins are metadata only.** They do not affect when a timer fires or
   any runtime scheduling decision. Setting the correct origin is about
   observability and the one backwards-compatibility rule.
2. **Always set an origin** on every `CreateTimerAction` your SDK emits —
   including plain `CreateTimer()` calls, which use the empty
   `TimerOriginCreateTimer{}`. Never leave origin unset on a new action.
3. **Origin is optional on read.** Tolerate `origin == nil` when consuming
   history events from older releases.
4. **Stable ids across retries.** `TimerOriginActivityRetry.taskExecutionId`
   and `TimerOriginChildWorkflowRetry.instanceId` identify the **logical
   operation**, not an individual attempt. Generate once, reuse across every
   retry timer produced by that logical operation.
5. **First-child rule for child retries.** The `instanceId` in
   `TimerOriginChildWorkflowRetry` always points to the first child
   scheduled by the call, not to the child about to be scheduled on the
   next attempt.
6. **Optional timers must round-trip through replay.** Recognize them via
   `origin = TimerOriginExternalEvent` + `fireAt = 9999-12-31T23:59:59.999999999Z` (both conditions, exactly). Drop-and-shift when a pre-patch history lacks
   the expected optional timer; match normally when a post-patch history
   contains it. No other kind of timer participates in this shift logic.

Metadata

Metadata

Assignees

Labels

javaPull requests that update Java code

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions