D2D trigger capability errors by ettec · Pull Request #20973 · smartcontractkit/chainlink

ettec · 2026-01-30T13:06:26Z

Change overview:

This change is to support the reporting of capability errors when a remote trigger is registered across the DON2DON (D2D) layer, no other changes required to handle the errors produced by trigger registration are included in this PR.

Brief overview of the current situation (before this change) for context:

On RegisterTrigger the TriggerSubscriber (TS) adds the registration to a list, this list is then polled at regular intervals and for each registration in the list a TriggerRegistration request is sent to the TriggerPublisher (TP). The TS does not wait for any response from the TP, it assumes the registration was successful, thus there is currently no way to report registration errors to the workflow DON (where the TS sits).

On receipt of a TriggerRegistration request the TriggerPublisher will attempt to register to the underlying trigger (if it hasn't already), if an error occurs it will be logged but no indication of the error is sent back to the TS. Note, the TS must send regular TriggerRegistration requests to the TP to keep the registration alive, failure to do so will result in the TP considering the registration to be dead and unregistering from the underlying trigger.

Summary of the change in this PR

The change is subject to the following constraints:

Must be backwards compatible - that is a DON with this change must continue to work when communicating to a DON without this change.
The D2D trigger communication mechanism has been proven in production, we want to avoid anything that fundamentally changes the behaviour.

This change introduces the concept of a TriggerRegistrationStatus message (actually it sort of existed already but was unused) which is sent by the TP to the TS whenever the TP receives a TriggerRegistrationRequest from the TS. The TriggerRegistrationStatus message includes the latest state of the registration on the underlying trigger including any registration error if one occurred. As before, if the trigger is not currently successfully registered the receipt of the TriggerRegistrationRequest will cause the TP to attempt a new registration. Also as before, the continued receipt of TriggerRegistrationRequests by the TP is necessary to keep a registration live.

On initial registration the TS can now (optionally) wait for TriggerRegistrationStatus messages from the TP, these can be used to determine if the initial registration attempt was successful. Note, even though the TS will continue to send TriggerRegistrationRequests as part of polling, it does not (currently) have any code that deals with updates to the trigger registration status beyond the initial registration - this could be something to address in a future change, for now we are focussed on getting back the status on initial subscription so we can identify most errors, and in particular user errors.

Note, if the TS does not receive sufficient TriggerRegisrationStatus messages during the initial trigger registration phase it will return the events channel associated with the subscription along with an error indicating that registration status was unable to be determined. Essentially this allows the code to behave in the current way (pre-this change) if running against old DON or for whatever other reason the status messages are not received.

It is expected that in most, if not all, cases initial trigger registration status will be returned, so will be a significant improvement over the current behaviour. At some future point, perhaps after all DONs are migrated, it would be possible to make the wait for initial trigger registration status mandatory, tbd.

Testing:

In addition to additional system test, manual testing was carried out on staging zone b to ensure that there are no compatibility issue when running DONs with this change against DONs without this change and to confirm user registration errors are reported as expected, summarised here:

https://docs.google.com/spreadsheets/d/1-SP9lUCmjRQrY-xiufwioUbf5dnvz6zIgoZB-ZEi9GQ/edit?usp=sharing

Deployment:

Testing confirms that the change can de deployed in any order (i.e. workflow node followed by cap node, or vice versa) with no adverse impact.

Once both DONS are updated, the behaviour can be enabled by updating a setting, here is an example of a CLD PR that updates the setting, in this case to disable the behaviour, to enable the behaviour change the setting to '10s'

https://github.com/smartcontractkit/chainlink-deployments/pull/12204

github-actions · 2026-01-30T13:07:32Z

I see you updated files related to core. Please run make gocs in the root directory to add a changeset as well as in the text include at least one of the following tags:

#added For any new functionality added.
#breaking_change For any functionality that requires manual action for the node to boot.
#bugfix For bug fixes.
#changed For any change to the existing functionality.
#db_update For any feature that introduces updates to database schema.
#deprecation_notice For any upcoming deprecation functionality.
#internal For changesets that need to be excluded from the final changelog.
#nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
#removed For any functionality/config that is removed.
#updated For any functionality that is updated.
#wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

trunk-io · 2026-01-30T14:46:00Z

_{View Full Report ↗︎ ⋅ Docs}

github-actions · 2026-02-04T15:38:52Z

CORA - Pending Reviewers

Codeowners Entry	Overall	Num Files	Owners
`*`	💬	5	@smartcontractkit/foundations, @smartcontractkit/core
`/core/capabilities/`	💬	22	@smartcontractkit/keystone, @smartcontractkit/capabilities-team
`/core/services/workflows`	💬	3	@smartcontractkit/keystone
`/core/chainlink.Dockerfile`	💬	1	@smartcontractkit/devex-cicd, @smartcontractkit/foundations, @smartcontractkit/core

For more details, see the full review summary.

ettec · 2026-02-05T14:58:31Z

@@ -444,9 +445,14 @@ func (e *Engine) runTriggerSubscriptionPhase(ctx context.Context) error {
 				// no Config needed - NoDAG uses Payload
 			})
 			if regErr != nil {


Note - now the remote trigger can return an error, this means workflow will fail to start when this happens. In the legacy code as the error from a remote trigger was not reported, it would continue to poll the publisher with registration requests, such that in theory an inactive trigger could become active eventually should the error be caused by a transient issue.

If we wanted to preserve that behaviour we could differentiate between error types that are consider to be recoverable/non-recoverable and only fail workflow in the non-recoverable case. (for example we could consider user errors non-recoverable, system errors recoverable or even introduce a seperate attribute for recoverability on the capability error). For this to work the engine would need to know if the trigger capability is being run locally or remotely.

However, we already have workflow retry behaviour, so IMO it's acceptable to fail the workflow where remote trigger returns an error on initial registration and rely on workflow retry to handle transient errors. This makes it consistent with local triggers.

github-actions · 2026-02-23T15:14:49Z

✅ No conflicts with other open PRs targeting develop

ettec · 2026-03-11T10:36:52Z

+	"unicode"
+)
+
+// TODO this is duplicated from utils to minimise the PR changeset - will be deduped later


Going to address this in a follow on commit

pickup fix to local trigger capability error reporting

fernandezlautaro · 2026-03-31T04:33:57Z

+						}
+					} else {
+						e.logger().Errorw("Trigger registration failed", "triggerID", sub.Id, "err", regErr)
+						e.metrics.With(platform.KeyTriggerID, sub.Id).IncrementRegisterTriggerFailureCounter(gCtx)


why are you omitting the platform.KeyCapabilityErrorCode here?
Also, for the capability_executor we used the type of error caperrors.Internal.String() so it will help us later on figuring out unclassified errors from the capabilities actions, should be the same for failed registrations

This relates to the change in trigger subscriber (previous comment). In the scenario where the consensus is not reached on the errors, a capability error with the consensus_failed error code is returned to indicate such has occurred, the subscription_registration logs out the underlying errors so that it can be determined what is the underlying cause i.e. why we are getting errors on which consensus cannot be determined.

fernandezlautaro · 2026-03-31T06:04:47Z

 		}
+	} else if msg.Method == types.TriggerRegistrationStatus {
+		meta := msg.GetTriggerRegistrationMetadata()
+		if meta == nil {


could this ever happen?

Honestly I can't see how, but the same check is in place for types.MethodTriggerEvent, assuming this may have actually happened so erring on the side of caution.

fernandezlautaro · 2026-03-31T06:08:10Z

+		s.mu.Unlock()
+
+		if !found {
+			s.lggr.Warnw("received trigger registration status message for unregistered workflow", "workflowID", SanitizeLogString(meta.WorkflowId), "triggerID", SanitizeLogString(meta.TriggerId), "sender", sender)


what's the point in calling SanitizeLogString for WF ID or Trigger ID?

The sending node (of the trigger registration status msg) could put any old nonsense (including very long string) in this field, so this protects against that, its a common patten in this code.

fernandezlautaro · 2026-03-31T06:08:36Z

+		reg.HandleTriggerRegistrationStatusUpdate(sender, msg, cfg.remoteConfig.MinResponsesToAggregate, cfg.remoteConfig.MessageExpiry,
+			len(cfg.capDonInfo.Members))
 	} else {
 		s.lggr.Errorw("received trigger event with unknown method", "method", SanitizeLogString(msg.Method), "sender", sender, "err", SanitizeLogString(msg.ErrorMsg))


what's the point in calling SanitizeLogString for method?
(it does make sense for ErrorMsg tho)

The sending node (of the trigger registration status msg) could put any old nonsense (including very long string) in this field, so this protects against that, its a common patten in this code.

fernandezlautaro · 2026-03-31T06:22:53Z

+			if totalErrorCount >= publisherNodeCount-int(minResponseToAggregate)+1 {
+				// There is no consensus error, return a generic error message
+				sr.setRegistrationStatus(types.RegistrationStatus_REGISTRATION_ERROR,
+					caperrors.NewPublicSystemError(fmt.Errorf("received %d errors, last error %s : %s", totalErrorCount, msg.Error, log.SanitizeLogString(lastErr)), caperrors.Unknown))


wouldn't this make all errors that somehow are a bit different become public and system errors?
If so, shouldn;t you try to wrap the current msg.Error using DeserializeErrorFromString() too in a generic error?

Just thinking on what I did on this PR https://github.com/smartcontractkit/chainlink/pull/21746/changes with the function newRemoteCapabilityExecuteErrorWithMessage

I think in this scenario as we are unable to get consensus on the errors the correct behaviour is to return a public system capability error with the 'ConsensusFailed' error code and log the errors. I've update the code to this. (note public is acceptable, the cap don side will have obfuscated any private errors before sending them across).

cl-sonarqube-production · 2026-03-31T11:44:34Z

Quality Gate passed

Issues
9 New issues
2 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

bolekk

lgtm

bolekk · 2026-04-01T04:06:05Z

+	triggerResponseCh chan commoncap.TriggerResponse
+	rawRequest        []byte
+
+	registrationStatusUpdateCache *messagecache.MessageCache[string, p2ptypes.PeerID]


Don't you need a mutex for this?

bolekk · 2026-04-01T04:07:40Z

+func (pr *PublisherRegistration) AddRegistrationRequest(ctx context.Context, sender p2ptypes.PeerID, rawRequest []byte,
+	callerDon commoncap.DON, registrationExpiry time.Duration) {
+	nowMs := time.Now().UnixMilli()
+	// First remote any expired messages from the cache, the cache is not expected to be large so this is acceptable


typo "remote" -> "remove"

ettec commented Feb 5, 2026

View reviewed changes

ettec commented Feb 9, 2026

View reviewed changes

Comment thread core/services/workflows/syncer/v2/grpc_workflow_source.go Outdated

ettec added the build-publish Build and Publish image to SDLC label Feb 11, 2026

bolekk reviewed Feb 13, 2026

View reviewed changes

github-actions Bot mentioned this pull request Feb 20, 2026

CRE-1827: detect potential merge conflicts #21255

Merged

ettec mentioned this pull request Feb 25, 2026

add registration status timeout to remote trigger config smartcontractkit/chainlink-common#1819

Merged

ettec closed this Feb 26, 2026

ettec reopened this Feb 26, 2026

This was referenced Mar 4, 2026

setting for d2d trigger errors smartcontractkit/chainlink-common#1876

Merged

return registration status cannot be determined smartcontractkit/chainlink-common#1884

Merged

ettec commented Mar 11, 2026

View reviewed changes

ettec mentioned this pull request Mar 23, 2026

registration error handling smartcontractkit/chainlink-common#1918

Merged

ettec added 2 commits March 30, 2026 14:21

D2D trigger capability errors & engine registration error handling

f93f9d7

pickup fix to local trigger capability error reporting

tidy

7e3bd22

fernandezlautaro reviewed Mar 31, 2026

View reviewed changes

ettec added 4 commits March 31, 2026 10:50

review comments

68b6ca1

update consensus failure error code

258bb55

error handling change

5f4f976

review comment

de3e4ff

bolekk approved these changes Apr 1, 2026

View reviewed changes

Conversation

ettec commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trunk-io Bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CORA - Pending Reviewers

Uh oh!

ettec Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ettec Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ettec Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ettec Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cl-sonarqube-production Bot commented Mar 31, 2026

Quality Gate passed

Uh oh!

bolekk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ettec commented Jan 30, 2026 •

edited

Loading

github-actions Bot commented Jan 30, 2026 •

edited

Loading

trunk-io Bot commented Jan 30, 2026 •

edited

Loading

github-actions Bot commented Feb 4, 2026 •

edited

Loading

ettec Feb 5, 2026 •

edited

Loading

github-actions Bot commented Feb 23, 2026 •

edited

Loading

ettec Mar 31, 2026 •

edited

Loading

ettec Mar 31, 2026 •

edited

Loading

ettec Mar 31, 2026 •

edited

Loading