Skip to content

Classify TaskCancelledException as client error instead of system error#5273

Draft
ahkcs wants to merge 4 commits intoopensearch-project:mainfrom
ahkcs:fix/task-cancelled-error-classification
Draft

Classify TaskCancelledException as client error instead of system error#5273
ahkcs wants to merge 4 commits intoopensearch-project:mainfrom
ahkcs:fix/task-cancelled-error-classification

Conversation

@ahkcs
Copy link
Collaborator

@ahkcs ahkcs commented Mar 26, 2026

Summary

When a user cancels a PPL query via the _tasks/_cancel API, the TaskCancelledException is wrapped by Calcite (RuntimeException → SQLException → TaskCancelledException) and reaches the REST error handler as a generic exception. This results in:

  • HTTP 500 instead of 400
  • Incrementing PPL_FAILED_REQ_COUNT_SYS (system error metric) instead of PPL_FAILED_REQ_COUNT_CUS (client error metric)

Since query cancellation is user-initiated, it should not be classified as a system error.

Changes

  • Add hasCauseOf() helper to walk the exception cause chain
  • Add TaskCancelledException to isClientError() via cause chain check
  • Add unit tests for wrapped and direct TaskCancelledException

Related

Test plan

  • Unit tests: wrapped TaskCancelledException classified as client error
  • Unit tests: direct TaskCancelledException classified as client error
  • Unit tests: generic RuntimeException still classified as system error

OSD side log:

server    log   [19:02:02.468] [error][plugins][queryEnhancements] pplSearchStrategy: {
  "error": {
    "reason": "Query cancelled",
    "details": "The task is cancelled.",
    "type": "TaskCancelledException"
  },
  "status": 400

When a user cancels a PPL query, the TaskCancelledException is wrapped
by Calcite (RuntimeException → SQLException → TaskCancelledException)
and reaches the REST error handler as a generic exception, resulting in
a 500 status code and incrementing the system error metric. Since query
cancellation is user-initiated, it should be classified as a client
error (400) to avoid inflating system failure metrics.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@github-actions
Copy link
Contributor

Failed to generate code suggestions for PR

@ahkcs ahkcs added enhancement New feature or request error-experience Issues related to how we handle failure cases in the plugin. and removed enhancement New feature or request labels Mar 26, 2026
TaskCancelledException extends OpenSearchException, so
ErrorMessageFactory.unwrapCause() was finding it and creating an
OpenSearchErrorMessage with exception.status() (500), overriding the
400 status passed from the REST handler. Handle TaskCancelledException
before the OpenSearchException check to use the passed-in status for
both the HTTP response and JSON body.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@github-actions
Copy link
Contributor

Failed to generate code suggestions for PR

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@github-actions
Copy link
Contributor

Failed to generate code suggestions for PR

ErrorMessage.fetchReason() returns "Invalid Query" for all 400 status
codes. Now that TaskCancelledException returns 400, the reason should
be "Query cancelled" instead of "Invalid Query".

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@github-actions
Copy link
Contributor

Failed to generate code suggestions for PR

@ahkcs ahkcs added the enhancement New feature or request label Mar 26, 2026
@ahkcs
Copy link
Collaborator Author

ahkcs commented Mar 26, 2026

Note: TaskCancelledException inherits INTERNAL_SERVER_ERROR (500) from OpenSearchException by default. Ideally, core would classify this with a more appropriate status (e.g., 400 ) so all plugins benefit. This PR is a customized fix on the plugin side to avoid inflating system error metrics for user-initiated cancellations. If core updates the classification in the future, this special-casing can be removed.

@ahkcs ahkcs marked this pull request as draft March 26, 2026 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request error-experience Issues related to how we handle failure cases in the plugin.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant