Skip to content

CIVIMM-480: Add webhook stuck recovery with attempt tracking#16

Merged
erawat merged 1 commit intomasterfrom
CIVIMM-480-webhook-stuck-recovery
Mar 10, 2026
Merged

CIVIMM-480: Add webhook stuck recovery with attempt tracking#16
erawat merged 1 commit intomasterfrom
CIVIMM-480-webhook-stuck-recovery

Conversation

@erawat
Copy link
Copy Markdown
Member

@erawat erawat commented Mar 9, 2026

Overview

When a worker crashes (PHP fatal, OOM, server death) during webhook processing, webhooks get stuck in 'processing' status forever with no rollback. This adds stuck recovery with attempt tracking so webhooks are automatically retried (up to 3 attempts) or permanently failed.

Before

  • Stuck webhooks reset to 'new' after 30 minutes with no attempt tracking
  • Could retry infinitely
  • No re-queuing after reset (webhooks stayed in 'new' but never re-entered the queue)

After

  • Stuck detection timeout: 1 day (configurable)
  • Each recovery increments attempt count
  • If attempts < 3: reset to 'new' and re-queued for processing
  • If attempts >= 3: marked as 'permanent_error' (stops retrying)
  • Batch SQL updates (1 SELECT + at most 2 UPDATEs) for performance

Technical Details

  • BAO (SRP): resetStuckWebhooks() split into data-access-only getStuckWebhooks() + batch update methods (batchResetStuckToNew, batchMarkStuckAsPermanentError)
  • Service (orchestration): New recoverStuckWebhooks() in WebhookQueueRunnerService handles decision logic (retry vs permanent error) and re-queuing
  • Optimistic locking: Batch updates include AND status = 'processing' guard to prevent race conditions
  • Tests: 6 unit tests covering query logic, attempt incrementing, permanent error, optimistic locking, and default timeout

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where payment webhooks could become permanently stuck in a 'processing' state if a worker crashed during their execution. The changes introduce a comprehensive recovery system that automatically detects these stuck webhooks, tracks their retry attempts, and either resets them for reprocessing or marks them as permanently failed after a defined number of retries. This significantly improves the reliability and resilience of webhook processing.

Highlights

  • Stuck Webhook Recovery: Implemented a robust recovery mechanism for webhooks that get stuck in 'processing' status due to worker crashes, preventing them from being stuck indefinitely.
  • Attempt Tracking: Introduced attempt tracking for stuck webhooks, allowing them to be retried a configurable number of times (default 3 attempts) before being marked as a permanent error.
  • Configurable Timeout: Changed the default stuck detection timeout from 30 minutes to 1 day, making the system more resilient to temporary processing delays.
  • Performance Optimization: Refactored the recovery process to use batch SQL updates (1 SELECT + at most 2 UPDATEs) for improved performance when handling multiple stuck webhooks.
  • Optimistic Locking: Added optimistic locking to batch update queries to prevent race conditions, ensuring that only webhooks still in 'processing' status are affected by the recovery.
Changelog
  • CRM/Paymentprocessingcore/BAO/PaymentWebhook.php
    • Updated documentation for findByEventId to reference the new recovery service.
    • Replaced the monolithic resetStuckWebhooks method with three distinct, single-responsibility methods: getStuckWebhooks (to retrieve stuck records), batchResetStuckToNew (to reset status and increment attempts), and batchMarkStuckAsPermanentError (to mark as failed).
    • Introduced DEFAULT_STUCK_TIMEOUT_MINUTES constant, setting the default stuck timeout to 1 day (1440 minutes).
    • Implemented optimistic locking in batch update queries (AND status = 'processing') to prevent race conditions.
  • Civi/Paymentprocessingcore/Service/WebhookQueueRunnerService.php
    • Removed direct calls to the old CRM_Paymentprocessingcore_BAO_PaymentWebhook::resetStuckWebhooks() method.
    • Introduced a new private method recoverStuckWebhooks() to centralize the logic for identifying, categorizing (retryable vs. permanent error), and updating stuck webhooks.
    • Updated runAllQueues() and runQueue() to invoke the new recoverStuckWebhooks() method.
    • The recoverStuckWebhooks() method now orchestrates the use of the new BAO batch update methods and re-queues retryable webhooks.
  • api/v3/PaymentWebhookRunner/Run.php
    • Updated the API documentation to reflect the new stuck webhook recovery timeout (1 day) and the addition of attempt tracking.
  • tests/phpunit/CRM/Paymentprocessingcore/BAO/PaymentWebhookTest.php
    • Renamed testResetStuckWebhooksUsesProcessingStartedAt to testGetStuckWebhooksFindsCorrectRecords and adapted it to test the new getStuckWebhooks method.
    • Added testGetStuckWebhooksIgnoresNullProcessingStartedAt to verify webhooks with NULL processing_started_at are not considered stuck.
    • Added testGetStuckWebhooksDefaultTimeoutIsOneDay to confirm the default 1-day timeout.
    • Added testBatchResetStuckToNewIncrementsAttempts to test the new batch reset functionality, including attempt incrementing.
    • Added testBatchMarkStuckAsPermanentError to test marking webhooks as permanently failed.
    • Added testBatchResetStuckToNewOnlyUpdatesProcessingStatus to verify optimistic locking.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust mechanism for recovering stuck webhooks, complete with attempt tracking and a distinction between retriable and permanent errors. The refactoring of the PaymentWebhook BAO to separate data access concerns is a great improvement for maintainability and follows the Single Responsibility Principle. The new unit tests are comprehensive and cover the new logic and edge cases effectively.

I have one suggestion regarding a potential performance issue where the stuck webhook recovery logic may be executed multiple times unnecessarily during a single run. Overall, this is a solid enhancement to the webhook processing reliability.

@erawat erawat force-pushed the CIVIMM-480-webhook-stuck-recovery branch 2 times, most recently from 2b0c00a to 34cd504 Compare March 9, 2026 22:57
@erawat
Copy link
Copy Markdown
Member Author

erawat commented Mar 9, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust mechanism for handling stuck webhooks, which is a significant improvement for reliability. The introduction of attempt tracking and a permanent error state prevents infinite loops and provides better visibility into failing webhooks. The code is well-structured, with a clear separation of concerns between the BAO layer for data access and the service layer for business logic. The addition of comprehensive unit tests ensures the new functionality is well-covered. I have a few minor suggestions to further improve code quality and adherence to repository guidelines.

Comment on lines +427 to +439
return \Civi\Api4\PaymentWebhook::get(FALSE)
->addSelect('id', 'processor_type')
->addWhere('processor_type', '=', $processorType)
->addWhere('status', '=', 'new')
->addWhere('attempts', '>', 0)
->addWhere('next_retry_at', 'IS NULL')
->addWhere('processing_started_at', 'IS NOT NULL')
->setLimit($limit)
->execute()
->getArrayCopy();
}

if (empty($webhookIds)) {
/**
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the getStuckWebhooks method, the repository guidelines prefer a manual loop with explicit type checks over fluent methods like getArrayCopy() or iterator_to_array() for processing CiviCRM Api4 results. This approach ensures compliance with strict static analysis (like PHPStan level 9) for typed arrays and maintains consistency with established codebase patterns. Please refactor this to use a foreach loop to allow for explicit type handling.

    $webhooks = [];
    foreach (\Civi\Api4\PaymentWebhook::get(FALSE)
        ->addSelect('id', 'processor_type')
        ->addWhere('processor_type', '=', $processorType)
        ->addWhere('status', '=', 'new')
        ->addWhere('attempts', '>', 0)
        ->addWhere('next_retry_at', 'IS NULL')
        ->addWhere('processing_started_at', 'IS NOT NULL')
        ->setLimit($limit)
        ->execute() as $webhook) {
      $webhooks[] = $webhook;
    }
    return $webhooks;
References
  1. When processing CiviCRM Api4 results, prefer a manual loop with explicit type checks over fluent methods like getArrayCopy() or indexBy(). This approach ensures compliance with strict static analysis (like PHPStan level 9) for typed arrays and maintains consistency with established codebase patterns.

@erawat erawat force-pushed the CIVIMM-480-webhook-stuck-recovery branch from 34cd504 to 18ab40e Compare March 9, 2026 23:13
@erawat erawat force-pushed the CIVIMM-480-webhook-stuck-recovery branch from 18ab40e to 113bafd Compare March 10, 2026 08:17
@erawat erawat merged commit e382172 into master Mar 10, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants