Skip to content

Implemented automatic grpc loop with reconnection#87

Draft
m-olko wants to merge 8 commits intomainfrom
grpc_loop
Draft

Implemented automatic grpc loop with reconnection#87
m-olko wants to merge 8 commits intomainfrom
grpc_loop

Conversation

@m-olko
Copy link
Contributor

@m-olko m-olko commented Mar 17, 2026

This PR

  • adds this new feature

Related Issues

Fixes #1234523

Notes

Follow-up Tasks

How to test

Signed-off-by: Marcin Olko <molko@google.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and reliability of the flagd provider by implementing an automatic gRPC stream reconnection mechanism. It introduces configurable parameters for managing connection retries and deadlines, ensuring that the provider can gracefully recover from network interruptions and maintain a continuous flag synchronization. The changes also refine the internal state management to accurately reflect the connection status, leading to a more stable and resilient flag evaluation system.

Highlights

  • Enhanced gRPC Connection Management: Introduced automatic reconnection logic for gRPC streams with exponential backoff, improving the provider's resilience to transient network issues.
  • Configurable gRPC Parameters: Added new configuration options for gRPC stream deadlines, retry backoff intervals, maximum backoff, grace periods, keep-alive times, and a list of fatal status codes.
  • Dynamic gRPC Service Configuration: Implemented a method to generate gRPC service configuration JSON dynamically, allowing for programmatic control over retry policies.
  • Improved State Handling: Expanded the GrpcSync state machine to include kReconnecting and kFatal states, providing clearer lifecycle management during connection disruptions.
Changelog
  • providers/flagd/src/BUILD
    • Added nlohmann_json as a dependency.
  • providers/flagd/src/configuration.cpp
    • Introduced new environment variables for gRPC stream deadline, retry backoff, max backoff, grace period, keep-alive time, and fatal status codes.
    • Defined default values for these new gRPC configuration parameters.
    • Added a kStatusCodeMap to map gRPC status code strings to integer values.
    • Implemented GetEnvLong helper function for parsing uint64_t environment variables.
    • Added ParseFatalStatusCodes function to parse a comma-separated string of fatal gRPC status codes.
    • Initialized new configuration members in the FlagdProviderConfig constructor.
    • Added getter methods for the new gRPC configuration parameters.
    • Implemented GetServiceConfigJson to generate gRPC service configuration JSON, including retry policies.
    • Added setter methods for the new gRPC configuration parameters.
  • providers/flagd/src/configuration.h
    • Declared new getter and setter methods for gRPC stream deadline, retry backoff, max backoff, grace period, keep-alive time, and fatal status codes.
    • Declared the GetServiceConfigJson method.
    • Added new private member variables to store the gRPC configuration.
  • providers/flagd/src/sync/grpc/grpc_sync.cpp
    • Modified the Init method to use wait_for with a timeout and handle kReconnecting and kFatal states.
    • Updated channel creation to use grpc::CreateCustomChannel with grpc::ChannelArguments for service configuration and keep-alive settings.
    • Added logging for state changes.
    • Refactored WaitForUpdates into an infinite loop for automatic reconnection.
    • Implemented exponential backoff logic for retries.
    • Added logic to check for fatal gRPC status codes and transition to kFatal state.
    • Introduced ClearFlags call on fatal errors or prolonged disconnection.
    • Updated Shutdown to include logging for state changes.
  • providers/flagd/src/sync/grpc/grpc_sync.h
    • Added kReconnecting and kFatal states to the State enum.
    • Introduced init_timed_out_ boolean member to track initialization timeouts.
  • providers/flagd/src/sync/sync.cpp
    • Implemented ClearFlags method to reset current flags and global metadata.
  • providers/flagd/src/sync/sync.h
    • Declared the ClearFlags method.
Activity
  • No human activity (comments, reviews, approvals) has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements an automatic gRPC reconnection loop with exponential backoff, which is a significant improvement for the provider's resilience. The changes introduce new configuration options for controlling retry behavior, keep-alive, and fatal status codes. The core logic is in GrpcSync::WaitForUpdates, which has been rewritten to handle connection states and retries. My review includes suggestions to improve code clarity and robustness, such as using more specific exception handling, adopting more idiomatic C++ for JSON construction and collection searching, and ensuring safer integer arithmetic in the backoff calculation.

Signed-off-by: Marcin Olko <molko@google.com>
@m-olko
Copy link
Contributor Author

m-olko commented Mar 17, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an automatic gRPC loop with reconnection capabilities, which is a significant enhancement for the provider's robustness. The implementation includes configurable retry policies, exponential backoff, and state management for the connection lifecycle. The code is generally well-structured. However, I've found a couple of potential issues in the new logic: one related to a possible integer overflow when configuring gRPC keep-alive, and another concerning the logic for handling the grace period after a disconnection. Both could lead to incorrect behavior under certain conditions.

m-olko added 2 commits March 18, 2026 10:27
Signed-off-by: Marcin Olko <molko@google.com>
Signed-off-by: Marcin Olko <molko@google.com>
@m-olko
Copy link
Contributor Author

m-olko commented Mar 18, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust gRPC client with automatic reconnection and exponential backoff, which is a significant improvement. The implementation correctly handles various states of the connection lifecycle. I have identified a potential issue with the retry counter that could lead to an integer overflow and have also suggested making the maximum number of retry attempts configurable for better flexibility. Overall, these are solid changes.

Signed-off-by: Marcin Olko <molko@google.com>
@m-olko
Copy link
Contributor Author

m-olko commented Mar 18, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements an automatic reconnection mechanism for the gRPC sync process in the flagd provider. This is a significant feature that improves the provider's resilience to network issues. The changes include adding numerous configuration options for controlling retry behavior (backoff, deadlines, etc.), implementing the reconnection loop with exponential backoff in GrpcSync, and adding corresponding state management and tests.

The implementation is comprehensive, but I've identified a couple of critical issues related to configuration validation that could lead to busy loops. I've also noted an inconsistency in how invalid status codes are handled between two different setter overloads. My review provides suggestions to address these points.

m-olko added 2 commits March 18, 2026 14:05
Signed-off-by: Marcin Olko <molko@google.com>
…ks for timings

Signed-off-by: Marcin Olko <molko@google.com>
@m-olko
Copy link
Contributor Author

m-olko commented Mar 18, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a robust automatic reconnection mechanism for the gRPC sync, including exponential backoff, configurable retry policies, and handling of fatal errors. The changes are extensive and well-structured, introducing new configuration options and a state machine to manage the connection lifecycle.

My review focuses on potential race conditions, resource management, and areas for improved configuration flexibility. I've identified a critical issue regarding thread cleanup on initialization failure, a potential integer overflow in the backoff logic, an incorrect test case, and an opportunity to make more retry parameters configurable. Addressing these points will enhance the stability and usability of the new feature.

Signed-off-by: Marcin Olko <molko@google.com>
@m-olko
Copy link
Contributor Author

m-olko commented Mar 18, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements an automatic reconnection mechanism for the gRPC sync process, which is a great improvement for the provider's resilience. The changes include adding new configuration options for retry behavior, implementing an exponential backoff strategy, and handling fatal errors to prevent endless retry loops. The state machine in GrpcSync has been enhanced to manage reconnecting and fatal states.

I've found a couple of areas for improvement in the retry logic:

  1. The backoff strategy is inconsistent. When stream creation fails, it uses a fixed delay, while a failed established stream uses exponential backoff.
  2. The exponential backoff calculation has a potential integer overflow issue with non-default configurations.

My detailed comments provide suggestions to address these points. Overall, this is a solid implementation of a complex feature.

Comment on lines +206 to +209
lifecycle_cv_.wait_for(
lock, std::chrono::milliseconds(config_.GetRetryBackoffMs()), [this] {
return state_ == State::kShuttingDown || state_ == State::kFatal;
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The retry logic here uses a fixed backoff time when stream creation fails. This is inconsistent with the exponential backoff used when an established stream fails. To make the retry mechanism more robust and consistent, an exponential backoff should be applied here as well.

      int64_t backoff = config_.GetRetryBackoffMs() * (1LL << retry_count);
      if (backoff >= config_.GetRetryBackoffMaxMs()) {
        backoff = config_.GetRetryBackoffMaxMs();
      } else {
        retry_count++;
      }
      lifecycle_cv_.wait_for(lock, std::chrono::milliseconds(backoff), [this] {
        return state_ == State::kShuttingDown || state_ == State::kFatal;
      });

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant