Skip to content

A115: disable priority LB child policy retention cache#541

Open
apolcyn wants to merge 9 commits intogrpc:masterfrom
apolcyn:remove_cache
Open

A115: disable priority LB child policy retention cache#541
apolcyn wants to merge 9 commits intogrpc:masterfrom
apolcyn:remove_cache

Conversation

@apolcyn
Copy link
Contributor

@apolcyn apolcyn commented Mar 17, 2026

@apolcyn apolcyn changed the title Disable priority LB child cache A56 update: disable priority LB child cache Mar 17, 2026
@apolcyn apolcyn changed the title A56 update: disable priority LB child cache A56 update: disable priority LB child policy retention cache Mar 17, 2026
@markdroth markdroth changed the title A56 update: disable priority LB child policy retention cache A115: disable priority LB child policy retention cache Mar 18, 2026
Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up!

Please let me know if you have any questions.


## Abstract

[A56](https://github.com/grpc/proposal/blob/master/A56-priority-lb-policy.md) describes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is misleading. A56 specifies that the caching mechanism will be used in two cases:

  1. When a higher priority child becomes reachable, we deactive the lower-priority children and eventually remove them.
  2. When a child is removed from the LB policy config.

This gRFC is removing case (2) only. We did not discuss removing case (1), and I don't think it makes sense to do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added these points

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the caching only for case (2) is not going to as simple as setting an integer (at least in Go), as discussed today.

To give a little bit more context, we use a balancergroup, which is a helper that manages a bunch of LB policies and one can add and remove policies from the balancergroup and it allows the caller to control aggregating state updates from these policies. Our caching is implemented in the balancergroup. And this balancergroup is used by a bunch of LB policies like priority, rls, weightedtarget and clustermanager.

We could add a flavor of the Remove method that supports caching and one that doesn't. But I'm wondering how the other languages implement this caching and how easy/hard it is going to be to support this only for case (2).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @markdroth about this

Copy link
Member

@markdroth markdroth Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to disable caching for case (1). If Go needs more work to do that, then we need to do that work.

FWIW, in C-core, we don't have any shared code for the policies you listed. The priority policy deactivates the child here for case 1 and here for case 2. So all we would need to do to remove this functionality for case 2 is to change the latter to immediately remove the child instead of deactivating it.

I can understand why Go might have used shared code for the priority, weighted_target, and xds_cluster_manager LB policies, since they all have this caching behavior. But I will note that the RLS policy was never supposed to have this caching behavior, so it seems strange to me that you'd also use it there.


## Abstract

[A56](https://github.com/grpc/proposal/blob/master/A56-priority-lb-policy.md) describes
Copy link
Member

@markdroth markdroth Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to disable caching for case (1). If Go needs more work to do that, then we need to do that work.

FWIW, in C-core, we don't have any shared code for the policies you listed. The priority policy deactivates the child here for case 1 and here for case 2. So all we would need to do to remove this functionality for case 2 is to change the latter to immediately remove the child instead of deactivating it.

I can understand why Go might have used shared code for the priority, weighted_target, and xds_cluster_manager LB policies, since they all have this caching behavior. But I will note that the RLS policy was never supposed to have this caching behavior, so it seems strange to me that you'd also use it there.

[[AA, BB, CC], [DD]] => [priority-0-3 priority-0-4]
[[BB, CC], [AA, DD]] => [priority-0-0 priority-0-1]
[[AA, BB, CC], [DD, EE]] => [priority-0-1 priority-0-2]
[[BB, CC], [AA, DD]] => [priority-0-2 priority-0-3]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the algorithm should be sticking with child numbers 1 and 2 in this last step.

I've sent you grpc/grpc#41896 to add tests demonstrating the algorithm's behavior in C-core.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm the original text is what Go was doing (i tagged you on an internal doc to illustrate)

but maybe that was specific to Go

Changed to your original suggestion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case you have now in the gRFC, in the third update, the algorithm will use child number 3 for [AA, BB] and reuse child number 2 for [CC, DD], because it was already using that number for [CC] in the second update.

If what you're seeing in Go matches what you currently have in the doc, then I suspect Go's implementation of this algorithm isn't quite right. @easwars

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a look and get back asap.

Copy link
Member

@ejona86 ejona86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we (all of us) have a plan for cluster_manager? Are we going to make a separate gRFC for that?

@markdroth
Copy link
Member

Do we (all of us) have a plan for cluster_manager? Are we going to make a separate gRFC for that?

I don't think we need any changes there at this point. Arguably, the caching there is a gray area, because I don't think any gRFC ever specified that we would actually have caching there, but all of our implementations actually do. But on the other hand, I don't think the caching in that policy is actually causing any problems right now. So my inclination is to wait until we have a problem before changing that.

@apolcyn apolcyn marked this pull request as ready for review March 19, 2026 19:15
[[AA, BB, CC], [DD]] => [priority-0-3 priority-0-4]
[[BB, CC], [AA, DD]] => [priority-0-0 priority-0-1]
[[AA, BB, CC], [DD, EE]] => [priority-0-1 priority-0-2]
[[BB, CC], [AA, DD]] => [priority-0-2 priority-0-3]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case you have now in the gRFC, in the third update, the algorithm will use child number 3 for [AA, BB] and reuse child number 2 for [CC, DD], because it was already using that number for [CC] in the second update.

If what you're seeing in Go matches what you currently have in the doc, then I suspect Go's implementation of this algorithm isn't quite right. @easwars

Note this should be done for Java and Go only.

For C-core, we are actually potentially getting benefit from this
behavior due to subchannel sharing, so we're not planning to drop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/sharing/pooling/, just to be precise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants