Conversation
| .get_permits(permit_sizes) | ||
| .await; | ||
| let permit_provider = if is_broad_search { | ||
| &searcher_context.secondary_search_permit_provider |
There was a problem hiding this comment.
with this logic, short request cannot use the broad search request permits. I don't think this is good.
(do you want a supermarket where people with less than 10 articles cannot go to the regular cashier)
There was a problem hiding this comment.
Makes sense. We just don't want to queue them there if there is some broad search going on.
Before making this change, what do you think about the fact that the permit provider is chosen on the leafs? I would prefer that for a large search, all leaf searches were tagged as "broad" in the root. Currently, the timeout cannot be enforced on the root because we don't know yet what timeout will be applied to its leaf searches. This requires a change in the RPC so I wanted to be sure the the change will be merged upstream before doing it.
|
@rdettai-sk I think we should not merge this PR. At least at the moment. The trouble is that it is a hacky solution (the problem I pointed out, but also others: Do we really want to not use the entire permit pool for a long request if there are no request running? Would a priority system be more suited?), this adds complexity (it has many new property in the configuration for instance) to the project. After the PR is merged, moving away for a better solution would require coordination. I understand this is a rather "unfair": we would happily merge a hack like this if it was scratching our customer's itch. |
|
I agree this solution isn't perfect. I also understand that adding new configurations that are likely to become irrelevant when moving to a better solution creates an undesirable maintenance burden. Do you have plans to work on the search scheduler soon? Do you never run into situations where the searchers become irresponsive because of this? |
Description
Currently, the search permit provider handles only one query at a time. This is good for the overall query latency when queries are relatively short. It is catastrophic in any concurrent setup where some queries are slow, because all the other queries will be starved from resources until the large query completes.
This PR creates a secondary search permit provider that is independently configured. "Fast" leaf queries are routed to one provider and "slow" ones to the other. Query duration is naively inferred from the number of targetted splits, and the threshold for being routed to one provider or the other can be configured using
SearcherConfig.secondary_targeted_split_count_threshold.One limitation of the current implementation of this PR is that for a given root search, various leaf searches might fall on different sides of the threshold. We could fix this by applying the threashold on the root and add the targeted provider in the leaf request.
How was this PR tested?
We ran this setup on our highly concurrent production environment. It helped stabilize query performance drastically.