-
Notifications
You must be signed in to change notification settings - Fork 817
[SOLR-18176] HttpShardHandler query throughput bottleneck from ZooKeeper #4237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| title: Fix query throughput bottleneck caused by uncached ZooKeeper get calls for queries with explicit 'collection' parameter | ||
| type: changed | ||
dsmiley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| authors: | ||
| - name: Matthew Biscocho | ||
| links: | ||
| - name: SOLR-18176 | ||
| url: https://issues.apache.org/jira/browse/SOLR-18176 | ||
| - name: SOLR-15352 | ||
| url: https://issues.apache.org/jira/browse/SOLR-15352 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,15 +24,19 @@ | |
| import java.util.Set; | ||
| import java.util.concurrent.TimeUnit; | ||
| import org.apache.solr.BaseDistributedSearchTestCase; | ||
| import org.apache.solr.client.solrj.SolrClient; | ||
| import org.apache.solr.client.solrj.jetty.HttpJettySolrClient; | ||
| import org.apache.solr.client.solrj.request.CollectionAdminRequest; | ||
| import org.apache.solr.client.solrj.request.SolrQuery; | ||
| import org.apache.solr.client.solrj.request.UpdateRequest; | ||
| import org.apache.solr.client.solrj.response.QueryResponse; | ||
| import org.apache.solr.cloud.SolrCloudTestCase; | ||
| import org.apache.solr.common.cloud.SolrZKMetricsListener; | ||
| import org.apache.solr.common.params.CommonParams; | ||
| import org.apache.solr.common.params.ShardParams; | ||
| import org.apache.solr.common.util.SimpleOrderedMap; | ||
| import org.apache.solr.common.util.StrUtils; | ||
| import org.apache.solr.embedded.JettySolrRunner; | ||
| import org.junit.BeforeClass; | ||
| import org.junit.Test; | ||
|
|
||
|
|
@@ -707,6 +711,57 @@ private QueryResponse queryWithAsserts(String... q) throws Exception { | |
| return response; | ||
| } | ||
|
|
||
| /** | ||
| * When a node resolves collection state for a collection it doesn't host, queries should use | ||
| * cached state and not make ZK calls on every query. | ||
| */ | ||
| @Test | ||
| public void testDistributedQueryDoesNotReadFromZk() throws Exception { | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dsmiley I agreed with your comment before and wrote a test. Did some TDD so this definitely failed before the fix. Also learned a bit more. This actually only triggers with multi-collection query or I guess if you have an alias pointing to multiple collections. So this test reflects that creating a collection on a separate node then querying for both. If your node is hosting all collections it doesn't do a lazyCollectionRef get call. Maybe worth adding that to the changelog?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. aaaaah! Aha! You have fixed SOLR-15352 then; wonderful. I'm so glad I hoped / suggested a test... we can learn a lot in the process.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added the link to it in the change log and slightly changed the title |
||
| final String secondColl = "secondColl"; | ||
|
|
||
| // Create a collection on only 1 node so the other node uses LazyCollectionRef for state | ||
| List<JettySolrRunner> jettys = cluster.getJettySolrRunners(); | ||
| CollectionAdminRequest.createCollection(secondColl, "conf", 1, 1) | ||
| .setCreateNodeSet(jettys.get(0).getNodeName()) | ||
| .processAndWait(cluster.getSolrClient(), DEFAULT_TIMEOUT); | ||
| cluster | ||
| .getZkStateReader() | ||
| .waitForState( | ||
| secondColl, | ||
| DEFAULT_TIMEOUT, | ||
| TimeUnit.SECONDS, | ||
| (n, c) -> SolrCloudTestCase.replicasForCollectionAreFullyActive(n, c, 1, 1)); | ||
|
|
||
| try { | ||
| // Node 1 hosts COLLECTION but not secondColl. | ||
| // Send a multi-collection query to trigger LazyCollectionRef get call | ||
| JettySolrRunner nodeWithoutSecondColl = jettys.get(1); | ||
| try (SolrClient client = | ||
| new HttpJettySolrClient.Builder(nodeWithoutSecondColl.getBaseUrl().toString()).build()) { | ||
|
|
||
| String collectionsParameter = COLLECTION + "," + secondColl; | ||
|
|
||
| // Warm up LazyCollectionRef state cache with query | ||
| client.query(COLLECTION, new SolrQuery("q", "*:*", "collection", collectionsParameter)); | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this only looks at the metrics on one node, and it's not obvious what node that is. A comment could clarify or actually get it from the pertinent jetty (the one you will query, the coordinator). |
||
| // Get ZK metrics from the coordinator node (the one we're querying) | ||
| SolrZKMetricsListener metrics = | ||
| nodeWithoutSecondColl.getCoreContainer().getZkController().getZkClient().getMetrics(); | ||
| long existsBefore = metrics.getExistsChecks(); | ||
|
|
||
| // Query again and assert that exists call is not made | ||
| client.query(COLLECTION, new SolrQuery("q", "*:*", "collection", collectionsParameter)); | ||
| assertEquals( | ||
| "Query should not cause ZK exists checks as collection state should be cached", | ||
| existsBefore, | ||
| metrics.getExistsChecks()); | ||
| } | ||
| } finally { | ||
| CollectionAdminRequest.deleteCollection(secondColl) | ||
| .processAndWait(cluster.getSolrClient(), DEFAULT_TIMEOUT); | ||
| } | ||
| } | ||
|
|
||
| private int getNumRequests( | ||
| Map<String, List<TrackingShardHandlerFactory.ShardRequestAndParams>> requests) { | ||
| int beforeNumRequests = 0; | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious but we're permitted to put multiple sentences here. I think the sentence you replaced was great as it was. Adding a second sentence clarifying the scope of impact (i.e. under what circumstances this was a problem) is needed too. "queries with an explicit 'collection' parameter" -- uh, this is news to me. Perhaps say (if true): "Happens when Solr does distributed-search over multiple collections, and when the coordinator has no local replica for some of them."