Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions changelog/unreleased/SOLR-18176-shardhandler-bottleneck.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
title: Fix query throughput bottleneck caused by uncached ZooKeeper get calls for queries with explicit 'collection' parameter
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious but we're permitted to put multiple sentences here. I think the sentence you replaced was great as it was. Adding a second sentence clarifying the scope of impact (i.e. under what circumstances this was a problem) is needed too. "queries with an explicit 'collection' parameter" -- uh, this is news to me. Perhaps say (if true): "Happens when Solr does distributed-search over multiple collections, and when the coordinator has no local replica for some of them."

type: changed
authors:
- name: Matthew Biscocho
links:
- name: SOLR-18176
url: https://issues.apache.org/jira/browse/SOLR-18176
- name: SOLR-15352
url: https://issues.apache.org/jira/browse/SOLR-15352
Original file line number Diff line number Diff line change
Expand Up @@ -109,12 +109,13 @@ private void withShardsParam(Builder builder, String shardsParam) {
if (sliceOrUrl.indexOf('/') < 0) {
// this is a logical shard
this.slices[i] = sliceOrUrl;
replicas[i] =
findReplicas(
builder,
shardsParam,
clusterState,
clusterState.getCollection(builder.collection).getSlice(sliceOrUrl));
DocCollection coll = clusterState.getCollectionOrNull(builder.collection, true);
if (coll == null) {
throw new SolrException(
SolrException.ErrorCode.BAD_REQUEST,
"Could not find collection to resolve replicas: " + builder.collection);
}
replicas[i] = findReplicas(builder, shardsParam, clusterState, coll.getSlice(sliceOrUrl));
} else {
// this has urls
this.replicas[i] = StrUtils.splitSmart(sliceOrUrl, "|", true);
Expand Down Expand Up @@ -189,7 +190,12 @@ private void addSlices(
String collectionName,
String shardKeys,
boolean multiCollection) {
DocCollection coll = state.getCollection(collectionName);
DocCollection coll = state.getCollectionOrNull(collectionName, true);
if (coll == null) {
throw new SolrException(
SolrException.ErrorCode.BAD_REQUEST,
"Could not find collection to add slices: " + collectionName);
}
Collection<Slice> slices = coll.getRouter().getSearchSlices(shardKeys, params, coll);
ClientUtils.addSlices(target, collectionName, slices, multiCollection);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,15 +24,19 @@
import java.util.Set;
import java.util.concurrent.TimeUnit;
import org.apache.solr.BaseDistributedSearchTestCase;
import org.apache.solr.client.solrj.SolrClient;
import org.apache.solr.client.solrj.jetty.HttpJettySolrClient;
import org.apache.solr.client.solrj.request.CollectionAdminRequest;
import org.apache.solr.client.solrj.request.SolrQuery;
import org.apache.solr.client.solrj.request.UpdateRequest;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.cloud.SolrCloudTestCase;
import org.apache.solr.common.cloud.SolrZKMetricsListener;
import org.apache.solr.common.params.CommonParams;
import org.apache.solr.common.params.ShardParams;
import org.apache.solr.common.util.SimpleOrderedMap;
import org.apache.solr.common.util.StrUtils;
import org.apache.solr.embedded.JettySolrRunner;
import org.junit.BeforeClass;
import org.junit.Test;

Expand Down Expand Up @@ -707,6 +711,57 @@ private QueryResponse queryWithAsserts(String... q) throws Exception {
return response;
}

/**
* When a node resolves collection state for a collection it doesn't host, queries should use
* cached state and not make ZK calls on every query.
*/
@Test
public void testDistributedQueryDoesNotReadFromZk() throws Exception {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsmiley I agreed with your comment before and wrote a test. Did some TDD so this definitely failed before the fix. Also learned a bit more. This actually only triggers with multi-collection query or I guess if you have an alias pointing to multiple collections. So this test reflects that creating a collection on a separate node then querying for both. If your node is hosting all collections it doesn't do a lazyCollectionRef get call. Maybe worth adding that to the changelog?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aaaaah! Aha! You have fixed SOLR-15352 then; wonderful. I'm so glad I hoped / suggested a test... we can learn a lot in the process.
Yes the changelog should reflect the scope of the problem, which apparently isn't all queries. This explains why this problem hasn't been identified yet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the link to it in the change log and slightly changed the title

final String secondColl = "secondColl";

// Create a collection on only 1 node so the other node uses LazyCollectionRef for state
List<JettySolrRunner> jettys = cluster.getJettySolrRunners();
CollectionAdminRequest.createCollection(secondColl, "conf", 1, 1)
.setCreateNodeSet(jettys.get(0).getNodeName())
.processAndWait(cluster.getSolrClient(), DEFAULT_TIMEOUT);
cluster
.getZkStateReader()
.waitForState(
secondColl,
DEFAULT_TIMEOUT,
TimeUnit.SECONDS,
(n, c) -> SolrCloudTestCase.replicasForCollectionAreFullyActive(n, c, 1, 1));

try {
// Node 1 hosts COLLECTION but not secondColl.
// Send a multi-collection query to trigger LazyCollectionRef get call
JettySolrRunner nodeWithoutSecondColl = jettys.get(1);
try (SolrClient client =
new HttpJettySolrClient.Builder(nodeWithoutSecondColl.getBaseUrl().toString()).build()) {

String collectionsParameter = COLLECTION + "," + secondColl;

// Warm up LazyCollectionRef state cache with query
client.query(COLLECTION, new SolrQuery("q", "*:*", "collection", collectionsParameter));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only looks at the metrics on one node, and it's not obvious what node that is. A comment could clarify or actually get it from the pertinent jetty (the one you will query, the coordinator).

// Get ZK metrics from the coordinator node (the one we're querying)
SolrZKMetricsListener metrics =
nodeWithoutSecondColl.getCoreContainer().getZkController().getZkClient().getMetrics();
long existsBefore = metrics.getExistsChecks();

// Query again and assert that exists call is not made
client.query(COLLECTION, new SolrQuery("q", "*:*", "collection", collectionsParameter));
assertEquals(
"Query should not cause ZK exists checks as collection state should be cached",
existsBefore,
metrics.getExistsChecks());
}
} finally {
CollectionAdminRequest.deleteCollection(secondColl)
.processAndWait(cluster.getSolrClient(), DEFAULT_TIMEOUT);
}
}

private int getNumRequests(
Map<String, List<TrackingShardHandlerFactory.ShardRequestAndParams>> requests) {
int beforeNumRequests = 0;
Expand Down
Loading