Conversation
This reverts commit 5de8cd7.
| String v = uniqueKey.getType().toInternal(params.get(ID)); | ||
| Term t = new Term(uniqueKey.getName(), v); | ||
| docId = searcher.getFirstMatch(t); | ||
| if (docId < 0) { |
There was a problem hiding this comment.
This a slight change to the existing contract. Previously if a docId was missing it would throw. Now I propose returning an empty response (more in line with how search works). This is to handle the distributed case more elegantly. It is theoretically a breaking change although perhaps unlikely someone was relying on this error? One way I could see this being a problem is someone assuming the doc field will be in the response if the response code is a 200 without null checking first. I suppose this could be problematic and perhaps warrants more care...
There was a problem hiding this comment.
the change makes sense to me.
| final String originalShardAddr; | ||
| final LukeResponse.FieldInfo originalFieldInfo; | ||
| private Object indexFlags; | ||
| private String indexFlagsShardAddr; |
There was a problem hiding this comment.
The source of the field info and the index flags can actually be different. If a shard has all the documents of a particular field deleted it will have field info but without the index flags, those can theoretically come from yet another shard.
There was a problem hiding this comment.
A different approach might be to simply get "first doc" from the index instead of "first live doc". I don't know if this has other implications. But it would eliminate this edge case. It's funny that we already touched first live doc logic here #4157
There was a problem hiding this comment.
How does the edger case result in difficulty? Meaning... difficulty in the client to interpret the results, difficulty in explaining, difficulty in implementing this? I tend to find it useful to err on the side of optimizing for result explain-ability.
There was a problem hiding this comment.
Mainly implementation difficulty. To start, this edge case only matters for the shard inconsistency validation so if we don't care about that then there's no point in even tracking the actual shard that contributed the first index flags.
Assuming we do care for a second, the only reason I can think of it being advantageous to read index flags from the first live doc as opposed to simply the first doc is if you did something like this:
- Had some index flags set to x on field f
- Deleted all docs witht field f
- Set index flags to y on field f
If you don't delete all docs with field f in step 2 you wouldn't be able to ingest any more docs because the indexing chain's new flags wouldn't match the existing indexed fields. That being said, I don't think I've actually tried this so I am not sure the steps above would provide a valid, writeable index either. That's why I was posing the question whether it matters if the doc is deleted or not.
There was a problem hiding this comment.
If we're not sure it matters, let's not over-complicate this. Do the simple thing ignore "live docs". We're not boxed in from realizing later we'd like something more complicated.
There was a problem hiding this comment.
I did a little more testing on this and my initial theory that this distinction only exists because we get the first live doc rather than the first doc is incorrect. It turns out a fully deleted field will lack terms even though it does still have schema flags. If a field lacks terms then we can't even get any doc for it (live or not). A workaround for this would be to ignore all fields that don't have terms (that should have terms) but this can get complicated for other reasons. As such, I am leaning toward keeping it the way it is currently.
|
|
||
| // docId is a Lucene-internal integer, not meaningful across shards | ||
| if (reqParams.getInt(DOC_ID) != null) { | ||
| throw new SolrException( |
There was a problem hiding this comment.
The current API returns a single doc in case of show=doc, thus requesting a Lucene docId in distributed mode makes no sense.
|
|
||
| String[] shards = rb.shards; | ||
| if (shards == null || shards.length == 0) { | ||
| return false; |
There was a problem hiding this comment.
The distributed response includes extra fields, thus delegating to the non-distributed API in case of a single sharded is a bit annoying API-wise because the client may have to handle different response structure per, say, collection or cloud.
| // Deal with the chance that the first bunch of terms are in deleted documents. Is there a | ||
| // better way? | ||
| for (int idx = 0; idx < 1000 && postingsEnum == null; ++idx) { | ||
| for (int idx = 0; idx < 1000; ++idx) { |
| * different Lucene FieldInfo (and thus different index flags strings) on each shard. | ||
| */ | ||
| @Test | ||
| public void testInconsistentIndexFlagsAcrossShards() throws Exception { |
There was a problem hiding this comment.
Is this test too brittle and possibly overkill?
There was a problem hiding this comment.
If we are ok sweeping inconsistencies under the rug we can remove the validation the distributed luke handler does and remove this test. If we remove this test without removing the validation some important code paths of the validation won't be hit in tests. I guess I wanted to showcase how you can get into odd states in Solr with the right (or wrong) sequence of allowed operations.
solr/solrj/src/java/org/apache/solr/client/solrj/response/LukeResponse.java
Outdated
Show resolved
Hide resolved
solr/solr-ref-guide/modules/indexing-guide/pages/luke-request-handler.adoc
Outdated
Show resolved
Hide resolved
solr/solr-ref-guide/modules/indexing-guide/pages/luke-request-handler.adoc
Outdated
Show resolved
Hide resolved
solr/solr-ref-guide/modules/indexing-guide/pages/luke-request-handler.adoc
Outdated
Show resolved
Hide resolved
solr/solrj/src/java/org/apache/solr/client/solrj/response/LukeResponse.java
Outdated
Show resolved
Hide resolved
solr/solrj/src/java/org/apache/solr/client/solrj/response/LukeResponse.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/handler/admin/LukeRequestHandler.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/handler/admin/LukeRequestHandler.java
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/handler/admin/LukeRequestHandler.java
Outdated
Show resolved
Hide resolved
solr/core/src/test/org/apache/solr/handler/admin/LukeRequestHandlerDistribTest.java
Outdated
Show resolved
Hide resolved
…handler.adoc Co-authored-by: David Smiley <dsmiley@apache.org>
…nto distributed-luke
solr/core/src/java/org/apache/solr/handler/admin/LukeRequestHandler.java
Outdated
Show resolved
Hide resolved
| int idx = nl.indexOf(key, 0); | ||
| if (idx >= 0) { | ||
| Object val = nl.getVal(idx); | ||
| if (val instanceof Long l && l >= Integer.MIN_VALUE && l <= Integer.MAX_VALUE) { |
solr/core/src/test/org/apache/solr/handler/admin/LukeHandlerCloudTest.java
Outdated
Show resolved
Hide resolved
| import org.junit.Test; | ||
|
|
||
| /** Cloud-specific Luke tests that require SolrCloud features like managed schema and Schema API. */ | ||
| public class LukeHandlerCloudTest extends SolrCloudTestCase { |
solr/core/src/test/org/apache/solr/handler/admin/LukeRequestHandlerDistribTest.java
Outdated
Show resolved
Hide resolved
| assertTrue("aggregated maxDoc should be > 0", rsp.getMaxDoc() > 0); | ||
| assertNotNull("deletedDocs should be present", rsp.getDeletedDocs()); | ||
|
|
||
| Map<String, LukeResponse> shardResponses = rsp.getShardResponses(); |
There was a problem hiding this comment.
I kind of question, do we want the distributed search to have shard information? Naturally we want the merged information. Have you considered use of org.apache.solr.common.params.ShardParams#SHARDS_INFO to toggle this?
There was a problem hiding this comment.
This includes index information that comes with the default luke response (as well as with show=index). I didn't want to just omit all those fields that couldn't be easily merged in the distributed mode (version, indexCommit, isCurrent, segmentsFile, directory, ...). I suppose you could but it didn't feel intuitive and that info could be seen as useful.
I think the initial motivation for SHARDS_INFO is a little different. I'm not opposed to it but also don't see a big benefit of adding another flag to the already many flags in luke. This part of the response is cheap anyway.
OTOH one could argue that you may want this info for all replicas not just from a single replica from each shard, although this would be more custom so perhaps not worth the effort. Something like that could be behind a flag...
There was a problem hiding this comment.
I appreciate it could be useful but I suspect many callers don't care about it. I just wrote a program days ago to find out what fields have docValues & terms index. I definitely didn't want per-shard details.
I suspect the admin UI wouldn't want it either, at least not by default. It would add complexity to a UI to display.
It's also kind of noise even if it's "cheap".
https://issues.apache.org/jira/browse/SOLR-8127
Description
Currently,
LukeRequestHandlerfetches its data from the local shard handling the request. Thus all the statistics it produces are for a single shard only. This is a bit misleading for a multi-shard cloud because an API that exists at the collection level produces shard-specific response. The solution proposed here distributes the request to one replica from each shard and aggregates the responses whenever possible. I introduce ashardsresponse collection for data that can't be aggregated, or at least cannot be practically aggregated.Solution
I used Claude Opus 4.6 to generate the implementation with many manual refinements/iterations.
Design Decisions
One-replica-per-shard aggregation
The distributed Luke handler queries one replica from each shard and aggregates the results. An alternative approach would be to query every replica in the cluster, which could be useful for inspecting low-level Lucene characteristics (directory paths, segment files, version numbers). This implementation takes the simpler one-replica-per-shard approach to match how other distributed handlers in Solr work and to keep the scope manageable. Querying all replicas can be considered as a future enhancement if there's community interest.
distrib=false as default
The handler defaults to distrib=false (local mode), preserving backward compatibility since the response structure changes for distrib=true. When distrib=true, the response adds a shards section containing per-shard data that is either not mergeable or difficult to merge. This includes full index metadata (directory, segmentsFile, version, userData, lastModified) and detailed per-field statistics (topTerms, distinct, histogram).
docsAsLong field in LukeResponse
The existing docs field in LukeResponse.FieldInfo is an int. When summing document counts across many shards, this can overflow. A new docsAsLong accessor was added to LukeResponse to handle collection-wide doc counts safely. The original docs field is preserved for backward compatibility.
show=doc lucene response has shard-local docFreq
When looking up a document with distrib=true, the response includes both a high-level solr section (stored fields) and a low-level lucene section (per-field index analysis including docFreq). The docFreq values in the lucene section are not aggregated across shards, they reflect only the shard where the document was found. Aggregating docFreq across shards would require additional fan-out requests to every shard for each term, which is expensive and I'm not sure what the demand is for this kind of logic. I acknowledge that it may feel inconsistent since the
docslogic per field does aggregate across shards.distrib=true with show=doc only only supports id request param, not docId
Lucene document IDs (docId) are internal to each shard's index and have no meaning across shards, i.e. the same docId value refers to different documents on different shards. Only the logical id parameter (using the schema's uniqueKeyField) is supported for distributed doc lookup, since it can be unambiguously resolved across the cluster. Passing docId with distrib=true returns a clear error. Additionally, if the same id is found on multiple shards (indicating index corruption), the handler returns an error rather than silently picking one.
Single-shard fallback to local mode
When distrib=true is set on a collection with only one shard, the handler falls back to local mode returning the standard local Luke response without the shards field. This matches a common pattern in other Solr handlers. However, this creates a response structure inconsistency: clients parsing a distributed response need to handle two shapes depending on shard count. For a single-shard collection, per-shard index metadata (like directory, segmentsFile) appears at the top level; for multi-shard collections, it appears under shards. This may be annoying for client code that needs to look in two places for the same data. An alternative would be to always use the distributed response structure when distrib=true is explicitly requested, even for single-shard collections. This is an open question worth community input.
Tests
Added
LukeRequestHandlerDistribTestto test this feature. Also have built this locally to ensure it works E2E.Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.