SOLR-18189: Improve tracking of duplicate Solr docs with content hash#4263
SOLR-18189: Improve tracking of duplicate Solr docs with content hash#4263fhuaulme wants to merge 3 commits intoapache:mainfrom
Conversation
| import java.util.function.Predicate; | ||
|
|
||
| /** | ||
| * An implementation of {@link UpdateRequestProcessor} which computes a hash of selected doc values, and uses this hash |
There was a problem hiding this comment.
"doc values" here could be confusing... as Lucene/Solr has "docValues" and this URP in fact uses that technical mechanism.
| return true; // Doc not found | ||
| } | ||
|
|
||
| private DocFoundAndOldUserAndSolrVersions getOldUserVersionsFromStored(BytesRef indexedDocId) |
There was a problem hiding this comment.
IMO we should not support this. It's not unreasonable to demand that someone using this feature enable docValues on the hash field.
| } | ||
|
|
||
| public String computeDocHash(SolrInputDocument doc) { | ||
| List<String> docIncludedFieldNames = |
There was a problem hiding this comment.
This results in a double-navigation of the fields of a document in order to have the hash be insensitive to the order of fields added. But in practice, a given end-to-end pipeline is going to be deterministic. Hmm. I suppose its no big deal. To improve, you cold instead stream SolrInputField, sorting by name, and get the value from them without another lookup. You could also do the whole thing as a stream pipeline without needlessing producing a List. It'd end in a forEach to process the value.
| } | ||
| } | ||
|
|
||
| private static class DocFoundAndOldUserAndSolrVersions { |
There was a problem hiding this comment.
should this be a record? Or simply an Optional of the oldHash?
| .encodeToString(signature); // Makes a base64 hash out of signature value | ||
| } | ||
|
|
||
| interface OldDocProvider { |
There was a problem hiding this comment.
realistically, what impls are we going to have?
| } finally { | ||
| rsp.addToLog("numAddsExisting", sameCount + differentCount + unknownCount); | ||
| rsp.addToLog("numAddsExistingWithIdentical", sameCount); | ||
| rsp.addToLog("numAddsExistingUnknown", unknownCount); |
There was a problem hiding this comment.
RE "numAdds" -- lets use "contentHash" prefix so as to associate these with this URP, and not confusing it with DocBasedVersionConstraintsProcessor which ought to include similar logs but doesn't yet do so.
IMO adding a new doc shouldn't add any log. I would expect one of these to incorporate the word "drop" or "discard" to reflect an action taken. I understand this URP can be configured to not take that action, albeit that would be in an exploratory / trial situation that ought to be temporary (I imagine).
| this.discardSameDocuments = discardSameDocuments; | ||
| } | ||
|
|
||
| private boolean validateHash(BytesRef indexedDocId, String newHash) throws IOException { |
There was a problem hiding this comment.
Perhaps we should align with a similar name in DocBasedVersionConstraintsProcessor, and rename to hashIsAcceptable ?
| private static final char SEPARATOR = ','; // Separator for included/excluded fields | ||
| static final String CONTENT_HASH_ENABLED_PARAM = "contentHashEnabled"; | ||
| private List<String> includeFields = | ||
| Collections.singletonList("*"); // Included fields defaults to 'all' |
There was a problem hiding this comment.
modern Java prefers List.of
| Collections.singletonList("*"); // Included fields defaults to 'all' | ||
| private List<String> excludeFields = new ArrayList<>(); | ||
| // No excluded field by default, yet hashFieldName is excluded by default | ||
| private String hashFieldName = |
There was a problem hiding this comment.
arguably we should insist this be provided rather than some auto-magic name
| private boolean validateHash(BytesRef indexedDocId, String newHash) throws IOException { | ||
| assert null != indexedDocId; | ||
|
|
||
| var docFoundAndOldUserVersions = getOldUserVersionsFromStored(indexedDocId); |
There was a problem hiding this comment.
This is one possible algorithm -- lookup hash from docValues of a doc that we got by first looking up the current doc by ID (a terms index lookup). But this is ~twice as expensive as doing a terms index lookup of the new hash, assuming we find no doc with that hash. If we do find one, then it's similar, as we then need to fetch the ID via docValues to see if it's the same doc. But we're about to drop the doc then, which isn't the common path. Even "need" there is debatable; it'd be extremely unlikely to find a match by content hash to a different doc ID.
Basically -- does the content hash field have indexed=true XOR docValues=true. We don't need to support both (IMO) -- we can choose what's most efficient for this use-case.
I understand you may not have time to implement another algorithm.
Just an FYI: the postingsFormat (of the terms index) is configurable and there are a number of lovely implementations, incuding a bloom filter one, which would make a lot of sense for "lookup" / exact match only use-case. Such a choice is up to the user to configure in the schema if they wish.
There was a problem hiding this comment.
Oh wait... docs that have been but not yet visible via the IndexSearcher will not have their content hashes searchable. Albeit in such a situation, the "worst" thing that would happen is we index a doc that could have been dropped/discarded -- a missed optimization opportunity.
A terms lookup on the hash should be much faster than the RTG + double-docValues lookups though.
https://issues.apache.org/jira/browse/SOLR-18189
Description
Solr may receive similar documents for updates. Depending on use case, these updates may be unnecessary so discard such updates would be beneficial.
Solution
Content hash detection aims to improve update efficiency by identifying and bypassing redundant document updates. By detecting instances where content remains unchanged, the URP chain skips unnecessary write operations to reduce CPU and I/O overhead on nodes by skipping operations done in downstream URPs.
Tests
Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.