SOLR-18189: Improve tracking of duplicate Solr docs with content hash by fhuaulme · Pull Request #4263 · apache/solr

fhuaulme · 2026-04-03T13:41:48Z

https://issues.apache.org/jira/browse/SOLR-18189

Description

Solr may receive similar documents for updates. Depending on use case, these updates may be unnecessary so discard such updates would be beneficial.

Solution

Content hash detection aims to improve update efficiency by identifying and bypassing redundant document updates. By detecting instances where content remains unchanged, the URP chain skips unnecessary write operations to reduce CPU and I/O overhead on nodes by skipping operations done in downstream URPs.

Tests

Unit tests added for URP and URP factory (namely org.apache.solr.update.processor.ContentHashVersionProcessorTest and org.apache.solr.update.processor.ContentHashVersionProcessorFactoryTest).

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide
I have added a changelog entry for my change

https://issues.apache.org/jira/browse/SOLR-18189

dsmiley

Partial review...

dsmiley · 2026-04-03T15:29:53Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+import java.util.function.Predicate;
+
+/**
+ * An implementation of {@link UpdateRequestProcessor} which computes a hash of selected doc values, and uses this hash


"doc values" here could be confusing... as Lucene/Solr has "docValues" and this URP in fact uses that technical mechanism.

dsmiley · 2026-04-03T15:57:42Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+    return true; // Doc not found
+  }
+
+  private DocFoundAndOldUserAndSolrVersions getOldUserVersionsFromStored(BytesRef indexedDocId)


IMO we should not support this. It's not unreasonable to demand that someone using this feature enable docValues on the hash field.

dsmiley · 2026-04-03T16:02:32Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+  }
+
+  public String computeDocHash(SolrInputDocument doc) {
+    List<String> docIncludedFieldNames =


This results in a double-navigation of the fields of a document in order to have the hash be insensitive to the order of fields added. But in practice, a given end-to-end pipeline is going to be deterministic. Hmm. I suppose its no big deal. To improve, you cold instead stream SolrInputField, sorting by name, and get the value from them without another lookup. You could also do the whole thing as a stream pipeline without needlessing producing a List. It'd end in a forEach to process the value.

dsmiley · 2026-04-03T16:07:06Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+    }
+  }
+
+  private static class DocFoundAndOldUserAndSolrVersions {


should this be a record? Or simply an Optional of the oldHash?

dsmiley · 2026-04-03T16:08:09Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+        .encodeToString(signature); // Makes a base64 hash out of signature value
+  }
+
+  interface OldDocProvider {


realistically, what impls are we going to have?

dsmiley · 2026-04-03T16:23:27Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+    } finally {
+      rsp.addToLog("numAddsExisting", sameCount + differentCount + unknownCount);
+      rsp.addToLog("numAddsExistingWithIdentical", sameCount);
+      rsp.addToLog("numAddsExistingUnknown", unknownCount);


RE "numAdds" -- lets use "contentHash" prefix so as to associate these with this URP, and not confusing it with DocBasedVersionConstraintsProcessor which ought to include similar logs but doesn't yet do so.

IMO adding a new doc shouldn't add any log. I would expect one of these to incorporate the word "drop" or "discard" to reflect an action taken. I understand this URP can be configured to not take that action, albeit that would be in an exploratory / trial situation that ought to be temporary (I imagine).

dsmiley · 2026-04-03T16:25:57Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+    this.discardSameDocuments = discardSameDocuments;
+  }
+
+  private boolean validateHash(BytesRef indexedDocId, String newHash) throws IOException {


Perhaps we should align with a similar name in DocBasedVersionConstraintsProcessor, and rename to hashIsAcceptable ?

dsmiley · 2026-04-03T16:28:16Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessorFactory.java

+  private static final char SEPARATOR = ','; // Separator for included/excluded fields
+  static final String CONTENT_HASH_ENABLED_PARAM = "contentHashEnabled";
+  private List<String> includeFields =
+      Collections.singletonList("*"); // Included fields defaults to 'all'


modern Java prefers List.of

dsmiley · 2026-04-03T16:34:30Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessorFactory.java

+      Collections.singletonList("*"); // Included fields defaults to 'all'
+  private List<String> excludeFields = new ArrayList<>();
+  // No excluded field by default, yet hashFieldName is excluded by default
+  private String hashFieldName =


arguably we should insist this be provided rather than some auto-magic name

dsmiley · 2026-04-03T16:55:48Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+  private boolean validateHash(BytesRef indexedDocId, String newHash) throws IOException {
+    assert null != indexedDocId;
+
+    var docFoundAndOldUserVersions = getOldUserVersionsFromStored(indexedDocId);


This is one possible algorithm -- lookup hash from docValues of a doc that we got by first looking up the current doc by ID (a terms index lookup). But this is ~twice as expensive as doing a terms index lookup of the new hash, assuming we find no doc with that hash. If we do find one, then it's similar, as we then need to fetch the ID via docValues to see if it's the same doc. But we're about to drop the doc then, which isn't the common path. Even "need" there is debatable; it'd be extremely unlikely to find a match by content hash to a different doc ID.

Basically -- does the content hash field have indexed=true XOR docValues=true. We don't need to support both (IMO) -- we can choose what's most efficient for this use-case.
I understand you may not have time to implement another algorithm.

Just an FYI: the postingsFormat (of the terms index) is configurable and there are a number of lovely implementations, incuding a bloom filter one, which would make a lot of sense for "lookup" / exact match only use-case. Such a choice is up to the user to configure in the schema if they wish.

Oh wait... docs that have been but not yet visible via the IndexSearcher will not have their content hashes searchable. Albeit in such a situation, the "worst" thing that would happen is we index a doc that could have been dropped/discarded -- a missed optimization opportunity.

A terms lookup on the hash should be much faster than the RTG + double-docValues lookups though.

SOLR-18189: Improve tracking of duplicate Solr docs with content hash

cd914b7

https://issues.apache.org/jira/browse/SOLR-18189

github-actions bot added tests cat:index labels Apr 3, 2026

fhuaulme-sfdc added 2 commits April 3, 2026 17:42

* lint fixes

5277652

* java lint fixes

cd3b510

dsmiley reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-18189: Improve tracking of duplicate Solr docs with content hash#4263

SOLR-18189: Improve tracking of duplicate Solr docs with content hash#4263
fhuaulme wants to merge 3 commits intoapache:mainfrom
fhuaulme:jira/SOLR-18189

fhuaulme commented Apr 3, 2026 •

edited

Loading

Uh oh!

dsmiley left a comment

Uh oh!

dsmiley Apr 3, 2026

Uh oh!

dsmiley Apr 3, 2026

Uh oh!

dsmiley Apr 3, 2026

Uh oh!

dsmiley Apr 3, 2026

Uh oh!

dsmiley Apr 3, 2026

Uh oh!

dsmiley Apr 3, 2026

Uh oh!

dsmiley Apr 3, 2026

Uh oh!

dsmiley Apr 3, 2026

Uh oh!

dsmiley Apr 3, 2026

Uh oh!

dsmiley Apr 3, 2026

Uh oh!

dsmiley Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fhuaulme commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Solution

Tests

Checklist

Uh oh!

dsmiley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fhuaulme commented Apr 3, 2026 •

edited

Loading