Skip to content

SOLR-18189: Improve tracking of duplicate Solr docs with content hash#4263

Open
fhuaulme wants to merge 3 commits intoapache:mainfrom
fhuaulme:jira/SOLR-18189
Open

SOLR-18189: Improve tracking of duplicate Solr docs with content hash#4263
fhuaulme wants to merge 3 commits intoapache:mainfrom
fhuaulme:jira/SOLR-18189

Conversation

@fhuaulme
Copy link
Copy Markdown

@fhuaulme fhuaulme commented Apr 3, 2026

https://issues.apache.org/jira/browse/SOLR-18189

Description

Solr may receive similar documents for updates. Depending on use case, these updates may be unnecessary so discard such updates would be beneficial.

Solution

Content hash detection aims to improve update efficiency by identifying and bypassing redundant document updates. By detecting instances where content remains unchanged, the URP chain skips unnecessary write operations to reduce CPU and I/O overhead on nodes by skipping operations done in downstream URPs.

Tests

  • Unit tests added for URP and URP factory (namely org.apache.solr.update.processor.ContentHashVersionProcessorTest and org.apache.solr.update.processor.ContentHashVersionProcessorFactoryTest).

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide
  • I have added a changelog entry for my change

Copy link
Copy Markdown
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review...

import java.util.function.Predicate;

/**
* An implementation of {@link UpdateRequestProcessor} which computes a hash of selected doc values, and uses this hash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"doc values" here could be confusing... as Lucene/Solr has "docValues" and this URP in fact uses that technical mechanism.

return true; // Doc not found
}

private DocFoundAndOldUserAndSolrVersions getOldUserVersionsFromStored(BytesRef indexedDocId)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should not support this. It's not unreasonable to demand that someone using this feature enable docValues on the hash field.

}

public String computeDocHash(SolrInputDocument doc) {
List<String> docIncludedFieldNames =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results in a double-navigation of the fields of a document in order to have the hash be insensitive to the order of fields added. But in practice, a given end-to-end pipeline is going to be deterministic. Hmm. I suppose its no big deal. To improve, you cold instead stream SolrInputField, sorting by name, and get the value from them without another lookup. You could also do the whole thing as a stream pipeline without needlessing producing a List. It'd end in a forEach to process the value.

}
}

private static class DocFoundAndOldUserAndSolrVersions {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a record? Or simply an Optional of the oldHash?

.encodeToString(signature); // Makes a base64 hash out of signature value
}

interface OldDocProvider {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

realistically, what impls are we going to have?

} finally {
rsp.addToLog("numAddsExisting", sameCount + differentCount + unknownCount);
rsp.addToLog("numAddsExistingWithIdentical", sameCount);
rsp.addToLog("numAddsExistingUnknown", unknownCount);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE "numAdds" -- lets use "contentHash" prefix so as to associate these with this URP, and not confusing it with DocBasedVersionConstraintsProcessor which ought to include similar logs but doesn't yet do so.

IMO adding a new doc shouldn't add any log. I would expect one of these to incorporate the word "drop" or "discard" to reflect an action taken. I understand this URP can be configured to not take that action, albeit that would be in an exploratory / trial situation that ought to be temporary (I imagine).

this.discardSameDocuments = discardSameDocuments;
}

private boolean validateHash(BytesRef indexedDocId, String newHash) throws IOException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should align with a similar name in DocBasedVersionConstraintsProcessor, and rename to hashIsAcceptable ?

private static final char SEPARATOR = ','; // Separator for included/excluded fields
static final String CONTENT_HASH_ENABLED_PARAM = "contentHashEnabled";
private List<String> includeFields =
Collections.singletonList("*"); // Included fields defaults to 'all'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modern Java prefers List.of

Collections.singletonList("*"); // Included fields defaults to 'all'
private List<String> excludeFields = new ArrayList<>();
// No excluded field by default, yet hashFieldName is excluded by default
private String hashFieldName =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arguably we should insist this be provided rather than some auto-magic name

private boolean validateHash(BytesRef indexedDocId, String newHash) throws IOException {
assert null != indexedDocId;

var docFoundAndOldUserVersions = getOldUserVersionsFromStored(indexedDocId);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one possible algorithm -- lookup hash from docValues of a doc that we got by first looking up the current doc by ID (a terms index lookup). But this is ~twice as expensive as doing a terms index lookup of the new hash, assuming we find no doc with that hash. If we do find one, then it's similar, as we then need to fetch the ID via docValues to see if it's the same doc. But we're about to drop the doc then, which isn't the common path. Even "need" there is debatable; it'd be extremely unlikely to find a match by content hash to a different doc ID.

Basically -- does the content hash field have indexed=true XOR docValues=true. We don't need to support both (IMO) -- we can choose what's most efficient for this use-case.
I understand you may not have time to implement another algorithm.

Just an FYI: the postingsFormat (of the terms index) is configurable and there are a number of lovely implementations, incuding a bloom filter one, which would make a lot of sense for "lookup" / exact match only use-case. Such a choice is up to the user to configure in the schema if they wish.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait... docs that have been but not yet visible via the IndexSearcher will not have their content hashes searchable. Albeit in such a situation, the "worst" thing that would happen is we index a doc that could have been dropped/discarded -- a missed optimization opportunity.

A terms lookup on the hash should be much faster than the RTG + double-docValues lookups though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants