Skip to content

SOLR-18187: Document enrichment with LLMs#4259

Draft
nicolo-rinaldi wants to merge 11 commits intoapache:mainfrom
SeaseLtd:llm-document-enrichment
Draft

SOLR-18187: Document enrichment with LLMs#4259
nicolo-rinaldi wants to merge 11 commits intoapache:mainfrom
SeaseLtd:llm-document-enrichment

Conversation

@nicolo-rinaldi
Copy link
Copy Markdown
Contributor

https://issues.apache.org/jira/browse/SOLR-18187

Description

The goal of this PR is to add a way to integrate LLMs directly into Solr at index time to fill fields that might be useful (e.g., categories, tags, etc.)

Solution

This PR adds LLM-based document enrichment capabilities to Solr's indexing pipeline via a new DocumentEnrichmentUpdateProcessorFactory in the language-models module. The processor allows users to enrich documents at index time by calling an LLM (via https://github.com/langchain4j/langchain4j) with a configurable prompt built from one or more existing document fields (inputFields), and storing the model's response into an output field. The output field can be of different types (i.e., string, text, int, long, float, double, boolean, and date) and can be single-valued or multi-valued. The structured output has been used to adapt to the output field type.

The implementation has taken inspiration from the text-to-vector feature in the same module. This has been done to keep the implementation consistent with conventions already in the language-models module.

Note: this PR was developed with assistance from Claude Code (Anthropic).

Tests

Tests covering configuration validation (missing required params, conflicting params, invalid field types, placeholder mismatches), and processor initialization.

Tests covering single-valued and multi-valued output fields of all supported types, multi-input-field prompts, prompt file loading, error handling (model exceptions, ambiguous/malformed JSON responses, unsupported model types), and skipNullOrMissingFieldValues behaviour. All the supported models have been tested.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide
  • I have added a changelog entry for my change

@github-actions github-actions bot added documentation Improvements or additions to documentation dependencies Dependency upgrades tool:build tests labels Apr 1, 2026
Copy link
Copy Markdown
Contributor

@aruggero aruggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments

import org.slf4j.LoggerFactory;

/**
* This object wraps a {@link ChatModel} to produce the content of new fields from another.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dev.langchain4j.model.chat.ChatModel

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This object wraps a {@link dev.langchain4j.model.chat.ChatModel} to generate the contents of a field based on other fields specified as input.

(just one field is generated... right?)

private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
private static final long BASE_RAM_BYTES =
RamUsageEstimator.shallowSizeOfInstance(SolrChatModel.class);
// timeout is type Duration
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand what you mean since we discussed this before, but it's not clear to others. I would not specify this here.
A person would just read that the variable TIMEOUT_PARAM is of type Duration while here it is declared as String.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

// timeout is type Duration
private static final String TIMEOUT_PARAM = "timeout";

// the followings are Integer type
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as before

var builder = modelClass.getMethod("builder").invoke(null);
if (params != null) {
/*
* This block of code has the responsibility of instantiate a {@link
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    /*
     * This block of code has the responsibility of instantiate a {@link
     * dev.langchain4j.model.chat.ChatModel} using the params provided. Classes have
     * params of the specific implementation of {@link
     * dev.langchain4j.model.chat.ChatModel}, which is not known beforehand. So we benefit of
     * the design choice in langchain4j that each subclass implementing {@link
     * dev.langchain4j.model.chat.ChatModel} uses setters with the same name of the
     * param.
     */

}
}
}
textToTextModel = (ChatModel) builder.getClass().getMethod("build").invoke(builder);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably I would call it chatModel to maintain the same nomenclature.

public void init(final NamedList<?> args) {
// removeConfigArgs handles both multiple <str name="inputField"> and <arr name="inputField">
// and must be called before toSolrParams() since it mutates args in place
Collection<String> fieldNames = args.removeConfigArgs(INPUT_FIELD_PARAM);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ask Alessandro also if needed...
In case we decide to keep this, I would probably add an example or say in the comment above that this is also supported..

.build();
}

private static JsonSchemaElement toJsonSchemaElement(FieldType fieldType) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for a first contribution this support is enough... to discuss with Alessandro if we are missing some interesting field types that should be added at this step.
https://solr.apache.org/guide/solr/latest/indexing-guide/field-types-included-with-solr.html#recommended-field-types

promptPlaceholders.add(m.group(1));
}

Set<String> missingInPrompt = new LinkedHashSet<>(fieldNames);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check var names

"prompt is missing placeholders for inputField(s): " + missingInPrompt);
}

Set<String> unknownInPrompt = new LinkedHashSet<>(promptPlaceholders);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check var names

}

Set<String> unknownInPrompt = new LinkedHashSet<>(promptPlaceholders);
unknownInPrompt.removeAll(new HashSet<>(fieldNames));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashSet or LinkedHashSet?

@@ -0,0 +1,8 @@
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to avoid creating all these different dummy model files?
like setting the response in the tests and initiating the model just one time

<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
<copyField source="*" dest="_text_"/>
<field name="vectorised" type="boolean" indexed="true" stored="false" docValues="true" default="false"/>
<field name="enriched" type="boolean" indexed="true" stored="false" docValues="true" default="false"/>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what is this used?

@@ -0,0 +1,62 @@
<?xml version="1.0" ?>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we use the other for everything?

<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>

<updateRequestProcessorChain name="failingDocumentEnrichmentMultiField">
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Difference with failingDocumentEnrichment?

import org.junit.BeforeClass;
import org.junit.Test;

public class DocumentEnrichmentUpdateProcessorFactoryTest extends TestLanguageModelBase {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test for a not-supported output field type (dense vector is present, but the other fields of the other exception are not)

}

@Test
public void processAtomicUpdate_shouldReplaceExistingEnrichedFieldNotAppend() throws Exception {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot see how this test can check that the value is replaced and this is not the value we had before

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this wasn't working, we should see "enriched content" two times?
What about if this value is the old one and it was not recomputed at all?

// --- multi-field tests ---

@Test
public void processAdd_multipleInputFields_allPresent_shouldEnrichDocument() throws Exception {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this enough to test that both input fields were considered? I would have the same response, even if only one of the two was used.

// --- typed single-valued output field tests ---

@Test
public void processAdd_singleLongOutputField_shouldPopulateValue() throws Exception {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some doubts on the following tests... the assert itself is not able to differentiate the output type.. double and float are the same...

// --- multivalued output field / scalar response test ---

@Test
public void processAdd_multivaluedOutputField_scalarLlmResponse_shouldStoreSingleValue()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processAdd_multivaluedOutputField_singleValuedLlmResponse_shouldStoreSingleValue

public void processAdd_multivaluedOutputField_scalarLlmResponse_shouldStoreSingleValue()
throws Exception {
// Model returns {"value": "a single string"} for a multivalued output field.
// The scalar falls through the List<?> check and is stored as a single-element value.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmm a bit confusing... the field is anyway multivalued.. even if it contains just one value

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Dependency upgrades documentation Improvements or additions to documentation tests tool:build

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants