SOLR-18187: Document enrichment with LLMs by nicolo-rinaldi · Pull Request #4259 · apache/solr

nicolo-rinaldi · 2026-04-01T16:11:45Z

https://issues.apache.org/jira/browse/SOLR-18187

Description

The goal of this PR is to add a way to integrate LLMs directly into Solr at index time to fill fields that might be useful (e.g., categories, tags, etc.)

Solution

This PR adds LLM-based document enrichment capabilities to Solr's indexing pipeline via a new DocumentEnrichmentUpdateProcessorFactory in the language-models module. The processor allows users to enrich documents at index time by calling an LLM (via https://github.com/langchain4j/langchain4j) with a configurable prompt built from one or more existing document fields (inputFields), and storing the model's response into an output field. The output field can be of different types (i.e., string, text, int, long, float, double, boolean, and date) and can be single-valued or multi-valued. The structured output has been used to adapt to the output field type.

The implementation has taken inspiration from the text-to-vector feature in the same module. This has been done to keep the implementation consistent with conventions already in the language-models module.

Note: this PR was developed with assistance from Claude Code (Anthropic).

Tests

Tests covering configuration validation (missing required params, conflicting params, invalid field types, placeholder mismatches), and processor initialization.

Tests covering single-valued and multi-valued output fields of all supported types, multi-input-field prompts, prompt file loading, error handling (model exceptions, ambiguous/malformed JSON responses, unsupported model types), and skipNullOrMissingFieldValues behaviour. All the supported models have been tested.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide
I have added a changelog entry for my change

…tUpdateProcessorFactory

- multivalued outputField - outputField different from Str/Text, with numeric, boolean and date

…or documentation

…t with LLMs' module

…rFactory

aruggero

Left some comments

aruggero · 2026-04-03T08:14:56Z

...e-models/src/java/org/apache/solr/languagemodels/documentenrichment/model/SolrChatModel.java

+import org.slf4j.LoggerFactory;
+
+/**
+ * This object wraps a {@link ChatModel} to produce the content of new fields from another.


dev.langchain4j.model.chat.ChatModel

This object wraps a {@link dev.langchain4j.model.chat.ChatModel} to generate the contents of a field based on other fields specified as input.

(just one field is generated... right?)

aruggero · 2026-04-03T08:19:07Z

...e-models/src/java/org/apache/solr/languagemodels/documentenrichment/model/SolrChatModel.java

+  private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
+  private static final long BASE_RAM_BYTES =
+      RamUsageEstimator.shallowSizeOfInstance(SolrChatModel.class);
+  // timeout is type Duration


I can understand what you mean since we discussed this before, but it's not clear to others. I would not specify this here.
A person would just read that the variable TIMEOUT_PARAM is of type Duration while here it is declared as String.

aruggero · 2026-04-03T08:19:16Z

...e-models/src/java/org/apache/solr/languagemodels/documentenrichment/model/SolrChatModel.java

+  // timeout is type Duration
+  private static final String TIMEOUT_PARAM = "timeout";
+
+  // the followings are Integer type


aruggero · 2026-04-03T08:39:50Z

...e-models/src/java/org/apache/solr/languagemodels/documentenrichment/model/SolrChatModel.java

+      var builder = modelClass.getMethod("builder").invoke(null);
+      if (params != null) {
+        /*
+         * This block of code has the responsibility of instantiate a {@link


/* * This block of code has the responsibility of instantiate a {@link * dev.langchain4j.model.chat.ChatModel} using the params provided. Classes have * params of the specific implementation of {@link * dev.langchain4j.model.chat.ChatModel}, which is not known beforehand. So we benefit of * the design choice in langchain4j that each subclass implementing {@link * dev.langchain4j.model.chat.ChatModel} uses setters with the same name of the * param. */

aruggero · 2026-04-03T08:43:37Z

...e-models/src/java/org/apache/solr/languagemodels/documentenrichment/model/SolrChatModel.java

+          }
+        }
+      }
+      textToTextModel = (ChatModel) builder.getClass().getMethod("build").invoke(builder);


Probably I would call it chatModel to maintain the same nomenclature.

aruggero · 2026-04-03T09:57:40Z