Status: Alpha
Version: 1.0
Last Validated: 2026-03-11 (b2342851)
Module Path: src/training/ / include/training/
- ARCHITECTURE.md — System architecture, component diagram, data flow
- ROADMAP.md — Feature roadmap and implementation phases
- FUTURE_ENHANCEMENTS.md — Planned features and design constraints
- German Docs — Deutschsprachige Übersicht
The Training module provides tools for building and maintaining domain-specific AI fine-tuning datasets and adapters within ThemisDB. It is designed around the legal domain (with German-language legal text as the primary target) and consists of three components: an automatic labeler that extracts structured training samples from legal documents using NLP modality detection, an incremental LoRA trainer that fine-tunes language model adapters with checkpoint/resume support, and a knowledge graph enricher that annotates training samples with graph-traversal context (related provisions, case law, and semantically similar documents).
| Interface / File | Role |
|---|---|
auto_labeler.cpp |
LegalAutoLabeler: structured training sample extraction from legal documents |
incremental_lora_trainer.cpp |
Incremental LoRA adapter fine-tuning with checkpoint/resume |
knowledge_graph_enricher.cpp |
AQL-based context enrichment via graph traversal |
lora_checkpoint_manager.cpp |
LoRA checkpoint management with SHA-256 integrity validation and rotation |
modality_parser.cpp |
ModalityDetector, TextClauseExtractor, TableExtractor, CitationExtractor, OCRExtractor |
provenance_tracker.cpp |
Training sample provenance and lineage tracking |
lora_data_selection.cpp |
Training data selection, deduplication, and balancing |
training_pipeline.cpp |
End-to-end training pipeline orchestrator (ConfidenceCalibrator, ProvenanceTracker integration) |
In Scope:
- Automated training data labeling from legal documents via NLP modality extraction
- Incremental LoRA adapter fine-tuning with epoch/batch training loop structure
- Checkpoint saving and resume from checkpoint
- Adapter version management (deploy, rollback, list versions)
- Knowledge graph context enrichment via AQL graph traversal queries
- Vector similarity search for finding semantically similar documents
- Confidence-threshold filtering and human-review flagging for low-confidence samples
Out of Scope:
- The NLP modality extractor itself (provided by
analytics/nlp_text_analyzer) - Raw LLM base model management or quantization
- Training data storage schema definition (handled by the storage module)
- Serving or inference of trained adapters (handled by the LLM integration layer)
- Distributed/multi-GPU training coordination
Location: auto_labeler.cpp
Automatically generates labeled training samples from legal documents stored in ThemisDB. Uses analytics::NlpTextAnalyzer to extract deontic modalities (must/shall/may/etc.) from text, then converts each detected modality into an input/output training pair.
Features:
labelAll()— process every document in the configured source collectionlabelDocument(id)— label a single document by ID, returnsstd::vector<TrainingSample>labelQuery(aql)— label documents matching an arbitrary AQL querygetLowConfidenceSamples(threshold)— retrieve samples below the confidence threshold for human reviewupdateSampleConfidence(id, score, reviewer)— record human review decisions- Configurable
min_confidencethreshold andflag_low_confidencefor review queue - Supports German (
de) and otherlanguage_codeconfigurations - Pimpl pattern for ABI stability
Training Sample Structure:
Each TrainingSample contains an input (analysis instruction + source text), an output (category, deontic logic label, and interpretation), a confidence score from the NLP analyzer, and source_id linking back to the originating document.
Location: incremental_lora_trainer.cpp
Manages the full lifecycle of LoRA (Low-Rank Adaptation) adapter training: initial training, incremental fine-tuning on new data, checkpoint save/resume, evaluation, deployment, and rollback.
Features:
train(mode, callback)— run training inINITIALorINCREMENTALmoderesumeFromCheckpoint(path, callback)— load checkpoint and continue trainingevaluate(adapter_version)— compute validation loss and accuracy for a stored adapterdeployVersion(version, traffic_split)— route a configurable fraction of traffic to a new adapter versionrollbackVersion(target_version)— revert to a previously deployed adapterlistVersions()— enumerate available adapter versionssetHyperparameters(rank, alpha, lr)— configure LoRA rank, scaling factor, and learning ratesetCheckpointing(enabled, steps)— enable periodic checkpoint saves during training- Training callback
(epoch, step, loss, message)for progress monitoring - Pimpl pattern for ABI stability
Training Configuration (IncrementalTrainingConfig):
adapter_version, num_epochs, batch_size, and source_collection name for fetching training samples.
Location: knowledge_graph_enricher.cpp
Enriches existing TrainingSample records with graph-derived context by executing AQL traversal queries against ThemisDB's graph store. Adds related legal provisions, case law citations, and semantically similar documents to each sample's context.
Features:
enrichAll(callback)— enrich every sample in the target collectionenrichSample(sample_id)— enrich a single sample, returns aGraphContextenrichQuery(aql, callback)— enrich samples matching an AQL queryfindRelatedProvisions(doc_id, max)— graph outbound traversal to find referenced provisionsfindRelatedCaseLaw(doc_id, max)— graph traversal filtering fordocument_type == "case_law"findSimilarDocuments(doc_id, max)— cosine similarity search over document embeddingssetCustomQuery(name, aql)— register named custom AQL queries for domain-specific traversals- Configurable flags:
include_provisions,include_case_law,include_similar_docs,max_related_items - Pimpl pattern for ABI stability
GraphContext Structure:
related_provisions (list of document keys), case_law (list of document keys), similar_documents (list of document keys), and context_summary (human-readable summary string).
LegalAutoLabeler
│
├─ analytics::NlpTextAnalyzer ← extracts deontic modalities from text
├─ Database source collection ← legal documents (AQL query)
└─► TrainingSample[] ── written to training samples collection
KnowledgeGraphEnricher
│
├─ AQL graph traversal ← related provisions, case law
├─ Vector similarity search ← similar documents (embedding index)
└─► GraphContext ── merged into TrainingSample.graph_context
IncrementalLoRATrainer
│
├─ Database training collection ← reads TrainingSample[] (with graph context)
├─ Training loop (epochs/batches)
├─ Checkpoint storage
└─► LoRA adapter versions ── stored and versioned for deployment
analytics/nlp_text_analyzer.h— legal modality extraction (NLP)training/auto_labeler.h—LegalAutoLabeler,TrainingSample,AutoLabelConfigtraining/incremental_lora_trainer.h—IncrementalLoRATrainer,IncrementalTrainingConfig,TrainingResulttraining/knowledge_graph_enricher.h—KnowledgeGraphEnricher,EnrichmentConfig,GraphContext
<chrono>— elapsed time tracking in training/enrichment stats<stdexcept>— exception propagation- AQL query executor (
QueryEngine/executeAql()fromquery/aql_runner.h) — document ID fetch inlabelAll(), document text fetch inlabelDocument(), user-supplied queries inlabelQuery(). Pass aQueryEngine*toLegalAutoLabelerat construction time. Passnullptrto run in offline/test mode (no DB access). VectorIndexManager(fromindex/vector_index.h) — cosine-similarity search forfindSimilarDocuments(). Wire viaKnowledgeGraphEnricher::setVectorIndex(&vim). Passnullptr(default) to run without a vector index.
#include "training/auto_labeler.h"
#include "training/incremental_lora_trainer.h"
#include "training/knowledge_graph_enricher.h"
#include "query/query_engine.h"
#include "query/aql_runner.h"
using namespace themis::training;
// --- 0. Obtain a QueryEngine (wired to your RocksDB instance) ---
// RocksDBWrapper db(...); db.open();
// SecondaryIndexManager idx(db);
// QueryEngine engine(db, idx);
//
// Pass &engine to components that need DB access, or nullptr for offline mode.
// --- 1. Auto-label legal documents (DB-connected mode) ---
AutoLabelConfig label_config;
label_config.source_collection = "legal_documents";
label_config.target_collection = "legal_training_samples";
label_config.language_code = "de";
label_config.min_confidence = 0.6f;
label_config.flag_low_confidence = true;
// Pass &engine to enable AQL-based document fetch.
// Omit the third argument (or pass nullptr) for offline/test mode.
LegalAutoLabeler labeler(label_config, "rocksdb://./data", &engine);
LabelingStats stats = labeler.labelAll(
[](size_t done, size_t total, const std::string& msg) {
// progress
}
);
// stats.documents_processed — number of documents fetched and labeled
// stats.samples_created — total training samples generated
// Label only documents matching an arbitrary AQL query:
auto custom_stats = labeler.labelQuery(
"FOR doc IN legal_documents FILTER doc.jurisdiction == 'DE' RETURN doc._key");
// --- 2. Enrich samples with graph context ---
EnrichmentConfig enrich_config;
enrich_config.include_provisions = true;
enrich_config.include_case_law = true;
enrich_config.include_similar_docs = true;
enrich_config.max_related_items = 5;
enrich_config.similarity_threshold = 0.75f;
// RocksDBWrapper db(...); db.open();
// VectorIndexManager vim(db);
// vim.init("documents", /*dim=*/1536, VectorIndexManager::Metric::COSINE);
// (populate vim with document embeddings, then:)
KnowledgeGraphEnricher enricher(enrich_config, "rocksdb://./data");
enricher.setVectorIndex(&vim); // wire real cosine-similarity search
// omit (or pass nullptr) for offline/test mode
enricher.enrichAll();
// Enrich a single sample
GraphContext ctx = enricher.enrichSample("sample_001");
// ctx.related_provisions, ctx.case_law, ctx.similar_documents
// Direct vector-similarity query (returns pairs of {doc_id, cosine_score ∈ [0,1]})
auto similar = enricher.findSimilarDocuments("doc_001", /*max_results=*/5);
// similar[0] = {"doc_042", 0.97f}, similar[1] = {"doc_017", 0.84f}, ...
// --- 3. Train a LoRA adapter ---
IncrementalTrainingConfig train_config;
train_config.adapter_version = "v1.0";
train_config.num_epochs = 3;
train_config.batch_size = 16;
IncrementalLoRATrainer trainer(train_config, "rocksdb://./data");
trainer.setHyperparameters(/*rank=*/8, /*alpha=*/16.0f, /*lr=*/1e-4f);
trainer.setCheckpointing(true, /*checkpoint_steps=*/100);
TrainingResult result = trainer.train(
TrainingMode::INITIAL,
[](size_t epoch, size_t step, double loss, const std::string& msg) {
// progress
}
);
if (result.success) {
// Deploy with 10% canary traffic
trainer.deployVersion(result.adapter_id, 0.1f);
// Evaluate
TrainingResult eval = trainer.evaluate(result.adapter_id);
// eval.accuracy, eval.validation_loss
}
// --- 4. Rollback if needed ---
trainer.rollbackVersion("v0.9");Follow these steps to wire LegalAutoLabeler to a live ThemisDB instance:
-
Open a RocksDB instance and create the required secondary index manager:
RocksDBWrapper::Config db_cfg; db_cfg.db_path = "data/themis"; RocksDBWrapper db(db_cfg); db.open(); SecondaryIndexManager idx(db); QueryEngine engine(db, idx);
-
Construct
LegalAutoLabelerwith the engine pointer:AutoLabelConfig cfg; cfg.source_collection = "legal_documents"; // collection to read from cfg.target_collection = "legal_training_samples"; LegalAutoLabeler labeler(cfg, "", &engine);
-
Populate the source collection. Each document must have a non-null, non-empty
textfield:BaseEntity doc("doc_001"); doc.setField("text", std::string("Die Behörde muss die Genehmigung erteilen ...")); idx.put("legal_documents", doc);
-
Run labeling —
labelAll()fetches all document IDs via AQL, then fetches each document'stextfield via a secondary AQL query and producesTrainingSamplerecords:auto stats = labeler.labelAll(); // stats.documents_processed == number of documents found in DB // stats.samples_created == number of training samples generated
-
Offline / test mode — pass
nullptr(or omit the third argument) to skip all DB access; the labeler will process zero documents from the collection while still allowinglabelDocument(id)calls:LegalAutoLabeler offline_labeler(cfg, ""); // engine defaults to nullptr
Current Status: Alpha
The module provides production-ready AQL-executor integration for document labeling and contains scaffolding for remaining stubs:
LegalAutoLabeler:- ✅
labelAll()— fetches document IDs fromsource_collectionvia AQL (FETCH_ALL_DOCUMENTS), then fetches each document'stextfield viaFETCH_DOCUMENT_BY_ID(wired in v1.6.0) - ✅
labelQuery(aql)— executes the caller-supplied AQL query to obtain document IDs, then labels each document as above (wired in v1.6.0) - ✅
labelDocument(id)— usesFETCH_DOCUMENT_BY_IDAQL when engine is wired; falls back to hardcoded text in offline/test mode - ⏳ DB sample writes (
persistSampleBatch) — placeholder pending batch-insert wiring (Phase 2)
- ✅
IncrementalLoRATrainer: actual model weight manipulation, optimizer state, and checkpoint serialization are simulated with placeholder values; integrate with your chosen ML framework (e.g., llama.cpp LoRA APIs, libtorch)KnowledgeGraphEnricher: AQL graph traversal queries and vector similarity search are defined as commented AQL templates but return empty lists until the query executor is wired in- Known limitations:
- Training data must be in ThemisDB; no external dataset connector is provided (use the Ingestion module first)
deployVersiontraffic splitting is a configuration placeholder; production deployment requires a routing layer update- German is the primary tested language; other languages require validation of the
NlpTextAnalyzermodality configuration
-
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., … Chen, W. (2022). LoRA: Low-Rank Adaptation of Large Language Models. Proceedings of the 10th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2106.09685
-
Howard, J., & Ruder, S. (2018). Universal Language Model Fine-Tuning for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 328–339. https://doi.org/10.18653/v1/P18-1031
-
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI Blog. https://openai.com/research/language-unsupervised
-
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. Advances in Neural Information Processing Systems (NeurIPS), 36. https://arxiv.org/abs/2305.14314
-
Liao, S., Martelot, E., Coates, A., Darby, T., Gong, B., Zhang, F., & Natsev, A. (2023). LIMA: Less Is More for Alignment. Advances in Neural Information Processing Systems (NeurIPS), 36. https://arxiv.org/abs/2305.11206