You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Version: 1.0 Last Updated: 2026-03-11 Module Path:src/training/
1. Overview
The Training module provides tools for building and maintaining domain-specific LLM
fine-tuning datasets and LoRA adapters from data stored in ThemisDB. It automates training
sample extraction from domain documents (legal text, technical documentation), manages
the incremental LoRA training lifecycle (train, checkpoint, evaluate, deploy, rollback),
and enriches training samples with knowledge graph context.
2. Design Principles
Data-Driven Labeling – auto_labeler.cpp uses ThemisDB's own NLP analytics to
extract structured training samples from stored documents without manual annotation.
Incremental Training – incremental_lora_trainer.cpp supports initial training
and incremental fine-tuning on new data, with checkpoint/resume.
Graph-Enriched Context – knowledge_graph_enricher.cpp adds related entities
(provisions, citations, similar documents) from the knowledge graph to training samples.
Confidence Gating – low-confidence samples are flagged for human review before
being included in training data.
Adapter Versioning – incremental_lora_trainer.cpp manages adapter versions with
deploy/rollback and configurable traffic splitting.
3. Component Architecture
3.1 Key Components
File
Role
auto_labeler.cpp
LegalAutoLabeler: extract structured training samples from documents
incremental_lora_trainer.cpp
LoRA adapter training with checkpoint/resume
knowledge_graph_enricher.cpp
AQL-based context enrichment via graph traversal
lora_checkpoint_manager.cpp
Checkpoint management with SHA-256 integrity validation and rotation