Analytics Module — Architecture Guide
Version: 1.0
Last Updated: 2026-02-24
Module Path: src/analytics/
The Analytics module transforms ThemisDB from a transactional database into a full analytical
platform. It provides OLAP multi-dimensional query processing, process mining, NLP text
analysis, time-series forecasting, machine learning integration, anomaly detection, and
complex event processing — all operating over the same data that the transactional engine
manages.
Separation from Query Parsing – the analytics layer receives already-parsed query
plans from the query module; it does not parse AQL.
Columnar Execution – results are processed in columnar batches for CPU cache
efficiency and optional Arrow/Parquet export.
Platform Parity – full implementation on Linux/Unix; stub with error logging on
Windows where native Arrow support is unavailable.
SIMD Acceleration – AVX2 SIMD paths for aggregation hot paths.
Pluggable ML – ML/AI functionality delegates to the llm module; the analytics
module orchestrates but does not host model weights.
3. Component Architecture
File
Role
olap.cpp
Core OLAP engine: GROUP BY, CUBE, ROLLUP, GROUPING SETS
streaming_window.cpp
Streaming windows: tumbling, sliding, session, hopping with watermarks
columnar_execution.cpp
Columnar batch processor with AVX2 SIMD
jit_aggregation.cpp
JIT-compiled hot-path aggregation dispatch (LLVM-ready)
process_mining.cpp
Process discovery (Alpha / Heuristic / Inductive algorithms)
process_pattern_matcher.cpp
Conformance checking against reference process models
llm_process_analyzer.cpp
LLM-powered process analysis and anomaly explanation
nlp_text_analyzer.cpp
Sentiment analysis, entity extraction, modality detection
diff_engine.cpp
Dataset diffing and change detection
cep_engine.cpp
Complex Event Processing (production-ready)
forecasting.cpp
Time-series forecasting (ARIMA, Holt-Winters, EXP_SMOOTHING, ENSEMBLE)
anomaly_detection.cpp
Statistical and ML-based anomaly detection
arrow_export.cpp
Apache Arrow IPC / Parquet export
arrow_flight.cpp
Arrow Flight RPC server/client for remote analytics (in-process + gRPC)
analytics_export.cpp
CSV / JSON export
distributed_analytics.cpp
Fan-out queries across shards (scatter-gather coordinator)
automl.cpp
Automated ML model selection
ml_serving.cpp
External ML inference (ONNX Runtime, TensorFlow Serving)
model_serving.cpp
In-process named+versioned model registry and online inference pipeline
incremental_view.cpp
Incremental materialized view maintenance
┌─────────────────────────────────────────────────────────────────┐
│ AQL Query Engine (src/query/) │
│ executes analytical query plans, calls analytics engines │
└──────────────────────────┬──────────────────────────────────────┘
│ analytical sub-plan
┌──────────────────────────▼──────────────────────────────────────┐
│ Analytics Dispatcher │
│ routes to OLAP / Process / NLP / CEP │
└──────┬───────────┬──────────────┬────────────────┬──────────────┘
│ │ │ │
┌──────▼──┐ ┌─────▼──────┐ ┌───▼──────┐ ┌───────▼────────┐
│ OLAP │ │ Process │ │ NLP │ │ CEP │
│ Engine │ │ Mining │ │Analyzer │ │ Engine │
│ │ │ │ │ │ │ (full impl) │
└──────┬──┘ └─────┬──────┘ └───┬──────┘ └───────┬────────┘
│ │ │ │
┌──────▼───────────▼──────────────▼────────────────▼────────────┐
│ Columnar Execution Engine (AVX2) │
└──────────────────────────┬─────────────────────────────────────┘
│
┌────────────▼────────────┐
│ Arrow / Parquet Export │
└─────────────────────────┘
Query Plan (from query engine)
│
▼
Analytics Dispatcher
│
├─ OLAP path:
│ executeSimpleGroupBy() / executeCubeQuery() / executeRollupQuery()
│ → columnar hash aggregation (AVX2)
│ → result cache
│ → JSON / Arrow / Parquet
│
├─ Process Mining path:
│ process_mining.cpp → Alpha/Heuristic/Inductive algorithms
│ → conformance checker (process_pattern_matcher.cpp)
│ → LLM explanation (llm_process_analyzer.cpp) [optional]
│
├─ NLP path:
│ nlp_text_analyzer.cpp → sentiment / entity / modality
│ (delegates heavy NLP to llm module via plugin interface)
│
└─ Diff path:
diff_engine.cpp → change set with additions/deletions/modifications
Direction
Module
Interface
Consumed by
src/query/
analytics sub-plan execution
Uses
src/llm/
LLM-powered process analysis and NLP
Uses
src/storage/
data access for analytics queries
Uses
src/sharding/
distributed_analytics.cpp fan-out
Exports to
clients
Arrow IPC, Parquet, CSV, JSON
6. Threading & Concurrency Model
OLAPEngine is thread-safe; concurrent queries use independent Impl state.
columnar_execution.cpp uses AVX2 on a single thread per batch; parallelism is
achieved at the shard fan-out level (distributed_analytics.cpp).
CEP streaming windows use a dedicated background thread per window group.
Process mining algorithms are single-threaded per invocation.
7. Performance Architecture
Technique
Where Applied
Hash-based aggregation
olap.cpp O(n) average grouping
Columnar data layout
columnar_execution.cpp for cache efficiency
AVX2 SIMD
Aggregation hot paths
Result caching
OLAPEngine caches repeated query results
Lazy evaluation
CUBE/ROLLUP skips unused dimension combinations
Incremental views
incremental_view.cpp avoids full re-computation
8. Security Considerations
Analytics queries run under the same RBAC/RLS enforcement as regular queries.
NLP and process analysis via LLM: prompts are sanitized before submission to prevent
data exfiltration through prompt injection.
Parquet/Arrow exports are scoped to the requesting tenant.
Parameter
Default
Description
analytics.olap.result_cache_size
128 entries
Max OLAP result cache entries
analytics.simd.avx2_enabled
auto-detect
Enable AVX2 SIMD for aggregation
analytics.export.parquet_enabled
true (Linux)
Enable Parquet export
analytics.cep.window_thread_pool_size
4
CEP streaming window threads
Error Type
Strategy
Unsupported operation on Windows
Return structured error; log via spdlog
Arrow/Parquet library unavailable
Disable export; log warning at startup
LLM timeout in process analysis
Return partial result without LLM explanation
OOM during large CUBE computation
Spill intermediate results (planned)
11. Known Limitations & Future Work
CEP engine (cep_engine.cpp) is fully implemented with NFA-based pattern matching, EPL parsing, window management, aggregation, alert dispatch, CDC integration, and stateful checkpointing.
Windows platform uses stub implementations for most analytics functions.
Distributed analytics fan-out is experimental; result merging for CUBE across shards is partial.
AutoML (automl.cpp) is in early development.
src/analytics/README.md — module overview
src/analytics/ROADMAP.md — roadmap and planned features
src/analytics/FUTURE_ENHANCEMENTS.md — future enhancements
docs/architecture/architecture_multi_model.md — multi-model data model
ARCHITECTURE.md (root) — full system architecture
docs/de/analytics/README.md — secondary documentation hub (German)
docs/de/analytics/olap_guide.md — OLAP usage guide
docs/de/analytics/cep_guide.md — CEP engine guide
docs/de/analytics/forecasting_guide.md — forecasting guide