Version: 1.7.0
Status: 🟢 Production-Ready
Last Updated: 2026-03-09
Module Path: src/analytics/
Production-ready for core OLAP, data export, process mining, text analytics, LLM integration, CEP engine, streaming aggregation windows, incremental materialized views, real-time anomaly detection, model serving / online inference pipeline, and predictive analytics / time-series forecasting.
- OLAP engine with GROUP BY, CUBE, ROLLUP, and GROUPING SETS
- Window functions (ROW_NUMBER, SUM OVER, AVG OVER with frame specs)
- Statistical aggregations (COUNT, SUM, AVG, MIN, MAX, STDDEV, VARIANCE, MEDIAN, PERCENTILE)
- Hash-based aggregation with result caching
- Columnar (Arrow) RecordBatch storage always available
- JSON and CSV export (no external dependencies)
- Optional Apache Arrow IPC, Parquet (with compression), and Feather export
- Process mining: Alpha Miner, Heuristic Miner, Inductive Miner
- Conformance checking (token replay and alignment-based)
- Process pattern matcher (graph, vector, behavioral, hybrid similarity)
- NLP text analyzer (tokenization, TF-IDF, NER, sentiment, keyword extraction)
- LLM process analyzer (OpenAI, Anthropic, Azure OpenAI, llama.cpp)
- Diff engine (changefeed-backed git-like diffs)
- SIMD-accelerated aggregations (AVX2)
- Thread-safe OLAPEngine for concurrent queries
- CEP full engine (NFA pattern matching, EPL parser, window+aggregation pipeline, alert dispatch, CDC integration) (
analytics/cep_engine.cpp) - CEP: EPL (Event Processing Language) parser:
CREATE RULE … AS, SELECT aggregations (COUNT/SUM/AVG/MIN/MAX/FIRST/LAST/STDDEV/VARIANCE/PERCENTILE/DISTINCT_COUNT/COLLECT/TOPN with AS alias), GROUP BY, parenthesized WINDOW specs with human-readable time units (ms/s/minutes/hours/days), PATTERN WITHIN with time units, ACTION dispatch (alert/webhook/db_write/log/slack/kafka/email), multi-line EPL normalization (analytics/cep_engine.cpp) - CEP stateful pattern matching with checkpointing:
PatternMatcher::serializeState()/restoreState(),RuleEngine::serializeMatcherStates()/restoreMatcherStates(), full NFA partial-match persistence across restarts (analytics/cep_engine.cpp) - Streaming aggregation windows: TumblingWindow, SlidingWindow, SessionWindow, HoppingWindow with watermark support (
analytics/streaming_window.cpp) - Incremental materialized views with delta-maintenance for all 10 aggregation functions, Welford STDDEV/VARIANCE, COUNT_DISTINCT ref-counting (
analytics/incremental_view.cpp) - Real-time anomaly detection: Z-Score, Modified Z-Score (MAD), IQR, Isolation Forest, LOF, Ensemble with adaptive learning (
analytics/anomaly_detection.cpp) - AutoML integration for automated model selection: Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, KNN, Linear Regression with hyperparameter search, feature engineering, ensemble generation, and SHAP-based explanations (
analytics/automl.cpp) - CEP engine-level backpressure handling and buffer management: engine queue depth limit, drop policy, backpressure signal at configurable threshold, Prometheus metrics (
analytics/cep_engine.cpp) - Integration with external ML tools: ONNX Runtime (local inference) and TensorFlow Serving (REST API) via unified
MLServingClientabstraction withDataPointintegration and graceful degradation when backends are absent (analytics/ml_serving.cpp) - Model serving and online inference pipeline: thread-safe named+versioned model registry, online/batch inference, class-probability output, per-model health metrics, serialization round-trip (
analytics/model_serving.cpp) - Predictive analytics and time-series forecasting: LINEAR_REGRESSION, EXP_SMOOTHING, Holt-Winters triple exponential smoothing, ARIMA (AR+I+MA via Yule–Walker), ENSEMBLE with weighted combination; confidence intervals, seasonal decomposition, accuracy metrics (MAE, RMSE, MAPE, sMAPE), model serialization round-trip (
analytics/forecasting.cpp) - Predictive analytics and time-series forecasting (Issue: #1473)
- AutoML integration for automated model selection (Issue: #1485) ✅
- Advanced graph analytics: betweenness centrality, Louvain community detection (Issue: #1475)
- Integration with external ML tools (ONNX Runtime, TensorFlow Serving) (Issue: #1476) ✅
- Multi-language NLP support (beyond English) (Issue: #1478)
- Full morphological lemmatization (Issue: #1479)
- Arrow Flight RPC support for remote analytics: in-process + optional native gRPC transport (Issue: #1472) (
analytics/arrow_flight.cpp)
(none — all Phase 3 items completed)
- [P] GPU-accelerated OLAP aggregations (CUDA) (Issue: #1469)
- [P] Zero-copy Arrow data transfer optimizations (Issue: #1471)
- CUDA geospatial distance and containment kernels (Target: Q3 2026)
- Inputs: WGS84 points/polygons, batch-size up to 1e6
- Outputs: distance matrix + containment bitset
- Constraints: deterministic FP tolerance ≤ 1e-6
- Errors: invalid geometry (NaN/Inf coordinates), polygon self-intersection, overflow during Haversine distance
- Tests: unit + property-based + GPU/CPU parity
- Perf: ≥ 8x speedup vs CPU baseline on RTX-class GPU
- Federated analytics query dispatch across multiple ThemisDB clusters (Target: Q3 2026)
- Affected:
src/analytics/distributed_analytics.cpp,include/analytics/distributed_analytics.h - Expected behavior: scatter-gather with partial failure tolerance; partial results returned if <20% shards fail
- Errors: shard unreachable → skip with warning; tenant isolation violation → reject with PERMISSION_DENIED
- Tests: unit tests for scatter/gather logic + integration tests with mock shards
- Perf: fan-out latency ≤ 200 ms for 16 shards on LAN
- Per-tenant data isolation at the
SourceRegistryboundary
- Affected:
- SARIMA and Prophet-style forecasting models (Target: Q4 2026)
- Affected:
src/analytics/forecasting.cpp,include/analytics/forecasting.h - Expected behavior: extends
ForecastMethodenum;fit()/predict()API unchanged - Errors: insufficient data for seasonal period (< 2 × seasonality), NaN in input series → structured error
- Tests: unit tests for fit/predict/evaluate/serialize round-trip; parity vs Python statsmodels reference
- Perf: SARIMA fit ≤ 5 s for series of length 10 000
- Confidence intervals and decomposition retained
- Affected:
- AutoML ONNX export and deployment pipeline (Target: Q4 2026)
- Affected:
src/analytics/automl.cpp,include/analytics/automl.h - Expected behavior:
AutoMLEngine::exportONNX(path)serializes trained model; loadable byMLServingClient - Errors: unsupported model type →
UNSUPPORTED_OPERATION; serialization failure → structured error with cause - Tests: unit test export → load → infer round-trip; ONNX opset compatibility for all supported algorithms
- Perf: export time ≤ 500 ms for any model trained on ≤ 1M samples
- Affected:
- OLAP engine with GROUP BY, CUBE, ROLLUP, and GROUPING SETS (
analytics/olap_engine.cpp) - Window functions: ROW_NUMBER, SUM OVER, AVG OVER with frame specifications
- Statistical aggregations (COUNT, SUM, AVG, MIN, MAX, STDDEV, VARIANCE, MEDIAN, PERCENTILE)
- Hash-based aggregation with result caching
- Columnar Arrow RecordBatch storage always available
- JSON, CSV, Parquet, and Feather export (
analytics/exporters/) - Process mining: Alpha Miner, Heuristic Miner, Inductive Miner (
analytics/process_mining/) - Conformance checking (token replay and alignment-based)
- NLP text analyzer: tokenization, TF-IDF, NER, sentiment, keyword extraction (
analytics/nlp_analyzer.cpp) - LLM process analyzer with OpenAI, Anthropic, Azure OpenAI, llama.cpp providers
- Diff engine (changefeed-backed git-like diffs,
analytics/diff_engine.cpp) - SIMD-accelerated aggregations (AVX2) in
analytics/simd_aggregations.cpp - Thread-safe OLAPEngine for concurrent queries
- CEP full engine implementation in
analytics/cep_engine.cpp - Streaming aggregation windows (tumbling/sliding/session/hopping) in
analytics/streaming_window.cpp - Incremental materialized views in
analytics/incremental_view.cpp
- Columnar execution engine with vectorized operator pipeline (
analytics/columnar_execution.cpp) - LLVM-JIT compilation for hot aggregation paths (
analytics/jit_aggregation.cpp): hot-path detection and template-specialised aggregation dispatch; LLVM MCJIT backend reserved behindTHEMIS_HAS_LLVM_JITcompile flag (Issue: #1482) - Distributed analytics sharding across cluster nodes (Issue: #1483)
- Predictive analytics and time-series forecasting integration (Issue: #1484)
- AutoML integration for automated model selection
- Model serving and online inference pipeline (
analytics/model_serving.cpp) (Issue: #1477)
- Unit tests (OLAP, Arrow export, process mining, NLP, diff engine, forecasting)
- Unit tests coverage > 80% (test files added for all Phase 2 components; all three Phase 2 test suites active in CI)
- Integration tests (query module, index module, CDC)
- CEP engine integration tests (
tests/analytics/test_cep_engine.cpp) — including stateful checkpoint lifecycle (StatefulCheckpointPreservesPartialMatches,CheckpointWithNoPartialMatchesIsClean) - Forecasting unit tests (
tests/analytics/test_forecasting.cpp) — TimeSeries, all five algorithms, fit/predict/evaluate/decompose, serialize/deserialize, edge cases - Anomaly detection unit tests (
tests/analytics/test_anomaly_detection.cpp) — all 6 algorithms, streaming, serialize round-trip - AutoML unit tests (
tests/analytics/test_automl.cpp) — classification, regression, feature engineering, ensemble, SHAP, serialize - Distributed analytics unit tests (
tests/analytics/test_distributed_analytics.cpp) — shard management, scatter-gather, partial failure - Process pattern matcher unit tests (
tests/analytics/test_process_pattern_matcher.cpp) — graph/vector/behavioral/hybrid similarity, conformance - Arrow export + analytics_export unit tests (
tests/analytics/test_arrow_export.cpp) — RecordBatch, JSON/CSV, optional Parquet/Feather/IPC, sanitization - Process mining LLM integration tests (
tests/analytics/test_process_mining_llm.cpp) — conformance, compliance rules, fraud detection, activity prediction - Standalone focused test targets registered in
tests/CMakeLists.txtfor all 14 analytics test files - All analytics sources registered in
cmake/CMakeLists.txtandcmake/ModularBuild.cmake - Arrow Flight RPC (
analytics/arrow_flight.cpp) — in-process + optional native gRPC transport (Issue: #1472) - Performance benchmarks (OLAP, export, process mining, graph, NLP)
- Security audit (LLM API key handling, data export sanitization)
- Documentation complete (API docs, OLAP guide, process mining guide)
- API stability guaranteed for OLAP, export, and process mining
- NLP text analyzer uses rule-based approaches — not suitable as a replacement for full NLP frameworks
- LLM analyzer requires external API keys; responses are non-deterministic
- Arrow-dependent formats (Parquet, Feather, IPC) require compile-time flag
THEMIS_HAS_ARROW - Graph analytics advanced algorithms (betweenness centrality, Louvain community detection) are now implemented as AQL functions in
include/query/functions/graph_extensions.h
- Arrow export format options may expand in v1.7.0 (additive, non-breaking)
- Implementation README:
README.md - Architecture:
ARCHITECTURE.md - Future Enhancements:
FUTURE_ENHANCEMENTS.md - API Documentation:
../../include/analytics/README.md - Secondary Docs (de):
../../docs/de/analytics/README.md