Skip to content

Latest commit

 

History

History
150 lines (133 loc) · 12.3 KB

File metadata and controls

150 lines (133 loc) · 12.3 KB

Analytics Module Roadmap

Version: 1.7.0 Status: 🟢 Production-Ready Last Updated: 2026-03-09 Module Path: src/analytics/

Current Status

Production-ready for core OLAP, data export, process mining, text analytics, LLM integration, CEP engine, streaming aggregation windows, incremental materialized views, real-time anomaly detection, model serving / online inference pipeline, and predictive analytics / time-series forecasting.

Completed ✅

  • OLAP engine with GROUP BY, CUBE, ROLLUP, and GROUPING SETS
  • Window functions (ROW_NUMBER, SUM OVER, AVG OVER with frame specs)
  • Statistical aggregations (COUNT, SUM, AVG, MIN, MAX, STDDEV, VARIANCE, MEDIAN, PERCENTILE)
  • Hash-based aggregation with result caching
  • Columnar (Arrow) RecordBatch storage always available
  • JSON and CSV export (no external dependencies)
  • Optional Apache Arrow IPC, Parquet (with compression), and Feather export
  • Process mining: Alpha Miner, Heuristic Miner, Inductive Miner
  • Conformance checking (token replay and alignment-based)
  • Process pattern matcher (graph, vector, behavioral, hybrid similarity)
  • NLP text analyzer (tokenization, TF-IDF, NER, sentiment, keyword extraction)
  • LLM process analyzer (OpenAI, Anthropic, Azure OpenAI, llama.cpp)
  • Diff engine (changefeed-backed git-like diffs)
  • SIMD-accelerated aggregations (AVX2)
  • Thread-safe OLAPEngine for concurrent queries
  • CEP full engine (NFA pattern matching, EPL parser, window+aggregation pipeline, alert dispatch, CDC integration) (analytics/cep_engine.cpp)
  • CEP: EPL (Event Processing Language) parser: CREATE RULE … AS, SELECT aggregations (COUNT/SUM/AVG/MIN/MAX/FIRST/LAST/STDDEV/VARIANCE/PERCENTILE/DISTINCT_COUNT/COLLECT/TOPN with AS alias), GROUP BY, parenthesized WINDOW specs with human-readable time units (ms/s/minutes/hours/days), PATTERN WITHIN with time units, ACTION dispatch (alert/webhook/db_write/log/slack/kafka/email), multi-line EPL normalization (analytics/cep_engine.cpp)
  • CEP stateful pattern matching with checkpointing: PatternMatcher::serializeState()/restoreState(), RuleEngine::serializeMatcherStates()/restoreMatcherStates(), full NFA partial-match persistence across restarts (analytics/cep_engine.cpp)
  • Streaming aggregation windows: TumblingWindow, SlidingWindow, SessionWindow, HoppingWindow with watermark support (analytics/streaming_window.cpp)
  • Incremental materialized views with delta-maintenance for all 10 aggregation functions, Welford STDDEV/VARIANCE, COUNT_DISTINCT ref-counting (analytics/incremental_view.cpp)
  • Real-time anomaly detection: Z-Score, Modified Z-Score (MAD), IQR, Isolation Forest, LOF, Ensemble with adaptive learning (analytics/anomaly_detection.cpp)
  • AutoML integration for automated model selection: Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, KNN, Linear Regression with hyperparameter search, feature engineering, ensemble generation, and SHAP-based explanations (analytics/automl.cpp)
  • CEP engine-level backpressure handling and buffer management: engine queue depth limit, drop policy, backpressure signal at configurable threshold, Prometheus metrics (analytics/cep_engine.cpp)
  • Integration with external ML tools: ONNX Runtime (local inference) and TensorFlow Serving (REST API) via unified MLServingClient abstraction with DataPoint integration and graceful degradation when backends are absent (analytics/ml_serving.cpp)
  • Model serving and online inference pipeline: thread-safe named+versioned model registry, online/batch inference, class-probability output, per-model health metrics, serialization round-trip (analytics/model_serving.cpp)
  • Predictive analytics and time-series forecasting: LINEAR_REGRESSION, EXP_SMOOTHING, Holt-Winters triple exponential smoothing, ARIMA (AR+I+MA via Yule–Walker), ENSEMBLE with weighted combination; confidence intervals, seasonal decomposition, accuracy metrics (MAE, RMSE, MAPE, sMAPE), model serialization round-trip (analytics/forecasting.cpp)
  • Predictive analytics and time-series forecasting (Issue: #1473)
  • AutoML integration for automated model selection (Issue: #1485) ✅
  • Advanced graph analytics: betweenness centrality, Louvain community detection (Issue: #1475)
  • Integration with external ML tools (ONNX Runtime, TensorFlow Serving) (Issue: #1476) ✅
  • Multi-language NLP support (beyond English) (Issue: #1478)
  • Full morphological lemmatization (Issue: #1479)
  • Arrow Flight RPC support for remote analytics: in-process + optional native gRPC transport (Issue: #1472) (analytics/arrow_flight.cpp)

In Progress 🚧

(none — all Phase 3 items completed)

Planned Features 📋

Short-term (Next 3-6 months)

  • [P] GPU-accelerated OLAP aggregations (CUDA) (Issue: #1469)
  • [P] Zero-copy Arrow data transfer optimizations (Issue: #1471)

Long-term (6-12 months)

  • CUDA geospatial distance and containment kernels (Target: Q3 2026)
    • Inputs: WGS84 points/polygons, batch-size up to 1e6
    • Outputs: distance matrix + containment bitset
    • Constraints: deterministic FP tolerance ≤ 1e-6
    • Errors: invalid geometry (NaN/Inf coordinates), polygon self-intersection, overflow during Haversine distance
    • Tests: unit + property-based + GPU/CPU parity
    • Perf: ≥ 8x speedup vs CPU baseline on RTX-class GPU
  • Federated analytics query dispatch across multiple ThemisDB clusters (Target: Q3 2026)
    • Affected: src/analytics/distributed_analytics.cpp, include/analytics/distributed_analytics.h
    • Expected behavior: scatter-gather with partial failure tolerance; partial results returned if <20% shards fail
    • Errors: shard unreachable → skip with warning; tenant isolation violation → reject with PERMISSION_DENIED
    • Tests: unit tests for scatter/gather logic + integration tests with mock shards
    • Perf: fan-out latency ≤ 200 ms for 16 shards on LAN
    • Per-tenant data isolation at the SourceRegistry boundary
  • SARIMA and Prophet-style forecasting models (Target: Q4 2026)
    • Affected: src/analytics/forecasting.cpp, include/analytics/forecasting.h
    • Expected behavior: extends ForecastMethod enum; fit()/predict() API unchanged
    • Errors: insufficient data for seasonal period (< 2 × seasonality), NaN in input series → structured error
    • Tests: unit tests for fit/predict/evaluate/serialize round-trip; parity vs Python statsmodels reference
    • Perf: SARIMA fit ≤ 5 s for series of length 10 000
    • Confidence intervals and decomposition retained
  • AutoML ONNX export and deployment pipeline (Target: Q4 2026)
    • Affected: src/analytics/automl.cpp, include/analytics/automl.h
    • Expected behavior: AutoMLEngine::exportONNX(path) serializes trained model; loadable by MLServingClient
    • Errors: unsupported model type → UNSUPPORTED_OPERATION; serialization failure → structured error with cause
    • Tests: unit test export → load → infer round-trip; ONNX opset compatibility for all supported algorithms
    • Perf: export time ≤ 500 ms for any model trained on ≤ 1M samples

Phase 1: Core Analytics Engine (Status: Completed ✅)

  • OLAP engine with GROUP BY, CUBE, ROLLUP, and GROUPING SETS (analytics/olap_engine.cpp)
  • Window functions: ROW_NUMBER, SUM OVER, AVG OVER with frame specifications
  • Statistical aggregations (COUNT, SUM, AVG, MIN, MAX, STDDEV, VARIANCE, MEDIAN, PERCENTILE)
  • Hash-based aggregation with result caching
  • Columnar Arrow RecordBatch storage always available
  • JSON, CSV, Parquet, and Feather export (analytics/exporters/)
  • Process mining: Alpha Miner, Heuristic Miner, Inductive Miner (analytics/process_mining/)
  • Conformance checking (token replay and alignment-based)
  • NLP text analyzer: tokenization, TF-IDF, NER, sentiment, keyword extraction (analytics/nlp_analyzer.cpp)
  • LLM process analyzer with OpenAI, Anthropic, Azure OpenAI, llama.cpp providers
  • Diff engine (changefeed-backed git-like diffs, analytics/diff_engine.cpp)
  • SIMD-accelerated aggregations (AVX2) in analytics/simd_aggregations.cpp
  • Thread-safe OLAPEngine for concurrent queries

Phase 2: Streaming & Incremental Analytics (Status: Completed ✅)

  • CEP full engine implementation in analytics/cep_engine.cpp
  • Streaming aggregation windows (tumbling/sliding/session/hopping) in analytics/streaming_window.cpp
  • Incremental materialized views in analytics/incremental_view.cpp

Phase 3: Distributed & ML-Augmented Analytics (Status: Completed ✅)

  • Columnar execution engine with vectorized operator pipeline (analytics/columnar_execution.cpp)
  • LLVM-JIT compilation for hot aggregation paths (analytics/jit_aggregation.cpp): hot-path detection and template-specialised aggregation dispatch; LLVM MCJIT backend reserved behind THEMIS_HAS_LLVM_JIT compile flag (Issue: #1482)
  • Distributed analytics sharding across cluster nodes (Issue: #1483)
  • Predictive analytics and time-series forecasting integration (Issue: #1484)
  • AutoML integration for automated model selection
  • Model serving and online inference pipeline (analytics/model_serving.cpp) (Issue: #1477)

Production Readiness Checklist

  • Unit tests (OLAP, Arrow export, process mining, NLP, diff engine, forecasting)
  • Unit tests coverage > 80% (test files added for all Phase 2 components; all three Phase 2 test suites active in CI)
  • Integration tests (query module, index module, CDC)
  • CEP engine integration tests (tests/analytics/test_cep_engine.cpp) — including stateful checkpoint lifecycle (StatefulCheckpointPreservesPartialMatches, CheckpointWithNoPartialMatchesIsClean)
  • Forecasting unit tests (tests/analytics/test_forecasting.cpp) — TimeSeries, all five algorithms, fit/predict/evaluate/decompose, serialize/deserialize, edge cases
  • Anomaly detection unit tests (tests/analytics/test_anomaly_detection.cpp) — all 6 algorithms, streaming, serialize round-trip
  • AutoML unit tests (tests/analytics/test_automl.cpp) — classification, regression, feature engineering, ensemble, SHAP, serialize
  • Distributed analytics unit tests (tests/analytics/test_distributed_analytics.cpp) — shard management, scatter-gather, partial failure
  • Process pattern matcher unit tests (tests/analytics/test_process_pattern_matcher.cpp) — graph/vector/behavioral/hybrid similarity, conformance
  • Arrow export + analytics_export unit tests (tests/analytics/test_arrow_export.cpp) — RecordBatch, JSON/CSV, optional Parquet/Feather/IPC, sanitization
  • Process mining LLM integration tests (tests/analytics/test_process_mining_llm.cpp) — conformance, compliance rules, fraud detection, activity prediction
  • Standalone focused test targets registered in tests/CMakeLists.txt for all 14 analytics test files
  • All analytics sources registered in cmake/CMakeLists.txt and cmake/ModularBuild.cmake
  • Arrow Flight RPC (analytics/arrow_flight.cpp) — in-process + optional native gRPC transport (Issue: #1472)
  • Performance benchmarks (OLAP, export, process mining, graph, NLP)
  • Security audit (LLM API key handling, data export sanitization)
  • Documentation complete (API docs, OLAP guide, process mining guide)
  • API stability guaranteed for OLAP, export, and process mining

Known Issues & Limitations

  • NLP text analyzer uses rule-based approaches — not suitable as a replacement for full NLP frameworks
  • LLM analyzer requires external API keys; responses are non-deterministic
  • Arrow-dependent formats (Parquet, Feather, IPC) require compile-time flag THEMIS_HAS_ARROW
  • Graph analytics advanced algorithms (betweenness centrality, Louvain community detection) are now implemented as AQL functions in include/query/functions/graph_extensions.h

Breaking Changes

  • Arrow export format options may expand in v1.7.0 (additive, non-breaking)

See Also