Skip to content

Latest commit

 

History

History
812 lines (636 loc) · 28.3 KB

File metadata and controls

812 lines (636 loc) · 28.3 KB

ThemisDB Storage Module

Module Purpose

The Storage module provides ThemisDB's persistent data layer, built on RocksDB for high-performance LSM-tree storage with MVCC support. It handles all aspects of data persistence including key-value storage, blob management, backup/recovery, compression, encryption, and transaction coordination.

Relevant Interfaces

Interface / File Role
rocksdb_wrapper.cpp RocksDB wrapper with MVCC, WAL, BlobDB, and async I/O
mvcc_store.cpp Multi-version concurrency control snapshot management
wal_storage.cpp Write-Ahead Log management and replay
backup_manager.cpp Backup creation and point-in-time recovery
pitr_manager.cpp Point-in-time recovery via WAL replay and snapshot restore
storage_engine.cpp Storage engine with dependency injection (storage_engine.h public API)
key_schema.cpp Unified multi-model key encoding (relational/doc/graph/vector/timeseries)
base_entity.cpp Base type for all storage-layer entities
batch_write_optimizer.cpp Adaptive write batching to reduce write amplification
compaction_manager.cpp Manual and scheduled RocksDB compaction control
compression_strategy.cpp Pluggable per-table compression (Snappy, Zstd, LZ4, Brotli)
compressed_storage.cpp Transparent compression/decompression layer
columnar_format.cpp Columnar storage for analytical workloads
blob_backend_filesystem.cpp Local filesystem blob backend
blob_backend_s3.cpp Amazon S3 blob backend
blob_backend_azure.cpp Azure Blob Storage backend
blob_backend_gcs.cpp Google Cloud Storage blob backend (requires THEMIS_ENABLE_GCS)
blob_backend_webdav.cpp WebDAV blob backend
blob_redundancy_manager.cpp RAID-1 mirror redundancy across multiple backends
database_connection_manager.cpp Connection pooling and lifecycle management
disk_space_monitor.cpp Real-time disk quota monitoring and alerting
index_maintenance.cpp Background index rebuild, optimize, and consistency checks
history_manager.cpp Version history and change tracking per key
hlc.cpp Hybrid Logical Clock for causally consistent timestamps
merge_operators.cpp Custom RocksDB merge operators (counters, list appends)
raft_mvcc_bridge.cpp Integration between Raft consensus log and MVCC storage
security_signature.cpp Field-level AES-GCM encryption primitives
security_signature_manager.cpp HMAC-SHA256 tamper detection and signature management
storage_audit_logger.cpp Structured audit trail for all storage operations
tiered_storage.cpp Hot/warm/cold tiered storage with automatic data migration
transaction_retry_manager.cpp Exponential backoff retry for failed transactions
nlp_metadata_extractor.cpp Automatic metadata extraction for ingested documents

Scope

In Scope:

  • LSM-tree key-value storage (RocksDB wrapper and configuration)
  • Multi-model key schema (relational, document, graph, vector, timeseries)
  • Large object storage (BlobDB and external blob backends)
  • Backup and point-in-time recovery (PITR)
  • Compression strategies and columnar formats
  • Field-level encryption and security signatures
  • Index maintenance and optimization
  • Transaction management and retry logic
  • Storage engine abstraction with dependency injection

Out of Scope:

  • Query parsing and execution (handled by query module)
  • Network protocols and APIs (handled by server module)
  • Authentication and authorization (handled by auth module)
  • Specific application logic (handled by higher layers)

Key Components

RocksDBWrapper

Location: rocksdb_wrapper.cpp, ../include/storage/rocksdb_wrapper.h

High-level wrapper around RocksDB TransactionDB providing MVCC, WAL, and BlobDB integration.

Features:

  • MVCC Support: Multi-version concurrency control via RocksDB transactions
  • LSM-Tree Tuning: Configurable memtable size, block cache, compaction
  • BlobDB Integration: Automatic offloading of large values (>4KB by default)
  • WAL Management: Write-ahead logging with separate directory support
  • Multi-Path SSTables: Distribute SSTable files across multiple NVMe drives
  • Read-Only Mode: Safe read access without WAL updates (v1.4.0+)
  • Async I/O: Prefetching for 2-5x scan performance improvement (v1.3.0+)
  • CPU Prefetch Hints: Software prefetch for random access patterns (v1.4.1+)

Thread Safety:

  • Read-safe: Multiple concurrent readers
  • Write-safe: Internal locking for concurrent writes
  • Iterator-safe: Reference counting prevents use-after-free
  • NOT move-safe: Move only during initialization/teardown

Configuration Example:

RocksDBWrapper::Config config;
config.db_path = "./data/rocksdb";
config.wal_dir = "./data/wal";           // Separate WAL directory
config.memtable_size_mb = 512;            // 512MB memtable
config.block_cache_size_mb = 1024;        // 1GB block cache
config.enable_blobdb = true;              // Enable BlobDB
config.blob_size_threshold = 4096;        // Files >4KB go to BlobDB
config.max_background_jobs = 4;           // Compaction threads
config.enable_async_io = true;            // Enable async I/O
config.async_io_readahead_size_mb = 128;  // 128MB readahead

auto db = std::make_unique<RocksDBWrapper>(config);
db->open();

Performance Characteristics:

  • Point reads: 10-50μs (with cache)
  • Sequential scans: 100K-500K keys/sec
  • Writes: 50K-200K ops/sec (with WAL)
  • Write amplification: 10-30x (LSM-tree characteristic)
  • Space amplification: 1.2-2.0x (depends on compaction)

Key Schema System

Location: key_schema.cpp, ../include/storage/key_schema.h

Unified key encoding scheme supporting all data models in a single RocksDB instance.

Key Formats (v1.5.0+):

Relational:  rel:table_name:pk_value
Document:    doc:collection_name:pk_value
Graph Node:  node:pk_value
Graph Edge:  edge:pk_value
Vector:      vec:object_name:pk_value
Timeseries:  ts:series_name:timestamp:pk_value

Secondary Index:  idx:table_name:field_name:field_value:pk_value
Graph Index:      gidx:from_id:edge_type:to_id

Benefits:

  • Single RocksDB instance for all data models
  • Efficient prefix scans per table/collection
  • Natural ordering for range queries
  • Simple backup (entire key space)

Usage:

// Encode a key
std::string key = KeySchema::encodeRelationalKey("users", "user123");

// Decode a key
auto [model, table, pk] = KeySchema::decodeKey(key);

// Iterate over a collection
auto it = db->createIterator();
it->seek(KeySchema::collectionPrefix("users"));
while (it->valid() && it->key().starts_with(KeySchema::collectionPrefix("users"))) {
    // Process document
    it->next();
}

Blob Storage System

BlobStorageManager

Location: See ../include/storage/blob_storage_manager.h

Orchestrates multiple blob storage backends with automatic selection based on size.

Selection Strategy:

< inline_threshold (default: 1KB)        → INLINE (store in RocksDB value)
< rocksdb_blob_threshold (default: 1MB)  → ROCKSDB_BLOB (BlobDB)
>= rocksdb_blob_threshold                → External backend (S3/Azure/Filesystem/WebDAV)

Supported Backends:

Backend Location Use Case
INLINE RocksDB value Small data (<1KB)
ROCKSDB_BLOB BlobDB (.blob files) Medium files (1KB-1MB)
FILESYSTEM Local disk Large files, development
S3 AWS S3 Production, distributed storage
AZURE_BLOB Azure Blob Storage Azure deployments
WEBDAV WebDAV server Custom storage, NextCloud/OwnCloud
GCS Google Cloud Storage GCP deployments

BlobRef Structure:

struct BlobRef {
    std::string blob_id;        // Unique identifier
    BlobStorageType type;       // Storage backend type
    std::string uri;            // Backend-specific URI
    size_t size_bytes;          // Blob size
    std::string sha256_hash;    // Integrity check
    std::string compression;    // Compression algorithm
    std::map<std::string, std::string> metadata;  // Custom metadata
};

Usage:

BlobStorageConfig config;
config.enable_s3 = true;
config.s3_bucket = "my-bucket";
config.inline_threshold_bytes = 1024;
config.rocksdb_blob_threshold_bytes = 1024 * 1024;

BlobStorageManager manager(config);
manager.registerBackend(BlobStorageType::S3, std::make_shared<S3Backend>(s3_config));

// Store blob (automatic backend selection)
std::vector<uint8_t> data = /* ... */;
BlobRef ref = manager.put("blob-123", data);

// Retrieve blob
auto result = manager.get(ref);

Blob Redundancy Manager

Location: blob_redundancy_manager.cpp

RAID-like redundancy across multiple blob backends for high availability.

Redundancy Modes:

  • Mirror (RAID-1): Write to multiple backends, read from any
  • Erasure Coding: Future enhancement for space-efficient redundancy

Example:

BlobRedundancyManager redundancy;
redundancy.addBackend(s3_backend);
redundancy.addBackend(azure_backend);
redundancy.addBackend(filesystem_backend);

// Automatically writes to all backends
redundancy.putWithRedundancy("blob-123", data);

// Reads from first available backend
auto result = redundancy.getWithRedundancy("blob-123");

Storage Engine

Location: storage_engine.cpp, ../include/storage/storage_engine.h

High-level storage abstraction with dependency injection for query evaluation, encryption, and indexing.

Dependency Injection Pattern:

class StorageEngine : public IStorageEngine {
public:
    StorageEngine(
        IExpressionEvaluatorPtr evaluator,    // Query filtering
        IFieldEncryptionPtr encryption,       // Field-level encryption
        IKeyProviderPtr key_provider,         // Key management
        IIndexManagerPtr index_manager        // Index coordination
    );
    
    // Storage operations
    Result<void> put(const std::string& key, const std::string& value);
    Result<std::string> get(const std::string& key);
    Result<void> del(const std::string& key);
    
    // Filter-aware operations
    bool apply_filter(const std::string& filter_expr, const void* context);
    
    // Encryption operations
    std::vector<uint8_t> encrypt_field(const std::string& field_name, 
                                        const std::vector<uint8_t>& plaintext);
};

Production Mode Safety:

export THEMIS_PRODUCTION_MODE=1
# Prevents use of default (no-op) encryption/key providers
# Throws error if insecure defaults are detected

Factory Method (for backward compatibility):

// Creates StorageEngine with default implementations
// WARNING: Not production-safe! Use DI constructor instead.
auto storage = StorageEngine::createDefault();

Backup & Recovery

Backup Manager

Location: backup_manager.cpp, ../include/storage/backup_manager.h

Incremental backup system with versioning and validation.

Features:

  • Incremental backups (only changed SSTables)
  • Full backups on demand
  • Backup verification with checksums
  • Restore to specific backup ID
  • Automatic cleanup of old backups

Usage:

BackupManager backup(db);

// Create incremental backup
backup.createBackup(false);  // false = incremental

// Create full backup
backup.createBackup(true);   // true = full

// List available backups
auto backups = backup.listBackups();

// Restore from backup
backup.restoreFromBackup(backup_id, restore_path);

// Cleanup old backups (keep last N)
backup.cleanupOldBackups(keep_count);

Point-in-Time Recovery (PITR) Manager

Location: pitr_manager.cpp, ../include/storage/pitr_manager.h

Snapshot-based point-in-time recovery with configurable retention.

Features:

  • Automatic snapshot creation
  • Restore to any past timestamp
  • Configurable snapshot retention policy
  • WAL-based recovery between snapshots

Usage:

PITRManager pitr(db);

// Create snapshot
pitr.createSnapshot();

// Restore to timestamp
auto timestamp = std::chrono::system_clock::now() - std::chrono::hours(24);
pitr.restoreToTimestamp(timestamp);

// List available snapshots
auto snapshots = pitr.listSnapshots();

Compression & Optimization

Compression Strategy

Location: compression_strategy.cpp, ../include/storage/compression_strategy.h

Pluggable compression algorithms for different data types.

Supported Algorithms:

  • None: No compression (fast access)
  • Snappy: Fast compression/decompression
  • Zstd: High compression ratio
  • LZ4: Ultra-fast compression
  • Brotli: Web-optimized compression

Per-Table Configuration:

CompressionStrategy strategy;
strategy.setTableCompression("users", CompressionType::SNAPPY);
strategy.setTableCompression("logs", CompressionType::ZSTD);
strategy.setTableCompression("cache", CompressionType::NONE);

Columnar Format

Location: columnar_format.cpp, ../include/storage/columnar_format.h

Columnar storage for analytical workloads.

Benefits:

  • Better compression (similar values together)
  • Faster scans (read only needed columns)
  • Vectorized processing support

Usage:

ColumnarFormat columnar;
columnar.writeColumnar(table_name, rows);
auto result = columnar.scanColumn(table_name, column_name, predicate);

Batch Write Optimizer

Location: batch_write_optimizer.cpp, ../include/storage/batch_write_optimizer.h

Automatic batching of writes to reduce write amplification.

Strategies:

  • Group writes by key prefix
  • Merge operations to same key
  • Delay flushes to accumulate writes
  • Adaptive batch sizes based on load

Security Components

Field Encryption

Location: security_signature.cpp, ../include/storage/security_signature.h

Field-level encryption before storage.

Features:

  • Per-field encryption configuration
  • AES-GCM encryption
  • Key rotation support
  • Selective field encryption

Security Signature Manager

Location: security_signature_manager.cpp, ../include/storage/security_signature_manager.h

Digital signatures for data integrity.

Features:

  • HMAC-SHA256 signatures
  • Signature verification on read
  • Tamper detection
  • Key management integration

Additional Components

Database Connection Manager

Location: database_connection_manager.cpp

Connection pooling and lifecycle management for database connections.

Disk Space Monitor

Location: disk_space_monitor.cpp

Real-time disk space monitoring with quota enforcement.

Index Maintenance

Location: index_maintenance.cpp

Background index rebuilding, optimization, and consistency checks.

Transaction Retry Manager

Location: transaction_retry_manager.cpp

Automatic retry logic for failed transactions with exponential backoff.

Merge Operators

Location: merge_operators.cpp

Custom RocksDB merge operators for efficient counter updates and list appends.

Architecture

Layered Architecture

┌───────────────────────────────────────────────────────────────┐
│                    Storage Engine API                          │
│  (High-level abstraction with dependency injection)           │
└───────────────────────────────────────────────────────────────┘
                              ↓
┌───────────────────────────────────────────────────────────────┐
│                    Key Schema Layer                            │
│  (Multi-model key encoding: rel:, doc:, node:, vec:, etc.)   │
└───────────────────────────────────────────────────────────────┘
                              ↓
┌───────────────────────────────────────────────────────────────┐
│                   RocksDB Wrapper Layer                        │
│  (MVCC, WAL, BlobDB, Transactions, Iterators)                │
└───────────────────────────────────────────────────────────────┘
                              ↓
┌───────────────────────────────────────────────────────────────┐
│                      RocksDB Engine                            │
│  (LSM-Tree, Compaction, MemTable, SSTables)                   │
└───────────────────────────────────────────────────────────────┘
                              ↓
┌────────────────┬────────────────────────┬─────────────────────┐
│  Local Disk    │   BlobDB (.blob files) │  External Blobs     │
│  (SSTables)    │   (large values)       │  (S3/Azure/WebDAV)  │
└────────────────┴────────────────────────┴─────────────────────┘

Data Flow

Write Path:

1. Application writes key-value
2. StorageEngine encrypts fields (if configured)
3. Key Schema encodes key (e.g., "doc:users:user123")
4. RocksDBWrapper writes to memtable
5. WAL logs write for durability
6. Memtable flush to L0 SSTable
7. Background compaction to L1-L6
8. Large values offloaded to BlobDB/S3
9. Backup Manager captures incremental changes

Read Path:

1. Application requests key
2. Key Schema encodes lookup key
3. RocksDBWrapper checks block cache
4. If miss, check memtable
5. If miss, check SSTables (L0-L6)
6. If BlobRef, fetch from blob storage
7. StorageEngine decrypts fields
8. Return value to application

Thread Safety Model

RocksDBWrapper:

  • Multiple concurrent readers (thread-safe)
  • Multiple concurrent writers (internal mutex)
  • Iterator reference counting (prevents use-after-free)
  • Move/copy not thread-safe (initialization only)

BlobStorageManager:

  • Thread-safe backend registration
  • Thread-safe put/get operations
  • Internal mutex for backend map

StorageEngine:

  • Thread-safe operations (delegates to RocksDBWrapper)
  • Dependency injection during construction only

Integration Points

With Core Module

Uses ConcernsContext for observability:

storage.setConcerns(concerns_context);
// Enables logging, tracing, metrics for storage operations

With Query Module

Provides IExpressionEvaluator interface for query filtering:

StorageEngine storage(query_evaluator, encryption, keys, index_manager);
// Query engine evaluates WHERE clauses during scans

With Index Module

Coordinates with index manager for secondary indexes:

storage.put("doc:users:user123", data);
// Automatically updates secondary indexes via IIndexManager

With Security Module

Integrates field-level encryption:

StorageEngine storage(evaluator, field_encryption, key_provider, index_manager);
// Automatically encrypts sensitive fields before storage

API/Usage Examples

Basic Storage Operations

#include "storage/storage_engine.h"
#include "storage/rocksdb_wrapper.h"

// Create RocksDB wrapper
RocksDBWrapper::Config config;
config.db_path = "./data";
config.enable_blobdb = true;

auto db = std::make_unique<RocksDBWrapper>(config);
db->open();

// Create storage engine with dependencies
auto storage = std::make_shared<StorageEngine>(
    expression_evaluator,
    field_encryption,
    key_provider,
    index_manager
);

// Store data
storage->put("doc:users:user123", R"({"name":"Alice","email":"alice@example.com"})");

// Retrieve data
auto result = storage->get("doc:users:user123");
if (result) {
    std::cout << "User data: " << *result << "\n";
}

// Delete data
storage->del("doc:users:user123");

Transaction Example

// Begin transaction
auto tx = db->beginTransaction();

// Write operations
tx->put("account:alice", "balance:1000");
tx->put("account:bob", "balance:500");

// Transfer money
auto alice_balance = tx->get("account:alice");
auto bob_balance = tx->get("account:bob");
tx->put("account:alice", "balance:900");
tx->put("account:bob", "balance:600");

// Commit atomically
tx->commit();

Blob Storage Example

#include "storage/blob_storage_manager.h"

BlobStorageConfig config;
config.enable_s3 = true;
config.s3_bucket = "my-bucket";

BlobStorageManager blob_manager(config);
blob_manager.registerBackend(BlobStorageType::S3, s3_backend);

// Store large file
std::vector<uint8_t> file_data = readFile("document.pdf");
BlobRef ref = blob_manager.put("document-123", file_data);

// Store ref in RocksDB
storage->put("doc:documents:123", ref.serialize());

// Later, retrieve file
auto ref_str = storage->get("doc:documents:123");
BlobRef ref = BlobRef::deserialize(*ref_str);
auto file_data = blob_manager.get(ref);

Backup & Recovery Example

#include "storage/backup_manager.h"

BackupManager backup(db);

// Create daily backups
backup.createBackup(false);  // Incremental

// Disaster recovery
backup.restoreFromBackup(backup_id, "./restore");

Dependencies

Internal Dependencies

  • themis/base/interfaces: Storage interface definitions
  • core/concerns: Logging, tracing, metrics
  • utils/expected: Result types
  • utils/tracing: Tracing utilities

External Dependencies

  • RocksDB (required): LSM-tree storage engine
  • fmt (required): String formatting
  • spdlog (optional): Logging
  • AWS SDK (optional): S3 backend
  • Azure SDK (optional): Azure Blob backend
  • libcurl (optional): WebDAV backend

Build Configuration

# Link storage module
target_link_libraries(my_app themis-storage)

# Dependencies
find_package(RocksDB REQUIRED)
find_package(fmt REQUIRED)

# Optional blob backends
option(THEMIS_ENABLE_S3 "Enable S3 blob backend" ON)
option(THEMIS_ENABLE_AZURE "Enable Azure blob backend" ON)
option(THEMIS_ENABLE_WEBDAV "Enable WebDAV blob backend" ON)

Performance Characteristics

RocksDB Performance

  • Point reads: 10-50μs (cached), 100-500μs (disk)
  • Sequential scans: 100K-500K keys/sec
  • Writes: 50K-200K ops/sec (with WAL), 500K+ ops/sec (no WAL)
  • Transactions: 10K-50K tx/sec

Write Amplification

  • LSM-tree characteristic: 10-30x write amplification
  • Tuning:
    • Larger memtables → less write-amp (fewer flushes)
    • Universal compaction → less write-amp
    • Level-based compaction → better read performance

Space Amplification

  • LSM-tree characteristic: 1.2-2.0x space usage
  • Compaction reduces space over time
  • BlobDB: ~1.0x (minimal overhead)

Memory Usage

  • Memtable: 512MB default (configurable)
  • Block cache: 1GB default (configurable)
  • Write buffers: 2GB total across all CFs
  • Total: ~3.5GB minimum for production

Tuning Recommendations

See PERFORMANCE_TIPS.md for detailed tuning guide.

Known Limitations

  1. LSM-Tree Characteristics

    • High write amplification (10-30x)
    • Compaction CPU overhead
    • Point updates slower than key-value inserts
  2. RocksDB Constraints

    • No distributed transactions (single-node only)
    • No built-in replication (use external replication)
    • Limited secondary index support (manual management)
  3. Blob Storage

    • External backends require network latency
    • No automatic blob migration between backends
    • No erasure coding (mirroring only)
  4. Backup & Recovery

    • Backups are point-in-time (not continuous)
    • Restore requires downtime
    • No online backup verification
  5. Encryption

    • Field-level only (not full-database encryption)
    • No automatic key rotation
    • Performance overhead (5-15% for encryption)
  6. Default Implementations

    • Default StorageEngine factory uses no-op encryption
    • Production mode prevents insecure defaults
    • Must use DI constructor for production

Status

Production Ready (as of v1.5.0)

Stable Features:

  • RocksDB wrapper with MVCC
  • Multi-model key schema
  • BlobDB and external blob storage
  • Backup and PITR
  • Compression strategies
  • Transaction support

⚠️ Beta Features:

  • Columnar storage format
  • WebDAV blob backend
  • Automatic index maintenance
  • Async I/O optimizations

🔬 Experimental:

  • Distributed transactions (Raft-based)
  • Erasure coding for blob storage
  • GPU-accelerated compression
  • Tiered storage (hot/warm/cold)

Related Documentation

Quick Links

Contributing

When contributing to the storage module:

  1. Maintain thread safety guarantees
  2. Add performance tests for new features
  3. Update key schema documentation for new data types
  4. Test backup/recovery with new features
  5. Consider write/space amplification impact

For detailed contribution guidelines, see CONTRIBUTING.md.

See Also

Scientific References

  1. O'Neil, P., Cheng, E., Gawlick, D., & O'Neil, E. (1996). The Log-Structured Merge-Tree (LSM-tree). Acta Informatica, 33(4), 351–385. https://doi.org/10.1007/s002360050048

  2. Dong, S., Callaghan, M., Galanis, L., Borthakur, D., Savor, T., & Strum, M. (2017). Optimizing Space Amplification in RocksDB. Proceedings of the 8th Biennial Conference on Innovative Data Systems Research (CIDR). https://www.cidrdb.org/cidr2017/papers/p82-dong-cidr17.pdf

  3. Rosenblum, M., & Ousterhout, J. K. (1992). The Design and Implementation of a Log-Structured File System. ACM Transactions on Computer Systems, 10(1), 26–52. https://doi.org/10.1145/146941.146943

  4. Reed, D. P. (1978). Naming and Synchronization in a Decentralized Computer System (Doctoral dissertation, MIT). https://dspace.mit.edu/handle/1721.1/14965

  5. Graefe, G. (2010). A Survey of B-Tree Locking Techniques. ACM Transactions on Database Systems, 35(3), 16:1–16:26. https://doi.org/10.1145/1806907.1806908