Backup & Recovery System Documentation

Overview

ThemisDB's Backup & Recovery system provides comprehensive data protection capabilities with support for multiple backup strategies, integrity verification, and point-in-time recovery.

Features

Backup Types

1. Full Backup

Complete database snapshot using RocksDB checkpoint API
Includes all data files and Write-Ahead Log (WAL) files
Foundation for incremental and differential backups
Duration: Medium (depends on database size)
Storage: Full database size

auto result = backup_mgr->createFullBackup("/backups");
if (result) {
    std::cout << "Backup created at: " << *result << std::endl;
} else {
    std::cerr << "Backup failed: " << result.error().message() << std::endl;
}

2. Incremental Backup

Captures only changes since the last backup (full or incremental)
Small storage footprint
Fast execution
Duration: Fast
Storage: Small (only changed data)

auto result = backup_mgr->createIncrementalBackup("/backups");

3. Differential Backup

Captures changes since the last full backup
Medium storage footprint
Faster restore than incremental chain
Duration: Fast
Storage: Medium (accumulated changes since full backup)

auto result = backup_mgr->createDifferentialBackup("/backups");

Backup Verification

All backups include integrity verification using:

SHA-256 checksums for data validation
Manifest file validation
Structure verification (checkpoint, WAL, metadata)
RAID5/6 shard completeness checks

auto result = backup_mgr->verifyBackup(backup_path);
if (result) {
    std::cout << "Backup integrity verified" << std::endl;
}

Compression

Backups can be compressed to save storage space:

// Compress a backup
auto compressed = backup_mgr->compressBackup(backup_path);
if (compressed) {
    std::cout << "Compressed to: " << *compressed << std::endl;
}

// Decompress for restore
auto decompressed = backup_mgr->decompressBackup(compressed_file, dest_dir);

WAL Archiving

Continuous archiving of Write-Ahead Log files for point-in-time recovery:

auto result = backup_mgr->archiveWAL("/wal_archive");

Backup Strategies

Strategy 1: Full + Incremental

Weekly full backup
Daily incremental backups
Pros: Minimal backup time, small storage
Cons: Longer restore time (need to apply all incrementals)

Strategy 2: Full + Differential

Weekly full backup
Daily differential backups
Pros: Faster restore (only need full + latest differential)
Cons: Larger backup sizes over time

Strategy 3: Full Only

Daily full backups
Pros: Simplest restore process
Cons: Highest storage and time requirements

RAID5/6 Support

For RAID5/6 configurations, the backup system ensures all shards (data + parity) are included:

// Automatic RAID detection
auto raid_config = BackupManager::detectRAIDConfiguration();

// Verify RAID backup completeness
auto result = backup_mgr->isBackupComplete(backup_path, raid_config);

Important: For RAID5/6, ALL shards must be backed up together to ensure complete data recovery.

Recovery Procedures

Full Database Recovery

auto result = backup_mgr->restoreFromBackup(backup_path);
if (result) {
    std::cout << "Database restored successfully" << std::endl;
}

Point-in-Time Recovery (PITR)

PITR is handled by the PITRManager class (see include/storage/pitr_manager.h):

PITRManager pitr(db, changefeed, snapshot_mgr);

// Restore to specific sequence number
auto result = pitr.restoreToSequence(target_seq);

// Restore to timestamp
auto result = pitr.restoreToTimestamp(timestamp_ms);

// Restore to named snapshot
auto result = pitr.restoreToTag("before_migration");

Backup Directory Structure

backup_dir/
├── full_20260122_120000/
│   ├── checkpoint/          # RocksDB checkpoint data
│   ├── wal/                 # WAL files at checkpoint time
│   ├── raid_topology/       # RAID5/6: shard topology info
│   │   ├── shard_0/
│   │   ├── shard_1/
│   │   └── shard_parity/
│   └── MANIFEST.json        # Backup metadata
├── incr_20260123_120000/
│   ├── wal/                 # Incremental WAL files
│   └── MANIFEST.json
├── diff_20260124_120000/
│   ├── wal/                 # Differential WAL files
│   └── MANIFEST.json
└── latest -> full_20260122_120000/  # Symlink to latest backup

Error Handling

The backup system uses Result<T> pattern for type-safe error handling:

auto result = backup_mgr->createFullBackup(dest_dir);
if (!result) {
    auto error = result.error();
    
    // Check error code
    if (error.code() == ErrorCode::ERR_STORAGE_DISK_FULL) {
        // Handle disk full
    }
    
    // Get error message with context
    std::cerr << "Error: " << error.message() << std::endl;
    
    // Get metadata for user guidance
    auto metadata = error.metadata();
    std::cerr << "Solution: " << metadata.solution << std::endl;
}

Backup Error Codes

Error Code	Description	Severity
`ERR_BACKUP_CREATION_FAILED`	Backup creation failed	Error
`ERR_BACKUP_RESTORATION_FAILED`	Backup restoration failed	Error
`ERR_BACKUP_VERIFICATION_FAILED`	Backup integrity check failed	Warning
`ERR_BACKUP_NOT_FOUND`	Backup does not exist	Error
`ERR_BACKUP_INVALID_TYPE`	Unsupported backup type	Error
`ERR_BACKUP_INCOMPLETE`	Missing backup components	Critical
`ERR_BACKUP_COMPRESSION_FAILED`	Compression failed	Error
`ERR_BACKUP_DECOMPRESSION_FAILED`	Decompression failed	Error
`ERR_BACKUP_CHECKSUM_MISMATCH`	Checksum verification failed	Critical
`ERR_BACKUP_MANIFEST_CORRUPT`	Manifest file corrupted	Error
`ERR_BACKUP_WAL_ARCHIVE_FAILED`	WAL archiving failed	Error

Best Practices

1. Regular Testing

Test restore procedures regularly
Verify backup integrity after creation
Practice recovery scenarios

2. Backup Retention

Keep multiple generations of backups
Implement retention policies based on RPO/RTO requirements
Archive old backups offsite

3. Monitoring

Monitor backup success/failure
Track backup sizes and durations
Alert on backup verification failures

4. Security

Encrypt backups for sensitive data
Secure backup storage locations
Implement access controls

5. RAID Considerations

For RAID5/6, coordinate backup across all shards
Verify all shards are present in backups
Test cross-shard restoration

Performance Considerations

Backup Performance

Full backup: ~10-50 MB/s (depends on I/O)
Incremental/Differential: Very fast (only WAL files)
Compression: Adds CPU overhead but saves storage

Restore Performance

Full restore: Similar to backup speed
Incremental chain: Slower (need to apply all changes)
Differential: Faster than incremental chain

Optimization Tips

Schedule backups during low-traffic periods
Use differential backups for faster restore
Consider parallel backup/restore (future feature)
Compress backups if storage is limited
Use local SSD storage for backup destination

Example: Complete Backup Workflow

#include "storage/backup_manager.h"
#include "storage/rocksdb_wrapper.h"

// Initialize
auto db = std::make_shared<RocksDBWrapper>(config);
db->open();

auto backup_mgr = std::make_unique<BackupManager>(db);

// Weekly full backup
auto full_result = backup_mgr->createFullBackup("/backups");
if (!full_result) {
    std::cerr << "Full backup failed: " << full_result.error().message() << std::endl;
    return;
}

// Verify backup
auto verify_result = backup_mgr->verifyBackup(*full_result);
if (!verify_result) {
    std::cerr << "Backup verification failed: " << verify_result.error().message() << std::endl;
    return;
}

// Compress for storage
auto compress_result = backup_mgr->compressBackup(*full_result);
if (compress_result) {
    std::cout << "Backup compressed: " << *compress_result << std::endl;
}

// Daily incremental backup
auto incr_result = backup_mgr->createIncrementalBackup("/backups");

// List all backups
auto backups = backup_mgr->listBackups("/backups");
for (const auto& backup : backups) {
    std::cout << "Backup: " << backup << std::endl;
}

// Restore if needed
if (disaster_occurred) {
    auto latest_backup = backups.back();
    auto restore_result = backup_mgr->restoreFromBackup(
        "/backups/" + latest_backup
    );
    
    if (restore_result) {
        std::cout << "Database restored successfully" << std::endl;
    }
}

Future Enhancements

Planned Features

Parallel backup/restore for faster performance
Backup deduplication to reduce storage
Cloud backup support (S3, Azure Blob, GCS)
Backup encryption
Automatic retention policy enforcement
Recovery time estimation
Backup catalog/metadata tracking
Cross-region replication
Snapshot-based backups

Configuration (Future)

backup:
  schedule:
    full: "0 2 * * 0"      # Weekly Sunday 2 AM
    incremental: "0 2 * * 1-6"  # Daily except Sunday
  
  retention:
    full: 4                 # Keep 4 full backups
    incremental: 30         # Keep 30 days of incrementals
  
  compression: true
  verify_after_backup: true
  
  destinations:
    - type: local
      path: /backup/local
    - type: s3
      bucket: themis-backups
      region: us-east-1

Troubleshooting

Backup Fails with Disk Full

# Check disk space
df -h /backups

# Clean old backups
rm -rf /backups/old_*

# Implement retention policy

Restore Fails with Checksum Mismatch

# Don't use this backup - it's corrupted
# Try previous backup
ls -lt /backups/

# Check for disk errors
dmesg | grep -i error

RAID Backup Incomplete

# Verify all shards are accessible
# Check RAID configuration
echo $THEMIS_RAID_GROUP
echo $THEMIS_SHARDS

# Ensure all shard nodes are running
# Create new coordinated backup

References

RocksDB Checkpoint Documentation
Write-Ahead Logging (WAL)
PITR Manager: include/storage/pitr_manager.h
Error Registry: include/utils/error_registry.h
Test Examples: tests/test_backup_manager_enhanced.cpp

FilesExpand file tree

backup_recovery_system.md

Latest commit

History