Database metadata and schema introspection for ThemisDB.
Manages the ThemisDB metadata catalog, providing schema management, collection metadata, type system definitions, and index metadata tracking.
In scope: Collection and schema metadata management, type system and field definitions, index metadata registry, metadata versioning.
Out of scope: Data storage (handled by storage module), index construction (handled by index module), configuration management (handled by config module).
| File | Description |
|---|---|
schema_manager.cpp |
Central schema discovery, table/property/index/relationship metadata |
statistics_collector.cpp |
Cardinality, selectivity, equi-height histograms |
information_schema.cpp |
SQL:2003-standard INFORMATION_SCHEMA views |
schema_version_manager.cpp |
Schema version tracking, diff, migration scripts |
schema_audit_log.cpp |
Durable per-table audit trail (RocksDB-persisted) |
schema_consistency_checker.cpp |
Background health scan for metadata consistency |
schema_constraints.cpp |
User-defined schema constraint validation |
column_lineage.cpp |
Column-level derivation DAG and provenance tracking |
er_diagram_exporter.cpp |
Cross-collection ER diagram export (Mermaid, DOT, JSON) |
catalog_exporter.cpp |
Apache Atlas and DataHub integration |
distributed_catalog.cpp |
Distributed metadata catalog across shards |
index_recommender.cpp |
Query-pattern-driven index usage and recommendation engine |
Maturity: 🟢 Production-Ready — All Phase 1–3 features shipped; schema introspection, statistics, changefeeds, adaptive TTL, audit log, consistency checker, ER diagram export, external catalog integration (Apache Atlas, DataHub), column lineage, distributed catalog, and the Schema API REST endpoint are production-ready as of v1.5.x.
- Schema Manager: Database self-awareness and schema discovery
- System Catalog: Metadata storage and retrieval
- Statistics Collector: Table and index statistics
- Information Schema: SQL-standard metadata views
- Metadata Cache: Performance-optimized metadata caching
- Column Lineage Tracker: Column-level derivation and data provenance
- Catalog Exporter: Publish schema metadata to Apache Atlas and DataHub
- Automatic table discovery: Scan RocksDB keys to find all tables
- Property type detection: Analyze stored entities for schema
- Index metadata: Collect index information from IndexManager
- Relationship discovery: Detect graph edges and foreign keys
- Thread-safe caching: Configurable TTL with read-heavy optimization
- Table metadata: Names, types, row counts, storage info
- Column metadata: Names, types, nullability, indexes
- Index metadata: Names, types, columns, uniqueness
- Statistics: Cardinality, selectivity, data distribution
- Version tracking: Schema version and change history
- INFORMATION_SCHEMA views: SQL-standard metadata access
- System tables:
tables,columns,indexes,statistics - Metadata queries: Query metadata like regular data
- Integration with AQL: Use AQL to query schema
- Column-level derivation tracking: Record how each column was produced from source columns
- Transitive upstream/downstream traversal: BFS through the derivation DAG
- Provenance export: Structured JSON for compliance and audit
- Apache Atlas: Publish
rdbms_db,rdbms_table,rdbms_columnentities via the v2 bulk API - DataHub: Emit
datasetPropertiesandschemaMetadataMetadataChangeProposals per table - Configurable auth: Basic auth (Atlas) or Bearer token (DataHub)
- No-network tests: Injectable HTTP function for unit testing without a live catalog
- Metadata caching: Cache with configurable TTL (default 60s)
- Incremental updates: Only scan changed tables
- Lazy loading: Load metadata on demand
- Index statistics: Use for query optimization
MetadataModule
├─→ SchemaManager (Primary metadata interface)
│ ├─→ RocksDB Key Scanning (table discovery)
│ ├─→ BaseEntity Parsing (property types)
│ └─→ SecondaryIndexManager (index metadata)
├─→ StatisticsCollector (Table/index statistics)
├─→ SystemCatalog (Metadata persistence)
├─→ InformationSchema (SQL-standard views)
└─→ ColumnLineageTracker (Column-level derivation DAG)
└─→ CatalogExporter (Apache Atlas & DataHub integration)
- List all tables and collections
- Discover schema without prior knowledge
- Generate documentation automatically
- IDE autocomplete and IntelliSense
- Use statistics for query planning
- Choose optimal indexes
- Estimate result set sizes
- Adaptive query execution
- Track schema changes over time
- Validate schema compatibility
- Generate migration scripts
- Enforce schema constraints
- Monitor table growth
- Track index usage
- Identify unused indexes
- Capacity planning
- Discovery time: <100ms for typical schemas (up to 100 tables)
- Cache hit rate: >90% expected
- Memory overhead: <50 MB for 100 tables
- Throughput: 1K+ metadata queries/second (cached)
- Collection time: 1-10 seconds per table (depends on size)
- Update frequency: Configurable (default: hourly)
- Storage overhead: ~1-5% of table size
- Accuracy: Sample-based (configurable sample size)
#include "metadata/schema_manager.h"
using namespace themis;
// Create schema manager
SchemaManager schema_mgr(db_wrapper, index_manager);
// Optionally configure cache TTL
schema_mgr.setCacheTTL(std::chrono::seconds(60));
// Get all tables
auto tables = schema_mgr.getAllTables();
for (const auto& table : tables) {
std::cout << "Table: " << table.name
<< " Type: " << table.type
<< " Rows: " << table.estimated_row_count << std::endl;
for (const auto& prop : table.properties) {
std::cout << " - " << prop.name
<< " (" << prop.type << ")"
<< (prop.indexed ? " [indexed]" : "") << std::endl;
}
}// Export all metadata to JSON
nlohmann::json schema_json = schema_mgr.toJSON();
// Save to file
std::ofstream out("schema.json");
out << schema_json.dump(2);
// Or serve via REST API
http_response->json(schema_json);#include "metadata/catalog_exporter.h"
// Apache Atlas
CatalogExporter::Config atlas_cfg;
atlas_cfg.type = CatalogExporter::CatalogType::APACHE_ATLAS;
atlas_cfg.endpoint = "http://atlas-host:21000";
atlas_cfg.username = "admin";
atlas_cfg.password = "admin";
CatalogExporter atlas_exporter(atlas_cfg);
auto result = atlas_exporter.publishSchema(schema_mgr.getAllTables());
if (!result.success) {
spdlog::error("Atlas publish failed: {}", result.error);
}
// DataHub
CatalogExporter::Config dh_cfg;
dh_cfg.type = CatalogExporter::CatalogType::DATAHUB;
dh_cfg.endpoint = "http://datahub-gms:8080";
dh_cfg.token = "my-token";
CatalogExporter dh_exporter(dh_cfg);
dh_exporter.publishSchema(schema_mgr.getAllTables());// Get specific table schema
auto table_schema = schema_mgr.getTableSchema("users");
if (table_schema.has_value()) {
std::cout << "Table: " << table_schema->name << std::endl;
std::cout << "Properties: " << table_schema->properties.size() << std::endl;
std::cout << "Indexes: " << table_schema->indexes.size() << std::endl;
}- Storage Layer: RocksDB key scanning for table discovery
- Index Module: Index metadata and statistics
- Query Module: Query optimization using statistics
- API Module: REST API for schema introspection
- MCP Module: Model Context Protocol integration
- Thread-safe with
std::shared_mutexfor read-heavy workloads - Multiple concurrent readers supported
- Single writer for cache updates
- Safe for high-concurrency access
- RocksDB: Key scanning and storage
- Secondary Index Manager: Index metadata
- nlohmann/json: JSON serialization
- spdlog: Logging
For detailed implementation documentation, see:
- v1.0.x: Initial schema manager with automatic discovery
- v1.1.x: Statistics collector and INFORMATION_SCHEMA views
- v1.2.x: Schema versioning, diff, and migration script generation
- v1.3.x: Real-time schema change notifications via changefeeds; adaptive TTL
- v1.4.x: Column lineage, Apache Atlas/DataHub integration, ER diagram export
- v1.5.x: Schema audit log, consistency checker, distributed catalog, full production readiness
SchemaManager schema_mgr(db, idx_mgr);
auto tables = schema_mgr.getAllTables();
for (const auto& table : tables) {
std::cout << table.name << std::endl;
}auto schema = schema_mgr.getTableSchema("users");
if (schema.has_value()) {
auto it = std::find_if(
schema->properties.begin(),
schema->properties.end(),
[](const auto& prop) { return prop.name == "email"; }
);
if (it != schema->properties.end()) {
std::cout << "Column 'email' exists, type: " << it->type << std::endl;
}
}auto schema = schema_mgr.getTableSchema("users");
if (schema.has_value()) {
for (const auto& idx : schema->indexes) {
std::cout << "Index: " << idx.name
<< " Type: " << idx.type
<< " Unique: " << (idx.unique ? "yes" : "no") << std::endl;
}
}- Set appropriate TTL: Balance freshness vs performance
- Invalidate on schema changes: Ensure cache consistency
- Warm cache on startup: Preload frequently accessed metadata
- Use cached access: Don't bypass cache unless necessary
- Batch metadata queries: Reduce overhead
- Monitor cache hit rate: Tune TTL based on metrics
- Track schema versions: Document changes over time
- Use migration scripts: Automate schema updates
- Test compatibility: Verify backward compatibility
- Header Documentation - Public API definitions
- Storage Module - Underlying storage layer
- Index Module - Index metadata
- Query Module - Query optimization
- Architecture Guide - Metadata architecture
-
ISO/IEC. (2013). Information Technology – Metadata Registries (MDR) – Part 1: Framework. ISO/IEC 11179-1:2013. https://www.iso.org/standard/61932.html
-
W3C. (2013). PROV-O: The PROV Ontology. W3C Recommendation. https://www.w3.org/TR/prov-o/
-
Bernstein, P. A., & Melnik, S. (2007). Model Management 2.0: Manipulating Richer Mappings. Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, 1–12. https://doi.org/10.1145/1247480.1247482
-
Doan, A., Halevy, A., & Ives, Z. (2012). Principles of Data Integration. Morgan Kaufmann. ISBN: 978-0-124-16248-4
-
Gray, J., Bosworth, A., Layman, A., & Pirahesh, H. (1997). Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. Data Mining and Knowledge Discovery, 1, 29–53. https://doi.org/10.1023/A:1009726021843