IceFrame is a high-level Python library designed to simplify interactions with Apache Iceberg tables by providing a DataFrame-centric API. It bridges the gap between the low-level pyiceberg client and the user-friendly experience of polars and pandas.
The main entry point (iceframe.core.IceFrame). It initializes the connection to the Iceberg catalog and exposes all functionality through a unified API.
- Responsibility: Configuration, catalog management, facade for all operations.
Handles CRUD operations (iceframe.operations).
- Create: Supports creating tables from PyArrow schemas, Polars DataFrames, or dicts.
- Read: Scans Iceberg tables, applies filters (pushdown), and converts to Polars DataFrames.
- Write: Appends or overwrites data using PyIceberg's write support.
A fluent API for constructing complex queries (iceframe.query).
- Expression System: Unified expression builder (
iceframe.expressions) that translates to:- PyIceberg Expressions: For predicate pushdown to the scan level.
- Polars Expressions: For local processing (aggregations, window functions).
- Execution Engine: Orchestrates the scan and post-processing.
Modular components for specific capabilities:
- Namespace Management (
iceframe.namespace): Manage schemas/databases. - Schema Evolution (
iceframe.schema): Add/drop/rename/update columns. - Partition Management (
iceframe.partition): Manage partition specs. - Data Quality (
iceframe.quality): Data validation and constraints - Table Maintenance (
iceframe.maintenance): High-level interface for snapshot expiration and file removal. - Garbage Collection (
iceframe.gc): Native implementation of snapshot expiration and orphan file (data & metadata) removal. - Compaction (
iceframe.compaction): Bin-packing, sorting strategies, and manifest rewriting. - Export (
iceframe.export): Export data to Parquet, CSV, JSON. - Incremental Processing (
iceframe.incremental): Read only new data, CDC. - Visualization (
iceframe.visualization): Altair-based plotting - Query Optimization (
iceframe.query): Partition-pruned updates. - Ingestion (
iceframe.ingest): Multi-format support. - Branching (
iceframe.branching): Create branches and tag snapshots. - Views (
iceframe.views): Cross-engine view management. - Evolution (
iceframe.evolution): Partition spec evolution. - Procedures (
iceframe.procedures): Stored procedure interface. - Rollback (
iceframe.rollback): Snapshot rollback and management. - Catalog Ops (
iceframe.catalog_ops): Catalog-level operations. - Async Operations (
iceframe/async_ops.py): Non-blocking operations. - AI Agent (
iceframe/agent): Natural language interface with LLM integration. - MCP Server (
iceframe/mcp_server.py): Model Context Protocol server. - Pydantic Integration (
iceframe/pydantic.py): Schema conversion and data validation. - Notebook Magics (
iceframe/magics.py): IPython magic commands (%iceframe,%%iceql). - Bulk Ingestion (
iceframe/ingestion.py): Add existing files to tables. - Format Ingestion (
iceframe/ingest.py): Read CSV, JSON, Parquet, Avro, ORC, Delta, Lance, Excel, Google Sheets, Hudi, SQL, XML, SAS, SPSS, Stata, API, HuggingFace, HTML, Clipboard.
- Query Caching (
iceframe.cache): In-memory and disk-based result caching - Parallel Operations (
iceframe.parallel): Concurrent table operations - Distributed Processing (
iceframe.distributed): Ray-based distributed execution - SQL Execution (
iceframe.datafusion_ops): Apache DataFusion integration - Connection Pooling (
iceframe.pool): Catalog connection pooling - Memory Management (
iceframe.memory): Lazy reading and memory limits - Query Optimization (
iceframe.optimizer): Automatic query optimization - Monitoring (
iceframe.monitoring): Query metrics and observability - Streaming (
iceframe.streaming): Micro-batch, Kafka streaming, and Auto-Compaction - Data Skipping (
iceframe.skipping): File-level filtering - Federation (
iceframe.federation): Multi-catalog support
A command-line interface (iceframe.cli) built with typer for quick table inspection and management.
An interactive AI assistant (iceframe-chat) for natural language interaction with Iceberg tables.
- User Interaction: User calls
ice.read_table()orice.query(). - Catalog Interaction:
IceFrameusespyicebergto load the table metadata. - Predicate Pushdown: Filters are translated to PyIceberg expressions and passed to
table.scan(). - Data Retrieval:
pyicebergreads data files (Parquet/Avro) matching the filter and returns a PyArrow Table. - Local Processing: PyArrow Table is converted to a Polars DataFrame. Additional operations (aggregations, complex filters, joins) are applied locally.
- Result: A Polars DataFrame is returned to the user.
- DataFrame-First: All data input/output is handled via Polars DataFrames (or PyArrow/Pandas where compatible).
- Pushdown Optimization: Maximize predicate pushdown to minimize data transfer.
- Modularity: Features are isolated in separate modules to maintain clean code and testability.
- Developer Experience: Fluent APIs, type hinting, and comprehensive documentation.
- Async Support: Non-blocking operations for high-concurrency scenarios.