IceFrame

Directory Structure

IceFrame is a high-level Python library designed to simplify interactions with Apache Iceberg tables by providing a DataFrame-centric API. It bridges the gap between the low-level pyiceberg client and the user-friendly experience of polars and pandas.

Core Components

1. IceFrame (Core Class)

The main entry point (iceframe.core.IceFrame). It initializes the connection to the Iceberg catalog and exposes all functionality through a unified API.

Responsibility: Configuration, catalog management, facade for all operations.

2. Table Operations

Handles CRUD operations (iceframe.operations).

Create: Supports creating tables from PyArrow schemas, Polars DataFrames, or dicts.
Read: Scans Iceberg tables, applies filters (pushdown), and converts to Polars DataFrames.
Write: Appends or overwrites data using PyIceberg's write support.

3. Query Builder

A fluent API for constructing complex queries (iceframe.query).

Expression System: Unified expression builder (iceframe.expressions) that translates to:
- PyIceberg Expressions: For predicate pushdown to the scan level.
- Polars Expressions: For local processing (aggregations, window functions).
Execution Engine: Orchestrates the scan and post-processing.

4. Feature Modules

Modular components for specific capabilities:

Namespace Management (iceframe.namespace): Manage schemas/databases.
Schema Evolution (iceframe.schema): Add/drop/rename/update columns.
Partition Management (iceframe.partition): Manage partition specs.
Data Quality (iceframe.quality): Data validation and constraints
Table Maintenance (iceframe.maintenance): High-level interface for snapshot expiration and file removal.
Garbage Collection (iceframe.gc): Native implementation of snapshot expiration and orphan file (data & metadata) removal.
Compaction (iceframe.compaction): Bin-packing, sorting strategies, and manifest rewriting.
Export (iceframe.export): Export data to Parquet, CSV, JSON.
Incremental Processing (iceframe.incremental): Read only new data, CDC.
Visualization (iceframe.visualization): Altair-based plotting
Query Optimization (iceframe.query): Partition-pruned updates.
Ingestion (iceframe.ingest): Multi-format support.
Branching (iceframe.branching): Create branches and tag snapshots.
Views (iceframe.views): Cross-engine view management.
Evolution (iceframe.evolution): Partition spec evolution.
Procedures (iceframe.procedures): Stored procedure interface.
Rollback (iceframe.rollback): Snapshot rollback and management.
Catalog Ops (iceframe.catalog_ops): Catalog-level operations.
Async Operations (iceframe/async_ops.py): Non-blocking operations.
AI Agent (iceframe/agent): Natural language interface with LLM integration.
MCP Server (iceframe/mcp_server.py): Model Context Protocol server.
Pydantic Integration (iceframe/pydantic.py): Schema conversion and data validation.
Notebook Magics (iceframe/magics.py): IPython magic commands (%iceframe, %%iceql).
Bulk Ingestion (iceframe/ingestion.py): Add existing files to tables.
Format Ingestion (iceframe/ingest.py): Read CSV, JSON, Parquet, Avro, ORC, Delta, Lance, Excel, Google Sheets, Hudi, SQL, XML, SAS, SPSS, Stata, API, HuggingFace, HTML, Clipboard.

5. Scalability Features

Query Caching (iceframe.cache): In-memory and disk-based result caching
Parallel Operations (iceframe.parallel): Concurrent table operations
Distributed Processing (iceframe.distributed): Ray-based distributed execution
SQL Execution (iceframe.datafusion_ops): Apache DataFusion integration
Connection Pooling (iceframe.pool): Catalog connection pooling
Memory Management (iceframe.memory): Lazy reading and memory limits
Query Optimization (iceframe.optimizer): Automatic query optimization
Monitoring (iceframe.monitoring): Query metrics and observability
Streaming (iceframe.streaming): Micro-batch, Kafka streaming, and Auto-Compaction
Data Skipping (iceframe.skipping): File-level filtering
Federation (iceframe.federation): Multi-catalog support

6. CLI

A command-line interface (iceframe.cli) built with typer for quick table inspection and management.

6. AI Chat

An interactive AI assistant (iceframe-chat) for natural language interaction with Iceberg tables.

Data Flow

User Interaction: User calls ice.read_table() or ice.query().
Catalog Interaction: IceFrame uses pyiceberg to load the table metadata.
Predicate Pushdown: Filters are translated to PyIceberg expressions and passed to table.scan().
Data Retrieval: pyiceberg reads data files (Parquet/Avro) matching the filter and returns a PyArrow Table.
Local Processing: PyArrow Table is converted to a Polars DataFrame. Additional operations (aggregations, complex filters, joins) are applied locally.
Result: A Polars DataFrame is returned to the user.

Design Principles

DataFrame-First: All data input/output is handled via Polars DataFrames (or PyArrow/Pandas where compatible).
Pushdown Optimization: Maximize predicate pushdown to minimize data transfer.
Modularity: Features are isolated in separate modules to maintain clean code and testability.
Developer Experience: Fluent APIs, type hinting, and comprehensive documentation.
Async Support: Non-blocking operations for high-concurrency scenarios.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IceFrame

Directory Structure

Core Components

1. IceFrame (Core Class)

2. Table Operations

3. Query Builder

4. Feature Modules

5. Scalability Features

6. CLI

6. AI Chat

Data Flow

Design Principles

FilesExpand file tree

architecture.md

Latest commit

History

architecture.md

File metadata and controls

IceFrame

Directory Structure

Core Components

1. IceFrame (Core Class)

2. Table Operations

3. Query Builder

4. Feature Modules

5. Scalability Features

6. CLI

6. AI Chat

Data Flow

Design Principles