Skip to content

[Feature] Add Native Scan Support for Apache Iceberg Copy-On-Write Tables #2015

@slfan1989

Description

@slfan1989

Overview

This PR introduces native scan support for Apache Iceberg Copy-On-Write (COW) tables in Auron engine, enabling Auron to directly read Iceberg data files and accelerate query performance through the native execution engine.

Design

Architecture Overview

The implementation adopts the SPI (Service Provider Interface) extension mechanism with three core components:

Spark Scan → Detect → Validate → Convert → Native Execute
              (SPI)    (Support)  (Exec)      (JNI)

Core Modules

  • IcebergConvertProvider

    • Implements AuronConvertProvider SPI interface
    • Auto-registered via META-INF/services
    • Checks Spark version compatibility (supports 3.4-4.0)
    • Provides configuration toggle: spark.auron.enable.iceberg.scan
  • IcebergScanSupport

    • Determines if the scan is from Iceberg data source (class name check)
    • Uses reflection to access Iceberg's internal SparkInputPartition and FileScanTask
    • Performs multiple checks to determine native scan eligibility:
      • Only supports COW tables (no delete files)
      • Does not support metadata columns (_file, _pos, etc.)
      • Only supports Parquet and ORC formats
      • Does not support residual filters (row-level filtering)
      • Does not support mixed file formats
      • Only supports Auron-compatible data types
  • NativeIcebergTableScanExec

    • Extends LeafExecNode and NativeSupports
    • Converts Iceberg FileScanTask to Spark FilePartition
    • Generates Protobuf scan plans (ParquetScanExecNode or OrcScanExecNode)
    • Registers Hadoop FileSystem resources via JniBridge
    • Implements projection pushdown
    • Handles file splitting and coalescing for partitioned tables

Supported Features

  • Currently Supported:
    • Full table scan on Iceberg COW tables
    • Parquet and ORC file formats
    • Projection pushdown (column pruning)
    • Partitioned table queries (partition filtering handled at Iceberg layer)
    • Empty table handling
    • Configuration toggle: spark.auron.enable.iceberg.scan (default: enabled)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions