Skip to content

[AURON #2030] Add Native Scan Support for Apache Hudi Copy-On-Write Tables.#2031

Open
slfan1989 wants to merge 2 commits intoapache:masterfrom
slfan1989:auron-2030
Open

[AURON #2030] Add Native Scan Support for Apache Hudi Copy-On-Write Tables.#2031
slfan1989 wants to merge 2 commits intoapache:masterfrom
slfan1989:auron-2030

Conversation

@slfan1989
Copy link
Contributor

Which issue does this PR close?

Closes #2030

Rationale for this change

This PR adds native scan support for Hudi Copy-On-Write (COW) tables, enabling Auron to accelerate Hudi table reads by converting FileSourceScanExec operations to native Parquet/ORC scan implementations.

What changes are included in this PR?

1. New Module: thirdparty/auron-hudi

  • HudiConvertProvider: Implements AuronConvertProvider SPI to intercept and convert Hudi FileSourceScanExec to native scans

    • Detects Hudi file formats (HoodieParquetFileFormat, HoodieOrcFileFormat)
    • Converts to NativeParquetScanExec or NativeOrcScanExec
    • Handles timestamp fallback logic automatically
  • HudiScanSupport: Core detection and validation logic

    • File format recognition with NewHoodie* format rejection
    • Table type resolution via multi-source metadata fallback:
      • Options → Catalog → .hoodie/hoodie.properties
    • MOR table detection and rejection
    • Time travel query detection (via as.of.instant, as.of.timestamp options)
    • FileIndex class hierarchy verification

2. Configuration

  • Added spark.auron.enable.hudi.scan config option (default: true)
  • Respects existing Parquet/ORC timestamp scanning configurations
  • Runtime Spark version validation (3.0–3.5 only)

3. Build & Integration

  • Maven: New profile hudi-0.15 with enforcer rules

    • Validates hudiEnabled=true property
    • Restricts Spark to 3.0–3.5
    • Pins Hudi version to 0.15.0
  • Build Script: Enhanced auron-build.sh

    • Added --hudi <VERSION> parameter
    • Version compatibility validation
    • Auto-enables hudiEnabled property
  • CI/CD: New workflow .github/workflows/hudi.yml

    • Matrix testing: Spark 3.0–3.5 × JDK 8/17/21 × Scala 2.12
    • Independent Hudi test pipeline

Are there any user-facing changes?

New Configuration Option

// Enable Hudi native scan (enabled by default)
spark.conf.set("spark.auron.enable.hudi.scan", "true")

How was this patch tested?

Add Junit Test.

…rite Tables.

Signed-off-by: slfan1989 <slfan1989@apache.org>
…rite Tables.

Signed-off-by: slfan1989 <slfan1989@apache.org>
@slfan1989
Copy link
Contributor Author

Spark 3.0 and 3.1 don’t support time travel yet, so I’ll revise the unit tests accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add Native Scan Support for Apache Hudi Copy-On-Write Tables

1 participant

Comments