Spark: backport stream-results option for remove orphan files by dushyantk1509 · Pull Request #234 · linkedin/iceberg

dushyantk1509 · 2026-03-17T12:56:07Z

Currently, ~ 30% of resources is consumed by OFD maintenance jobs because of very high spark memory configurations. To address this, this PR backport apache/iceberg#14278 to openhouse-1.5.2. Adds stream-results option to DeleteOrphanFilesSparkAction to prevent driver OOM when removing large numbers of orphan files. Instead of collecting all orphan file paths into driver memory, files are streamed partition-by-partition using toLocalIterator() and deleted in batches of 100K.

When enabled, the result contains a sample of up to 20,000 file paths. The total count of deleted files is logged.

Tested using openhouse local docker setup - https://github.com/linkedin/openhouse/blob/main/SETUP.md#test-through-spark-shell

Created 5 orphan files.
Spark call ran successfully and deleted all these files - spark.sql("CALL openhouse.system.remove_orphan_files(table => 'test_db.stream_test', stream_results => true)").show(false)
SparkAction also ran fine - SparkActions.get(spark).deleteOrphanFiles(icebergTable).olderThan(System.currentTimeMillis()).option("stream-results", "true").execute()
Default (non-streaming) calls also succeeded - spark.sql("CALL openhouse.system.remove_orphan_files(table => 'test_db.stream_test')").show(false) & SparkActions.get(spark).deleteOrphanFiles(icebergTable).olderThan(System.currentTimeMillis()).execute() and deleted orphan files.

Backport of apache/iceberg#14278 to openhouse-1.5.2. Adds stream-results option to DeleteOrphanFilesSparkAction to prevent driver OOM when removing large numbers of orphan files. Instead of collecting all orphan file paths into driver memory, files are streamed partition-by-partition using toLocalIterator() and deleted in batches of 100K. When enabled, the result contains a sample of up to 20,000 file paths. The total count of deleted files is logged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

teamurko

lgtm

Picks up stream-results option for remove orphan files (linkedin/iceberg#234) to prevent driver OOM when removing large numbers of orphan files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Picks up stream-results option for remove orphan files (linkedin/iceberg#234) to prevent driver OOM when removing large numbers of orphan files. ## Summary  [Issue](https://github.com/linkedin/openhouse/issues/#nnn)] Briefly discuss the summary of the changes made in this pull request in 2-3 lines. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [X] New Features -- introduces streaming mode for OFD. - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done  - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [X] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request. Co-authored-by: Dushyant Kumar <dukumar@linkedin.biz> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

## Summary - Adds `sparkVersion` field to `JobLaunchConf` to allow per-job Spark version configuration via `jobs.yaml` - Enables gradual migration of maintenance jobs from Spark 3.1 to 3.5 without code changes. Starting with OFD - picks up stream-results option for remove orphan files (linkedin/iceberg#234) to prevent driver OOM when removing large numbers of orphan files. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [X] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done  - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [X] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request. Co-authored-by: Dushyant Kumar <dukumar@linkedin.biz> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions Bot added the SPARK label Mar 17, 2026

dushyantk1509 marked this pull request as ready for review March 17, 2026 13:01

teamurko approved these changes Apr 7, 2026

View reviewed changes

maluchari merged commit c9c41c4 into linkedin:openhouse-1.5.2 Apr 7, 2026
13 checks passed

dushyantk1509 mentioned this pull request Apr 8, 2026

Bump iceberg 1.5 version to 1.5.2.10 linkedin/openhouse#535

Merged

17 tasks

dushyantk1509 mentioned this pull request Apr 15, 2026

Add sparkVersion field to JobLaunchConf linkedin/openhouse#546

Merged

17 tasks

dushyantk1509 mentioned this pull request Apr 20, 2026

Add stream-results option for orphan files deletion linkedin/openhouse#549

Merged

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: backport stream-results option for remove orphan files#234

Spark: backport stream-results option for remove orphan files#234
maluchari merged 1 commit intolinkedin:openhouse-1.5.2from
dushyantk1509:dushyantk1509/stream-results-orphan-files

dushyantk1509 commented Mar 17, 2026 •

edited

Loading

Uh oh!

teamurko left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dushyantk1509 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

teamurko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dushyantk1509 commented Mar 17, 2026 •

edited

Loading