Skip to content

Spark: backport stream-results option for remove orphan files#234

Merged
maluchari merged 1 commit intolinkedin:openhouse-1.5.2from
dushyantk1509:dushyantk1509/stream-results-orphan-files
Apr 7, 2026
Merged

Spark: backport stream-results option for remove orphan files#234
maluchari merged 1 commit intolinkedin:openhouse-1.5.2from
dushyantk1509:dushyantk1509/stream-results-orphan-files

Conversation

@dushyantk1509
Copy link
Copy Markdown

@dushyantk1509 dushyantk1509 commented Mar 17, 2026

Currently, ~ 30% of resources is consumed by OFD maintenance jobs because of very high spark memory configurations. To address this, this PR backport apache/iceberg#14278 to openhouse-1.5.2. Adds stream-results option to DeleteOrphanFilesSparkAction to prevent driver OOM when removing large numbers of orphan files. Instead of collecting all orphan file paths into driver memory, files are streamed partition-by-partition using toLocalIterator() and deleted in batches of 100K.

When enabled, the result contains a sample of up to 20,000 file paths. The total count of deleted files is logged.

Tested using openhouse local docker setup - https://github.com/linkedin/openhouse/blob/main/SETUP.md#test-through-spark-shell

  • Created 5 orphan files.
  • Spark call ran successfully and deleted all these files - spark.sql("CALL openhouse.system.remove_orphan_files(table => 'test_db.stream_test', stream_results => true)").show(false)
  • SparkAction also ran fine - SparkActions.get(spark).deleteOrphanFiles(icebergTable).olderThan(System.currentTimeMillis()).option("stream-results", "true").execute()
  • Default (non-streaming) calls also succeeded - spark.sql("CALL openhouse.system.remove_orphan_files(table => 'test_db.stream_test')").show(false) & SparkActions.get(spark).deleteOrphanFiles(icebergTable).olderThan(System.currentTimeMillis()).execute() and deleted orphan files.

Backport of apache/iceberg#14278 to openhouse-1.5.2. Adds stream-results
option to DeleteOrphanFilesSparkAction to prevent driver OOM when
removing large numbers of orphan files. Instead of collecting all orphan
file paths into driver memory, files are streamed partition-by-partition
using toLocalIterator() and deleted in batches of 100K.

When enabled, the result contains a sample of up to 20,000 file paths.
The total count of deleted files is logged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added the SPARK label Mar 17, 2026
@dushyantk1509 dushyantk1509 marked this pull request as ready for review March 17, 2026 13:01
Copy link
Copy Markdown
Collaborator

@teamurko teamurko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@maluchari maluchari merged commit c9c41c4 into linkedin:openhouse-1.5.2 Apr 7, 2026
13 checks passed
dushyantk1509 pushed a commit to dushyantk1509/openhouse that referenced this pull request Apr 8, 2026
Picks up stream-results option for remove orphan files (linkedin/iceberg#234)
to prevent driver OOM when removing large numbers of orphan files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dushyantk1509 added a commit to linkedin/openhouse that referenced this pull request Apr 9, 2026
Picks up stream-results option for remove orphan files
(linkedin/iceberg#234) to prevent driver OOM when removing large numbers
of orphan files.

## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

[Issue](https://github.com/linkedin/openhouse/issues/#nnn)] Briefly
discuss the summary of the changes made in this
pull request in 2-3 lines.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [X] New Features -- introduces streaming mode for OFD.
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [X] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

Co-authored-by: Dushyant Kumar <dukumar@linkedin.biz>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
dushyantk1509 added a commit to linkedin/openhouse that referenced this pull request Apr 16, 2026
## Summary
- Adds `sparkVersion` field to `JobLaunchConf` to allow per-job Spark
version configuration via `jobs.yaml`
- Enables gradual migration of maintenance jobs from Spark 3.1 to 3.5
without code changes. Starting with OFD - picks up stream-results option
for remove orphan files (linkedin/iceberg#234)
to prevent driver OOM when removing large numbers of orphan files.

## Changes
- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [X] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [X] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

Co-authored-by: Dushyant Kumar <dukumar@linkedin.biz>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants