Skip to content

[SPARK-56093][CORE][HISTORY] Allow AppStatus to be cached and reused by the history server.#54878

Open
ForVic wants to merge 9 commits intoapache:masterfrom
ForVic:dev/victors/history-snapshot-oss
Open

[SPARK-56093][CORE][HISTORY] Allow AppStatus to be cached and reused by the history server.#54878
ForVic wants to merge 9 commits intoapache:masterfrom
ForVic:dev/victors/history-snapshot-oss

Conversation

@ForVic
Copy link
Contributor

@ForVic ForVic commented Mar 18, 2026

STILL A WIP

What changes were proposed in this pull request?

When a Spark application completes, we write out the AppStatus, the materialized state generated by the AppStatusListener in protobuf. When a Spark application is loaded in the Spark History Server, we load that state as an optimization, as opposed to recomputing that state using a ReplayListenerBus and AppStatusListener in the history server.

Why are the changes needed?

The Spark History Server can be slow to load application status for jobs with large event logs, due to deserialization overhead. The history server does the exact same work that is already being done on the driver during application runtime, while it is serving the live UI from the AppStatusListener's state into the KVStore (in-memory or rocksdb backed).

Does this PR introduce any user-facing change?

Yes, it introduces a couple of user facing configs to enable this change, and should result in the History server UI being quicker.

How was this patch tested?

Unit tests, and has been running internally, at scale across multiple spark history server instances for multiple days.

Was this patch authored or co-authored using generative AI tooling?

Partially,
Generated-by: GPT-5.4

@ForVic ForVic changed the title [CORE][HISTORY] Allow AppStatus to be cached and reused by the history server. [WIP][CORE][HISTORY] Allow AppStatus to be cached and reused by the history server. Mar 18, 2026
@ForVic ForVic force-pushed the dev/victors/history-snapshot-oss branch 23 times, most recently from 09ee478 to 8508dec Compare March 20, 2026 04:12
@ForVic ForVic marked this pull request as ready for review March 20, 2026 04:46
@ForVic ForVic changed the title [WIP][CORE][HISTORY] Allow AppStatus to be cached and reused by the history server. [SPARK-56093][CORE][HISTORY] Allow AppStatus to be cached and reused by the history server. Mar 20, 2026
@ForVic
Copy link
Contributor Author

ForVic commented Mar 20, 2026

To share some results (also in the jira):
We saw p90 history server load times improve by 10x, and p99 improve by about 30x.
The improvement is generally better with the larger event logs we observe.

ForVic and others added 3 commits March 20, 2026 14:11
###### Summary
Remove the attempt id from history snapshot import failure logs.

###### Details
Align the FsHistoryProvider snapshot import log messages with the newer snapshot logging format by logging only .

###### Test Plan
- build/sbt "core/scalastyle"
###### Summary
Move the quiet-delete error guard to cover path existence checks too.

###### Details
Wrap the full deletePathQuietly conditional in Utils.tryLogNonFatalError so fs.exists(path) failures are suppressed alongside delete failures.

###### Test Plan
- build/sbt "core/scalastyle"
###### Summary
Move invalid snapshot wrapping for snapshot size reads into HistorySnapshotStore.snapshotSize.

###### Details
Promote InvalidHistorySnapshotException to the FsHistoryProvider companion so snapshot helpers can throw it directly, and simplify createDiskStoreFromSnapshot by removing the redundant size-read catch block.

###### Test Plan
- build/sbt "core/scalastyle"
###### Summary
Rename the private snapshot deletion helper to make its role clearer.

###### Details
Keep invalidateSnapshot as the public semantic API for bad snapshots, and rename the private manifest/data cleanup helper to deleteSnapshotArtifacts so stale-snapshot cleanup and invalidation no longer look like two competing deleteSnapshot flows.

###### Test Plan
- build/sbt "core/scalastyle"
###### Summary
Rename the public invalid snapshot cleanup API to better match its behavior.

###### Details
Change invalidateSnapshot to deleteInvalidSnapshot and update the FsHistoryProvider call sites so the distinction between invalid-snapshot handling and generic artifact cleanup reads directly in the code.

###### Test Plan
- build/sbt "core/scalastyle"
###### Summary
Collapse invalid-snapshot deletion onto the shared snapshot deletion helper.

###### Details
Remove the separate deleteInvalidSnapshot wrapper from HistorySnapshotStore, expose the underlying deleteSnapshot helper within the history package, and keep the invalid-snapshot logging at the FsHistoryProvider catch sites where that context already exists.

###### Test Plan
- build/sbt "core/scalastyle"
###### Summary
Update the snapshot-related history config version tags to 4.2.0.

###### Details
Mark spark.history.snapshot.enabled and spark.history.snapshot.path as introduced in 4.2.0.

###### Test Plan
- build/sbt "core/scalastyle"
@ForVic ForVic force-pushed the dev/victors/history-snapshot-oss branch from b4feb15 to 380c8cb Compare March 20, 2026 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants