Skip to content

[FEATURE] Recover WriteStatus from previously successful writer attempt to prevent duplicate files during Spark task/stage retries #18049

@prashantwason

Description

@prashantwason

Describe the problem

When Spark tasks or stages are retried (due to failures, speculation, or task resubmission), write operations may be re-executed. This can cause:

  1. Duplicate data files - Multiple copies of the same logical write are created with different file names
  2. Corrupted/partial Parquet files - When speculative tasks are killed mid-write, they leave behind incomplete files (often just headers)
  3. Downstream job failures - Subsequent read, ingestion, or compaction jobs fail when encountering corrupt or duplicate files

Related issues

Describe the solution

When DIRECT markers are configured:

  1. Store WriteStatus in completion markers - After a file is successfully written, serialize and store the WriteStatus in the completed marker file (with checksum for integrity)

  2. Recover WriteStatus on retry - When a task retry attempts to create an in-progress marker, check if a completed marker already exists. If it does:

    • Deserialize and recover the WriteStatus from the existing marker
    • Skip the write operation entirely
    • Reuse the original data file
  3. Support file splitting - Handle scenarios where a single executor splits records into multiple files by recovering WriteStatus for all files with matching fileId prefix

Key changes required

  • HoodieWriteHandle: Add hasRecoveredWriteStatus flag and recoveredWriteStatuses list; modify createInProgressMarkerFile to return boolean and recover write statuses from existing completed markers
  • DirectWriteMarkers: Add methods to write serialized data with checksum to completion markers and read it back
  • Various write handles (HoodieAppendHandle, HoodieCreateHandle, HoodieMergeHandle): Store serialized WriteStatus in completed markers
  • Table classes: Check for recovered write status and skip re-writing if already completed

Benefits

  • Prevents duplicate data files during task/stage retries
  • Avoids corrupted partial files from killed speculative tasks
  • Improves job reliability when speculation is enabled
  • No data loss since the original successfully written file is preserved

Additional context

This enhancement works with DIRECT markers only. Timeline-server-based markers do not support storing content in markers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions