Skip to content

Conversation

@gengdy1545
Copy link
Collaborator

@gengdy1545 gengdy1545 commented Feb 11, 2026

Pixels Visibility Checkpoint Mechanism

The Checkpoint mechanism in Pixels is a critical design for handling long-running transactions (LRTs) and optimizing Garbage Collection (GC) in the Retina service. It ensures that LRTs can maintain a consistent view of the data without preventing the system from reclaiming memory occupied by old visibility bitmaps.

1. Core Objectives

  • GC Blocking Prevention: Prevent LRTs from holding up the Global Safe Timestamp, which allows the system to prune old visibility versions from memory.
  • Scalability: Offload large visibility bitmaps to external storage (HDFS/S3) for long-running queries, reducing JVM heap pressure on Retina nodes.
  • Reliability: Support Retina node recovery by persisting the system state (GC Checkpoint).

2. Implementation Mechanism

The implementation is primarily located in RetinaResourceManager.java and interacts with Trino through PixelsOffloadDetector.java.

A. Lifecycle of a Long-Running Query Checkpoint

  1. Detection (PixelsOffloadDetector.java):
    A background thread in Trino monitors active transactions. If currentTime - startTime > threshold, it triggers an offload:

    // 1. RPC to Retina to create checkpoint
    this.retinaService.registerOffload(context.getTimestamp());
    // 2. Notify Daemon side TransService
    this.transService.markTransOffloaded(context.getTransId());
  2. Persistence (RetinaResourceManager.java):
    When registerOffload(timestamp) is called, Retina performs the following:

    • Parallel Capture: It iterates through all RGVisibility objects in memory.
    • Async Write: It uses a BlockingQueue and checkpointExecutor to write the bitmaps to a file in a producer-consumer pattern.
    • Filename Convention: RetinaUtils generates a unique name: offload_<hostname>_<timestamp>.bin.
    • Storage: The file is written to the path defined by pixels.retina.checkpoint.dir (typically on shared storage).
  3. Routing & Loading (RetinaServerImpl.java & VisibilityCheckpointCache.java):

    • Daemon Routing: When a Worker asks for visibility via queryVisibility, Retina checks offloadedCheckpoints. If a path exists, it returns the path instead of the bitmaps.
    • Worker Loading: The Worker's PixelsReader receives the path and delegates to VisibilityCheckpointCache, which downloads and parses the .bin file into a local Caffeine cache.
  4. Cleanup:
    When the query commits or rolls back, Trino calls unregisterOffload(timestamp). Retina decrements a refCount and deletes the physical file once the count reaches zero.

B. System State Checkpoint (GC Checkpoint)

In addition to LRT offloading, Retina periodically runs GC:

  • Trigger: runGC() runs every retina.gc.interval seconds.
  • Mechanism: Before clearing old bitmaps from memory, it calls createCheckpoint(timestamp, CheckpointType.GC).
  • Recovery: On startup, recoverCheckpoints() scans the directory, finds the latest gc_*.bin file, and populates rgVisibilityMap, effectively restoring the system state to the last GC point.

3. Data Format (.bin File)

The checkpoint file is a flat binary format optimized for sequential reading:

Field Type Description
totalRgs int Number of Row Groups in this checkpoint
Repeated Block For each Row Group:
fileId long Unique identifier for the data file
rgId int Row Group index within the file
recordNum int Total number of rows in the RG
bitmapLen int Length of the long array
bitmap long[] The actual visibility bits

4. Key Components Summary

Class Function
PixelsOffloadDetector Trino-side monitor that identifies and triggers offloads for LRTs.
RetinaResourceManager The "Brain" - manages memory bitmaps, triggers async writes, and handles ref-counting.
RGVisibility Stores the actual versioned bitmaps and performs the low-level GC/Deletion.
VisibilityCheckpointCache Worker-side cache that prevents redundant IO for the same checkpoint file.
RetinaUtils Utility for path/filename generation ensuring no conflicts in multi-node setups.

5. Distributed Consistency

  • Multi-Retina: Each node writes its own managed RGs to a file tagged with its hostname. A Worker query for a specific file is routed to the correct Retina node, which provides the correct node-specific checkpoint path.
  • Multi-Worker: Workers are stateless. They obtain the either memory data or a checkpoint path from the Retina nodes. Shared storage ensures all Workers can see the checkpoint files.

@gengdy1545 gengdy1545 requested a review from bianhq February 11, 2026 04:49
@gengdy1545 gengdy1545 self-assigned this Feb 11, 2026
@gengdy1545 gengdy1545 added the enhancement New feature or request label Feb 11, 2026
@gengdy1545 gengdy1545 added this to the Real-time CRUD milestone Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[pixels-daemon, common, core, retina] visibility checkpoint needs to have a read cache

1 participant