Skip to content

ref(service): Consistent atomic tombstone writes#396

Merged
jan-auer merged 8 commits intomainfrom
ref/atomic-tombstone-cas
Mar 20, 2026
Merged

ref(service): Consistent atomic tombstone writes#396
jan-auer merged 8 commits intomainfrom
ref/atomic-tombstone-cas

Conversation

@jan-auer
Copy link
Member

@jan-auer jan-auer commented Mar 19, 2026

Before this PR, the tiered storage implementation used unconditional writes and stored all objects under the same key as their tombstone. This could lead to lost updates and orphans on concurrent writes and deletes. This PR adds a unique revision to the key path of each large-object and uses check-and-set that commits only if the stored tombstone's revision still matches the last known state. Together, these provide atomic commit points for all three operations:

  • Replaces create_tombstone with a single compare_and_write method on HighVolumeBackend. The method takes a TieredWrite (Tombstone, Object, or Delete) and an optional expected redirect target, and applies the mutation only if the current row state matches.

  • Large-object writes now store the payload at a unique revision key ({key}/{uuid_v7}) so each write gets a distinct storage path. A get_tiered_metadata read establishes the CAS precondition before writing to GCS, and a subsequent compare_and_write atomically commits the tombstone. CAS conflicts clean up the new GCS blob; CAS errors do the same then propagate.

  • Small-object writes that encounter an existing tombstone now CAS-swap it for inline data rather than routing to LT. This fixes the expiry-mismatch TODO and keeps small objects in HV.

  • Deletes Remove the tombstone first (commit point), then clean up GCS best-effort. This is the inverse of the previous ordering: if GCS cleanup fails an orphan blob remains (accepted), but the tombstone is gone and the object is unreachable.

Tests have been reorganized for readability. There are no tests covering races and extreme edge cases. These will be added in #363.

Ref FS-309

Replace `create_tombstone` with a single `cas_put` method on
`HighVolumeBackend`. The method takes a `CasMutation` (WriteTombstone,
WriteInline, or Delete) and an optional expected redirect target, and
applies the mutation only if the current row state matches.

Large-object writes now store the payload at a unique revision key
(`{key}/{uuid_v7}`) before CAS-committing the tombstone, so concurrent
writers don't silently overwrite each other. Small objects that land on
an existing tombstone are CAS-swapped inline (fixing the expiry-mismatch
TODO). Deletes CAS-remove the tombstone first, then clean up GCS
best-effort — reversing the previous ordering.

Co-Authored-By: Claude <noreply@anthropic.com>
@linear-code
Copy link

linear-code bot commented Mar 19, 2026

@jan-auer jan-auer changed the title ref(service): Add CAS commit points for atomic tombstone writes ref(service): Consistent atomic tombstone writes Mar 19, 2026
jan-auer and others added 4 commits March 19, 2026 21:13
Rename `CasMutation` to `TieredWrite` and `cas_put` to `compare_and_write`
for clarity — the names now match the trait method and the write variants
(`WriteInline` → `Object`, `WriteTombstone` → `Tombstone`) align with
the existing `TieredGet`/`TieredMetadata` vocabulary.

Extract `put_high_volume`, `put_long_term`, and `counting_stream` from the
`put_object` body in `tiered.rs`, and rename `revision_id` to
`new_long_term_revision` to make its purpose explicit. No behavior change.

Co-Authored-By: Claude <noreply@anthropic.com>
…ents

Factor the shared legacy-metadata branch out of both `tombstone_predicate`
and `redirect_target_filter` into a `legacy_tombstone_filter` helper.
Improve the `redirect_target_filter` doc comment to show the Chain/Interleave
structure as bullet points.

Co-Authored-By: Claude <noreply@anthropic.com>
Update doc comment to reference `compare_and_write` instead of the old
`tiered_write` method name.

Co-Authored-By: Claude <noreply@anthropic.com>
…odel

The tiered storage documentation still described the pre-atomic write
model. Update all three documentation layers to reflect the current
compare-and-swap algorithm:

- architecture.md: motivates why HighVolumeBackend needs atomic
  operations and points to the tiered module for details
- tiered module doc: describes revision keys and per-operation
  sequences (large write, small write, delete) with conflict handling
- TieredStorage struct doc: concise consistency summary with pointer
  to the module-level diagrams

Also fixes the delete ordering (was backwards) and corrects field name
references (high_volume_backend -> high_volume).
.compare_and_write(id, Some(&target), write)
.await?;

// TODO: Schedule cleanups into background to ensure eventual cleanup
Copy link
Member Author

@jan-auer jan-auer Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These TODOs are intentional and should be merged in. We'll address them when the oplog is merged and we can safely start async background jobs with exponential backoff that persist across restarts and crashes.

This is also why we swallow the errors below (let _ = ...). Internally, all these operations are retried 3 times, which makes them much less likely to fail.

//! motivation, and the [`TieredStorage`] struct docs for routing and tombstone
//! semantics.
//!
//! # Cross-Tier Consistency
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the algorithm is described

Fold redundant tests, fill gaps in CAS conflict paths, and reorganize
into named groups matching the operation they exercise.
@jan-auer jan-auer force-pushed the ref/atomic-tombstone-cas branch from aaf81e7 to 670a283 Compare March 19, 2026 21:51
@jan-auer jan-auer marked this pull request as ready for review March 19, 2026 21:56
@jan-auer jan-auer requested a review from a team as a code owner March 19, 2026 21:56
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

.await
})
.await?
.predicate_matched;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrying non-idempotent CAS risks false negatives and data loss

Medium Severity

The compare_and_write method wraps check_and_mutate_row in a retry block, but CheckAndMutateRow is not idempotent. If the server applies the mutation but the response is lost to a transient error, the retry re-evaluates against the already-modified row and returns an incorrect predicate_matched value. Unlike put_non_tombstone/delete_non_tombstone, which have outer re-read-and-retry loops to self-heal from false CAS results, compare_and_write returns the result directly. In put_long_term, a false negative causes the caller to delete the blob it just successfully committed, leaving a dangling tombstone.

Additional Locations (1)
Fix in Cursor Fix in Web

Copy link
Member Author

@jan-auer jan-auer Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true. The final plan includes we also compare against the target revision and treat it as match - this got dropped in the implementation since we're not doing crash recovery yet. Maybe it's better to add this right away.

The general problem is much harder to solve: what if the CAS succeeds but we get a non retryable failure back.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be adding this in a follow-up that sets up proper idempotency for the CAS predicates.

Copy link
Member

@lcian lcian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@jan-auer jan-auer merged commit 38876f9 into main Mar 20, 2026
21 checks passed
@jan-auer jan-auer deleted the ref/atomic-tombstone-cas branch March 20, 2026 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants