ref(service): Consistent atomic tombstone writes#396
Conversation
Replace `create_tombstone` with a single `cas_put` method on
`HighVolumeBackend`. The method takes a `CasMutation` (WriteTombstone,
WriteInline, or Delete) and an optional expected redirect target, and
applies the mutation only if the current row state matches.
Large-object writes now store the payload at a unique revision key
(`{key}/{uuid_v7}`) before CAS-committing the tombstone, so concurrent
writers don't silently overwrite each other. Small objects that land on
an existing tombstone are CAS-swapped inline (fixing the expiry-mismatch
TODO). Deletes CAS-remove the tombstone first, then clean up GCS
best-effort — reversing the previous ordering.
Co-Authored-By: Claude <noreply@anthropic.com>
Rename `CasMutation` to `TieredWrite` and `cas_put` to `compare_and_write` for clarity — the names now match the trait method and the write variants (`WriteInline` → `Object`, `WriteTombstone` → `Tombstone`) align with the existing `TieredGet`/`TieredMetadata` vocabulary. Extract `put_high_volume`, `put_long_term`, and `counting_stream` from the `put_object` body in `tiered.rs`, and rename `revision_id` to `new_long_term_revision` to make its purpose explicit. No behavior change. Co-Authored-By: Claude <noreply@anthropic.com>
…ents Factor the shared legacy-metadata branch out of both `tombstone_predicate` and `redirect_target_filter` into a `legacy_tombstone_filter` helper. Improve the `redirect_target_filter` doc comment to show the Chain/Interleave structure as bullet points. Co-Authored-By: Claude <noreply@anthropic.com>
Update doc comment to reference `compare_and_write` instead of the old `tiered_write` method name. Co-Authored-By: Claude <noreply@anthropic.com>
…odel The tiered storage documentation still described the pre-atomic write model. Update all three documentation layers to reflect the current compare-and-swap algorithm: - architecture.md: motivates why HighVolumeBackend needs atomic operations and points to the tiered module for details - tiered module doc: describes revision keys and per-operation sequences (large write, small write, delete) with conflict handling - TieredStorage struct doc: concise consistency summary with pointer to the module-level diagrams Also fixes the delete ordering (was backwards) and corrects field name references (high_volume_backend -> high_volume).
| .compare_and_write(id, Some(&target), write) | ||
| .await?; | ||
|
|
||
| // TODO: Schedule cleanups into background to ensure eventual cleanup |
There was a problem hiding this comment.
These TODOs are intentional and should be merged in. We'll address them when the oplog is merged and we can safely start async background jobs with exponential backoff that persist across restarts and crashes.
This is also why we swallow the errors below (let _ = ...). Internally, all these operations are retried 3 times, which makes them much less likely to fail.
| //! motivation, and the [`TieredStorage`] struct docs for routing and tombstone | ||
| //! semantics. | ||
| //! | ||
| //! # Cross-Tier Consistency |
There was a problem hiding this comment.
This is where the algorithm is described
Fold redundant tests, fill gaps in CAS conflict paths, and reorganize into named groups matching the operation they exercise.
aaf81e7 to
670a283
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| .await | ||
| }) | ||
| .await? | ||
| .predicate_matched; |
There was a problem hiding this comment.
Retrying non-idempotent CAS risks false negatives and data loss
Medium Severity
The compare_and_write method wraps check_and_mutate_row in a retry block, but CheckAndMutateRow is not idempotent. If the server applies the mutation but the response is lost to a transient error, the retry re-evaluates against the already-modified row and returns an incorrect predicate_matched value. Unlike put_non_tombstone/delete_non_tombstone, which have outer re-read-and-retry loops to self-heal from false CAS results, compare_and_write returns the result directly. In put_long_term, a false negative causes the caller to delete the blob it just successfully committed, leaving a dangling tombstone.
Additional Locations (1)
There was a problem hiding this comment.
This is true. The final plan includes we also compare against the target revision and treat it as match - this got dropped in the implementation since we're not doing crash recovery yet. Maybe it's better to add this right away.
The general problem is much harder to solve: what if the CAS succeeds but we get a non retryable failure back.
There was a problem hiding this comment.
I'll be adding this in a follow-up that sets up proper idempotency for the CAS predicates.


Before this PR, the tiered storage implementation used unconditional writes and stored all objects under the same key as their tombstone. This could lead to lost updates and orphans on concurrent writes and deletes. This PR adds a unique revision to the key path of each large-object and uses check-and-set that commits only if the stored tombstone's revision still matches the last known state. Together, these provide atomic commit points for all three operations:
Replaces
create_tombstonewith a singlecompare_and_writemethod onHighVolumeBackend. The method takes aTieredWrite(Tombstone, Object, or Delete) and an optional expected redirect target, and applies the mutation only if the current row state matches.Large-object writes now store the payload at a unique revision key (
{key}/{uuid_v7}) so each write gets a distinct storage path. Aget_tiered_metadataread establishes the CAS precondition before writing to GCS, and a subsequentcompare_and_writeatomically commits the tombstone. CAS conflicts clean up the new GCS blob; CAS errors do the same then propagate.Small-object writes that encounter an existing tombstone now CAS-swap it for inline data rather than routing to LT. This fixes the expiry-mismatch TODO and keeps small objects in HV.
Deletes Remove the tombstone first (commit point), then clean up GCS best-effort. This is the inverse of the previous ordering: if GCS cleanup fails an orphan blob remains (accepted), but the tombstone is gone and the object is unreachable.
Tests have been reorganized for readability. There are no tests covering races and extreme edge cases. These will be added in #363.
Ref FS-309