feat: support atomic multi-table transactions via namespace manifest#6173
feat: support atomic multi-table transactions via namespace manifest#6173XuQianJin-Stars wants to merge 18 commits intolance-format:mainfrom
Conversation
PR Review: feat: support atomic multi-table transactions via namespace manifestP0: Broken dependency path — will not build for anyone else-lance-namespace-reqwest-client = "0.5.2"
+lance-namespace-reqwest-client = { path = "../lance-namespace/rust/lance-namespace-reqwest-client" }This points to a local path outside the repository ( P0: "Atomic" batch create is not actually atomic
If Phase 2 fails (e.g., conflict on a duplicate version), Phase 1 has already written manifest files to their final paths. There is no rollback/cleanup of these orphaned files. Other readers scanning You need either:
P1:
|
92a9bae to
8b40e86
Compare
|
hi @jackye1995 Could a new version be released for https://github.com/lance-format/lance-namespace to support the interfaces related to lance-format/lance-namespace#310 and lance-format/lance-namespace#311? The newly added interface methods in the current 0.5.2 version are all missing. |
67f0f07 to
4994c3b
Compare
|
Thanks @XuQianJin-Stars for your great work!!! I'd like to help review this pr! I will start after the namespace issue is fixed, feel free to ping me if I miss the progress. You can also fix the fmt/clippy issue when you have time. |
| } | ||
|
|
||
| // If table_version_storage_enabled is enabled, also record in __manifest | ||
| if self.table_version_storage_enabled { |
There was a problem hiding this comment.
This should be a best effort, okay to fail
| /// - Storage options are invalid | ||
| pub async fn build(self) -> Result<DirectoryNamespace> { | ||
| // Validate: table_version_storage_enabled requires manifest_enabled | ||
| if self.table_version_storage_enabled && !self.manifest_enabled { |
There was a problem hiding this comment.
When enabled, we should ensure the __manifest table has a property persistedin it's metadata. This is because once we enable this, we should start to reject all writes that are not us the storage to ensure consistency. (Maybe time to develop the same feature flag system like the table format for forward compatibility)
There was a problem hiding this comment.
Let's add the boolean flag for now, and we can do a more formal feature flag system in a separated PR to make the experience better, and also introduce a similar feature flag for manifest_enabled
| ) -> Result<ListTableVersionsResponse> { | ||
| // When table_version_storage_enabled, query from __manifest | ||
| if self.table_version_storage_enabled { | ||
| if let Some(ref manifest_ns) = self.manifest_ns { |
There was a problem hiding this comment.
All these table version specific logic should go to manifest.rs since it requires manifest. We can call into it here.
|
|
||
| /// Convert a table ID (vec of strings) to an object_id string | ||
| fn str_object_id(table_id: &[String]) -> String { | ||
| pub fn str_object_id(table_id: &[String]) -> String { |
There was a problem hiding this comment.
We should just rename the table_id variable to object_id, and we can use that also for cases like
let object_id = format!("{}${}", table_id, version);
below
| // table_version object_ids are formatted as "{table_id}${version}" | ||
| let filter = format!( | ||
| "object_type = 'table_version' AND starts_with(object_id, '{}$')", | ||
| escaped_id |
There was a problem hiding this comment.
I think we should store the version number in lexigraphical way, like 00000000001 for version 1, so the versions are ordered and we can do range queries more easily
| )))) | ||
| })?; | ||
|
|
||
| merge_builder.when_matched(lance::dataset::WhenMatched::Fail); |
There was a problem hiding this comment.
nit: we should try to import things at top of module when possible ( I'm not doing a good job with that either 😆)
9512267 to
f33f35d
Compare
| /// contains the full metadata JSON stored in the manifest (manifest_path, manifest_size, | ||
| /// e_tag, naming_scheme). | ||
| /// | ||
| /// **Known limitation**: All matching rows are loaded into memory, sorted in Rust, |
There was a problem hiding this comment.
okay sorry I missed that we need descending. Looks like we probably need to do the reverse trick like what we do for the manifest file name, but using 9999999999999 instead of long max here, so that it's ordered first in the string form.
Because Lance scan is ordered, by doing this, we should always be able to just do LIMIT 1 to get the latest table version.
There was a problem hiding this comment.
oh actually, nvm it's ordered by the row address, not by the actual value. Then the current way should be fine for now, it should also work if there is a btree, but might require some additional optimization in the query planning that we can do later.
| { | ||
| return Ok(Some(metadata)); | ||
| } | ||
| // Fall back to legacy plain integer format |
| TableVersion, | ||
| /// A property entry used to persist feature flags and configuration | ||
| /// in the __manifest table (e.g., `table_version_storage_enabled`). | ||
| Property, |
There was a problem hiding this comment.
we don't need this, when I say property, I am saying you update the __manifest table's table metadata key-value map, which can be done with dataset.update_metadata()
| /// | ||
| /// Versions are stored as 20-digit zero-padded integers (e.g., `00000000000000000001` | ||
| /// for version 1) so that string-based range queries and sorting work correctly. | ||
| pub fn format_version(version: i64) -> String { |
There was a problem hiding this comment.
nit: should be named format_table_version
83a214f to
29473fe
Compare
jackye1995
left a comment
There was a problem hiding this comment.
Overall looks great! I would suggest we get all features except for the batch_create_table_versions/batch_commit_tables in first, and then we get a few subsequent items:
- add support for atomically insert and delete entries through merge_insert
- add
batch_commit_tablesofficially with updated lance-namespace crate version - add some more official feature flag setup like the table format
| /// | ||
| /// The object_id is formatted as `{table_id}${zero_padded_version}`. | ||
| pub fn parse_version_from_object_id(object_id: &str) -> Option<i64> { | ||
| object_id |
There was a problem hiding this comment.
this can use split_object_id and then parse the last part.
| } | ||
| } | ||
|
|
||
| // Legacy behavior: delete physical files directly (no __manifest) |
There was a problem hiding this comment.
nit: this is not really legacy, it's just when the feature is not enabled.
| // Even if some file deletions fail, the versions are already removed from | ||
| // __manifest, so they won't be visible to readers. Leftover files are | ||
| // orphaned but harmless and can be cleaned up later. | ||
| for te in &table_entries { |
There was a problem hiding this comment.
this should be extracted as a helper method since it's basically used twice here and below.
| }) | ||
| } | ||
|
|
||
| async fn batch_create_table_versions( |
There was a problem hiding this comment.
TODO: pending merging the namespace PR to get this updated. But I think we can also just not implement this and merge the remaining, have a separated PR to actually get that implemented, if that is cleaner.
| // Phase 1 (atomic commit point): Execute inserts and deletes atomically. | ||
| // Inserts use MergeInsert (WhenMatched::Fail), deletes use DeleteBuilder. | ||
| // We execute inserts first, then deletes. Both are individual atomic ops on __manifest. | ||
| if !insert_entries.is_empty() { |
There was a problem hiding this comment.
if we do them separately, they are not atomic. I think there are 2 ways out: (1) support merge_insert with WhenMatched::InsertOrDelete(condition), so that we can actually choose to insert or delete based on another condition. (2) do a batch commit of 2 staged transactions. I think we should probably do 1, it's a long-overdue feature in merge_insert.
|
|
||
| /// Insert an entry into the manifest table with metadata and base_objects | ||
| async fn insert_into_manifest_with_metadata( | ||
| pub async fn insert_into_manifest_with_metadata( |
There was a problem hiding this comment.
I think we can do some refactoring, instead of having another batch_insert_into_manifest, make this support inserting a list of objects (object_id + object_type)
| /// This provides atomic multi-table version creation semantics. | ||
| pub async fn batch_insert_into_manifest( | ||
| &self, | ||
| entries: Vec<(String, ObjectType, Option<String>, Option<String>)>, |
There was a problem hiding this comment.
can make this a struct to make the meaning of each entry more clear
|
|
||
| // Persist table_version_storage_enabled flag in __manifest so that once | ||
| // enabled, it becomes a permanent property of this namespace. | ||
| if self.table_version_storage_enabled { |
There was a problem hiding this comment.
this logic should be moved to manifest::ensure_manifest_table_up_to_date so we always have a central place to track all migration related functionalities.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Add a new `table_version_storage_enabled` flag to `DirectoryNamespaceBuilder` that stores table versions in the `__manifest` table (as `table_version` object type) instead of relying solely on Lance's native version management. This enables atomic multi-table version creation via `BatchCreateTableVersions`. Key changes: **manifest.rs** - Add `TableVersion` variant to `ObjectType` enum - Make `str_object_id` and `insert_into_manifest_with_metadata` public - Add `batch_insert_into_manifest()`: atomically inserts multiple entries using a single MergeInsert with `WhenMatched::Fail` semantics, so if any version already exists the entire batch is rejected - Add `query_table_versions()`: lists all versions for a given table from the manifest, returning version number and full metadata JSON - Add `query_table_version()`: queries a single version entry by table ID and version number, returning its metadata JSON - Add `delete_table_versions()`: removes version entries by table ID and version ranges from the manifest **dir.rs** - Add `table_version_storage_enabled` field with builder method and validation (requires `manifest_enabled=true`) - `CreateTableVersion`: when enabled, also records the version entry in `__manifest` via `insert_into_manifest_with_metadata` - `BatchCreateTableVersions`: two-phase commit — first copies staging manifests to final locations, then atomically inserts all version entries into `__manifest` in a single batch operation - `ListTableVersions`: when enabled, queries versions from `__manifest` and reconstructs `TableVersion` from stored metadata JSON - `DescribeTableVersion`: when enabled and version is specified, queries directly from `__manifest` instead of opening the dataset - `BatchDeleteTableVersions`: after deleting physical files, syncs deletion to `__manifest` via `delete_table_versions()` - Add comprehensive tests for batch create, conflict detection, manifest requirement validation, and single version recording **namespace.rs** - Add `batch_create_table_versions` trait method with default not-supported implementation **rest.rs / rest_adapter.rs** - Add REST endpoints for `batch_create_table_versions` (POST /v1/table/version/batch_create) and `batch_delete_table_versions` (POST /v1/table/:id/version/batch_delete)
Update dir.rs to use generated tuple variant types (CommitTableOperation,
CommitTableResult) instead of hand-written inline field variants:
- CommitTableOperation::DeclareTable(op) pattern with op.id, op.location
- CommitTableResult::DeclareTable(Box::new(CommitTableResultDeclareTable {...}))
- Same pattern for CreateTableVersion, DeleteTableVersions, DeregisterTable
All 122 tests pass.
…llapsible_if errors - Update lance-namespace-reqwest-client from local path to crates.io v0.6.0 - Fix 5 collapsible_if clippy errors in dir.rs and manifest.rs
- Escape <hash>_<object_id> in doc comment to fix rustdoc invalid-html-tags error - Update python/Cargo.lock for lance-namespace-reqwest-client 0.5.2 -> 0.6.0
010a7af to
7b6abd0
Compare
Summary
Add a new
table_version_storage_enabledflag toDirectoryNamespaceBuilderthat stores table versions in the__manifesttable (astable_versionobject type) instead of relying solely on Lance's native version management. This enables atomic multi-table version creation viaBatchCreateTableVersions, multi-table deletion viaBatchDeleteTableVersions, and a unified atomicBatchCommitTablesoperation that supports mixed operations (declare, create version, delete versions, deregister) across multiple tables in a single transaction.Key Changes
manifest.rs
TableVersionvariant toObjectTypeenumstr_object_idandinsert_into_manifest_with_metadatapublicbatch_insert_into_manifest(): atomically inserts multiple entries using a single MergeInsert withWhenMatched::Failsemantics, so if any version already exists the entire batch is rejectedquery_table_versions(): lists all versions for a given table from the manifest, returning version number and full metadata JSONquery_table_version(): queries a single version entry by table ID and version number, returning its metadata JSONdelete_table_versions(): removes version entries by table ID and version ranges from the manifestbatch_delete_table_versions_by_object_ids(): atomically deletes version entries across multiple tables in a singleDeleteBuilderoperationdir.rs
table_version_storage_enabledfield with builder method and validation (requiresmanifest_enabled=true)CreateTableVersion: when enabled, also records the version entry in__manifestviainsert_into_manifest_with_metadataBatchCreateTableVersions: two-phase commit — first copies staging manifests to final locations, then atomically inserts all version entries into__manifestin a single batch operationListTableVersions: when enabled, queries versions from__manifestand reconstructsTableVersionfrom stored metadata JSONDescribeTableVersion: when enabled and version is specified, queries directly from__manifestinstead of opening the datasetBatchDeleteTableVersions: supports two modes:request.id+request.rangesfor one tablerequest.entriesto atomically delete versions across multiple tables (metadata deletion is atomic; physical file cleanup is best-effort)BatchCommitTables: new unified operation that atomically commits mixed table operations (declare_table,create_table_version,delete_table_versions,deregister_table) across multiple tables using discriminated union types from the generatedlance-namespaceAPI specnamespace.rs
batch_create_table_versionstrait method with default not-supported implementationbatch_commit_tablestrait method with default not-supported implementationbatch_delete_table_versionsdoc to document both single-table (legacy) and multi-table (transactional) modesrest.rs / rest_adapter.rs
POST /v1/table/version/batch_create—batch_create_table_versionsPOST /v1/table/:id/version/batch_delete—batch_delete_table_versionsPOST /v1/table/batch_commit—batch_commit_tables(unified atomic multi-table commit)Depends On
BatchCommitTablesAPI spec and generated client code with discriminated union types (CommitTableOperation,CommitTableResult)Testing