Skip to content

feat: support atomic multi-table transactions via namespace manifest#6173

Open
XuQianJin-Stars wants to merge 18 commits intolance-format:mainfrom
XuQianJin-Stars:feat/multi-table-transactions
Open

feat: support atomic multi-table transactions via namespace manifest#6173
XuQianJin-Stars wants to merge 18 commits intolance-format:mainfrom
XuQianJin-Stars:feat/multi-table-transactions

Conversation

@XuQianJin-Stars
Copy link
Contributor

@XuQianJin-Stars XuQianJin-Stars commented Mar 12, 2026

Summary

Add a new table_version_storage_enabled flag to DirectoryNamespaceBuilder that stores table versions in the __manifest table (as table_version object type) instead of relying solely on Lance's native version management. This enables atomic multi-table version creation via BatchCreateTableVersions, multi-table deletion via BatchDeleteTableVersions, and a unified atomic BatchCommitTables operation that supports mixed operations (declare, create version, delete versions, deregister) across multiple tables in a single transaction.

Key Changes

manifest.rs

  • Add TableVersion variant to ObjectType enum
  • Make str_object_id and insert_into_manifest_with_metadata public
  • Add batch_insert_into_manifest(): atomically inserts multiple entries using a single MergeInsert with WhenMatched::Fail semantics, so if any version already exists the entire batch is rejected
  • Add query_table_versions(): lists all versions for a given table from the manifest, returning version number and full metadata JSON
  • Add query_table_version(): queries a single version entry by table ID and version number, returning its metadata JSON
  • Add delete_table_versions(): removes version entries by table ID and version ranges from the manifest
  • Add batch_delete_table_versions_by_object_ids(): atomically deletes version entries across multiple tables in a single DeleteBuilder operation

dir.rs

  • Add table_version_storage_enabled field with builder method and validation (requires manifest_enabled=true)
  • CreateTableVersion: when enabled, also records the version entry in __manifest via insert_into_manifest_with_metadata
  • BatchCreateTableVersions: two-phase commit — first copies staging manifests to final locations, then atomically inserts all version entries into __manifest in a single batch operation
  • ListTableVersions: when enabled, queries versions from __manifest and reconstructs TableVersion from stored metadata JSON
  • DescribeTableVersion: when enabled and version is specified, queries directly from __manifest instead of opening the dataset
  • BatchDeleteTableVersions: supports two modes:
    • Single-table (legacy): uses request.id + request.ranges for one table
    • Multi-table (transactional): uses request.entries to atomically delete versions across multiple tables (metadata deletion is atomic; physical file cleanup is best-effort)
  • BatchCommitTables: new unified operation that atomically commits mixed table operations (declare_table, create_table_version, delete_table_versions, deregister_table) across multiple tables using discriminated union types from the generated lance-namespace API spec
  • Add comprehensive tests: batch create atomic, conflict detection, manifest requirement validation, single version recording, and more

namespace.rs

  • Add batch_create_table_versions trait method with default not-supported implementation
  • Add batch_commit_tables trait method with default not-supported implementation
  • Update batch_delete_table_versions doc to document both single-table (legacy) and multi-table (transactional) modes

rest.rs / rest_adapter.rs

  • Add REST client/server endpoints:
    • POST /v1/table/version/batch_createbatch_create_table_versions
    • POST /v1/table/:id/version/batch_deletebatch_delete_table_versions
    • POST /v1/table/batch_commitbatch_commit_tables (unified atomic multi-table commit)

Depends On

  • lance-namespace PR: adds BatchCommitTables API spec and generated client code with discriminated union types (CommitTableOperation, CommitTableResult)

Testing

  • 122 unit tests pass (including new tests for batch create, conflict detection, multi-table delete, and manifest validation)
  • All doc-tests pass

@github-actions github-actions bot added the enhancement New feature or request label Mar 12, 2026
@github-actions
Copy link
Contributor

PR Review: feat: support atomic multi-table transactions via namespace manifest

P0: Broken dependency path — will not build for anyone else

-lance-namespace-reqwest-client = "0.5.2"
+lance-namespace-reqwest-client = { path = "../lance-namespace/rust/lance-namespace-reqwest-client" }

This points to a local path outside the repository (../lance-namespace/...). This will break CI and every other developer's build. The Cargo.lock changes (removing source and checksum) confirm this. This must be reverted before merge.

P0: "Atomic" batch create is not actually atomic

batch_create_table_versions claims atomicity but the two-phase approach has a critical gap:

  • Phase 1: Copies all staging manifests to final locations (visible to readers immediately)
  • Phase 2: Inserts entries into __manifest via MergeInsert with WhenMatched::Fail

If Phase 2 fails (e.g., conflict on a duplicate version), Phase 1 has already written manifest files to their final paths. There is no rollback/cleanup of these orphaned files. Other readers scanning _versions/ directories will see these phantom manifests. This defeats the purpose of atomic multi-table versioning.

You need either:

  1. Reverse the order: insert into __manifest first (as the commit point), then copy files. Readers using table_version_storage_enabled will only see versions in the manifest.
  2. Add a rollback in Phase 2's error path that deletes the files written in Phase 1.

P1: delete_table_versions is O(n) per version — will be very slow for large ranges

for (start, end) in ranges {
    for version in *start..=*end {
        // scan + check existence + delete, one at a time
    }
}

Deleting a range of 1000 versions means 1000 separate scans and 1000 separate delete operations. This should use a single filter expression like object_id IN (...) or starts_with(object_id, ...) AND version BETWEEN ... with a single delete call.

P1: query_table_versions loads everything into memory, then sorts client-side

The scanner fetches all matching rows, collects into a Vec, sorts in Rust, then truncates. For tables with many versions, this is unnecessarily expensive. Consider pushing sort/limit into the scan if Lance supports it, or at minimum document this as a known limitation.

Minor

  • use serde_json; (added in both dir.rs and manifest.rs) is unnecessary in Rust 2018+ — external crates are available by name without an explicit use.
  • list_builder.append_null() for every entry in batch_insert_into_manifest means base_objects is always null for version entries. This is fine semantically but worth a comment explaining the intent.
  • In ListTableVersions, entries with missing manifest_path in the JSON metadata are silently dropped via filter_map. This should at minimum log a warning, as it could mask data corruption.

Summary: The dependency path change is a build-breaker, and the non-atomic "atomic" batch create is a correctness issue that undermines the core goal of this PR. I'd recommend fixing these before proceeding with review of the rest.

@XuQianJin-Stars XuQianJin-Stars force-pushed the feat/multi-table-transactions branch from 92a9bae to 8b40e86 Compare March 12, 2026 07:58
@XuQianJin-Stars
Copy link
Contributor Author

hi @jackye1995 Could a new version be released for https://github.com/lance-format/lance-namespace to support the interfaces related to lance-format/lance-namespace#310 and lance-format/lance-namespace#311? The newly added interface methods in the current 0.5.2 version are all missing.

@XuQianJin-Stars XuQianJin-Stars force-pushed the feat/multi-table-transactions branch from 67f0f07 to 4994c3b Compare March 12, 2026 08:53
@wojiaodoubao
Copy link
Contributor

Thanks @XuQianJin-Stars for your great work!!! I'd like to help review this pr! I will start after the namespace issue is fixed, feel free to ping me if I miss the progress.

You can also fix the fmt/clippy issue when you have time.

}

// If table_version_storage_enabled is enabled, also record in __manifest
if self.table_version_storage_enabled {
Copy link
Contributor

@jackye1995 jackye1995 Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a best effort, okay to fail

/// - Storage options are invalid
pub async fn build(self) -> Result<DirectoryNamespace> {
// Validate: table_version_storage_enabled requires manifest_enabled
if self.table_version_storage_enabled && !self.manifest_enabled {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When enabled, we should ensure the __manifest table has a property persistedin it's metadata. This is because once we enable this, we should start to reject all writes that are not us the storage to ensure consistency. (Maybe time to develop the same feature flag system like the table format for forward compatibility)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the boolean flag for now, and we can do a more formal feature flag system in a separated PR to make the experience better, and also introduce a similar feature flag for manifest_enabled

) -> Result<ListTableVersionsResponse> {
// When table_version_storage_enabled, query from __manifest
if self.table_version_storage_enabled {
if let Some(ref manifest_ns) = self.manifest_ns {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these table version specific logic should go to manifest.rs since it requires manifest. We can call into it here.


/// Convert a table ID (vec of strings) to an object_id string
fn str_object_id(table_id: &[String]) -> String {
pub fn str_object_id(table_id: &[String]) -> String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just rename the table_id variable to object_id, and we can use that also for cases like

let object_id = format!("{}${}", table_id, version);

below

// table_version object_ids are formatted as "{table_id}${version}"
let filter = format!(
"object_type = 'table_version' AND starts_with(object_id, '{}$')",
escaped_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should store the version number in lexigraphical way, like 00000000001 for version 1, so the versions are ordered and we can do range queries more easily

))))
})?;

merge_builder.when_matched(lance::dataset::WhenMatched::Fail);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should try to import things at top of module when possible ( I'm not doing a good job with that either 😆)

@XuQianJin-Stars XuQianJin-Stars force-pushed the feat/multi-table-transactions branch from 9512267 to f33f35d Compare March 13, 2026 05:47
/// contains the full metadata JSON stored in the manifest (manifest_path, manifest_size,
/// e_tag, naming_scheme).
///
/// **Known limitation**: All matching rows are loaded into memory, sorted in Rust,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay sorry I missed that we need descending. Looks like we probably need to do the reverse trick like what we do for the manifest file name, but using 9999999999999 instead of long max here, so that it's ordered first in the string form.

Because Lance scan is ordered, by doing this, we should always be able to just do LIMIT 1 to get the latest table version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh actually, nvm it's ordered by the row address, not by the actual value. Then the current way should be fine for now, it should also work if there is a btree, but might require some additional optimization in the query planning that we can do later.

{
return Ok(Some(metadata));
}
// Fall back to legacy plain integer format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to fallback

TableVersion,
/// A property entry used to persist feature flags and configuration
/// in the __manifest table (e.g., `table_version_storage_enabled`).
Property,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this, when I say property, I am saying you update the __manifest table's table metadata key-value map, which can be done with dataset.update_metadata()

///
/// Versions are stored as 20-digit zero-padded integers (e.g., `00000000000000000001`
/// for version 1) so that string-based range queries and sorting work correctly.
pub fn format_version(version: i64) -> String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should be named format_table_version

@XuQianJin-Stars XuQianJin-Stars force-pushed the feat/multi-table-transactions branch from 83a214f to 29473fe Compare March 14, 2026 03:24
Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great! I would suggest we get all features except for the batch_create_table_versions/batch_commit_tables in first, and then we get a few subsequent items:

  1. add support for atomically insert and delete entries through merge_insert
  2. add batch_commit_tables officially with updated lance-namespace crate version
  3. add some more official feature flag setup like the table format

///
/// The object_id is formatted as `{table_id}${zero_padded_version}`.
pub fn parse_version_from_object_id(object_id: &str) -> Option<i64> {
object_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can use split_object_id and then parse the last part.

}
}

// Legacy behavior: delete physical files directly (no __manifest)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is not really legacy, it's just when the feature is not enabled.

// Even if some file deletions fail, the versions are already removed from
// __manifest, so they won't be visible to readers. Leftover files are
// orphaned but harmless and can be cleaned up later.
for te in &table_entries {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be extracted as a helper method since it's basically used twice here and below.

})
}

async fn batch_create_table_versions(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: pending merging the namespace PR to get this updated. But I think we can also just not implement this and merge the remaining, have a separated PR to actually get that implemented, if that is cleaner.

// Phase 1 (atomic commit point): Execute inserts and deletes atomically.
// Inserts use MergeInsert (WhenMatched::Fail), deletes use DeleteBuilder.
// We execute inserts first, then deletes. Both are individual atomic ops on __manifest.
if !insert_entries.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we do them separately, they are not atomic. I think there are 2 ways out: (1) support merge_insert with WhenMatched::InsertOrDelete(condition), so that we can actually choose to insert or delete based on another condition. (2) do a batch commit of 2 staged transactions. I think we should probably do 1, it's a long-overdue feature in merge_insert.


/// Insert an entry into the manifest table with metadata and base_objects
async fn insert_into_manifest_with_metadata(
pub async fn insert_into_manifest_with_metadata(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do some refactoring, instead of having another batch_insert_into_manifest, make this support inserting a list of objects (object_id + object_type)

/// This provides atomic multi-table version creation semantics.
pub async fn batch_insert_into_manifest(
&self,
entries: Vec<(String, ObjectType, Option<String>, Option<String>)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can make this a struct to make the meaning of each entry more clear


// Persist table_version_storage_enabled flag in __manifest so that once
// enabled, it becomes a permanent property of this namespace.
if self.table_version_storage_enabled {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic should be moved to manifest::ensure_manifest_table_up_to_date so we always have a central place to track all migration related functionalities.

@codecov
Copy link

codecov bot commented Mar 16, 2026

@wojiaodoubao wojiaodoubao mentioned this pull request Mar 17, 2026
7 tasks
Add a new `table_version_storage_enabled` flag to `DirectoryNamespaceBuilder`
that stores table versions in the `__manifest` table (as `table_version` object
type) instead of relying solely on Lance's native version management. This enables
atomic multi-table version creation via `BatchCreateTableVersions`.

Key changes:

**manifest.rs**
- Add `TableVersion` variant to `ObjectType` enum
- Make `str_object_id` and `insert_into_manifest_with_metadata` public
- Add `batch_insert_into_manifest()`: atomically inserts multiple entries
  using a single MergeInsert with `WhenMatched::Fail` semantics, so if any
  version already exists the entire batch is rejected
- Add `query_table_versions()`: lists all versions for a given table from
  the manifest, returning version number and full metadata JSON
- Add `query_table_version()`: queries a single version entry by table ID
  and version number, returning its metadata JSON
- Add `delete_table_versions()`: removes version entries by table ID and
  version ranges from the manifest

**dir.rs**
- Add `table_version_storage_enabled` field with builder method and
  validation (requires `manifest_enabled=true`)
- `CreateTableVersion`: when enabled, also records the version entry in
  `__manifest` via `insert_into_manifest_with_metadata`
- `BatchCreateTableVersions`: two-phase commit — first copies staging
  manifests to final locations, then atomically inserts all version entries
  into `__manifest` in a single batch operation
- `ListTableVersions`: when enabled, queries versions from `__manifest`
  and reconstructs `TableVersion` from stored metadata JSON
- `DescribeTableVersion`: when enabled and version is specified, queries
  directly from `__manifest` instead of opening the dataset
- `BatchDeleteTableVersions`: after deleting physical files, syncs
  deletion to `__manifest` via `delete_table_versions()`
- Add comprehensive tests for batch create, conflict detection, manifest
  requirement validation, and single version recording

**namespace.rs**
- Add `batch_create_table_versions` trait method with default
  not-supported implementation

**rest.rs / rest_adapter.rs**
- Add REST endpoints for `batch_create_table_versions`
  (POST /v1/table/version/batch_create) and `batch_delete_table_versions`
  (POST /v1/table/:id/version/batch_delete)
Update dir.rs to use generated tuple variant types (CommitTableOperation,
CommitTableResult) instead of hand-written inline field variants:

- CommitTableOperation::DeclareTable(op) pattern with op.id, op.location
- CommitTableResult::DeclareTable(Box::new(CommitTableResultDeclareTable {...}))
- Same pattern for CreateTableVersion, DeleteTableVersions, DeregisterTable

All 122 tests pass.
…llapsible_if errors

- Update lance-namespace-reqwest-client from local path to crates.io v0.6.0
- Fix 5 collapsible_if clippy errors in dir.rs and manifest.rs
- Escape <hash>_<object_id> in doc comment to fix rustdoc invalid-html-tags error
- Update python/Cargo.lock for lance-namespace-reqwest-client 0.5.2 -> 0.6.0
@XuQianJin-Stars XuQianJin-Stars force-pushed the feat/multi-table-transactions branch from 010a7af to 7b6abd0 Compare March 18, 2026 02:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants