Skip to content

refactor: improve RowCompactedSerializer by using string_view to avoid data copies#196

Open
lxy-9602 wants to merge 2 commits intoalibaba:mainfrom
lxy-9602:row-compact-string-view
Open

refactor: improve RowCompactedSerializer by using string_view to avoid data copies#196
lxy-9602 wants to merge 2 commits intoalibaba:mainfrom
lxy-9602:row-compact-string-view

Conversation

@lxy-9602
Copy link
Collaborator

Purpose

Linked issue: #93

  1. Replace BinaryString and std::string with std::string_view in RowCompactedSerializer serialization interfaces to enable zero-copy access and reduce memory allocations. No functional behavior is changed, but performance is improved in hot paths.
  2. CompactStrategy supports BucketedDvMaintainer.

Tests

CompactStrategyTest, TestPickFullCompaction

Generative AI tooling

Partially Generated-by: Claude-4.6-Opus

@lucasfang lucasfang requested a review from Copilot March 24, 2026 01:52
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors serialization and compaction selection to reduce copies by using std::string_view in hot paths, and extends full-compaction picking logic to account for deletion vectors via BucketedDvMaintainer.

Changes:

  • Enable use_view=true for InternalRow field getters and add string-view based write paths in RowCompactedSerializer / serializer utilities.
  • Extend CompactStrategy::PickFullCompaction to accept a BucketedDvMaintainer and trigger rewrites when deletion vectors exist.
  • Update/add unit tests to cover the new compaction selection behavior and adjust persist processor test setup.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
src/paimon/core/mergetree/lookup/persist_processor_test.cpp Reworks test fixture setup to build KeyValue from Arrow/ColumnarRow instead of BinaryRowGenerator.
src/paimon/core/mergetree/compact/compact_strategy_test.cpp Updates PickFullCompaction call sites for new signature and adds DV-maintainer coverage.
src/paimon/core/mergetree/compact/compact_strategy.h Adds DV maintainer parameter and logic to include files with deletion vectors in full compaction.
src/paimon/common/data/serializer/row_compacted_serializer.h Adds RowWriter::WriteStringView to write length-prefixed bytes from std::string_view.
src/paimon/common/data/serializer/row_compacted_serializer.cpp Enables view-based getters and removes intermediate allocations/copies when serializing string/binary.
src/paimon/common/data/serializer/binary_serializer_utils.cpp Switches STRING/BINARY serialization to pass views instead of copying.
src/paimon/common/data/generic_row.h Adds conversion path when a GenericRow field is stored as std::string_view but API requires BinaryString.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

bool force_rewrite_all_files) {
static std::optional<CompactUnit> PickFullCompaction(
int32_t num_levels, const std::vector<LevelSortedRun>& runs,
const std::shared_ptr<BucketedDvMaintainer>& dv_maintainer, bool force_rewrite_all_files) {
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the public PickFullCompaction signature by adding dv_maintainer, which can ripple through many callers. Consider adding an overload retaining the previous signature (calling the new one with nullptr) to keep the API backwards-compatible and reduce churn for call sites that don’t care about deletion vectors.

Suggested change
const std::shared_ptr<BucketedDvMaintainer>& dv_maintainer, bool force_rewrite_all_files) {
bool force_rewrite_all_files) {
return PickFullCompaction(num_levels, runs, nullptr, force_rewrite_all_files);
}
static std::optional<CompactUnit> PickFullCompaction(
int32_t num_levels, const std::vector<LevelSortedRun>& runs,
const std::shared_ptr<BucketedDvMaintainer>& dv_maintainer,
bool force_rewrite_all_files) {

Copilot uses AI. Check for mistakes.
#pragma once

#include "paimon/core/compact/compact_unit.h"
#include "paimon/core/deletionvectors/bucketed_dv_maintainer.h"
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including bucketed_dv_maintainer.h in this header increases coupling and compile-time impact, especially since PickFullCompaction is defined inline. If feasible, move the PickFullCompaction implementation to a .cpp file so the header can forward-declare BucketedDvMaintainer and avoid pulling in the heavier dependency transitively.

Copilot uses AI. Check for mistakes.
}
case arrow::Type::type::STRING: {
writer->WriteString(pos, getter->GetString(pos));
writer->WriteStringView(pos, getter->GetStringView(pos));
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For arrow::Type::BINARY, this switches from getter->GetBinary(pos) to getter->GetStringView(pos) and from WriteBinary to WriteStringView. If GetStringView is implemented with STRING semantics (e.g., expecting UTF-8, or only supported for STRING arrays), this can mis-serialize BINARY data or fail at runtime. Prefer a binary-specific view API (e.g., GetBinaryView / WriteBinaryView) or keep using GetBinary + WriteBinary for the BINARY case while still avoiding allocations.

Copilot uses AI. Check for mistakes.
}
case arrow::Type::type::BINARY: {
writer->WriteBinary(pos, *(getter->GetBinary(pos)));
writer->WriteStringView(pos, getter->GetStringView(pos));
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For arrow::Type::BINARY, this switches from getter->GetBinary(pos) to getter->GetStringView(pos) and from WriteBinary to WriteStringView. If GetStringView is implemented with STRING semantics (e.g., expecting UTF-8, or only supported for STRING arrays), this can mis-serialize BINARY data or fail at runtime. Prefer a binary-specific view API (e.g., GetBinaryView / WriteBinaryView) or keep using GetBinary + WriteBinary for the BINARY case while still avoiding allocations.

Suggested change
writer->WriteStringView(pos, getter->GetStringView(pos));
writer->WriteBinary(pos, getter->GetBinary(pos));

Copilot uses AI. Check for mistakes.
Comment on lines +95 to +96
std::map<std::string, std::shared_ptr<DeletionVector>> deletion_vectors = {
{"fake.data", std::make_shared<BitmapDeletionVector>(RoaringBitmap32())}};
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test assumes the file(s) produced by CreateRunsWithLevelAndSize will have file_name == \"fake.data\". If the helper generates different names (common in such factories), DeletionVectorOf(file->file_name) won’t find a DV, and the compaction selection may return nullopt, making the test flaky/incorrect. Make the DV map key match the actual file_name produced by runs (e.g., derive the name from runs/files created in the test) so the test asserts the intended behavior reliably.

Copilot uses AI. Check for mistakes.
Comment on lines +95 to +102
std::map<std::string, std::shared_ptr<DeletionVector>> deletion_vectors = {
{"fake.data", std::make_shared<BitmapDeletionVector>(RoaringBitmap32())}};

auto dv_maintainer = std::make_shared<BucketedDvMaintainer>(
std::make_shared<DeletionVectorsIndexFile>(nullptr, nullptr, /*bitmap64=*/false,
GetDefaultPool()),
deletion_vectors);
auto runs = CreateRunsWithLevelAndSize(/*levels=*/{3}, /*sizes*/ {10});
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test assumes the file(s) produced by CreateRunsWithLevelAndSize will have file_name == \"fake.data\". If the helper generates different names (common in such factories), DeletionVectorOf(file->file_name) won’t find a DV, and the compaction selection may return nullopt, making the test flaky/incorrect. Make the DV map key match the actual file_name produced by runs (e.g., derive the name from runs/files created in the test) so the test asserts the intended behavior reliably.

Suggested change
std::map<std::string, std::shared_ptr<DeletionVector>> deletion_vectors = {
{"fake.data", std::make_shared<BitmapDeletionVector>(RoaringBitmap32())}};
auto dv_maintainer = std::make_shared<BucketedDvMaintainer>(
std::make_shared<DeletionVectorsIndexFile>(nullptr, nullptr, /*bitmap64=*/false,
GetDefaultPool()),
deletion_vectors);
auto runs = CreateRunsWithLevelAndSize(/*levels=*/{3}, /*sizes*/ {10});
auto runs = CreateRunsWithLevelAndSize(/*levels=*/{3}, /*sizes*/ {10});
// Derive the file name from the created runs to ensure the DV map key matches.
ASSERT_FALSE(runs.empty());
ASSERT_TRUE(runs[0].run);
ASSERT_FALSE(runs[0].run->files.empty());
const std::string& file_name = runs[0].run->files[0]->file_name;
std::map<std::string, std::shared_ptr<DeletionVector>> deletion_vectors = {
{file_name, std::make_shared<BitmapDeletionVector>(RoaringBitmap32())}};
auto dv_maintainer = std::make_shared<BucketedDvMaintainer>(
std::make_shared<DeletionVectorsIndexFile>(nullptr, nullptr, /*bitmap64=*/false,
GetDefaultPool()),
deletion_vectors);

Copilot uses AI. Check for mistakes.

#include "paimon/core/mergetree/lookup/persist_processor.h"

#include "arrow/ipc/json_simple.h"
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test now depends on Arrow IPC ‘internal’ JSON utilities (arrow::ipc::internal::json and json_simple.h). These internal APIs are more likely to change across Arrow upgrades and ValueOrDie() will hard-abort rather than producing a clean gtest failure. Prefer a public Arrow test utility (if available in this repo’s Arrow version) and/or propagate failures via ASSERT-style checks so failures are reported as test assertions rather than process aborts.

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +34
auto key_array = std::dynamic_pointer_cast<arrow::StructArray>(
arrow::ipc::internal::json::ArrayFromJSON(key_type, R"([[10]])").ValueOrDie());
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test now depends on Arrow IPC ‘internal’ JSON utilities (arrow::ipc::internal::json and json_simple.h). These internal APIs are more likely to change across Arrow upgrades and ValueOrDie() will hard-abort rather than producing a clean gtest failure. Prefer a public Arrow test utility (if available in this repo’s Arrow version) and/or propagate failures via ASSERT-style checks so failures are reported as test assertions rather than process aborts.

Copilot uses AI. Check for mistakes.
Comment on lines +39 to +41
auto value_array = std::dynamic_pointer_cast<arrow::StructArray>(
arrow::ipc::internal::json::ArrayFromJSON(value_type, R"([["Alice", 10, null, 10.1]])")
.ValueOrDie());
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test now depends on Arrow IPC ‘internal’ JSON utilities (arrow::ipc::internal::json and json_simple.h). These internal APIs are more likely to change across Arrow upgrades and ValueOrDie() will hard-abort rather than producing a clean gtest failure. Prefer a public Arrow test utility (if available in this repo’s Arrow version) and/or propagate failures via ASSERT-style checks so failures are reported as test assertions rather than process aborts.

Copilot uses AI. Check for mistakes.
// add all files when force compacted
files_to_be_compacted.push_back(file);
} else if (dv_maintainer && dv_maintainer->DeletionVectorOf(file->file_name)) {
// check deletion vector for large files
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says ‘check deletion vector for large files’, but the condition does not check file size—only whether a deletion vector exists. To avoid misleading future readers, update the comment to reflect the actual logic (e.g., ‘rewrite files with deletion vectors’) or add an explicit size-based condition if that was intended.

Suggested change
// check deletion vector for large files
// rewrite files that have deletion vectors

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants