Skip to content

[feature](tso) Add global monotonically increasing Timestamp Oracle(TSO)#61199

Open
AntiTopQuark wants to merge 33 commits intoapache:masterfrom
AntiTopQuark:support_tso
Open

[feature](tso) Add global monotonically increasing Timestamp Oracle(TSO)#61199
AntiTopQuark wants to merge 33 commits intoapache:masterfrom
AntiTopQuark:support_tso

Conversation

@AntiTopQuark
Copy link
Copy Markdown

@AntiTopQuark AntiTopQuark commented Mar 11, 2026

What problem does this PR solve?

Issue Number: close #61198

Related #57921

Problem Summary:

Release note

  • Implement a global monotonically increasing Timestamp Oracle (TSO) service that generates unique, monotonically increasing timestamps for transactions.
    The service calibrates its initial timestamp at startup and periodically updates it to maintain a time window.
    A TSO timestamp encodes the physical time and a logical counter; it is assembled and extracted by the new TSOTimestamp class.
  • Introduce TSOService, a master-only daemon that manages global timestamps.
    The service exposes two main methods:
    • getTSO() – returns a new TSO timestamp for transaction commits.
    • getCurrentTSO() – returns the current TSO without bumping the logical counter.
  • Add multiple configuration properties to control the behavior of the TSO feature:
    • experimental_enable_feature_tso – enables/disables the TSO feature.
    • tso_service_update_interval_ms – interval in milliseconds for the TSO service to update its window.
    • max_update_tso_retry_count and max_get_tso_retry_count – retry limits for updating and obtaining TSOs.
    • tso_service_window_duration_ms – length of the time window allocated by the TSO service.
    • tso_time_offset_debug_mode – debug offset for the physical time.
    • enable_tso_persist_journal and enable_tso_checkpoint_module – persistence switches for edit log and checkpoint.
  • Table property: Introduce enable_tso which can be configured in CREATE TABLE or modified via ALTER TABLE. Only tables with enable_tso = true generate commit TSO for transactions; when disabled, commit_tso remains -1.
  • Transaction and commit integration:
    • During commit, TransactionState now fetches a commit TSO from TSOService when TSO is enabled and stores it in the transaction state and TableCommitInfo.
    • The commit TSO is recorded per partition (via TPartitionVersionInfo.commit_tso), and is persisted with each rowset (see next item).
  • Rowset and meta changes:
    • Rowset::make_visible now accepts a commit_tso parameter and writes it to RowsetMeta.
    • RowsetMetaPB adds a new field commit_tso to persist commit timestamps.
    • information_schema.rowsets introduces a new column COMMIT_TSO allowing users to query the commit timestamp for each rowset.
    • Pending publish tasks, asynchronous publish tasks and other internal structures have been extended to carry commit TSO.
  • External interface:
    A new REST endpoint /api/tso is added for retrieving current TSO information. It returns a JSON payload containing:
    • window_end_physical_time – end of the current TSO time window.
    • current_tso – the current composed 64‑bit TSO.
    • current_tso_physical_time and current_tso_logical_counter – the decomposed physical and logical parts of the current TSO. This API does not increment the logical counter.
  • Metrics & observability:
    New metrics counters (e.g., tso_clock_drift_detected, tso_clock_backward_detected, tso_clock_calculated, tso_clock_updated) expose state and health of the TSO service.
  • Regression & unit tests:
    New unit tests verify TSOTimestamp bit manipulation, TSOService behavior, commit TSO propagation, and the /api/tso endpoint. Regression tests verify that rowset commit timestamps are populated when TSO is enabled and that the API returns increasing TSOs.

Impact and Compatibility

  • Experimental: the TSO feature is currently guarded by experimental_enable_feature_tso. It is disabled by default and can be enabled in front-end configuration. When enabled, old FE versions without this feature cannot replay edit log entries containing TSO operations; therefore upgrade all FEs before enabling.
  • Table compatibility: tables created before enabling TSO remain unaffected unless explicitly modified to set enable_tso to true. Tables with TSO enabled will produce commit TSO for each rowset and may require downstream consumers to handle the new commit_tso field.
  • Client API: clients can call /api/tso to inspect current TSO values. No existing API is modified.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@AntiTopQuark
Copy link
Copy Markdown
Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.45% (1798/2263)
Line Coverage 64.65% (32249/49881)
Region Coverage 65.60% (16145/24611)
Branch Coverage 56.06% (8611/15360)

@AntiTopQuark
Copy link
Copy Markdown
Author

run buildall

@doris-robot
Copy link
Copy Markdown

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.31% (1798/2267)
Line Coverage 64.65% (32290/49944)
Region Coverage 65.57% (16170/24659)
Branch Coverage 55.96% (8611/15388)

@AntiTopQuark
Copy link
Copy Markdown
Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.31% (1798/2267)
Line Coverage 64.63% (32279/49944)
Region Coverage 65.55% (16165/24659)
Branch Coverage 55.95% (8610/15388)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 44.34% (188/424) 🎉
Increment coverage report
Complete coverage report

@AntiTopQuark
Copy link
Copy Markdown
Author

run buildall

…TSO)

Signed-off-by: Jingzhe Jia <AntiTopQuark1350@outlook.com>
@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.31% (1798/2267)
Line Coverage 64.66% (32294/49944)
Region Coverage 65.59% (16173/24659)
Branch Coverage 55.97% (8612/15388)

@AntiTopQuark
Copy link
Copy Markdown
Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 44.34% (188/424) 🎉
Increment coverage report
Complete coverage report

Signed-off-by: Jingzhe Jia <AntiTopQuark1350@outlook.com>
@AntiTopQuark
Copy link
Copy Markdown
Author

run buildall

@AntiTopQuark
Copy link
Copy Markdown
Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.24% (1798/2269)
Line Coverage 64.53% (32281/50023)
Region Coverage 65.45% (16165/24699)
Branch Coverage 55.85% (8609/15414)

@AntiTopQuark
Copy link
Copy Markdown
Author

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Mar 30, 2026
@doris-robot
Copy link
Copy Markdown

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.67% (1796/2283)
Line Coverage 64.39% (32279/50128)
Region Coverage 65.25% (16157/24761)
Branch Coverage 55.71% (8609/15454)

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 26912 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 772d322ab7d2917fab06270025e16e602542148e, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17665	4492	4380	4380
q2	q3	10735	797	535	535
q4	4734	343	250	250
q5	8207	1215	1028	1028
q6	234	175	144	144
q7	806	865	680	680
q8	10669	1475	1368	1368
q9	6792	4783	4749	4749
q10	6332	1943	1659	1659
q11	460	260	244	244
q12	770	587	467	467
q13	18052	2701	1946	1946
q14	227	240	217	217
q15	q16	734	728	669	669
q17	744	832	466	466
q18	5924	5368	5385	5368
q19	1109	984	626	626
q20	554	495	375	375
q21	4525	2095	1467	1467
q22	387	336	274	274
Total cold run time: 99660 ms
Total hot run time: 26912 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4864	4587	4568	4568
q2	q3	3909	4375	3833	3833
q4	903	1259	772	772
q5	4072	4413	4348	4348
q6	180	172	144	144
q7	1767	1667	1573	1573
q8	2503	2735	2595	2595
q9	7754	7578	7427	7427
q10	3853	4024	3624	3624
q11	509	442	419	419
q12	491	704	465	465
q13	2954	2973	2057	2057
q14	298	304	300	300
q15	q16	766	773	714	714
q17	1214	1394	1399	1394
q18	7268	6748	6610	6610
q19	891	923	941	923
q20	2095	2194	2007	2007
q21	3959	3485	3401	3401
q22	485	441	367	367
Total cold run time: 50735 ms
Total hot run time: 47541 ms

@AntiTopQuark
Copy link
Copy Markdown
Author

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 36676 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 6755fa1061376778da12ffdc0a7d55f410eb3900, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17632	4532	4353	4353
q2	q3	10634	1028	633	633
q4	4716	465	354	354
q5	7776	802	625	625
q6	240	220	185	185
q7	1010	881	647	647
q8	9516	945	808	808
q9	9938	7467	7447	7447
q10	8982	4112	3755	3755
q11	576	354	342	342
q12	810	685	570	570
q13	17982	4880	4014	4014
q14	477	480	450	450
q15	q16	569	556	462	462
q17	929	1056	820	820
q18	6682	6330	5766	5766
q19	1179	1312	898	898
q20	1528	1446	1616	1446
q21	4911	3312	2730	2730
q22	505	400	371	371
Total cold run time: 106592 ms
Total hot run time: 36676 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5532	5227	5364	5227
q2	q3	1945	2397	2012	2012
q4	1005	1451	890	890
q5	4614	4933	4811	4811
q6	231	206	174	174
q7	7952	7900	7686	7686
q8	3272	3605	3427	3427
q9	27508	27430	27218	27218
q10	4591	5015	4563	4563
q11	1036	932	907	907
q12	577	676	497	497
q13	4458	4987	4040	4040
q14	465	481	470	470
q15	q16	551	573	556	556
q17	3771	3865	3792	3792
q18	7547	7196	6936	6936
q19	1173	1190	1215	1190
q20	2337	2444	2267	2267
q21	14015	13414	13352	13352
q22	596	546	487	487
Total cold run time: 93176 ms
Total hot run time: 90502 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.67% (1796/2283)
Line Coverage 64.40% (32280/50128)
Region Coverage 65.26% (16159/24761)
Branch Coverage 55.73% (8612/15454)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 58.53% (271/463) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.92% (19966/37732)
Line Coverage 36.45% (187288/513766)
Region Coverage 32.70% (145252/444150)
Branch Coverage 33.88% (63662/187922)

@AntiTopQuark AntiTopQuark requested a review from morningman March 30, 2026 16:03
@AntiTopQuark
Copy link
Copy Markdown
Author

Hi, @morningman and @morrySnow .I have fixed all the comments. Please help review this code again.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.54% (27176/36952)
Line Coverage 57.07% (292346/512219)
Region Coverage 54.34% (243592/448270)
Branch Coverage 56.08% (105713/188488)

@gavinchou
Copy link
Copy Markdown
Contributor

LGTM

gavinchou
gavinchou previously approved these changes Mar 31, 2026
Copy link
Copy Markdown
Contributor

@gavinchou gavinchou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 31, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@AntiTopQuark
Copy link
Copy Markdown
Author

Hi, @morningman and @morrySnow . Please help review this code again.

@dataroaring
Copy link
Copy Markdown
Contributor

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 2 issues worth addressing before this lands.

  1. High: fe/fe-core/src/main/java/org/apache/doris/tso/TSOService.java makes image persistence of windowEndTSO optional even though calibration already requires journal persistence. If enable_tso_feature=true, enable_tso_persist_journal=true, and enable_tso_checkpoint_module=false, a checkpoint can be written without the last reserved TSO window. After that image is taken and older TSO journals are compacted away, the next master restart will reload windowEndTSO=0 from image and can calibrate from wall clock instead of the last issued window, violating the feature's global monotonicity guarantee.
  2. Medium: fe/fe-core/src/main/java/org/apache/doris/httpv2/rest/TSOAction.java exposes cached TSO values even when the feature has been disabled or the service is uncalibrated, because the endpoint does not validate feature state and TSOService does not clear the cached timestamp on disable. That makes /api/tso report stale state as if it were authoritative.

Critical checkpoint conclusions:

  • Goal of the task: Partially achieved. The PR wires TSO through FE commit, publish RPC, and BE rowset metadata, but the checkpoint/restart path still has a correctness hole.
  • Minimality/focus: Broad but reasonable for the feature scope.
  • Concurrency: Locking in TSOService and BE publish paths looks consistent; I did not find a deadlock, but the disable path leaves stale in-memory state observable through the API.
  • Lifecycle/static initialization: Not fully correct. The TSO lifecycle across journal replay, checkpoint, and master restart is incomplete when the image module is disabled.
  • Configuration: New configs are present, but enable_tso_persist_journal and enable_tso_checkpoint_module can currently be combined into a broken recovery mode.
  • Compatibility/incompatible change: New FE op code / meta module handling exists, but the optional checkpoint persistence weakens recovery compatibility semantics.
  • Parallel code paths: commit_tso propagation appears covered in the main publish path, async publish path, and report-based republish path.
  • Special conditional checks: The persist-journal precondition in calibration is good; the HTTP API is missing equivalent feature-state validation.
  • Test coverage: Good unit/regression coverage for normal flows, but I did not see coverage for checkpoint-disabled restart recovery or API behavior while disabled/uninitialized.
  • Observability: Logging and metrics are generally sufficient.
  • Transaction/persistence: Not correct yet because the durable source of windowEndTSO can disappear after checkpoint compaction.
  • Data writes/modifications: FE-to-BE commit_tso plumbing and BE rowset persistence look consistent in the reviewed paths.
  • FE-BE variable passing: The new thrift/proto fields are propagated through publish and recovery paths I checked.
  • Performance: No obvious hot-path regression stood out during review.
  • Other issues: None beyond the two findings above.

}

public long saveTSO(CountingDataOutputStream dos, long checksum) throws IOException {
if (!Config.enable_tso_checkpoint_module) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

saveTSO() is gated by enable_tso_checkpoint_module, but calibrateTimestamp() already requires enable_tso_persist_journal=true, so TSO can be actively handing out timestamps while this image module stays disabled. That creates a monotonicity hole after checkpoint compaction: once a checkpoint is taken without windowEndTSO and the pre-checkpoint TSO journal entries are purged, the next master restart reloads windowEndTSO=0 from image and can calibrate from wall clock instead of the last reserved window. In that scenario the new master can generate a TSO lower than one already issued before the checkpoint. This state needs to be made impossible, either by always saving windowEndTSO into the image whenever journal persistence is enabled, or by rejecting the config combination that enables TSO without checkpoint persistence.

Copy link
Copy Markdown
Author

@AntiTopQuark AntiTopQuark Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We intentionally keep these two switches decoupled:

  • enable_tso_persist_journal serves as the strict correctness gate for enabling TSO; if this configuration is set to false, calibrateTimestamp() will fail fast.

  • enable_tso_checkpoint_module is a compatibility switch for image format and rollback scenarios, not the switch for enabling TSO.

Even when the checkpoint module is disabled, window-end events are still replayed from the journal (OP_TSO_TIMESTAMP_WINDOW_END).Therefore, this configuration combination is a compatibility mode chosen by the operator, with known trade‑offs, rather than an unintended state.

cc @dataroaring

return ResponseEntityBuilder.badRequest("Current FE is not master");
}
// Get current TSO information without increasing it
long windowEndPhysicalTime = env.getTSOService().getWindowEndTSO();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endpoint never checks whether TSO is enabled or calibrated before returning getWindowEndTSO() / getCurrentTSO(). That is observable because TSOService.runAfterCatalogReady() only flips isInitialized to false when the feature is disabled; it does not clear the cached globalTimestamp or windowEndTSO. After ADMIN SET FRONTEND CONFIG ('enable_tso_feature'='false'), /api/tso will still return the previous timestamp as if it were current. Please either clear the cached state on disable, or make /api/tso reject requests unless the feature is enabled and calibrated.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an intentional design, not an oversight.
The /api/tso endpoint is intended as a state observation interface, not a timestamp issuance interface.

The actual path for externally allocating TSO is getTSO(), where strict validation is performed on both enable_tso_feature and isInitialized. Requests will be rejected directly if TSO is not enabled or not calibrated.

In contrast, /api/tso only reads in-memory snapshots of the current state (getWindowEndTSO() / getCurrentTSO()) for observation and troubleshooting purposes, and does not advance the timestamp.

cc @dataroaring

Copy link
Copy Markdown
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Thanks for this substantial feature addition! The TSO design follows established patterns (similar to TiDB/PD). Here are my findings:


Critical Issues

1. Config naming inconsistency — enable_tso_feature vs experimental_enable_feature_tso

The Config field is enable_tso_feature but the error message in PropertyAnalyzer.analyzeEnableTso() says "experimental_enable_tso_feature" (note the word order difference: enable_tso_feature vs enable_feature_tso). This will confuse users trying to set the right config.

2. getTSO() sleeps in the transaction commit path

getTSO() is called from DatabaseTransactionMgr.getCommitTSO() during commit. It contains retry loops that call sleep(200) up to tso_max_get_retry_count (default 10) times. This means a transaction commit can block for up to 2 seconds waiting for the environment. The Env.getCurrentEnv().isReady() and isMaster() checks should be done once before entering the retry loop rather than repeated inside it.

3. Logical counter overflow recovery gap

In getTSO(), when logical > MAX_LOGICAL_COUNTER, the code sleeps and retries. But generateTSO() returns logicalCounter + 1 without resetting the counter — the reset only happens when updateTimestamp() daemon advances the physical time via setTSOPhysical(). If the daemon tick hasn't run yet, all concurrent getTSO() calls will see overflow and fail until the daemon catches up.

4. unprotectedCommitTransaction signature change

unprotectedCommitTransaction and unprotectedCommitTransaction2PC now throw TransactionCommitFailedException. This is a behavioral change — all callers need to be verified. Subclasses (e.g. cloud variants) that override these methods must also update their signatures.

5. calibrateTimestamp() requires enable_tso_persist_journal=true but it defaults to false

Enabling enable_tso_feature=true without also setting enable_tso_persist_journal=true causes the service to perpetually fail calibration in runAfterCatalogReady(). These config dependencies should be enforced or at least clearly documented.


Moderate Issues

6. _async_publish_tasks still uses anonymous std::tuple

storage_engine.h changed the async publish map from pair to tuple<int64_t, int64_t, int64_t> with positional access (std::get<0>, etc.). Since DiscontinuousVersionTablet was correctly introduced as a named struct for the other case, consider doing the same here for consistency.

7. Rowset::make_visible has no default for commit_tso

The signature changed from make_visible(Version) to make_visible(Version, int64_t commit_tso) without a default parameter. The BE txn_manager.h overloads correctly use = -1, but rowset.h does not. Any code (including out-of-tree or test code) calling the old signature will break.

8. /api/tso authentication check

TSOAction.getTSO() calls executeCheckPassword(request, response) but doesn't check its return value. Other REST actions in the codebase typically verify the result or use checkGlobalAuth. This may allow unauthenticated access to TSO state.

9. TSOClockBackwardException propagates through MasterDaemon

If clock backward exceeds the threshold, a RuntimeException propagates from runAfterCatalogReady(). Depending on how MasterDaemon handles uncaught exceptions in the run loop, this could kill the daemon thread permanently.

10. Concurrency model mixes ReentrantLock and AtomicLong/AtomicBoolean

windowEndTSO is an AtomicLong read without the lock in updateTimestamp(), while isInitialized is an AtomicBoolean set under lock in some paths but checked without it in others. This is technically safe but makes the concurrency contract harder to reason about. Consider documenting which fields are protected by what.


Minor Issues

11. Missing spaces in config descriptions

Several @ConfField description strings are missing trailing spaces before concatenation:

  • "retry 3 times" + "to update""retry 3 timesto update"
  • "5000ms from BDBJE once." → should have space before "will apply"
  • Same pattern in tso_max_get_retry_count, tso_service_window_duration_ms, tso_time_offset_debug_mode

12. FeMetaVersion bump mentioned but not in diff

The PR description mentions bumping VERSION_CURRENT to VERSION_141, but this file is not in the changed files list. Is the version bump missing or in a separate PR?

13. Regression test may silently pass when TSO is disabled

test_tso_api.groovy and test_tso_rowset_commit_tso.groovy — if the test cluster doesn't have enable_tso_feature and enable_tso_persist_journal set, the tests may skip or pass without exercising the feature.


Overall the BE-side plumbing (threading commit_tso through publish, rowset meta, information_schema) is clean. The FE core needs attention on the concurrency/retry model in getTSO() and the config dependency chain.

…riptions to ensure clarity of the descriptive information.
@AntiTopQuark
Copy link
Copy Markdown
Author

run buildall

@AntiTopQuark
Copy link
Copy Markdown
Author

Code Review

Thanks for this substantial feature addition! The TSO design follows established patterns (similar to TiDB/PD). Here are my findings:

Critical Issues

1. Config naming inconsistency — enable_tso_feature vs experimental_enable_feature_tso

The Config field is enable_tso_feature but the error message in PropertyAnalyzer.analyzeEnableTso() says "experimental_enable_tso_feature" (note the word order difference: enable_tso_feature vs enable_feature_tso). This will confuse users trying to set the right config.

2. getTSO() sleeps in the transaction commit path

getTSO() is called from DatabaseTransactionMgr.getCommitTSO() during commit. It contains retry loops that call sleep(200) up to tso_max_get_retry_count (default 10) times. This means a transaction commit can block for up to 2 seconds waiting for the environment. The Env.getCurrentEnv().isReady() and isMaster() checks should be done once before entering the retry loop rather than repeated inside it.

3. Logical counter overflow recovery gap

In getTSO(), when logical > MAX_LOGICAL_COUNTER, the code sleeps and retries. But generateTSO() returns logicalCounter + 1 without resetting the counter — the reset only happens when updateTimestamp() daemon advances the physical time via setTSOPhysical(). If the daemon tick hasn't run yet, all concurrent getTSO() calls will see overflow and fail until the daemon catches up.

4. unprotectedCommitTransaction signature change

unprotectedCommitTransaction and unprotectedCommitTransaction2PC now throw TransactionCommitFailedException. This is a behavioral change — all callers need to be verified. Subclasses (e.g. cloud variants) that override these methods must also update their signatures.

5. calibrateTimestamp() requires enable_tso_persist_journal=true but it defaults to false

Enabling enable_tso_feature=true without also setting enable_tso_persist_journal=true causes the service to perpetually fail calibration in runAfterCatalogReady(). These config dependencies should be enforced or at least clearly documented.

Moderate Issues

6. _async_publish_tasks still uses anonymous std::tuple

storage_engine.h changed the async publish map from pair to tuple<int64_t, int64_t, int64_t> with positional access (std::get<0>, etc.). Since DiscontinuousVersionTablet was correctly introduced as a named struct for the other case, consider doing the same here for consistency.

7. Rowset::make_visible has no default for commit_tso

The signature changed from make_visible(Version) to make_visible(Version, int64_t commit_tso) without a default parameter. The BE txn_manager.h overloads correctly use = -1, but rowset.h does not. Any code (including out-of-tree or test code) calling the old signature will break.

8. /api/tso authentication check

TSOAction.getTSO() calls executeCheckPassword(request, response) but doesn't check its return value. Other REST actions in the codebase typically verify the result or use checkGlobalAuth. This may allow unauthenticated access to TSO state.

9. TSOClockBackwardException propagates through MasterDaemon

If clock backward exceeds the threshold, a RuntimeException propagates from runAfterCatalogReady(). Depending on how MasterDaemon handles uncaught exceptions in the run loop, this could kill the daemon thread permanently.

10. Concurrency model mixes ReentrantLock and AtomicLong/AtomicBoolean

windowEndTSO is an AtomicLong read without the lock in updateTimestamp(), while isInitialized is an AtomicBoolean set under lock in some paths but checked without it in others. This is technically safe but makes the concurrency contract harder to reason about. Consider documenting which fields are protected by what.

Minor Issues

11. Missing spaces in config descriptions

Several @ConfField description strings are missing trailing spaces before concatenation:

  • "retry 3 times" + "to update""retry 3 timesto update"
  • "5000ms from BDBJE once." → should have space before "will apply"
  • Same pattern in tso_max_get_retry_count, tso_service_window_duration_ms, tso_time_offset_debug_mode

12. FeMetaVersion bump mentioned but not in diff

The PR description mentions bumping VERSION_CURRENT to VERSION_141, but this file is not in the changed files list. Is the version bump missing or in a separate PR?

13. Regression test may silently pass when TSO is disabled

test_tso_api.groovy and test_tso_rowset_commit_tso.groovy — if the test cluster doesn't have enable_tso_feature and enable_tso_persist_journal set, the tests may skip or pass without exercising the feature.

Overall the BE-side plumbing (threading commit_tso through publish, rowset meta, information_schema) is clean. The FE core needs attention on the concurrency/retry model in getTSO() and the config dependency chain.

  1. Config naming inconsistency
    enable_tso_feature is the code field name, while experimental_enable_tso_feature is the external config name for the experimental feature. The experimental prefix follows the system's established convention.

  2. getTSO() contains sleep in the transaction commit path
    This is an intentional bounded retry, designed to handle transient states during master‑slave failover or readiness flutter. Moving the checks outside the loop would turn recoverable transient errors into immediate failures. The maximum blocking time is also configurable, not unbounded.

  3. Logical counter overflow "recovery gap"
    This is by design as backpressure, not a bug: on overflow, we wait for the daemon to advance physical time to guarantee monotonicity and persistence semantics. Reset and persistence are not performed on the hot getTSO() path.

  4. unprotectedCommitTransaction* signature change
    This is an explicit upgrade to failure semantics (abort commit if TSO acquisition fails). All in‑tree call chains have been aligned, and I have not found any overridden methods that are not adapted.

  5. calibrateTimestamp() requires enable_tso_persist_journal=true
    This is an intentional fail‑fast constraint to prevent unsafe execution such as “TSO enabled without persistence”. The default value being false is part of a gradual rollout strategy, not a defect.

  6. _async_publish_tasks uses tuple
    This minimizes code changes, and relevant comments have already been added in the code.

  7. Rowset::make_visible has no default parameter
    All in‑tree call sites now explicitly pass commit_tso, so there is no issue causing existing code to fail at compile or runtime.

  8. /api/tso missing authentication check
    The semantics of executeCheckPassword() are to throw an exception directly on failure, rather than relying on return value checking. Many other REST interfaces in the codebase follow the same pattern.

  9. TSOClockBackwardException causes daemon thread exit
    This will not happen. Daemon.run() catches Throwable at the top level, so the thread proceeds to the next round and is not permanently terminated by this exception.

  10. Mixed use of ReentrantLock and Atomic*
    This is intentional layering: high‑frequency read‑only states use atomic variables, while compound state such as globalTimestamp uses locks. The thread‑safety model is clear and correct.

  11. Spaces in config description strings
    This has been fixed.

  12. FeMetaVersion bump mentioned but not present in code
    During earlier code review, it was determined that the FeMetaVersion change was unnecessary. That part of the code has been removed, and the description updated accordingly.

  13. Regression tests “silently pass” when TSO is disabled
    Both regression tests explicitly run ADMIN SET FRONTEND CONFIG at the beginning to enable experimental_enable_tso_feature and enable_tso_persist_journal. They do not run and pass vacuously with TSO disabled.

@AntiTopQuark
Copy link
Copy Markdown
Author

Hi @morningman @gavinchou @morrySnow @dataroaring , Please review this PR again, thank you very much~

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Mar 31, 2026
@morningman
Copy link
Copy Markdown
Contributor

run buildall

@doris-robot
Copy link
Copy Markdown

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.48% (1798/2291)
Line Coverage 64.16% (32272/50296)
Region Coverage 65.01% (16173/24878)
Branch Coverage 55.51% (8624/15536)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add a global monotonically increasing timestamp service (TSO) for incremental computation in Doris

9 participants