[feature](tso) Add global monotonically increasing Timestamp Oracle(TSO)#61199
[feature](tso) Add global monotonically increasing Timestamp Oracle(TSO)#61199AntiTopQuark wants to merge 33 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
996adea to
66a8f8e
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
66a8f8e to
fa48c2a
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
fa48c2a to
dc6a3a6
Compare
|
run buildall |
…TSO) Signed-off-by: Jingzhe Jia <AntiTopQuark1350@outlook.com>
dc6a3a6 to
9c9dec0
Compare
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
Signed-off-by: Jingzhe Jia <AntiTopQuark1350@outlook.com>
|
run buildall |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 26912 ms |
|
run buildall |
TPC-H: Total hot run time: 36676 ms |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
Hi, @morningman and @morrySnow .I have fixed all the comments. Please help review this code again. |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
LGTM |
|
PR approved by at least one committer and no changes requested. |
|
Hi, @morningman and @morrySnow . Please help review this code again. |
|
/review |
There was a problem hiding this comment.
Found 2 issues worth addressing before this lands.
- High:
fe/fe-core/src/main/java/org/apache/doris/tso/TSOService.javamakes image persistence ofwindowEndTSOoptional even though calibration already requires journal persistence. Ifenable_tso_feature=true,enable_tso_persist_journal=true, andenable_tso_checkpoint_module=false, a checkpoint can be written without the last reserved TSO window. After that image is taken and older TSO journals are compacted away, the next master restart will reloadwindowEndTSO=0from image and can calibrate from wall clock instead of the last issued window, violating the feature's global monotonicity guarantee. - Medium:
fe/fe-core/src/main/java/org/apache/doris/httpv2/rest/TSOAction.javaexposes cached TSO values even when the feature has been disabled or the service is uncalibrated, because the endpoint does not validate feature state andTSOServicedoes not clear the cached timestamp on disable. That makes/api/tsoreport stale state as if it were authoritative.
Critical checkpoint conclusions:
- Goal of the task: Partially achieved. The PR wires TSO through FE commit, publish RPC, and BE rowset metadata, but the checkpoint/restart path still has a correctness hole.
- Minimality/focus: Broad but reasonable for the feature scope.
- Concurrency: Locking in
TSOServiceand BE publish paths looks consistent; I did not find a deadlock, but the disable path leaves stale in-memory state observable through the API. - Lifecycle/static initialization: Not fully correct. The TSO lifecycle across journal replay, checkpoint, and master restart is incomplete when the image module is disabled.
- Configuration: New configs are present, but
enable_tso_persist_journalandenable_tso_checkpoint_modulecan currently be combined into a broken recovery mode. - Compatibility/incompatible change: New FE op code / meta module handling exists, but the optional checkpoint persistence weakens recovery compatibility semantics.
- Parallel code paths:
commit_tsopropagation appears covered in the main publish path, async publish path, and report-based republish path. - Special conditional checks: The persist-journal precondition in calibration is good; the HTTP API is missing equivalent feature-state validation.
- Test coverage: Good unit/regression coverage for normal flows, but I did not see coverage for checkpoint-disabled restart recovery or API behavior while disabled/uninitialized.
- Observability: Logging and metrics are generally sufficient.
- Transaction/persistence: Not correct yet because the durable source of
windowEndTSOcan disappear after checkpoint compaction. - Data writes/modifications: FE-to-BE
commit_tsoplumbing and BE rowset persistence look consistent in the reviewed paths. - FE-BE variable passing: The new thrift/proto fields are propagated through publish and recovery paths I checked.
- Performance: No obvious hot-path regression stood out during review.
- Other issues: None beyond the two findings above.
| } | ||
|
|
||
| public long saveTSO(CountingDataOutputStream dos, long checksum) throws IOException { | ||
| if (!Config.enable_tso_checkpoint_module) { |
There was a problem hiding this comment.
saveTSO() is gated by enable_tso_checkpoint_module, but calibrateTimestamp() already requires enable_tso_persist_journal=true, so TSO can be actively handing out timestamps while this image module stays disabled. That creates a monotonicity hole after checkpoint compaction: once a checkpoint is taken without windowEndTSO and the pre-checkpoint TSO journal entries are purged, the next master restart reloads windowEndTSO=0 from image and can calibrate from wall clock instead of the last reserved window. In that scenario the new master can generate a TSO lower than one already issued before the checkpoint. This state needs to be made impossible, either by always saving windowEndTSO into the image whenever journal persistence is enabled, or by rejecting the config combination that enables TSO without checkpoint persistence.
There was a problem hiding this comment.
We intentionally keep these two switches decoupled:
-
enable_tso_persist_journalserves as the strict correctness gate for enabling TSO; if this configuration is set to false,calibrateTimestamp()will fail fast. -
enable_tso_checkpoint_moduleis a compatibility switch for image format and rollback scenarios, not the switch for enabling TSO.
Even when the checkpoint module is disabled, window-end events are still replayed from the journal (OP_TSO_TIMESTAMP_WINDOW_END).Therefore, this configuration combination is a compatibility mode chosen by the operator, with known trade‑offs, rather than an unintended state.
cc @dataroaring
| return ResponseEntityBuilder.badRequest("Current FE is not master"); | ||
| } | ||
| // Get current TSO information without increasing it | ||
| long windowEndPhysicalTime = env.getTSOService().getWindowEndTSO(); |
There was a problem hiding this comment.
This endpoint never checks whether TSO is enabled or calibrated before returning getWindowEndTSO() / getCurrentTSO(). That is observable because TSOService.runAfterCatalogReady() only flips isInitialized to false when the feature is disabled; it does not clear the cached globalTimestamp or windowEndTSO. After ADMIN SET FRONTEND CONFIG ('enable_tso_feature'='false'), /api/tso will still return the previous timestamp as if it were current. Please either clear the cached state on disable, or make /api/tso reject requests unless the feature is enabled and calibrated.
There was a problem hiding this comment.
This is an intentional design, not an oversight.
The /api/tso endpoint is intended as a state observation interface, not a timestamp issuance interface.
The actual path for externally allocating TSO is getTSO(), where strict validation is performed on both enable_tso_feature and isInitialized. Requests will be rejected directly if TSO is not enabled or not calibrated.
In contrast, /api/tso only reads in-memory snapshots of the current state (getWindowEndTSO() / getCurrentTSO()) for observation and troubleshooting purposes, and does not advance the timestamp.
cc @dataroaring
dataroaring
left a comment
There was a problem hiding this comment.
Code Review
Thanks for this substantial feature addition! The TSO design follows established patterns (similar to TiDB/PD). Here are my findings:
Critical Issues
1. Config naming inconsistency — enable_tso_feature vs experimental_enable_feature_tso
The Config field is enable_tso_feature but the error message in PropertyAnalyzer.analyzeEnableTso() says "experimental_enable_tso_feature" (note the word order difference: enable_tso_feature vs enable_feature_tso). This will confuse users trying to set the right config.
2. getTSO() sleeps in the transaction commit path
getTSO() is called from DatabaseTransactionMgr.getCommitTSO() during commit. It contains retry loops that call sleep(200) up to tso_max_get_retry_count (default 10) times. This means a transaction commit can block for up to 2 seconds waiting for the environment. The Env.getCurrentEnv().isReady() and isMaster() checks should be done once before entering the retry loop rather than repeated inside it.
3. Logical counter overflow recovery gap
In getTSO(), when logical > MAX_LOGICAL_COUNTER, the code sleeps and retries. But generateTSO() returns logicalCounter + 1 without resetting the counter — the reset only happens when updateTimestamp() daemon advances the physical time via setTSOPhysical(). If the daemon tick hasn't run yet, all concurrent getTSO() calls will see overflow and fail until the daemon catches up.
4. unprotectedCommitTransaction signature change
unprotectedCommitTransaction and unprotectedCommitTransaction2PC now throw TransactionCommitFailedException. This is a behavioral change — all callers need to be verified. Subclasses (e.g. cloud variants) that override these methods must also update their signatures.
5. calibrateTimestamp() requires enable_tso_persist_journal=true but it defaults to false
Enabling enable_tso_feature=true without also setting enable_tso_persist_journal=true causes the service to perpetually fail calibration in runAfterCatalogReady(). These config dependencies should be enforced or at least clearly documented.
Moderate Issues
6. _async_publish_tasks still uses anonymous std::tuple
storage_engine.h changed the async publish map from pair to tuple<int64_t, int64_t, int64_t> with positional access (std::get<0>, etc.). Since DiscontinuousVersionTablet was correctly introduced as a named struct for the other case, consider doing the same here for consistency.
7. Rowset::make_visible has no default for commit_tso
The signature changed from make_visible(Version) to make_visible(Version, int64_t commit_tso) without a default parameter. The BE txn_manager.h overloads correctly use = -1, but rowset.h does not. Any code (including out-of-tree or test code) calling the old signature will break.
8. /api/tso authentication check
TSOAction.getTSO() calls executeCheckPassword(request, response) but doesn't check its return value. Other REST actions in the codebase typically verify the result or use checkGlobalAuth. This may allow unauthenticated access to TSO state.
9. TSOClockBackwardException propagates through MasterDaemon
If clock backward exceeds the threshold, a RuntimeException propagates from runAfterCatalogReady(). Depending on how MasterDaemon handles uncaught exceptions in the run loop, this could kill the daemon thread permanently.
10. Concurrency model mixes ReentrantLock and AtomicLong/AtomicBoolean
windowEndTSO is an AtomicLong read without the lock in updateTimestamp(), while isInitialized is an AtomicBoolean set under lock in some paths but checked without it in others. This is technically safe but makes the concurrency contract harder to reason about. Consider documenting which fields are protected by what.
Minor Issues
11. Missing spaces in config descriptions
Several @ConfField description strings are missing trailing spaces before concatenation:
"retry 3 times" + "to update"→"retry 3 timesto update""5000ms from BDBJE once."→ should have space before"will apply"- Same pattern in
tso_max_get_retry_count,tso_service_window_duration_ms,tso_time_offset_debug_mode
12. FeMetaVersion bump mentioned but not in diff
The PR description mentions bumping VERSION_CURRENT to VERSION_141, but this file is not in the changed files list. Is the version bump missing or in a separate PR?
13. Regression test may silently pass when TSO is disabled
test_tso_api.groovy and test_tso_rowset_commit_tso.groovy — if the test cluster doesn't have enable_tso_feature and enable_tso_persist_journal set, the tests may skip or pass without exercising the feature.
Overall the BE-side plumbing (threading commit_tso through publish, rowset meta, information_schema) is clean. The FE core needs attention on the concurrency/retry model in getTSO() and the config dependency chain.
…riptions to ensure clarity of the descriptive information.
|
run buildall |
|
|
Hi @morningman @gavinchou @morrySnow @dataroaring , Please review this PR again, thank you very much~ |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Issue Number: close #61198
Related #57921
Problem Summary:
Release note
The service calibrates its initial timestamp at startup and periodically updates it to maintain a time window.
A TSO timestamp encodes the physical time and a logical counter; it is assembled and extracted by the new TSOTimestamp class.
The service exposes two main methods:
getTSO()– returns a new TSO timestamp for transaction commits.getCurrentTSO()– returns the current TSO without bumping the logical counter.experimental_enable_feature_tso– enables/disables the TSO feature.tso_service_update_interval_ms– interval in milliseconds for the TSO service to update its window.max_update_tso_retry_countandmax_get_tso_retry_count– retry limits for updating and obtaining TSOs.tso_service_window_duration_ms– length of the time window allocated by the TSO service.tso_time_offset_debug_mode– debug offset for the physical time.enable_tso_persist_journalandenable_tso_checkpoint_module– persistence switches for edit log and checkpoint.enable_tsowhich can be configured inCREATE TABLEor modified viaALTER TABLE. Only tables withenable_tso = truegenerate commit TSO for transactions; when disabled, commit_tso remains-1.TransactionStatenow fetches a commit TSO fromTSOServicewhen TSO is enabled and stores it in the transaction state andTableCommitInfo.TPartitionVersionInfo.commit_tso), and is persisted with each rowset (see next item).Rowset::make_visiblenow accepts acommit_tsoparameter and writes it toRowsetMeta.RowsetMetaPBadds a new fieldcommit_tsoto persist commit timestamps.information_schema.rowsetsintroduces a new columnCOMMIT_TSOallowing users to query the commit timestamp for each rowset.A new REST endpoint
/api/tsois added for retrieving current TSO information. It returns a JSON payload containing:window_end_physical_time– end of the current TSO time window.current_tso– the current composed 64‑bit TSO.current_tso_physical_timeandcurrent_tso_logical_counter– the decomposed physical and logical parts of the current TSO. This API does not increment the logical counter.New metrics counters (e.g.,
tso_clock_drift_detected,tso_clock_backward_detected,tso_clock_calculated,tso_clock_updated) expose state and health of the TSO service.New unit tests verify
TSOTimestampbit manipulation,TSOServicebehavior, commit TSO propagation, and the/api/tsoendpoint. Regression tests verify that rowset commit timestamps are populated when TSO is enabled and that the API returns increasing TSOs.Impact and Compatibility
experimental_enable_feature_tso. It is disabled by default and can be enabled in front-end configuration. When enabled, old FE versions without this feature cannot replay edit log entries containing TSO operations; therefore upgrade all FEs before enabling.enable_tsototrue. Tables with TSO enabled will produce commit TSO for each rowset and may require downstream consumers to handle the newcommit_tsofield./api/tsoto inspect current TSO values. No existing API is modified.Check List (For Author)
Test
Behavior changed:
Does this need documentation?
docs: add TSO-related functions description and documentation. doris-website#3454
Check List (For Reviewer who merge this PR)