Skip to content

[enhance](cloud) proactively sync tablet meta after alter#61585

Open
luwei16 wants to merge 1 commit intoapache:masterfrom
luwei16:luwei/cloud-sync-tablet-meta-master-20260320
Open

[enhance](cloud) proactively sync tablet meta after alter#61585
luwei16 wants to merge 1 commit intoapache:masterfrom
luwei16:luwei/cloud-sync-tablet-meta-master-20260320

Conversation

@luwei16
Copy link
Contributor

@luwei16 luwei16 commented Mar 20, 2026

FE now sends a sync_tablet_meta RPC to all alive cloud backends after alter updates tablet meta in meta service. The request carries affected tablet ids and is dispatched as a best-effort notification, so alter success still depends on meta service update instead of backend acknowledgements.

BE handles the RPC by refreshing meta only for tablets that are already cached locally. Uncached tablets are skipped, which avoids polluting tablet cache while still fixing stale compaction policy and related tablet meta on active compute clusters. The RPC also returns synced/skipped/failed counts and exposes bvar counters for observability.

This change adds FE and BE unit tests and a cloud regression suite. The regression covers cached and uncached multi-cluster behavior, the negative path with proactive notify disabled, and the version-limit scenario where a size_based table hits too many versions, is altered to time_series, and can accept new writes immediately after alter.

FE now sends a sync_tablet_meta RPC to all alive cloud backends
after alter updates tablet meta in meta service. The request carries
affected tablet ids and is dispatched as a best-effort notification,
so alter success still depends on meta service update instead of
backend acknowledgements.

BE handles the RPC by refreshing meta only for tablets that are
already cached locally. Uncached tablets are skipped, which avoids
polluting tablet cache while still fixing stale compaction policy and
related tablet meta on active compute clusters. The RPC also returns
synced/skipped/failed counts and exposes bvar counters for
observability.

This change adds FE and BE unit tests and a cloud regression suite.
The regression covers cached and uncached multi-cluster behavior, the
negative path with proactive notify disabled, and the version-limit
scenario where a size_based table hits too many versions, is altered
to time_series, and can accept new writes immediately after alter.
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@luwei16
Copy link
Contributor Author

luwei16 commented Mar 20, 2026

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.63% (1796/2284)
Line Coverage 64.45% (32309/50130)
Region Coverage 65.34% (16178/24760)
Branch Coverage 55.76% (8618/15456)

@doris-robot
Copy link

TPC-H: Total hot run time: 26984 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 324fe418b174d3b18138abe83fbe7f13cef0cefb, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17663	4590	4341	4341
q2	q3	10706	822	526	526
q4	4729	359	248	248
q5	8067	1200	1021	1021
q6	237	174	149	149
q7	814	877	674	674
q8	10549	1500	1379	1379
q9	6529	4984	4695	4695
q10	6378	1925	1647	1647
q11	470	247	245	245
q12	743	587	460	460
q13	18055	2937	2164	2164
q14	235	241	214	214
q15	q16	741	745	671	671
q17	744	871	430	430
q18	6404	5447	5291	5291
q19	1132	990	621	621
q20	522	499	368	368
q21	4843	2067	1556	1556
q22	366	325	284	284
Total cold run time: 99927 ms
Total hot run time: 26984 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4609	4613	4655	4613
q2	q3	3914	4396	3860	3860
q4	891	1217	786	786
q5	4047	4412	4333	4333
q6	178	171	142	142
q7	1784	1665	1638	1638
q8	2533	2719	2581	2581
q9	7651	7630	7390	7390
q10	3728	3991	3617	3617
q11	596	522	432	432
q12	508	596	450	450
q13	2777	3201	2313	2313
q14	284	296	269	269
q15	q16	717	758	720	720
q17	1192	1398	1418	1398
q18	7284	6832	6653	6653
q19	1025	972	928	928
q20	2055	2331	2008	2008
q21	3924	3487	3349	3349
q22	473	427	375	375
Total cold run time: 50170 ms
Total hot run time: 47855 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 167520 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 324fe418b174d3b18138abe83fbe7f13cef0cefb, data reload: false

query5	4330	639	507	507
query6	325	223	205	205
query7	4210	467	263	263
query8	341	236	222	222
query9	8724	2735	2693	2693
query10	509	364	323	323
query11	6990	5083	4845	4845
query12	180	132	121	121
query13	1257	440	322	322
query14	5729	3658	3453	3453
query14_1	2794	2773	2783	2773
query15	205	190	170	170
query16	966	471	468	468
query17	898	741	635	635
query18	2450	455	368	368
query19	220	213	194	194
query20	137	127	130	127
query21	213	134	109	109
query22	13265	13794	14768	13794
query23	16232	15940	15710	15710
query23_1	15677	15538	15319	15319
query24	7236	1624	1241	1241
query24_1	1223	1217	1247	1217
query25	602	496	455	455
query26	1252	262	158	158
query27	2777	484	300	300
query28	4523	1834	1799	1799
query29	845	587	500	500
query30	302	227	191	191
query31	1014	949	880	880
query32	85	78	71	71
query33	533	350	288	288
query34	902	875	524	524
query35	655	699	591	591
query36	1129	1138	997	997
query37	132	93	128	93
query38	2911	2931	2810	2810
query39	866	838	813	813
query39_1	798	788	795	788
query40	232	152	137	137
query41	62	63	59	59
query42	261	255	257	255
query43	235	243	215	215
query44	
query45	198	188	183	183
query46	866	975	608	608
query47	2110	2807	2064	2064
query48	301	313	226	226
query49	632	466	376	376
query50	680	273	214	214
query51	4128	4072	3970	3970
query52	262	267	261	261
query53	288	339	282	282
query54	295	268	264	264
query55	89	84	84	84
query56	309	324	308	308
query57	1934	1884	1641	1641
query58	284	273	268	268
query59	2809	2928	2760	2760
query60	348	323	346	323
query61	153	152	189	152
query62	615	592	538	538
query63	308	276	276	276
query64	5024	1268	983	983
query65	
query66	1473	453	349	349
query67	24206	24207	24078	24078
query68	
query69	412	314	280	280
query70	996	951	1004	951
query71	339	308	294	294
query72	2897	2700	2474	2474
query73	539	538	316	316
query74	9593	9557	9379	9379
query75	2849	2752	2463	2463
query76	2309	1030	700	700
query77	367	397	308	308
query78	10837	11135	10419	10419
query79	1086	758	570	570
query80	1174	625	536	536
query81	525	259	222	222
query82	1354	150	116	116
query83	339	262	237	237
query84	295	122	107	107
query85	901	497	503	497
query86	410	303	291	291
query87	3108	3086	3012	3012
query88	3545	2642	2643	2642
query89	428	360	347	347
query90	1890	177	166	166
query91	177	164	137	137
query92	79	67	72	67
query93	912	832	490	490
query94	525	331	300	300
query95	579	405	313	313
query96	652	510	223	223
query97	2495	2531	2386	2386
query98	254	227	225	225
query99	1025	1000	927	927
Total cold run time: 248388 ms
Total hot run time: 167520 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants