Skip to content

[Feature] Implement string function ord with unicode alias following DuckDB semantics#60409

Open
Copilot wants to merge 5 commits intomasterfrom
copilot/implement-string-ord-function
Open

[Feature] Implement string function ord with unicode alias following DuckDB semantics#60409
Copilot wants to merge 5 commits intomasterfrom
copilot/implement-string-ord-function

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 1, 2026

Implements ord(string) function that returns the Unicode code point of the first character, following DuckDB semantics. Adds unicode as an alias.

Backend

  • Added StringOrd in function_string.cpp with proper UTF-8 decoding (1-4 byte sequences)
  • Validates continuation bytes, returns 0 for empty/invalid UTF-8
  • Returns Int64 to accommodate full Unicode range (U+0000 to U+10FFFF)
  • Registered unicode alias via register_alias()

Frontend

  • Added Ord.java scalar function returning BigIntType
  • Registered both ord and unicode in BuiltinScalarFunctions.java
  • Added visitor method in ScalarFunctionVisitor.java

FE Constant Folding

  • Added @ExecFunction implementations in StringArithmetic.java for compile-time evaluation

Tests

  • BE unit tests covering ASCII, 2/3/4-byte UTF-8, empty string, null
  • Regression tests in query_p0 and nereids_p0 suites
  • Fold constant tests in fold_constant_string_arithmatic.groovy

Example

SELECT ord('A');      -- 65
SELECT ord('');     -- 20320
SELECT ord('😀');     -- 128512
SELECT unicode('A');  -- 65 (alias)
SELECT ord('');       -- 0

Key difference from ascii()

ascii() returns the first byte value; ord()/unicode() decodes UTF-8 and returns the actual Unicode code point.


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Feb 1, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copilot AI and others added 2 commits February 1, 2026 10:21
Co-authored-by: zclllyybb <61408379+zclllyybb@users.noreply.github.com>
Co-authored-by: zclllyybb <61408379+zclllyybb@users.noreply.github.com>
Copilot AI changed the title [WIP] Add String function ord implementation in DuckDB [Feature] Implement string function ord following DuckDB semantics Feb 1, 2026
Copilot AI requested a review from zclllyybb February 1, 2026 10:25
@zclllyybb
Copy link
Copy Markdown
Contributor

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 31748 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit a9a22399386f66d4733d503b4d7fce0debdd68ca, data reload: false

------ Round 1 ----------------------------------
q1	17635	5403	5022	5022
q2	2067	352	188	188
q3	10150	1317	726	726
q4	10196	794	312	312
q5	7544	2197	1878	1878
q6	198	177	149	149
q7	901	744	604	604
q8	9257	1353	1081	1081
q9	5213	4758	4868	4758
q10	6843	1957	1552	1552
q11	533	288	291	288
q12	336	375	231	231
q13	17776	4033	3252	3252
q14	243	244	224	224
q15	909	825	811	811
q16	680	683	613	613
q17	652	766	520	520
q18	6644	6364	7480	6364
q19	1287	1055	624	624
q20	415	382	256	256
q21	2895	2137	2007	2007
q22	375	341	288	288
Total cold run time: 102749 ms
Total hot run time: 31748 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5653	5499	5461	5461
q2	260	342	254	254
q3	2325	2866	2462	2462
q4	1547	1842	1650	1650
q5	4764	4608	4602	4602
q6	227	182	135	135
q7	2021	1885	1842	1842
q8	2521	2407	2339	2339
q9	7618	7909	7548	7548
q10	2946	3009	2472	2472
q11	542	455	437	437
q12	627	693	569	569
q13	3521	4052	3233	3233
q14	269	286	257	257
q15	837	804	795	795
q16	636	700	640	640
q17	1073	1247	1277	1247
q18	7515	7247	7276	7247
q19	799	787	791	787
q20	1956	2040	1885	1885
q21	4439	4194	4103	4103
q22	573	534	521	521
Total cold run time: 52669 ms
Total hot run time: 50486 ms

@doris-robot
Copy link
Copy Markdown

ClickBench: Total hot run time: 28.61 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit a9a22399386f66d4733d503b4d7fce0debdd68ca, data reload: false

query1	0.06	0.05	0.05
query2	0.09	0.04	0.04
query3	0.25	0.09	0.09
query4	1.61	0.11	0.12
query5	0.27	0.25	0.25
query6	1.17	0.68	0.66
query7	0.04	0.03	0.03
query8	0.05	0.04	0.04
query9	0.54	0.50	0.49
query10	0.55	0.52	0.54
query11	0.15	0.10	0.09
query12	0.15	0.10	0.10
query13	0.63	0.61	0.61
query14	1.06	1.06	1.04
query15	0.88	0.86	0.87
query16	0.39	0.40	0.42
query17	1.17	1.14	1.16
query18	0.23	0.21	0.20
query19	2.11	1.96	2.05
query20	0.02	0.01	0.02
query21	15.41	0.26	0.14
query22	5.29	0.05	0.05
query23	16.05	0.27	0.11
query24	1.48	0.60	0.91
query25	0.08	0.11	0.05
query26	0.14	0.12	0.13
query27	0.06	0.06	0.06
query28	5.05	1.13	0.96
query29	12.54	3.91	3.19
query30	0.29	0.14	0.11
query31	2.82	0.63	0.41
query32	3.24	0.60	0.50
query33	3.22	3.26	3.25
query34	16.30	5.38	4.72
query35	4.84	4.78	4.81
query36	0.65	0.50	0.49
query37	0.12	0.08	0.07
query38	0.06	0.04	0.04
query39	0.05	0.03	0.04
query40	0.19	0.16	0.15
query41	0.08	0.04	0.03
query42	0.04	0.03	0.03
query43	0.05	0.04	0.03
Total cold run time: 99.47 s
Total hot run time: 28.61 s

@doris-robot
Copy link
Copy Markdown

BE UT Coverage Report

Increment line coverage 94.74% (36/38) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.49% (19276/36725)
Line Coverage 35.96% (179098/497997)
Region Coverage 32.36% (138862/429072)
Branch Coverage 33.32% (60105/180377)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 94.74% (36/38) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.50% (25741/36000)
Line Coverage 54.15% (269033/496820)
Region Coverage 51.83% (224664/433482)
Branch Coverage 53.14% (96235/181109)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 92.31% (12/13) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 94.74% (36/38) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.50% (25741/36000)
Line Coverage 54.15% (269033/496820)
Region Coverage 51.83% (224664/433482)
Branch Coverage 53.14% (96235/181109)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 92.31% (12/13) 🎉
Increment coverage report
Complete coverage report

void register_function_string(SimpleFunctionFactory& factory) {
factory.register_function<FunctionStringParseDataSize>();
factory.register_function<FunctionStringASCII>();
factory.register_function<FunctionStringOrd>();
Copy link
Copy Markdown
Contributor

@linrrzqqq linrrzqqq Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in duckdb, ord is just an alias for unicode, detail see duckdb.string_function.

So I think we could add an unicode alias here

Co-authored-by: zclllyybb <61408379+zclllyybb@users.noreply.github.com>
Copilot AI changed the title [Feature] Implement string function ord following DuckDB semantics Implement string function ord with unicode alias Feb 9, 2026
@linrrzqqq
Copy link
Copy Markdown
Contributor

Add constant folding implementation to doris/fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/executable/StringArithmetic.java

Co-authored-by: zclllyybb <61408379+zclllyybb@users.noreply.github.com>
Copilot AI changed the title Implement string function ord with unicode alias [Feature] Implement string function ord with unicode alias following DuckDB semantics Feb 9, 2026
@zclllyybb
Copy link
Copy Markdown
Contributor

run buildall

@zclllyybb zclllyybb marked this pull request as ready for review February 10, 2026 01:45
@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 30407 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 5b613d754ea7e82d2501687fc3a597dbb0f6865e, data reload: false

------ Round 1 ----------------------------------
q1	17622	4436	4308	4308
q2	2026	357	231	231
q3	10172	1328	714	714
q4	10202	784	310	310
q5	7510	2212	1928	1928
q6	194	175	144	144
q7	869	724	615	615
q8	9273	1463	1172	1172
q9	4702	4664	4625	4625
q10	6775	1973	1559	1559
q11	514	306	290	290
q12	344	384	229	229
q13	17803	4033	3225	3225
q14	238	237	211	211
q15	913	798	804	798
q16	703	686	630	630
q17	692	821	538	538
q18	6392	5814	5881	5814
q19	1234	986	619	619
q20	518	495	385	385
q21	2541	1842	1775	1775
q22	359	322	287	287
Total cold run time: 101596 ms
Total hot run time: 30407 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4355	4337	4357	4337
q2	259	344	256	256
q3	2125	2700	2219	2219
q4	1364	1741	1303	1303
q5	4297	4176	4216	4176
q6	214	178	134	134
q7	1844	1789	1679	1679
q8	2454	2695	2459	2459
q9	7574	7599	7545	7545
q10	2903	3242	2687	2687
q11	602	474	469	469
q12	736	826	658	658
q13	3980	4334	3417	3417
q14	277	320	285	285
q15	871	791	798	791
q16	685	748	686	686
q17	1153	1330	1396	1330
q18	8244	8055	7829	7829
q19	939	871	845	845
q20	2213	2117	1979	1979
q21	4882	4439	4500	4439
q22	609	519	496	496
Total cold run time: 52580 ms
Total hot run time: 50019 ms

@doris-robot
Copy link
Copy Markdown

ClickBench: Total hot run time: 28.33 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 5b613d754ea7e82d2501687fc3a597dbb0f6865e, data reload: false

query1	0.05	0.05	0.05
query2	0.10	0.04	0.04
query3	0.25	0.09	0.08
query4	1.61	0.11	0.11
query5	0.28	0.27	0.25
query6	1.16	0.66	0.67
query7	0.03	0.03	0.03
query8	0.06	0.04	0.04
query9	0.57	0.50	0.49
query10	0.55	0.54	0.56
query11	0.15	0.10	0.10
query12	0.14	0.11	0.10
query13	0.63	0.62	0.62
query14	1.08	1.07	1.06
query15	0.88	0.86	0.88
query16	0.39	0.40	0.39
query17	1.10	1.11	1.11
query18	0.23	0.21	0.21
query19	2.01	2.02	2.07
query20	0.02	0.02	0.02
query21	15.42	0.24	0.15
query22	5.54	0.05	0.05
query23	15.85	0.28	0.11
query24	1.51	0.24	0.18
query25	0.09	0.06	0.06
query26	0.15	0.14	0.13
query27	0.06	0.06	0.05
query28	3.92	1.15	0.97
query29	12.55	3.84	3.16
query30	0.28	0.14	0.12
query31	2.82	0.65	0.40
query32	3.24	0.60	0.50
query33	3.25	3.22	3.33
query34	15.97	5.41	4.75
query35	4.77	4.86	4.82
query36	0.65	0.51	0.49
query37	0.11	0.08	0.07
query38	0.09	0.04	0.04
query39	0.05	0.03	0.03
query40	0.20	0.16	0.16
query41	0.09	0.04	0.03
query42	0.04	0.03	0.03
query43	0.05	0.03	0.03
Total cold run time: 97.99 s
Total hot run time: 28.33 s

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 5.26% (1/19) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 94.87% (37/39) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.71% (19443/36884)
Line Coverage 36.20% (180990/499941)
Region Coverage 32.58% (140492/431207)
Branch Coverage 33.62% (60875/181079)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 94.87% (37/39) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.42% (26534/36140)
Line Coverage 56.46% (281569/498703)
Region Coverage 53.87% (234664/435589)
Branch Coverage 55.69% (101229/181783)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 78.95% (15/19) 🎉
Increment coverage report
Complete coverage report

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Feb 10, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@Mryange
Copy link
Copy Markdown
Contributor

Mryange commented Feb 10, 2026

感觉实现的不太好啊
直接用UTF8_BYTE_LENGTH拿到这个字符的长度然后转到int64就可以了。


public static final List<FunctionSignature> SIGNATURES = ImmutableList.of(
FunctionSignature.ret(BigIntType.INSTANCE).args(VarcharType.SYSTEM_DEFAULT),
FunctionSignature.ret(BigIntType.INSTANCE).args(StringType.INSTANCE)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为啥是返回int64呢?我看实现好像int32就可以了啊。

Copy link
Copy Markdown
Contributor

@Mryange Mryange Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScalarFunction UnicodeFun::GetFunction() {
	return ScalarFunction({LogicalType::VARCHAR}, LogicalType::INTEGER,
	                      ScalarFunction::UnaryFunction<string_t, int32_t, UnicodeOperator>);
}

duckdb是返回int32的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants