[Version 11.0] Feature support for UTF-8 string literals#1610
[Version 11.0] Feature support for UTF-8 string literals#1610RexJaeschke wants to merge 4 commits intodraft-v11from
Conversation
| <!-- markdownlint-disable MD028 --> | ||
|
|
||
| <!-- markdownlint-enable MD028 --> | ||
| > *Note*: As `ReadOnlySpan<byte>` is a ref struct type, a UTF-8 string literal cannot be converted to `object` or used as a type parameter ([§16.2.3]( structs.md#1623-ref-modifier)). *end note* |
There was a problem hiding this comment.
allows ref struct is added in C# 13, and this PR is for C# 11, so that part is OK…
But the wording is imprecise. A string literal is syntax rather than a type so it couldn't be used as a type parameter (or rather a type argument) anyway.
| ``` | ||
|
|
||
| This overload of the binary `+` operator performs concatenation of UTF-8 string literals and the concatenated results thereof (which is much more restrictive than for UTF-16 string concatenation). The operands shall be UTF-8-encoded values. | ||
| The result of the operator is a ReadOnlySpan<byte> that consists of the bytes of the left operand followed by the bytes of the right operand. The result may be used directly as an operand to the UTF-8 string concatenation operator. |
There was a problem hiding this comment.
From this normative text, it is not clear whether redundant parentheses are allowed.
using System;
public class C {
public void M() {
ReadOnlySpan<byte> a
= (( "A"u8 + "a"u8 ))
+ (( "B"u8 + "b"u8 ));
}
}| The type of a *String_Literal* is `string`. | ||
| A *String_Literal* that does not contain a *Utf8_Suffix* is a ***UTF-16 string literal***, whose type is `string`. | ||
|
|
||
| A *String_Literal* that contains a *Utf8_Suffix* is a ***UTF-8 string literal***, whose type is `System.ReadOnlySpan<byte>` (an indexable collection type), and whose value contains a UTF-8 byte representation of the string. A null terminator (a byte with value zero) is placed beyond the last byte in memory (and outside the length of the `ReadOnlySpan<byte>`) in order to support scenarios that expect null-terminated byte strings. A UTF-8 string literal is not a constant. A UTF-8 string literal without its *Utf8_Suffix* shall be valid UTF-16. (For example, `"\uDC00\uDD00"u8` is ill-formed as one low surrogate cannot be followed by another.) |
There was a problem hiding this comment.
Should add a new subclause within 16.5.15 Safe context constraint, saying that when a UTF-8 string literal has a ReadOnlySpan<byte> value, the safe-context of that value is caller-context. That would allow it to be returned out of the function, like so:
using System;
public class C {
public ReadOnlySpan<byte> M() {
return "xyz"u8;
}
}Concatenation of UTF-8 string literals would then follow the safe-context rule in 16.5.15.5 Operators. As the safe-context of both operands would be caller-context, the safe-context of the result would be the same, without having to be specified separately for this case.
There was a problem hiding this comment.
The following would then likewise be valid:
public class C {
public ref readonly byte M() {
return ref "xyz"u8[0];
}
}
This is Rex's adaptation of the corresponding MS proposal.
Back in 2025, when I was researching this PR, I created PR #1103 to tidy up a bunch of Unicode-related things. That PR is complimentary to this one and the two do not overlap.
Note 1 re overlap with raw string literal PR
Some spec changes required to support this feature touch the same section(s) as does PR #1599 for V11’s raw string literals. We need to make sure they coexist, depending on which of the two gets merged first.
What is missing from this PR and needs to be added once the raw string PR has been added is the addition of the two Utf8_Suffixes, as shown below:
Note 2 re Lowering
Re the “Lowering” section of the MS proposal, that appears to be about implementation details. In any event, Rex did not add anything from there to the spec.
Note 3 re “Unresolved questions”
Re the “Unresolved questions” section of the MS proposal, no text was added w.r.t implicit conversion from
string. Should something be added?