Skip to content

[Version 11.0] Feature support for UTF-8 string literals#1610

Draft
RexJaeschke wants to merge 4 commits intodraft-v11from
v11-UTF-8-string-literals
Draft

[Version 11.0] Feature support for UTF-8 string literals#1610
RexJaeschke wants to merge 4 commits intodraft-v11from
v11-UTF-8-string-literals

Conversation

@RexJaeschke
Copy link
Contributor

@RexJaeschke RexJaeschke commented Mar 18, 2026

This is Rex's adaptation of the corresponding MS proposal.

Back in 2025, when I was researching this PR, I created PR #1103 to tidy up a bunch of Unicode-related things. That PR is complimentary to this one and the two do not overlap.

Note 1 re overlap with raw string literal PR

Some spec changes required to support this feature touch the same section(s) as does PR #1599 for V11’s raw string literals. We need to make sure they coexist, depending on which of the two gets merged first.

What is missing from this PR and needs to be added once the raw string PR has been added is the addition of the two Utf8_Suffixes, as shown below:

fragment Single_Line_Raw_String_Literal
    : Raw_String_Literal_Delimiter  Raw_String_Literal Content
      Raw_String_Literal_Delimiter >>>Utf8_Suffix?<<<
    ;

fragment Multi_Line_Raw_String_Literal
    : Raw_String_Literal_Delimiter Whitespace* New_Line
      (Raw_String_Literal Content | New_Line)* New_Line
      Whitespace* Raw_String_Literal_Delimiter >>>Utf8_Suffix?<<<
    ;

Note 2 re Lowering

Re the “Lowering” section of the MS proposal, that appears to be about implementation details. In any event, Rex did not add anything from there to the spec.

Note 3 re “Unresolved questions”

Re the “Unresolved questions” section of the MS proposal, no text was added w.r.t implicit conversion from string. Should something be added?

@RexJaeschke RexJaeschke added this to the C# 11 milestone Mar 18, 2026
@RexJaeschke RexJaeschke added type: feature This issue describes a new feature Review: pending Proposal is available for review labels Mar 18, 2026
@RexJaeschke RexJaeschke marked this pull request as draft March 18, 2026 14:40
<!-- markdownlint-disable MD028 -->

<!-- markdownlint-enable MD028 -->
> *Note*: As `ReadOnlySpan<byte>` is a ref struct type, a UTF-8 string literal cannot be converted to `object` or used as a type parameter ([§16.2.3]( structs.md#1623-ref-modifier)). *end note*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allows ref struct is added in C# 13, and this PR is for C# 11, so that part is OK…

But the wording is imprecise. A string literal is syntax rather than a type so it couldn't be used as a type parameter (or rather a type argument) anyway.

```

This overload of the binary `+` operator performs concatenation of UTF-8 string literals and the concatenated results thereof (which is much more restrictive than for UTF-16 string concatenation). The operands shall be UTF-8-encoded values.
The result of the operator is a ReadOnlySpan<byte> that consists of the bytes of the left operand followed by the bytes of the right operand. The result may be used directly as an operand to the UTF-8 string concatenation operator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this normative text, it is not clear whether redundant parentheses are allowed.

using System;
public class C {
    public void M() {
        ReadOnlySpan<byte> a
            = (( "A"u8 + "a"u8 ))
            + (( "B"u8 + "b"u8 ));
    }
}

The type of a *String_Literal* is `string`.
A *String_Literal* that does not contain a *Utf8_Suffix* is a ***UTF-16 string literal***, whose type is `string`.

A *String_Literal* that contains a *Utf8_Suffix* is a ***UTF-8 string literal***, whose type is `System.ReadOnlySpan<byte>` (an indexable collection type), and whose value contains a UTF-8 byte representation of the string. A null terminator (a byte with value zero) is placed beyond the last byte in memory (and outside the length of the `ReadOnlySpan<byte>`) in order to support scenarios that expect null-terminated byte strings. A UTF-8 string literal is not a constant. A UTF-8 string literal without its *Utf8_Suffix* shall be valid UTF-16. (For example, `"\uDC00\uDD00"u8` is ill-formed as one low surrogate cannot be followed by another.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a new subclause within 16.5.15 Safe context constraint, saying that when a UTF-8 string literal has a ReadOnlySpan<byte> value, the safe-context of that value is caller-context. That would allow it to be returned out of the function, like so:

using System;
public class C {
    public ReadOnlySpan<byte> M() {
        return "xyz"u8;
    }
}

Concatenation of UTF-8 string literals would then follow the safe-context rule in 16.5.15.5 Operators. As the safe-context of both operands would be caller-context, the safe-context of the result would be the same, without having to be specified separately for this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following would then likewise be valid:

public class C {
    public ref readonly byte M() {
        return ref "xyz"u8[0];
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Review: pending Proposal is available for review type: feature This issue describes a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants