Merge pull request #2104 from rust-lang/TC/add-cut-to-grammar

ehuss · web-flow · commit 575dffa58df9 · 2026-02-13T14:42:33.000Z
Add cut operator (`^`) to grammar
diff --git a/dev-guide/src/grammar.md b/dev-guide/src/grammar.md
@@ -35,7 +35,9 @@ Name -> <Alphanumeric or `_`>+
 
 Expression -> Sequence (` `* `|` ` `* Sequence)*
 
-Sequence -> (` `* AdornedExpr)+
+Sequence ->
+        (` `* AdornedExpr)* ` `* Cut
+      | (` `* AdornedExpr)+
 
 AdornedExpr -> ExprRepeat Suffix? Footnote?
 
@@ -92,6 +94,8 @@ Prose -> `<` ~[`>` LF]+ `>`
 Group -> `(` ` `* Expression ` `* `)`
 
 NegativeExpression -> `~` ( Charset | Terminal | NonTerminal )
+
+Cut -> `^` Sequence
 ```
 
 The general format is a series of productions separated by blank lines. The expressions are as follows:
@@ -110,6 +114,7 @@ The general format is a series of productions separated by blank lines. The expr
 | Prose | \<any ASCII character except CR\> | An English description of what should be matched, surrounded in angle brackets. |
 | Group | (\`,\` Parameter)+ | Groups an expression for the purpose of precedence, such as applying a repetition operator to a sequence of other expressions. |
 | NegativeExpression | ~\[\` \` LF\] | Matches anything except the given Charset, Terminal, or Nonterminal. |
+| Cut | Expr1 ^ Expr2 \| Expr3 | The hard cut operator. Once the expressions preceding `^` in the sequence match, the rest of the sequence must match or parsing fails unconditionally --- no enclosing expression can backtrack past the cut point. |
 | Sequence | \`fn\` Name Parameters | A sequence of expressions that must match in order. |
 | Alternation | Expr1 \| Expr2 | Matches only one of the given expressions, separated by the vertical pipe character. |
 | Suffix | \_except \[LazyBooleanExpression\]\_  | Adds a suffix to the previous expression to provide an additional English description, rendered in subscript. This can contain limited Markdown, but try to avoid anything except basics like links. |
diff --git a/src/notation.md b/src/notation.md
@@ -24,13 +24,23 @@ The following notations are used by the *Lexer* and *Syntax* grammar snippets:
 | ~\[ ]              | ~\[`b` `B`]                    | Any characters, except those listed       |
 | ~`string`         | ~`\n`, ~`*/`                  | Any characters, except this sequence      |
 | ( )               | (`,` _Parameter_)<sup>?</sup> | Groups items                              |
+| ^                 | `b'` ^ ASCII_FOR_CHAR         | The rest of the sequence must match or parsing fails unconditionally ([hard cut operator]) |
 | U+xxxx            | U+0060                        | A single unicode character                |
 | \<text\>          | \<any ASCII char except CR\>  | An English description of what should be matched |
 | Rule <sub>suffix</sub> | IDENTIFIER_OR_KEYWORD <sub>_except `crate`_</sub> | A modification to the previous rule |
 | // Comment. | // Single line comment. | A comment extending to the end of the line. |
 
 Sequences have a higher precedence than `|` alternation.
 
+r[notation.grammar.cut]
+### The hard cut operator
+
+The grammar uses ordered alternation: the parser tries alternatives left to right and takes the first that matches. If an alternative fails partway through a sequence, the parser normally backtracks and tries the next alternative. The cut operator (`^`) prevents this. Once every expression to the left of `^` in a sequence has matched, the rest of the sequence must match or parsing fails unconditionally.
+
+Mizushima et al. introduced [cut operators][cut operator paper] to parsing expression grammars. In the PEG literature, a *soft cut* prevents backtracking only within the immediately enclosing ordered choice --- outer choices can still recover. A *hard cut* prevents all backtracking past the cut point; failure is definitive. The `^` used in this grammar is a hard cut.
+
+The hard cut operator is necessary because some tokens in Rust begin with a prefix that is itself a valid token. For example, `c"` begins a C string literal, but `c` alone is a valid identifier. Without the cut, if `c"\0"` failed to lex as a C string literal (because null bytes are not allowed in C strings), the parser could backtrack and lex it as two tokens: the identifier `c` and the string literal `"\0"`. The [cut after `c"`] prevents this --- once the opening delimiter is recognized, the parser cannot go back. The same reasoning applies to [byte literals], [byte string literals], [raw string literals], and other literals with prefixes that are themselves valid tokens.
+
 r[notation.grammar.string-tables]
 ### String table productions
 
@@ -52,7 +62,13 @@ r[notation.grammar.visualizations]
 Below each grammar block is a button to toggle the display of a [syntax diagram]. A square element is a non-terminal rule, and a rounded rectangle is a terminal.
 
 [binary operators]: expressions/operator-expr.md#arithmetic-and-logical-binary-operators
+[byte literals]: tokens.md#r-lex.token.byte.syntax
+[byte string literals]: tokens.md#r-lex.token.str-byte.syntax
+[cut after `c"`]: tokens.md#r-lex.token.str-c.syntax
+[cut operator paper]: https://kmizu.github.io/papers/paste513-mizushima.pdf
+[hard cut operator]: notation.md#the-hard-cut-operator
 [keywords]: keywords.md
+[raw string literals]: tokens.md#r-lex.token.literal.str-raw.syntax
 [syntax diagram]: https://en.wikipedia.org/wiki/Syntax_diagram
 [tokens]: tokens.md
 [unary operators]: expressions/operator-expr.md#borrow-operators
diff --git a/src/tokens.md b/src/tokens.md
@@ -217,7 +217,7 @@ r[lex.token.literal.str-raw.syntax]
 RAW_STRING_LITERAL -> `r` RAW_STRING_CONTENT SUFFIX?
 
 RAW_STRING_CONTENT ->
-      `"` ( ~CR )*? `"`
+      `"` ^ ( ~CR )*? `"`
     | `#` RAW_STRING_CONTENT `#`
 ```
 
@@ -251,7 +251,7 @@ r[lex.token.byte]
 r[lex.token.byte.syntax]
 ```grammar,lexer
 BYTE_LITERAL ->
-    `b'` ( ASCII_FOR_CHAR | BYTE_ESCAPE )  `'` SUFFIX?
+    `b'` ^ ( ASCII_FOR_CHAR | BYTE_ESCAPE )  `'` SUFFIX?
 
 ASCII_FOR_CHAR ->
     <any ASCII (i.e. 0x00 to 0x7F) except `'`, `\`, LF, CR, or TAB>
@@ -270,7 +270,7 @@ r[lex.token.str-byte]
 r[lex.token.str-byte.syntax]
 ```grammar,lexer
 BYTE_STRING_LITERAL ->
-    `b"` ( ASCII_FOR_STRING | BYTE_ESCAPE | STRING_CONTINUE )* `"` SUFFIX?
+    `b"` ^ ( ASCII_FOR_STRING | BYTE_ESCAPE | STRING_CONTINUE )* `"` SUFFIX?
 
 ASCII_FOR_STRING ->
     <any ASCII (i.e 0x00 to 0x7F) except `"`, `\`, or CR>
@@ -306,7 +306,7 @@ RAW_BYTE_STRING_LITERAL ->
     `br` RAW_BYTE_STRING_CONTENT SUFFIX?
 
 RAW_BYTE_STRING_CONTENT ->
-      `"` ASCII_FOR_RAW*? `"`
+      `"` ^ ASCII_FOR_RAW*? `"`
     | `#` RAW_BYTE_STRING_CONTENT `#`
 
 ASCII_FOR_RAW ->
@@ -343,13 +343,12 @@ r[lex.token.str-c]
 r[lex.token.str-c.syntax]
 ```grammar,lexer
 C_STRING_LITERAL ->
-    `c"` (
+    `c"` ^ (
         ~[`"` `\` CR NUL]
       | BYTE_ESCAPE _except `\0` or `\x00`_
       | UNICODE_ESCAPE _except `\u{0}`, `\u{00}`, …, `\u{000000}`_
       | STRING_CONTINUE
     )* `"` SUFFIX?
-
 ```
 
 r[lex.token.str-c.intro]
@@ -402,7 +401,7 @@ RAW_C_STRING_LITERAL ->
     `cr` RAW_C_STRING_CONTENT SUFFIX?
 
 RAW_C_STRING_CONTENT ->
-      `"` ( ~[CR NUL] )*? `"`
+      `"` ^ ( ~[CR NUL] )*? `"`
     | `#` RAW_C_STRING_CONTENT `#`
 ```
 
diff --git a/tools/grammar/src/lib.rs b/tools/grammar/src/lib.rs
@@ -76,6 +76,8 @@ pub enum ExpressionKind {
     Charset(Vec<Characters>),
     /// ``~[` ` LF]``
     NegExpression(Box<Expression>),
+    /// `^ A B C`
+    Cut(Box<Expression>),
     /// `U+0060`
     Unicode(String),
 }
@@ -116,7 +118,8 @@ impl Expression {
             | ExpressionKind::RepeatPlus(e)
             | ExpressionKind::RepeatPlusNonGreedy(e)
             | ExpressionKind::RepeatRange(e, _, _)
-            | ExpressionKind::NegExpression(e) => {
+            | ExpressionKind::NegExpression(e)
+            | ExpressionKind::Cut(e) => {
                 e.visit_nt(callback);
             }
             ExpressionKind::Alt(es) | ExpressionKind::Sequence(es) => {
diff --git a/tools/grammar/src/parser.rs b/tools/grammar/src/parser.rs
@@ -173,18 +173,19 @@ impl Parser<'_> {
         match es.len() {
             0 => Ok(None),
             1 => Ok(Some(es.pop().unwrap())),
-            _ => Ok(Some(Expression {
-                kind: ExpressionKind::Alt(es),
-                suffix: None,
-                footnote: None,
-            })),
+            _ => Ok(Some(Expression::new_kind(ExpressionKind::Alt(es)))),
         }
     }
 
     fn parse_seq(&mut self) -> Result<Option<Expression>> {
         let mut es = Vec::new();
         loop {
             self.space0();
+            if self.peek() == Some(b'^') {
+                let cut = self.parse_cut()?;
+                es.push(cut);
+                break;
+            }
             let Some(e) = self.parse_expr1()? else {
                 break;
             };
@@ -201,6 +202,19 @@ impl Parser<'_> {
         }
     }
 
+    /// Parse cut (`^`) operator.
+    fn parse_cut(&mut self) -> Result<Expression> {
+        self.expect("^", "expected `^`")?;
+        let Some(rhs) = self.parse_seq()? else {
+            bail!(self, "expected expression after cut operator");
+        };
+        Ok(Expression {
+            kind: ExpressionKind::Cut(Box::new(rhs)),
+            suffix: None,
+            footnote: None,
+        })
+    }
+
     fn parse_expr1(&mut self) -> Result<Option<Expression>> {
         let Some(next) = self.peek() else {
             return Ok(None);
@@ -506,13 +520,71 @@ fn translate_position(input: &str, index: usize) -> (&str, usize, usize) {
     ("", line_number + 1, 0)
 }
 
-#[test]
-fn translate_tests() {
-    assert_eq!(translate_position("", 0), ("", 0, 0));
-    assert_eq!(translate_position("test", 0), ("test", 1, 1));
-    assert_eq!(translate_position("test", 3), ("test", 1, 4));
-    assert_eq!(translate_position("test", 4), ("test", 1, 5));
-    assert_eq!(translate_position("test\ntest2", 4), ("test", 1, 5));
-    assert_eq!(translate_position("test\ntest2", 5), ("test2", 2, 1));
-    assert_eq!(translate_position("test\ntest2\n", 11), ("", 3, 0));
+#[cfg(test)]
+mod tests {
+    use crate::parser::{parse_grammar, translate_position};
+    use crate::{ExpressionKind, Grammar};
+    use std::path::Path;
+
+    #[test]
+    fn test_translate() {
+        assert_eq!(translate_position("", 0), ("", 0, 0));
+        assert_eq!(translate_position("test", 0), ("test", 1, 1));
+        assert_eq!(translate_position("test", 3), ("test", 1, 4));
+        assert_eq!(translate_position("test", 4), ("test", 1, 5));
+        assert_eq!(translate_position("test\ntest2", 4), ("test", 1, 5));
+        assert_eq!(translate_position("test\ntest2", 5), ("test2", 2, 1));
+        assert_eq!(translate_position("test\ntest2\n", 11), ("", 3, 0));
+    }
+
+    fn parse(input: &str) -> Result<Grammar, String> {
+        let mut grammar = Grammar::default();
+        parse_grammar(input, &mut grammar, "test", Path::new("test.md"))
+            .map_err(|e| e.to_string())?;
+        Ok(grammar)
+    }
+
+    #[test]
+    fn test_cut() {
+        let input = "Rule -> A ^ B | C";
+        let grammar = parse(input).unwrap();
+        grammar.productions.get("Rule").unwrap();
+    }
+
+    #[test]
+    fn test_cut_captures() {
+        let input = "Rule -> A ^ B C | D";
+        let grammar = parse(input).unwrap();
+        let rule = grammar.productions.get("Rule").unwrap();
+        // The top-level expression is an alternation: (A ^ B C) | D.
+        let ExpressionKind::Alt(alts) = &rule.expression.kind else {
+            panic!("expected Alt, got {:?}", rule.expression.kind);
+        };
+        assert_eq!(alts.len(), 2);
+        // First alternative is a sequence: A, Cut(Sequence(B, C)).
+        let ExpressionKind::Sequence(seq) = &alts[0].kind else {
+            panic!("expected Sequence, got {:?}", alts[0].kind);
+        };
+        assert_eq!(seq.len(), 2);
+        assert!(matches!(&seq[0].kind, ExpressionKind::Nt(n) if n == "A"));
+        // The cut captures the rest of the sequence (B and C).
+        let ExpressionKind::Cut(cut_inner) = &seq[1].kind else {
+            panic!("expected Cut, got {:?}", seq[1].kind);
+        };
+        let ExpressionKind::Sequence(cut_seq) = &cut_inner.kind else {
+            panic!("expected Sequence inside Cut, got {:?}", cut_inner.kind);
+        };
+        assert_eq!(cut_seq.len(), 2);
+        assert!(matches!(&cut_seq[0].kind, ExpressionKind::Nt(n) if n == "B"));
+        assert!(matches!(&cut_seq[1].kind, ExpressionKind::Nt(n) if n == "C"));
+        // Second alternative is just D.
+        assert!(matches!(&alts[1].kind, ExpressionKind::Nt(n) if n == "D"));
+    }
+
+    #[test]
+    fn test_cut_fail_trailing() {
+        let input = "Rule -> A ^";
+        let err = parse(input).unwrap_err();
+        assert!(err.contains("expected expression after cut operator"));
+    }
 }
diff --git a/tools/mdbook-spec/src/grammar/render_markdown.rs b/tools/mdbook-spec/src/grammar/render_markdown.rs
@@ -79,6 +79,7 @@ fn last_expr(expr: &Expression) -> &ExpressionKind {
         | ExpressionKind::Comment(_)
         | ExpressionKind::Charset(_)
         | ExpressionKind::NegExpression(_)
+        | ExpressionKind::Cut(_)
         | ExpressionKind::Unicode(_) => &expr.kind,
     }
 }
@@ -171,6 +172,10 @@ fn render_expression(expr: &Expression, cx: &RenderCtx, output: &mut String) {
             output.push('~');
             render_expression(e, cx, output);
         }
+        ExpressionKind::Cut(e) => {
+            output.push_str("^ ");
+            render_expression(e, cx, output);
+        }
         ExpressionKind::Unicode(s) => {
             output.push_str("U+");
             output.push_str(s);
diff --git a/tools/mdbook-spec/src/grammar/render_railroad.rs b/tools/mdbook-spec/src/grammar/render_railroad.rs
@@ -214,6 +214,11 @@ fn render_expression(expr: &Expression, cx: &RenderCtx, stack: bool) -> Option<B
                     let ch = node_for_nt(cx, "CHAR");
                     Box::new(Except::new(Box::new(ch), n))
                 }
+                ExpressionKind::Cut(e) => {
+                    let rhs = render_expression(e, cx, stack)?;
+                    let lbox = LabeledBox::new(rhs, Comment::new("no backtracking".to_string()));
+                    Box::new(lbox)
+                }
                 ExpressionKind::Unicode(s) => Box::new(Terminal::new(format!("U+{}", s))),
             };
         }

Original file line number	Diff line number	Diff line change
`@@ -79,6 +79,7 @@ fn last_expr(expr: &Expression) -> &ExpressionKind {`
`79`	`79`	`\| ExpressionKind::Comment(_)`
`80`	`80`	`\| ExpressionKind::Charset(_)`
`81`	`81`	`\| ExpressionKind::NegExpression(_)`
	`82`	`+ \| ExpressionKind::Cut(_)`
`82`	`83`	`\| ExpressionKind::Unicode(_) => &expr.kind,`
`83`	`84`	`}`
`84`	`85`	`}`
`@@ -171,6 +172,10 @@ fn render_expression(expr: &Expression, cx: &RenderCtx, output: &mut String) {`
`171`	`172`	`output.push('~');`
`172`	`173`	`render_expression(e, cx, output);`
`173`	`174`	`}`
	`175`	`+ ExpressionKind::Cut(e) => {`
	`176`	`+ output.push_str("^ ");`
	`177`	`+ render_expression(e, cx, output);`
	`178`	`+ }`
`174`	`179`	`ExpressionKind::Unicode(s) => {`
`175`	`180`	`output.push_str("U+");`
`176`	`181`	`output.push_str(s);`
Original file line number	Diff line number	Diff line change
`@@ -214,6 +214,11 @@ fn render_expression(expr: &Expression, cx: &RenderCtx, stack: bool) -> Option<B`
`214`	`214`	`let ch = node_for_nt(cx, "CHAR");`
`215`	`215`	`Box::new(Except::new(Box::new(ch), n))`
`216`	`216`	`}`
	`217`	`+ ExpressionKind::Cut(e) => {`
	`218`	`+ let rhs = render_expression(e, cx, stack)?;`
	`219`	`+ let lbox = LabeledBox::new(rhs, Comment::new("no backtracking".to_string()));`
	`220`	`+ Box::new(lbox)`
	`221`	`+ }`
`217`	`222`	`ExpressionKind::Unicode(s) => Box::new(Terminal::new(format!("U+{}", s))),`
`218`	`223`	`};`
`219`	`224`	`}`