You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The general format is a series of productions separated by blank lines. The expressions are as follows:
@@ -110,6 +114,7 @@ The general format is a series of productions separated by blank lines. The expr
110
114
| Prose |\<any ASCII character except CR\>| An English description of what should be matched, surrounded in angle brackets. |
111
115
| Group | (\`,\` Parameter)+ | Groups an expression for the purpose of precedence, such as applying a repetition operator to a sequence of other expressions. |
112
116
| NegativeExpression |~\[\`\` LF\]| Matches anything except the given Charset, Terminal, or Nonterminal. |
117
+
| Cut | Expr1 ^ Expr2 \| Expr3 | The hard cut operator. Once the expressions preceding `^` in the sequence match, the rest of the sequence must match or parsing fails unconditionally --- no enclosing expression can backtrack past the cut point. |
113
118
| Sequence |\`fn\` Name Parameters | A sequence of expressions that must match in order. |
114
119
| Alternation | Expr1 \| Expr2 | Matches only one of the given expressions, separated by the vertical pipe character. |
115
120
| Suffix |\_except \[LazyBooleanExpression\]\_| Adds a suffix to the previous expression to provide an additional English description, rendered in subscript. This can contain limited Markdown, but try to avoid anything except basics like links. |
Copy file name to clipboardExpand all lines: src/notation.md
+16Lines changed: 16 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,13 +24,23 @@ The following notations are used by the *Lexer* and *Syntax* grammar snippets:
24
24
|~\[]|~\[`b``B`]| Any characters, except those listed |
25
25
|~`string`|~`\n`, ~`*/`| Any characters, except this sequence |
26
26
| ( ) | (`,`_Parameter_)<sup>?</sup> | Groups items |
27
+
| ^ |`b'` ^ ASCII_FOR_CHAR | The rest of the sequence must match or parsing fails unconditionally ([hard cut operator]) |
27
28
| U+xxxx | U+0060 | A single unicode character |
28
29
|\<text\>|\<any ASCII char except CR\>| An English description of what should be matched |
29
30
| Rule <sub>suffix</sub> | IDENTIFIER_OR_KEYWORD <sub>_except `crate`_</sub> | A modification to the previous rule |
30
31
| // Comment. | // Single line comment. | A comment extending to the end of the line. |
31
32
32
33
Sequences have a higher precedence than `|` alternation.
33
34
35
+
r[notation.grammar.cut]
36
+
### The hard cut operator
37
+
38
+
The grammar uses ordered alternation: the parser tries alternatives left to right and takes the first that matches. If an alternative fails partway through a sequence, the parser normally backtracks and tries the next alternative. The cut operator (`^`) prevents this. Once every expression to the left of `^` in a sequence has matched, the rest of the sequence must match or parsing fails unconditionally.
39
+
40
+
Mizushima et al. introduced [cut operators][cut operator paper] to parsing expression grammars. In the PEG literature, a *soft cut* prevents backtracking only within the immediately enclosing ordered choice --- outer choices can still recover. A *hard cut* prevents all backtracking past the cut point; failure is definitive. The `^` used in this grammar is a hard cut.
41
+
42
+
The hard cut operator is necessary because some tokens in Rust begin with a prefix that is itself a valid token. For example, `c"` begins a C string literal, but `c` alone is a valid identifier. Without the cut, if `c"\0"` failed to lex as a C string literal (because null bytes are not allowed in C strings), the parser could backtrack and lex it as two tokens: the identifier `c` and the string literal `"\0"`. The [cut after `c"`] prevents this --- once the opening delimiter is recognized, the parser cannot go back. The same reasoning applies to [byte literals], [byte string literals], [raw string literals], and other literals with prefixes that are themselves valid tokens.
Below each grammar block is a button to toggle the display of a [syntax diagram]. A square element is a non-terminal rule, and a rounded rectangle is a terminal.
0 commit comments