Skip to content

Commit 1ebfc40

Browse files
committed
Clarify lexing is greedy with lookahead restrictions.
GraphQL syntactical grammars intend to be unambiguous. While lexical grammars should also be - there has historically been an assumption that lexical parsing is greedy. This is obvious for numbers and words, but less obvious for empty block strings. This also removes regular expression representation from the lexical grammar notation, since it wasn't always clear. Either way, the additional clarity removes ambiguity from the spec Partial fix for #564 Specifically addresses #564 (comment)
1 parent 6de9b65 commit 1ebfc40

File tree

3 files changed

+192
-67
lines changed

3 files changed

+192
-67
lines changed

spec/Appendix A -- Notation Conventions.md

Lines changed: 30 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,10 @@ of the sequences it is defined by, until all non-terminal symbols have been
2222
replaced by terminal characters.
2323

2424
Terminals are represented in this document in a monospace font in two forms: a
25-
specific Unicode character or sequence of Unicode characters (ex. {`=`} or {`terminal`}), and a pattern of Unicode characters defined by a regular expression
26-
(ex {/[0-9]+/}).
25+
specific Unicode character or sequence of Unicode characters (ie. {`=`} or
26+
{`terminal`}), and prose typically describing a specific Unicode code-point
27+
{"Space (U+0020)"}. Sequences of Unicode characters only appear in syntactic
28+
grammars and represent a {Name} token of that specific sequence.
2729

2830
Non-terminal production rules are represented in this document using the
2931
following notation for a non-terminal with a single definition:
@@ -48,23 +50,25 @@ ListOfLetterA :
4850

4951
The GraphQL language is defined in a syntactic grammar where terminal symbols
5052
are tokens. Tokens are defined in a lexical grammar which matches patterns of
51-
source characters. The result of parsing a sequence of source Unicode characters
52-
produces a GraphQL AST.
53+
source characters. The result of parsing a source text sequence of Unicode
54+
characters first produces a sequence of lexical tokens according to the lexical
55+
grammar which then produces abstract syntax tree (AST) according to the
56+
syntactical grammar.
5357

54-
A Lexical grammar production describes non-terminal "tokens" by
58+
A lexical grammar production describes non-terminal "tokens" by
5559
patterns of terminal Unicode characters. No "whitespace" or other ignored
5660
characters may appear between any terminal Unicode characters in the lexical
5761
grammar production. A lexical grammar production is distinguished by a two colon
5862
`::` definition.
5963

60-
Word :: /[A-Za-z]+/
64+
Word :: Letter+
6165

6266
A Syntactical grammar production describes non-terminal "rules" by patterns of
63-
terminal Tokens. Whitespace and other ignored characters may appear before or
64-
after any terminal Token. A syntactical grammar production is distinguished by a
65-
one colon `:` definition.
67+
terminal Tokens. {WhiteSpace} and other {Ignored} sequences may appear before or
68+
after any terminal {Token}. A syntactical grammar production is distinguished by
69+
a one colon `:` definition.
6670

67-
Sentence : Noun Verb
71+
Sentence : Word+ `.`
6872

6973

7074
## Grammar Notation
@@ -80,13 +84,11 @@ and their expanded definitions in the context-free grammar.
8084
A grammar production may specify that certain expansions are not permitted by
8185
using the phrase "but not" and then indicating the expansions to be excluded.
8286

83-
For example, the production:
87+
For example, the following production means that the nonterminal {SafeWord} may
88+
be replaced by any sequence of characters that could replace {Word} provided
89+
that the same sequence of characters could not replace {SevenCarlinWords}.
8490

85-
SafeName : Name but not SevenCarlinWords
86-
87-
means that the nonterminal {SafeName} may be replaced by any sequence of
88-
characters that could replace {Name} provided that the same sequence of
89-
characters could not replace {SevenCarlinWords}.
91+
SafeWord : Word but not SevenCarlinWords
9092

9193
A grammar may also list a number of restrictions after "but not" separated
9294
by "or".
@@ -96,6 +98,18 @@ For example:
9698
NonBooleanName : Name but not `true` or `false`
9799

98100

101+
**Lookahead Restrictions**
102+
103+
A grammar production may specify that certain characters or tokens are not
104+
permitted to follow it by using the pattern {[lookahead != NotAllowed]}.
105+
Lookahead restrictions are often used to remove ambiguity from the grammar.
106+
107+
The following example makes it clear that {Letter+} must be greedy, since {Word}
108+
cannot be followed by yet another {Letter}.
109+
110+
Word :: Letter+ [lookahead != Letter]
111+
112+
99113
**Optionality and Lists**
100114

101115
A subscript suffix "{Symbol?}" is shorthand for two possible sequences, one

spec/Appendix B -- Grammar Summary.md

Lines changed: 36 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,12 @@
11
# B. Appendix: Grammar Summary
22

3-
SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
3+
## Source Text
4+
5+
SourceCharacter ::
6+
- "U+0009"
7+
- "U+000A"
8+
- "U+000D"
9+
- "U+0020–U+FFFF"
410

511

612
## Ignored Tokens
@@ -20,10 +26,10 @@ WhiteSpace ::
2026

2127
LineTerminator ::
2228
- "New Line (U+000A)"
23-
- "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
29+
- "Carriage Return (U+000D)" [lookahead != "New Line (U+000A)"]
2430
- "Carriage Return (U+000D)" "New Line (U+000A)"
2531

26-
Comment :: `#` CommentChar*
32+
Comment :: `#` CommentChar* [lookahead != CommentChar]
2733

2834
CommentChar :: SourceCharacter but not LineTerminator
2935

@@ -41,24 +47,41 @@ Token ::
4147

4248
Punctuator :: one of ! $ & ( ) ... : = @ [ ] { | }
4349

44-
Name :: /[_A-Za-z][_0-9A-Za-z]*/
50+
Name ::
51+
- NameStart NameContinue* [lookahead != NameContinue]
52+
53+
NameStart ::
54+
- Letter
55+
- `_`
56+
57+
NameContinue ::
58+
- Letter
59+
- Digit
60+
- `_`
4561

46-
IntValue :: IntegerPart
62+
Letter :: one of
63+
`A` `B` `C` `D` `E` `F` `G` `H` `I` `J` `K` `L` `M`
64+
`N` `O` `P` `Q` `R` `S` `T` `U` `V` `W` `X` `Y` `Z`
65+
`a` `b` `c` `d` `e` `f` `g` `h` `i` `j` `k` `l` `m`
66+
`n` `o` `p` `q` `r` `s` `t` `u` `v` `w` `x` `y` `z`
67+
68+
Digit :: one of
69+
`0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
70+
71+
IntValue :: IntegerPart [lookahead != {Digit, `.`, ExponentPart}]
4772

4873
IntegerPart ::
4974
- NegativeSign? 0
5075
- NegativeSign? NonZeroDigit Digit*
5176

5277
NegativeSign :: -
5378

54-
Digit :: one of 0 1 2 3 4 5 6 7 8 9
55-
5679
NonZeroDigit :: Digit but not `0`
5780

5881
FloatValue ::
59-
- IntegerPart FractionalPart
60-
- IntegerPart ExponentPart
61-
- IntegerPart FractionalPart ExponentPart
82+
- IntegerPart FractionalPart ExponentPart [lookahead != {Digit, `.`, ExponentIndicator}]
83+
- IntegerPart FractionalPart [lookahead != {Digit, `.`, ExponentIndicator}]
84+
- IntegerPart ExponentPart [lookahead != {Digit, `.`, ExponentIndicator}]
6285

6386
FractionalPart :: . Digit+
6487

@@ -69,7 +92,8 @@ ExponentIndicator :: one of `e` `E`
6992
Sign :: one of + -
7093

7194
StringValue ::
72-
- `"` StringCharacter* `"`
95+
- `""` [lookahead != `"`]
96+
- `"` StringCharacter+ `"`
7397
- `"""` BlockStringCharacter* `"""`
7498

7599
StringCharacter ::
@@ -89,7 +113,7 @@ Note: Block string values are interpreted to exclude blank initial and trailing
89113
lines and uniform indentation with {BlockStringValue()}.
90114

91115

92-
## Document
116+
## Document Syntax
93117

94118
Document : Definition+
95119

0 commit comments

Comments
 (0)