Skip to content

Commit 27c2602

Browse files
committed
Clarify lexing is greedy with lookahead restrictions.
GraphQL syntactical grammars intend to be unambiguous. While lexical grammars should also be - there has historically been an assumption that lexical parsing is greedy. This is obvious for numbers and words, but less obvious for empty block strings. This also removes regular expression representation from the lexical grammar notation, since it wasn't always clear. Either way, the additional clarity removes ambiguity from the spec Partial fix for #564 Specifically addresses #564 (comment)
1 parent 439cacf commit 27c2602

File tree

3 files changed

+149
-55
lines changed

3 files changed

+149
-55
lines changed

spec/Appendix A -- Notation Conventions.md

Lines changed: 30 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,10 @@ of the sequences it is defined by, until all non-terminal symbols have been
2222
replaced by terminal characters.
2323

2424
Terminals are represented in this document in a monospace font in two forms: a
25-
specific Unicode character or sequence of Unicode characters (ex. {`=`} or {`terminal`}), and a pattern of Unicode characters defined by a regular expression
26-
(ex {/[0-9]+/}).
25+
specific Unicode character or sequence of Unicode characters (ie. {`=`} or
26+
{`terminal`}), and prose typically describing a specific Unicode code-point
27+
{"Space (U+0020)"}. Sequences of Unicode characters only appear in syntactic
28+
grammars and represent a {Name} token of that specific sequence.
2729

2830
Non-terminal production rules are represented in this document using the
2931
following notation for a non-terminal with a single definition:
@@ -48,23 +50,25 @@ ListOfLetterA :
4850

4951
The GraphQL language is defined in a syntactic grammar where terminal symbols
5052
are tokens. Tokens are defined in a lexical grammar which matches patterns of
51-
source characters. The result of parsing a sequence of source Unicode characters
52-
produces a GraphQL AST.
53+
source characters. The result of parsing a source text sequence of Unicode
54+
characters first produces a sequence of lexical tokens according to the lexical
55+
grammar which then produces abstract syntax tree (AST) according to the
56+
syntactical grammar.
5357

54-
A Lexical grammar production describes non-terminal "tokens" by
58+
A lexical grammar production describes non-terminal "tokens" by
5559
patterns of terminal Unicode characters. No "whitespace" or other ignored
5660
characters may appear between any terminal Unicode characters in the lexical
5761
grammar production. A lexical grammar production is distinguished by a two colon
5862
`::` definition.
5963

60-
Word :: /[A-Za-z]+/
64+
Word :: Letter+
6165

6266
A Syntactical grammar production describes non-terminal "rules" by patterns of
63-
terminal Tokens. Whitespace and other ignored characters may appear before or
64-
after any terminal Token. A syntactical grammar production is distinguished by a
65-
one colon `:` definition.
67+
terminal Tokens. {WhiteSpace} and other {Ignored} sequences may appear before or
68+
after any terminal {Token}. A syntactical grammar production is distinguished by
69+
a one colon `:` definition.
6670

67-
Sentence : Noun Verb
71+
Sentence : Word+ `.`
6872

6973

7074
## Grammar Notation
@@ -80,13 +84,11 @@ and their expanded definitions in the context-free grammar.
8084
A grammar production may specify that certain expansions are not permitted by
8185
using the phrase "but not" and then indicating the expansions to be excluded.
8286

83-
For example, the production:
87+
For example, the following production means that the nonterminal {SafeWord} may
88+
be replaced by any sequence of characters that could replace {Word} provided
89+
that the same sequence of characters could not replace {SevenCarlinWords}.
8490

85-
SafeName : Name but not SevenCarlinWords
86-
87-
means that the nonterminal {SafeName} may be replaced by any sequence of
88-
characters that could replace {Name} provided that the same sequence of
89-
characters could not replace {SevenCarlinWords}.
91+
SafeWord : Word but not SevenCarlinWords
9092

9193
A grammar may also list a number of restrictions after "but not" separated
9294
by "or".
@@ -96,6 +98,18 @@ For example:
9698
NonBooleanName : Name but not `true` or `false`
9799

98100

101+
**Lookahead Restrictions**
102+
103+
A grammar production may specify that certain characters or tokens are not
104+
permitted to follow it by using the pattern {[lookahead != NotAllowed]}.
105+
Lookahead restrictions are often used to remove ambiguity from the grammar.
106+
107+
The following example makes it clear that {Letter+} must be greedy, since {Word}
108+
cannot be followed by yet another {Letter}.
109+
110+
Word :: Letter+ [lookahead != Letter]
111+
112+
99113
**Optionality and Lists**
100114

101115
A subscript suffix "{Symbol?}" is shorthand for two possible sequences, one

spec/Appendix B -- Grammar Summary.md

Lines changed: 36 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,12 @@
11
# B. Appendix: Grammar Summary
22

3-
SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
3+
## Source Text
4+
5+
SourceCharacter ::
6+
- "U+0009"
7+
- "U+000A"
8+
- "U+000D"
9+
- "U+0020–U+FFFF"
410

511

612
## Ignored Tokens
@@ -20,10 +26,10 @@ WhiteSpace ::
2026

2127
LineTerminator ::
2228
- "New Line (U+000A)"
23-
- "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
29+
- "Carriage Return (U+000D)" [lookahead != "New Line (U+000A)"]
2430
- "Carriage Return (U+000D)" "New Line (U+000A)"
2531

26-
Comment :: `#` CommentChar*
32+
Comment :: `#` CommentChar* [lookahead != CommentChar]
2733

2834
CommentChar :: SourceCharacter but not LineTerminator
2935

@@ -41,24 +47,41 @@ Token ::
4147

4248
Punctuator :: one of ! $ & ( ) ... : = @ [ ] { | }
4349

44-
Name :: /[_A-Za-z][_0-9A-Za-z]*/
50+
Name ::
51+
- NameStart NameContinue* [lookahead != NameContinue]
52+
53+
NameStart ::
54+
- Letter
55+
- `_`
56+
57+
NameContinue ::
58+
- Letter
59+
- Digit
60+
- `_`
4561

46-
IntValue :: IntegerPart
62+
Letter :: one of
63+
`A` `B` `C` `D` `E` `F` `G` `H` `I` `J` `K` `L` `M`
64+
`N` `O` `P` `Q` `R` `S` `T` `U` `V` `W` `X` `Y` `Z`
65+
`a` `b` `c` `d` `e` `f` `g` `h` `i` `j` `k` `l` `m`
66+
`n` `o` `p` `q` `r` `s` `t` `u` `v` `w` `x` `y` `z`
67+
68+
Digit :: one of
69+
`0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
70+
71+
IntValue :: IntegerPart [lookahead != {Digit, `.`}]
4772

4873
IntegerPart ::
4974
- NegativeSign? 0
5075
- NegativeSign? NonZeroDigit Digit*
5176

5277
NegativeSign :: -
5378

54-
Digit :: one of 0 1 2 3 4 5 6 7 8 9
55-
5679
NonZeroDigit :: Digit but not `0`
5780

5881
FloatValue ::
59-
- IntegerPart FractionalPart
60-
- IntegerPart ExponentPart
61-
- IntegerPart FractionalPart ExponentPart
82+
- IntegerPart FractionalPart ExponentPart [lookahead != Digit]
83+
- IntegerPart FractionalPart [lookahead != Digit]
84+
- IntegerPart ExponentPart [lookahead != Digit]
6285

6386
FractionalPart :: . Digit+
6487

@@ -69,7 +92,8 @@ ExponentIndicator :: one of `e` `E`
6992
Sign :: one of + -
7093

7194
StringValue ::
72-
- `"` StringCharacter* `"`
95+
- `""` [lookahead != `"`]
96+
- `"` StringCharacter+ `"`
7397
- `"""` BlockStringCharacter* `"""`
7498

7599
StringCharacter ::
@@ -89,7 +113,7 @@ Note: Block string values are interpreted to exclude blank initial and trailing
89113
lines and uniform indentation with {BlockStringValue()}.
90114

91115

92-
## Document
116+
## Document Syntax
93117

94118
Document : Definition+
95119

spec/Section 2 -- Language.md

Lines changed: 83 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,50 @@ common unit of composition allowing for query reuse.
77

88
A GraphQL document is defined as a syntactic grammar where terminal symbols are
99
tokens (indivisible lexical units). These tokens are defined in a lexical
10-
grammar which matches patterns of source characters (defined by a
11-
double-colon `::`).
10+
grammar which matches patterns of source characters. In this document, syntactic
11+
grammar productions are distinguished with a colon `:` while lexical grammar
12+
productions are distinguished with a double-colon `::`.
1213

13-
Note: See [Appendix A](#sec-Appendix-Notation-Conventions) for more details about the definition of lexical and syntactic grammar and other notational conventions
14-
used in this document.
14+
The source text of a GraphQL document must be a sequence of {SourceCharacter}.
15+
The character sequence must be described by a sequence of {Token} and {Ignored}
16+
lexical grammars. The lexical token sequence, omitting {Ignored}, must be
17+
described by a single {Document} syntactic grammar.
18+
19+
Note: See [Appendix A](#sec-Appendix-Notation-Conventions) for more information
20+
about the lexical and syntactic grammar and other notational conventions used
21+
throughout this document.
22+
23+
**Lexical Analysis & Syntactic Parse**
24+
25+
The source text of a GraphQL document is first converted into a sequence of
26+
lexical tokens, {Token}, and ignored tokens, {Ignored}. The source text is
27+
scanned from left to right, repeatedly taking the next possible sequence of
28+
code-points allowed by the lexical grammar productions as the next token. This
29+
sequence of lexical tokens are then scanned from left to right to produce an
30+
abstract syntax tree (AST) according to the {Document} syntactical grammar.
31+
32+
Lexical grammar productions in this document use *lookahead restrictions* to
33+
remove ambiguity and ensure a single valid lexical analysis. A lexical token is
34+
only valid if not followed by a character in its lookahead restriction.
35+
36+
For example, an {IntValue} has the restriction {[lookahead != Digit]}, so cannot
37+
be followed by a {Digit}. Because of this, the sequence `123` cannot represent
38+
as the tokens (`12`, `3`) since `12` is followed by the {Digit} `3` and so must
39+
only represent a single token. Use {WhiteSpace} or other {Ignored} between
40+
characters to represent multiple tokens.
41+
42+
Note: This typically has the same behavior as a
43+
"[maximal munch](https://en.wikipedia.org/wiki/Maximal_munch)" longest possible
44+
match, however some lookahead restrictions include additional constraints.
1545

1646

1747
## Source Text
1848

19-
SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
49+
SourceCharacter ::
50+
- "U+0009"
51+
- "U+000A"
52+
- "U+000D"
53+
- "U+0020–U+FFFF"
2054

2155
GraphQL documents are expressed as a sequence of
2256
[Unicode](https://unicode.org/standard/standard.html) characters. However, with
@@ -60,7 +94,7 @@ control tools.
6094

6195
LineTerminator ::
6296
- "New Line (U+000A)"
63-
- "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
97+
- "Carriage Return (U+000D)" [lookahead != "New Line (U+000A)"]
6498
- "Carriage Return (U+000D)" "New Line (U+000A)"
6599

66100
Like white space, line terminators are used to improve the legibility of source
@@ -75,7 +109,7 @@ the line number.
75109

76110
### Comments
77111

78-
Comment :: `#` CommentChar*
112+
Comment :: `#` CommentChar* [lookahead != CommentChar]
79113

80114
CommentChar :: SourceCharacter but not LineTerminator
81115

@@ -118,8 +152,7 @@ Token ::
118152
A GraphQL document is comprised of several kinds of indivisible lexical tokens
119153
defined here in a lexical grammar by patterns of source Unicode characters.
120154

121-
Tokens are later used as terminal symbols in a GraphQL Document
122-
syntactic grammars.
155+
Tokens are later used as terminal symbols in GraphQL syntactic grammar rules.
123156

124157

125158
### Ignored Tokens
@@ -131,15 +164,16 @@ Ignored ::
131164
- Comment
132165
- Comma
133166

134-
Before and after every lexical token may be any amount of ignored tokens
135-
including {WhiteSpace} and {Comment}. No ignored regions of a source
136-
document are significant, however ignored source characters may appear within
137-
a lexical token in a significant way, for example a {String} may contain white
138-
space characters.
167+
{Ignored} tokens are used to improve readability and provide separation between
168+
{Token}, but are otherwise insignificant and not referenced in syntactical
169+
grammar productions.
139170

140-
No characters are ignored while parsing a given token, as an example no
141-
white space characters are permitted between the characters defining a
142-
{FloatValue}.
171+
Any amount of {Ignored} may appear before and after every lexical token. No
172+
ignored regions of a source document are significant, however ignored source
173+
characters may appear within a lexical token in a significant way, for example a
174+
{String} may contain white space characters. No characters are ignored within a
175+
{Token}, as an example no white space characters are permitted between the
176+
characters defining a {FloatValue}.
143177

144178

145179
### Punctuators
@@ -153,7 +187,26 @@ lacks the punctuation often used to describe mathematical expressions.
153187

154188
### Names
155189

156-
Name :: /[_A-Za-z][_0-9A-Za-z]*/
190+
Name ::
191+
- NameStart NameContinue* [lookahead != NameContinue]
192+
193+
NameStart ::
194+
- Letter
195+
- `_`
196+
197+
NameContinue ::
198+
- Letter
199+
- Digit
200+
- `_`
201+
202+
Letter :: one of
203+
`A` `B` `C` `D` `E` `F` `G` `H` `I` `J` `K` `L` `M`
204+
`N` `O` `P` `Q` `R` `S` `T` `U` `V` `W` `X` `Y` `Z`
205+
`a` `b` `c` `d` `e` `f` `g` `h` `i` `j` `k` `l` `m`
206+
`n` `o` `p` `q` `r` `s` `t` `u` `v` `w` `x` `y` `z`
207+
208+
Digit :: one of
209+
`0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
157210

158211
GraphQL Documents are full of named things: operations, fields, arguments,
159212
types, directives, fragments, and variables. All names must follow the same
@@ -163,8 +216,9 @@ Names in GraphQL are case-sensitive. That is to say `name`, `Name`, and `NAME`
163216
all refer to different names. Underscores are significant, which means
164217
`other_name` and `othername` are two different names.
165218

166-
Names in GraphQL are limited to this <acronym>ASCII</acronym> subset of possible
167-
characters to support interoperation with as many other systems as possible.
219+
Note: Names in GraphQL are limited to the Latin <acronym>ASCII</acronym> subset
220+
of possible Source Characters in order to support interoperation with as many
221+
other systems as possible.
168222

169223

170224
## Document
@@ -666,27 +720,28 @@ specified as a variable. List and inputs objects may also contain variables (unl
666720

667721
### Int Value
668722

669-
IntValue :: IntegerPart
723+
IntValue :: IntegerPart [lookahead != {Digit, `.`}]
670724

671725
IntegerPart ::
672726
- NegativeSign? 0
673727
- NegativeSign? NonZeroDigit Digit*
674728

675729
NegativeSign :: -
676730

677-
Digit :: one of 0 1 2 3 4 5 6 7 8 9
678-
679731
NonZeroDigit :: Digit but not `0`
680732

681733
An Int number is specified without a decimal point or exponent (ex. `1`).
682734

735+
An {IntValue} must not be followed by a {`.`}. If a {`.`} follows the token must
736+
only be interpreted as a {FloatValue}.
737+
683738

684739
### Float Value
685740

686741
FloatValue ::
687-
- IntegerPart FractionalPart
688-
- IntegerPart ExponentPart
689-
- IntegerPart FractionalPart ExponentPart
742+
- IntegerPart FractionalPart ExponentPart [lookahead != Digit]
743+
- IntegerPart FractionalPart [lookahead != Digit]
744+
- IntegerPart ExponentPart [lookahead != Digit]
690745

691746
FractionalPart :: . Digit+
692747

@@ -710,7 +765,8 @@ The two keywords `true` and `false` represent the two boolean values.
710765
### String Value
711766

712767
StringValue ::
713-
- `"` StringCharacter* `"`
768+
- `""` [lookahead != `"`]
769+
- `"` StringCharacter+ `"`
714770
- `"""` BlockStringCharacter* `"""`
715771

716772
StringCharacter ::

0 commit comments

Comments
 (0)