You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Clarify lexing is greedy with lookahead restrictions.
GraphQL syntactical grammars intend to be unambiguous. While lexical grammars should also be - there has historically been an assumption that lexical parsing is greedy. This is obvious for numbers and words, but less obvious for empty block strings.
This also removes regular expression representation from the lexical grammar notation, since it wasn't always clear.
Either way, the additional clarity removes ambiguity from the spec
Partial fix for #564
Specifically addresses #564 (comment)
Copy file name to clipboardExpand all lines: spec/Appendix A -- Notation Conventions.md
+30-16Lines changed: 30 additions & 16 deletions
Original file line number
Diff line number
Diff line change
@@ -22,8 +22,10 @@ of the sequences it is defined by, until all non-terminal symbols have been
22
22
replaced by terminal characters.
23
23
24
24
Terminals are represented in this document in a monospace font in two forms: a
25
-
specific Unicode character or sequence of Unicode characters (ex. {`=`} or {`terminal`}), and a pattern of Unicode characters defined by a regular expression
26
-
(ex {/[0-9]+/}).
25
+
specific Unicode character or sequence of Unicode characters (ie. {`=`} or
26
+
{`terminal`}), and prose typically describing a specific Unicode code-point
27
+
{"Space (U+0020)"}. Sequences of Unicode characters only appear in syntactic
28
+
grammars and represent a {Name} token of that specific sequence.
27
29
28
30
Non-terminal production rules are represented in this document using the
29
31
following notation for a non-terminal with a single definition:
@@ -48,23 +50,25 @@ ListOfLetterA :
48
50
49
51
The GraphQL language is defined in a syntactic grammar where terminal symbols
50
52
are tokens. Tokens are defined in a lexical grammar which matches patterns of
51
-
source characters. The result of parsing a sequence of source Unicode characters
52
-
produces a GraphQL AST.
53
+
source characters. The result of parsing a source text sequence of Unicode
54
+
characters first produces a sequence of lexical tokens according to the lexical
55
+
grammar which then produces abstract syntax tree (AST) according to the
56
+
syntactical grammar.
53
57
54
-
A Lexical grammar production describes non-terminal "tokens" by
58
+
A lexical grammar production describes non-terminal "tokens" by
55
59
patterns of terminal Unicode characters. No "whitespace" or other ignored
56
60
characters may appear between any terminal Unicode characters in the lexical
57
61
grammar production. A lexical grammar production is distinguished by a two colon
58
62
`::` definition.
59
63
60
-
Word :: /[A-Za-z]+/
64
+
Word :: Letter+
61
65
62
66
A Syntactical grammar production describes non-terminal "rules" by patterns of
63
-
terminal Tokens. Whitespace and other ignored characters may appear before or
64
-
after any terminal Token. A syntactical grammar production is distinguished by a
65
-
one colon `:` definition.
67
+
terminal Tokens. {WhiteSpace} and other {Ignored} sequences may appear before or
68
+
after any terminal {Token}. A syntactical grammar production is distinguished by
69
+
a one colon `:` definition.
66
70
67
-
Sentence : Noun Verb
71
+
Sentence : Word+ `.`
68
72
69
73
70
74
## Grammar Notation
@@ -80,13 +84,11 @@ and their expanded definitions in the context-free grammar.
80
84
A grammar production may specify that certain expansions are not permitted by
81
85
using the phrase "but not" and then indicating the expansions to be excluded.
82
86
83
-
For example, the production:
87
+
For example, the following production means that the nonterminal {SafeWord} may
88
+
be replaced by any sequence of characters that could replace {Word} provided
89
+
that the same sequence of characters could not replace {SevenCarlinWords}.
84
90
85
-
SafeName : Name but not SevenCarlinWords
86
-
87
-
means that the nonterminal {SafeName} may be replaced by any sequence of
88
-
characters that could replace {Name} provided that the same sequence of
89
-
characters could not replace {SevenCarlinWords}.
91
+
SafeWord : Word but not SevenCarlinWords
90
92
91
93
A grammar may also list a number of restrictions after "but not" separated
92
94
by "or".
@@ -96,6 +98,18 @@ For example:
96
98
NonBooleanName : Name but not `true` or `false`
97
99
98
100
101
+
**Lookahead Restrictions**
102
+
103
+
A grammar production may specify that certain characters or tokens are not
104
+
permitted to follow it by using the pattern {[lookahead != NotAllowed]}.
105
+
Lookahead restrictions are often used to remove ambiguity from the grammar.
106
+
107
+
The following example makes it clear that {Letter+} must be greedy, since {Word}
108
+
cannot be followed by yet another {Letter}.
109
+
110
+
Word :: Letter+ [lookahead != Letter]
111
+
112
+
99
113
**Optionality and Lists**
100
114
101
115
A subscript suffix "{Symbol?}" is shorthand for two possible sequences, one
Copy file name to clipboardExpand all lines: spec/Section 2 -- Language.md
+83-27Lines changed: 83 additions & 27 deletions
Original file line number
Diff line number
Diff line change
@@ -7,16 +7,50 @@ common unit of composition allowing for query reuse.
7
7
8
8
A GraphQL document is defined as a syntactic grammar where terminal symbols are
9
9
tokens (indivisible lexical units). These tokens are defined in a lexical
10
-
grammar which matches patterns of source characters (defined by a
11
-
double-colon `::`).
10
+
grammar which matches patterns of source characters. In this document, syntactic
11
+
grammar productions are distinguished with a colon `:` while lexical grammar
12
+
productions are distinguished with a double-colon `::`.
12
13
13
-
Note: See [Appendix A](#sec-Appendix-Notation-Conventions) for more details about the definition of lexical and syntactic grammar and other notational conventions
14
-
used in this document.
14
+
The source text of a GraphQL document must be a sequence of {SourceCharacter}.
15
+
The character sequence must be described by a sequence of {Token} and {Ignored}
16
+
lexical grammars. The lexical token sequence, omitting {Ignored}, must be
17
+
described by a single {Document} syntactic grammar.
18
+
19
+
Note: See [Appendix A](#sec-Appendix-Notation-Conventions) for more information
20
+
about the lexical and syntactic grammar and other notational conventions used
21
+
throughout this document.
22
+
23
+
**Lexical Analysis & Syntactic Parse**
24
+
25
+
The source text of a GraphQL document is first converted into a sequence of
26
+
lexical tokens, {Token}, and ignored tokens, {Ignored}. The source text is
27
+
scanned from left to right, repeatedly taking the next possible sequence of
28
+
code-points allowed by the lexical grammar productions as the next token. This
29
+
sequence of lexical tokens are then scanned from left to right to produce an
30
+
abstract syntax tree (AST) according to the {Document} syntactical grammar.
31
+
32
+
Lexical grammar productions in this document use *lookahead restrictions* to
33
+
remove ambiguity and ensure a single valid lexical analysis. A lexical token is
34
+
only valid if not followed by a character in its lookahead restriction.
35
+
36
+
For example, an {IntValue} has the restriction {[lookahead != Digit]}, so cannot
37
+
be followed by a {Digit}. Because of this, the sequence `123` cannot represent
38
+
as the tokens (`12`, `3`) since `12` is followed by the {Digit} `3` and so must
39
+
only represent a single token. Use {WhiteSpace} or other {Ignored} between
40
+
characters to represent multiple tokens.
41
+
42
+
Note: This typically has the same behavior as a
43
+
"[maximal munch](https://en.wikipedia.org/wiki/Maximal_munch)" longest possible
44
+
match, however some lookahead restrictions include additional constraints.
0 commit comments