Clarify lexing is greedy with lookahead restrictions.

leebyron · leebyron · commit 27c2602cb761 · 2019-07-30T00:53:54.000-07:00
GraphQL syntactical grammars intend to be unambiguous. While lexical grammars should also be - there has historically been an assumption that lexical parsing is greedy. This is obvious for numbers and words, but less obvious for empty block strings. This also removes regular expression representation from the lexical grammar notation, since it wasn't always clear. Either way, the additional clarity removes ambiguity from the spec Partial fix for #564 Specifically addresses #564 (comment)
diff --git a/spec/Appendix A -- Notation Conventions.md b/spec/Appendix A -- Notation Conventions.md
@@ -22,8 +22,10 @@ of the sequences it is defined by, until all non-terminal symbols have been
 replaced by terminal characters.
 
 Terminals are represented in this document in a monospace font in two forms: a
-specific Unicode character or sequence of Unicode characters (ex. {`=`} or {`terminal`}), and a pattern of Unicode characters defined by a regular expression
-(ex {/[0-9]+/}).
+specific Unicode character or sequence of Unicode characters (ie. {`=`} or
+{`terminal`}), and prose typically describing a specific Unicode code-point
+{"Space (U+0020)"}. Sequences of Unicode characters only appear in syntactic
+grammars and represent a {Name} token of that specific sequence.
 
 Non-terminal production rules are represented in this document using the
 following notation for a non-terminal with a single definition:
@@ -48,23 +50,25 @@ ListOfLetterA :
 
 The GraphQL language is defined in a syntactic grammar where terminal symbols
 are tokens. Tokens are defined in a lexical grammar which matches patterns of
-source characters. The result of parsing a sequence of source Unicode characters
-produces a GraphQL AST.
+source characters. The result of parsing a source text sequence of Unicode
+characters first produces a sequence of lexical tokens according to the lexical
+grammar which then produces abstract syntax tree (AST) according to the
+syntactical grammar.
 
-A Lexical grammar production describes non-terminal "tokens" by
+A lexical grammar production describes non-terminal "tokens" by
 patterns of terminal Unicode characters. No "whitespace" or other ignored
 characters may appear between any terminal Unicode characters in the lexical
 grammar production. A lexical grammar production is distinguished by a two colon
 `::` definition.
 
-Word :: /[A-Za-z]+/
+Word :: Letter+
 
 A Syntactical grammar production describes non-terminal "rules" by patterns of
-terminal Tokens. Whitespace and other ignored characters may appear before or
-after any terminal Token. A syntactical grammar production is distinguished by a
-one colon `:` definition.
+terminal Tokens. {WhiteSpace} and other {Ignored} sequences may appear before or
+after any terminal {Token}. A syntactical grammar production is distinguished by
+a one colon `:` definition.
 
-Sentence : Noun Verb
+Sentence : Word+ `.`
 
 
 ## Grammar Notation
@@ -80,13 +84,11 @@ and their expanded definitions in the context-free grammar.
 A grammar production may specify that certain expansions are not permitted by
 using the phrase "but not" and then indicating the expansions to be excluded.
 
-For example, the production:
+For example, the following production means that the nonterminal {SafeWord} may
+be replaced by any sequence of characters that could replace {Word} provided
+that the same sequence of characters could not replace {SevenCarlinWords}.
 
-SafeName : Name but not SevenCarlinWords
-
-means that the nonterminal {SafeName} may be replaced by any sequence of
-characters that could replace {Name} provided that the same sequence of
-characters could not replace {SevenCarlinWords}.
+SafeWord : Word but not SevenCarlinWords
 
 A grammar may also list a number of restrictions after "but not" separated
 by "or".
@@ -96,6 +98,18 @@ For example:
 NonBooleanName : Name but not `true` or `false`
 
 
+**Lookahead Restrictions**
+
+A grammar production may specify that certain characters or tokens are not
+permitted to follow it by using the pattern {[lookahead != NotAllowed]}.
+Lookahead restrictions are often used to remove ambiguity from the grammar.
+
+The following example makes it clear that {Letter+} must be greedy, since {Word}
+cannot be followed by yet another {Letter}.
+
+Word :: Letter+ [lookahead != Letter]
+
+
 **Optionality and Lists**
 
 A subscript suffix "{Symbol?}" is shorthand for two possible sequences, one
diff --git a/spec/Appendix B -- Grammar Summary.md b/spec/Appendix B -- Grammar Summary.md
@@ -1,6 +1,12 @@
 # B. Appendix: Grammar Summary
 
-SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
+## Source Text
+
+SourceCharacter ::
+  - "U+0009"
+  - "U+000A"
+  - "U+000D"
+  - "U+0020–U+FFFF"
 
 
 ## Ignored Tokens
@@ -20,10 +26,10 @@ WhiteSpace ::
 
 LineTerminator ::
   - "New Line (U+000A)"
-  - "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
+  - "Carriage Return (U+000D)" [lookahead != "New Line (U+000A)"]
   - "Carriage Return (U+000D)" "New Line (U+000A)"
 
-Comment :: `#` CommentChar*
+Comment :: `#` CommentChar* [lookahead != CommentChar]
 
 CommentChar :: SourceCharacter but not LineTerminator
 
@@ -41,24 +47,41 @@ Token ::
 
 Punctuator :: one of ! $ & ( ) ... : = @ [ ] { | }
 
-Name :: /[_A-Za-z][_0-9A-Za-z]*/
+Name ::
+  - NameStart NameContinue* [lookahead != NameContinue]
+
+NameStart ::
+  - Letter
+  - `_`
+
+NameContinue ::
+  - Letter
+  - Digit
+  - `_`
 
-IntValue :: IntegerPart
+Letter :: one of
+  `A` `B` `C` `D` `E` `F` `G` `H` `I` `J` `K` `L` `M`
+  `N` `O` `P` `Q` `R` `S` `T` `U` `V` `W` `X` `Y` `Z`
+  `a` `b` `c` `d` `e` `f` `g` `h` `i` `j` `k` `l` `m`
+  `n` `o` `p` `q` `r` `s` `t` `u` `v` `w` `x` `y` `z`
+
+Digit :: one of
+  `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
+
+IntValue :: IntegerPart [lookahead != {Digit, `.`}]
 
 IntegerPart ::
   - NegativeSign? 0
   - NegativeSign? NonZeroDigit Digit*
 
 NegativeSign :: -
 
-Digit :: one of 0 1 2 3 4 5 6 7 8 9
-
 NonZeroDigit :: Digit but not `0`
 
 FloatValue ::
-  - IntegerPart FractionalPart
-  - IntegerPart ExponentPart
-  - IntegerPart FractionalPart ExponentPart
+  - IntegerPart FractionalPart ExponentPart [lookahead != Digit]
+  - IntegerPart FractionalPart [lookahead != Digit]
+  - IntegerPart ExponentPart [lookahead != Digit]
 
 FractionalPart :: . Digit+
 
@@ -69,7 +92,8 @@ ExponentIndicator :: one of `e` `E`
 Sign :: one of + -
 
 StringValue ::
-  - `"` StringCharacter* `"`
+  - `""` [lookahead != `"`]
+  - `"` StringCharacter+ `"`
   - `"""` BlockStringCharacter* `"""`
 
 StringCharacter ::
@@ -89,7 +113,7 @@ Note: Block string values are interpreted to exclude blank initial and trailing
 lines and uniform indentation with {BlockStringValue()}.
 
 
-## Document
+## Document Syntax
 
 Document : Definition+
 
diff --git a/spec/Section 2 -- Language.md b/spec/Section 2 -- Language.md
@@ -7,16 +7,50 @@ common unit of composition allowing for query reuse.
 
 A GraphQL document is defined as a syntactic grammar where terminal symbols are
 tokens (indivisible lexical units). These tokens are defined in a lexical
-grammar which matches patterns of source characters (defined by a
-double-colon `::`).
+grammar which matches patterns of source characters. In this document, syntactic
+grammar productions are distinguished with a colon `:` while lexical grammar
+productions are distinguished with a double-colon `::`.
 
-Note: See [Appendix A](#sec-Appendix-Notation-Conventions) for more details about the definition of lexical and syntactic grammar and other notational conventions
-used in this document.
+The source text of a GraphQL document must be a sequence of {SourceCharacter}.
+The character sequence must be described by a sequence of {Token} and {Ignored}
+lexical grammars. The lexical token sequence, omitting {Ignored}, must be
+described by a single {Document} syntactic grammar.
+
+Note: See [Appendix A](#sec-Appendix-Notation-Conventions) for more information
+about the lexical and syntactic grammar and other notational conventions used
+throughout this document.
+
+**Lexical Analysis & Syntactic Parse**
+
+The source text of a GraphQL document is first converted into a sequence of
+lexical tokens, {Token}, and ignored tokens, {Ignored}. The source text is
+scanned from left to right, repeatedly taking the next possible sequence of
+code-points allowed by the lexical grammar productions as the next token. This
+sequence of lexical tokens are then scanned from left to right to produce an
+abstract syntax tree (AST) according to the {Document} syntactical grammar.
+
+Lexical grammar productions in this document use *lookahead restrictions* to
+remove ambiguity and ensure a single valid lexical analysis. A lexical token is
+only valid if not followed by a character in its lookahead restriction.
+
+For example, an {IntValue} has the restriction {[lookahead != Digit]}, so cannot
+be followed by a {Digit}. Because of this, the sequence `123` cannot represent
+as the tokens (`12`, `3`) since `12` is followed by the {Digit} `3` and so must
+only represent a single token. Use {WhiteSpace} or other {Ignored} between
+characters to represent multiple tokens.
+
+Note: This typically has the same behavior as a
+"[maximal munch](https://en.wikipedia.org/wiki/Maximal_munch)" longest possible
+match, however some lookahead restrictions include additional constraints.
 
 
 ## Source Text
 
-SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
+SourceCharacter ::
+  - "U+0009"
+  - "U+000A"
+  - "U+000D"
+  - "U+0020–U+FFFF"
 
 GraphQL documents are expressed as a sequence of
 [Unicode](https://unicode.org/standard/standard.html) characters. However, with
@@ -60,7 +94,7 @@ control tools.
 
 LineTerminator ::
   - "New Line (U+000A)"
-  - "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
+  - "Carriage Return (U+000D)" [lookahead != "New Line (U+000A)"]
   - "Carriage Return (U+000D)" "New Line (U+000A)"
 
 Like white space, line terminators are used to improve the legibility of source
@@ -75,7 +109,7 @@ the line number.
 
 ### Comments
 
-Comment :: `#` CommentChar*
+Comment :: `#` CommentChar* [lookahead != CommentChar]
 
 CommentChar :: SourceCharacter but not LineTerminator
 
@@ -118,8 +152,7 @@ Token ::
 A GraphQL document is comprised of several kinds of indivisible lexical tokens
 defined here in a lexical grammar by patterns of source Unicode characters.
 
-Tokens are later used as terminal symbols in a GraphQL Document
-syntactic grammars.
+Tokens are later used as terminal symbols in GraphQL syntactic grammar rules.
 
 
 ### Ignored Tokens
@@ -131,15 +164,16 @@ Ignored ::
   - Comment
   - Comma
 
-Before and after every lexical token may be any amount of ignored tokens
-including {WhiteSpace} and {Comment}. No ignored regions of a source
-document are significant, however ignored source characters may appear within
-a lexical token in a significant way, for example a {String} may contain white
-space characters.
+{Ignored} tokens are used to improve readability and provide separation between
+{Token}, but are otherwise insignificant and not referenced in syntactical
+grammar productions.
 
-No characters are ignored while parsing a given token, as an example no
-white space characters are permitted between the characters defining a
-{FloatValue}.
+Any amount of {Ignored} may appear before and after every lexical token. No
+ignored regions of a source document are significant, however ignored source
+characters may appear within a lexical token in a significant way, for example a
+{String} may contain white space characters. No characters are ignored within a
+{Token}, as an example no white space characters are permitted between the
+characters defining a {FloatValue}.
 
 
 ### Punctuators
@@ -153,7 +187,26 @@ lacks the punctuation often used to describe mathematical expressions.
 
 ### Names
 
-Name :: /[_A-Za-z][_0-9A-Za-z]*/
+Name ::
+  - NameStart NameContinue* [lookahead != NameContinue]
+
+NameStart ::
+  - Letter
+  - `_`
+
+NameContinue ::
+  - Letter
+  - Digit
+  - `_`
+
+Letter :: one of
+  `A` `B` `C` `D` `E` `F` `G` `H` `I` `J` `K` `L` `M`
+  `N` `O` `P` `Q` `R` `S` `T` `U` `V` `W` `X` `Y` `Z`
+  `a` `b` `c` `d` `e` `f` `g` `h` `i` `j` `k` `l` `m`
+  `n` `o` `p` `q` `r` `s` `t` `u` `v` `w` `x` `y` `z`
+
+Digit :: one of
+  `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
 
 GraphQL Documents are full of named things: operations, fields, arguments,
 types, directives, fragments, and variables. All names must follow the same
@@ -163,8 +216,9 @@ Names in GraphQL are case-sensitive. That is to say `name`, `Name`, and `NAME`
 all refer to different names. Underscores are significant, which means
 `other_name` and `othername` are two different names.
 
-Names in GraphQL are limited to this <acronym>ASCII</acronym> subset of possible
-characters to support interoperation with as many other systems as possible.
+Note: Names in GraphQL are limited to the Latin <acronym>ASCII</acronym> subset
+of possible Source Characters in order to support interoperation with as many
+other systems as possible.
 
 
 ## Document
@@ -666,27 +720,28 @@ specified as a variable. List and inputs objects may also contain variables (unl
 
 ### Int Value
 
-IntValue :: IntegerPart
+IntValue :: IntegerPart [lookahead != {Digit, `.`}]
 
 IntegerPart ::
   - NegativeSign? 0
   - NegativeSign? NonZeroDigit Digit*
 
 NegativeSign :: -
 
-Digit :: one of 0 1 2 3 4 5 6 7 8 9
-
 NonZeroDigit :: Digit but not `0`
 
 An Int number is specified without a decimal point or exponent (ex. `1`).
 
+An {IntValue} must not be followed by a {`.`}. If a {`.`} follows the token must
+only be interpreted as a {FloatValue}.
+
 
 ### Float Value
 
 FloatValue ::
-  - IntegerPart FractionalPart
-  - IntegerPart ExponentPart
-  - IntegerPart FractionalPart ExponentPart
+  - IntegerPart FractionalPart ExponentPart [lookahead != Digit]
+  - IntegerPart FractionalPart [lookahead != Digit]
+  - IntegerPart ExponentPart [lookahead != Digit]
 
 FractionalPart :: . Digit+
 
@@ -710,7 +765,8 @@ The two keywords `true` and `false` represent the two boolean values.
 ### String Value
 
 StringValue ::
-  - `"` StringCharacter* `"`
+  - `""` [lookahead != `"`]
+  - `"` StringCharacter+ `"`
   - `"""` BlockStringCharacter* `"""`
 
 StringCharacter ::