blob: 14e6aa8f4e7404a92ccdbeefe939956d28fd7290 [file] [log] [blame]
 N4JS Design Specification

5. Parser

Some of the concepts described here were presented at EclipseCon 2013 and XtextCon 2014. Note that the material presented at the linked videos may be outdated.

5.1. Overview

The parser is created from an Xtext grammar. Actually, there are several grammars used as shown in Figure CD Grammars. These grammars and the parsers generated from them are described more closely in the following sections.

Figure 2. N4 Grammars

5.2. N4JS Parser

One of the most tricky parts of JavaScript is the parsing because there is a conceptual mismatch between the ANTLR runtime and the specified grammar. Another challenge is the disambiguation of regular expressions and binary operations. Both features require significant customizing of the generated parser (see figure below).

Figure 3. Overview custom parser implementation (runtime only)

5.3. Parser Generation Post-Processing

The ANTLR grammar that is generated by Xtext is post-processed to inject custom code into the grammar file before it is passed to the ANTLR tool. This is required in particular due to ASI (Automated Semicolon Insertion), but for some other reasons as well.

Actually, there are several injections:

1. Due to Xtext restrictions, the generated ANTLR grammar file (*.g) is modified. This means that some some additional actions are added and some rules are rewritten.

2. Due to ANTLR restrictions, the generated ANTLR Java parser (*.java) os modified. This means that some generated rules are slightly modified to match certain requirements.

3. Due to Java restrictions, the generated Java parser needs to be preprocessed in order to reduce the size of certain methods since they must not exceed 64k characters. This is implemented by means of an MWE fragment, activated after the other post processing steps are done.

The first two steps are handled by AntlrGeneratorWithCustomKeywordLogic, which is configured with additional helpers in GenerateN4JS.mwe2. shows the customized classes which modify the code generation. These classes are all part of the releng.utils bundle.

Figure 4. Class Diagram Parser Generation

5.3.1. Automatic Semicolon Insertion

The EcmaScript specification mandates that valid implementations automatically insert a semicolon as a statement delimiter if it is missing and the input file would become invalid due to the missing semicolon. This is known as ASI. It implies that not only valid implementations have to perform this, but a valid parser has to mimic this behavior in order to parse executable code. The ASI is implemented by two different means.

The parser’s error recovery strategy is customized so it attempts to insert a semicolon if it was expected. Both strategies have to work hand in hand in order to consume all sorts of legal JavaScript code.

5.3.1.1. Injected code in the Antlr grammar file

Under certain circumstances, the parser has to actively promote a token to become a semicolon even though it may be a syntactically a closing brace or line break. This has to happen before that token is consumed thus the rules for return statements, continue statements and break statements are enhanced to actively promote these tokens to semicolons.

The same rule is applied to promote line breaks between an expression and a possible postfix operator ++ or . At this location the line break is always treated as a semicolon even though the operator may be validly consumed and produce a postfix expression.

In both cases, the method promoteEOL() is used to move a token that may serve as an automatically injected semicolon from the so called hidden token channel to the semantic channel. The hidden tokens are usually not handled by the parser explicitly thus they are semantically invisible (therefore the term hidden token). Nevertheless, they can be put on the semantic channel explicitly to make them recognizable. That’s implemented in the EOL promotion. The offending tokens include the hidden line terminators and multi-line comments that include line breaks. Furthermore, closing braces (right curly brackets) are included in the set of offending tokens as well as explicit semicolons.

5.3.1.2. Customized error recovery

Since the EOL promotion does not work well with Antlr prediction mode, another customization complements that feature. As soon as an invalid token sequence is attempted to be parsed and missing semicolon would make that sequence valid, an offending token is sought and moved to the semantic channel. This is implemented in the custom recovery strategy.

5.3.2. Async and No line terminator allowed here Handling

There is no way of directly defining No line terminator allowed here. This is required not only for ASI, but also for async. This requires not only a special rule (using some rules from ASI), but also a special error recovery since the token ’async’ may be rejected (by the manually enriched rule) which is of course unexpected behavior from the generated source code.

5.3.3. Regular Expression

The ANTLR parsing process can basically be divided into three steps. First of all, the file contents has to be read from disk. This includes the proper encoding of bytes to characters. The second step is the lexing or tokenizing of the character stream. A token is a basically a typed region in the stream, that is a triplet of token-id, offset and length. The last step is the parsing of these tokens. The result is a semantic model that is associated with a node tree. All necessary information to validate the model can be deduced from these two interlinked representations.

Figure 5. Simplified visualization of the parsing

Since the default semantics and control flow of Antlr generated parsers do not really fit the requirements of a fully working JavaScript parser, some customizations are necessary. Regular expression literals in JavaScript cannot be syntactically disambiguated from div operations without contextual information. Nevertheless, the spec clearly describes, where a regular expression may appear and where it is prohibited. Unfortunately, it is not possible to implement these rules in the lexer alone, since it does not have enough contextual information. Therefore, the parser has been enhanced to establish a communication channel with the lexer. It announces when it expects a regular expression rather than a binary operation.

This required a reworking of the Antlr internals. Instead of a completely pre-populated TokenStream, the parser works on a lazy implementation that only reads as many characters as possible without a disambiguation between regular expression literals and divide operators.

Only after the parser has read this buffered tokens and potentially announced that it expects a regular expression, another batch of characters is processed by the lexer until the next ambiguous situation occurs. This is fundamentally different from the default behavior of Antlr.

Figure 6. Abstract control and object flow during parsing

shows the involved classes which allow for this lexer-parser communication.

Figure 7. Class Diagram Parser-Lexer Communication

5.3.4. Unicode

Unicode support in JavaScript includes the possibility to use unicode escape sequences in identifiers, string literals and regular expression literals. Another issue in this field is the specification of valid identifiers in JavaScript. They are described by means of unicode character classes. These have to be enumerated in the terminal rules in order to fully accept or reject valid or invalid JS identifiers.

For that purpose, a small code generator is used to define the terminal fragments for certain unicode categories. The UnicodeGrammarGenerator basically iterates all characters from Character.MIN_VALUE to Character.MAX_VALUE and adds them as alternatives to the respective terminal fragments, e.g. UNICODE_DIGIT_FRAGMENT.

The real terminal rules are defined as a composition of these generated fragments. Besides that, each character in an identifier, in a string literal or in a regular expression literal may be represented by its unicode escape value, e.g.  u0060. These escape sequences are handled and validated by the IValueConverter for the corresponding terminal rules.

The second piece of the puzzle are the unicode escaped sequences that may be used in keywords. This issue is covered by the UnicodeKeywordHelper which replaces the default terminal representation in the generated Antlr grammar by more elaborated alternatives. The keyword if is not only lexed as ’if’ but as seen in snippet Terminal if listing.

Terminal if
If :
( 'i' | '\\' 'u' '0   0   6   9' )
( 'f' | '\\' 'u' '0   0   6   6' );

5.3.5. Literals

Template literals are also to be handled specially, see TemplateLiteralDisambiguationInjector for details.

5.4. Modifiers

On the AST side, all modifiers are included in a single enumeration N4Modifier. In the types model however, the individual modifiers are mapped to two different enumerations of access modifiers (namely TypeAccessModifier and MemberAccessModifier) and a number of boolean properties (in case of non-access modifiers such as abstract or static). This mapping is done by the types builder, mostly by calling methods in class ModifierUtils.

The grammar allows the use of certain modifiers in many places that are actually invalid. Rules where a certain modifier may appear in the AST are implemented in method isValid(EClass,N4Modifier) in class ModifierUtils and checked via several validations in N4JSSyntaxValidator. Those validations also check for a particular order of modifiers that is not enforced by the grammar.

See API documentation of enumeration N4Modifier in file N4JS.xcore and the utility class ModifierUtils for more details.

5.5. Conflict Resolutions

5.5.1. Reserved Keywords vs. Identifier Names

Keywords and identifiers have to be distinguished by the lexer. Therefore, there is no means to decide upfront whether a certain keyword is actually used as a keyword or whether it is used as an identifier in a given context. This limitation is idiomatically overcome by a data type rule for valid identifiers. This data type rule enumerates all keywords which may be used as identifiers and the pure IDENTIFIER terminal rule as seen in Keywords as Identifier listing.

Keywords as Identifier
N4JSIdentifier: IDENTIFIER
| 'get'
| 'set'
...
;

5.5.2. Operators and Generics

The ambiguity between shift operators and nested generics arises also from the fact, that Antlr lexer upfront without any contextual information. When implemented naively, the grammar will be broken, since a token sequence a>>b can either be part of List<List<a>> b or it can be part of a binary operation int c = a >> b. Therefore the shift operator may not be defined with a single token but has to be composed from individual characters (see Shift Operator listing).

Shift Operator listing
ShiftOperator:
'>' '>' '>'?
| '<' '<'
;
This section may be outdated!

The CA parser also needs adjustments for supporting automatic semicolon insertion and regular expressions. Instead of modifying the CA parser generator similar to the normal parser, the former reuses parts of the latter as far as possible. That is, the token sequence that is produced during production parsing is used as is for the content assist parser. Semicolons have already been inserted where appropriate and regular expression are successfully distinguished from divide operators.

Since the n4js grammar uses syntactic predicates, the content assist parser is compiled with backtracking enabled. This is always the case for Xtext’s CA parsers that rely on backtracking or predicates (local backtracking) in the production parser. This approach is both good (CA works in general) and bad (unpredictable decisions in case of error at locations prior to the cursor). Since parsing with backtracking enabled makes for a fundamental difference in how the prediction and parsing works and how the parser decides which decision paths to take, the customization patterns from the production parser are not applied 1:1 to the CA parser, but adapted instead. The content assist parser doesn’t use a freshly lexed token stream with unicode support, ASI or regular expression literals, but instead uses a synthesized token sequence which is rebuilt from the existing node model.

The token stream that is consumed by the content assist parser is therefore not created by a lexer but by the org.eclipse.n4js.ui.contentassist.NodeModelTokenSource. It traverses the existing node model that is contained in the resource and was produced by the production parser. This approach has the significant advantage that any decision that was made by that parser is also immediately applicable to the content assist infrastructure. For that purpose, the leaf nodes of the node model are mapped to ANTLR token types. This is achieved by the org.eclipse.n4js.ui.contentassist.ContentAssistTokenTypeMapper which is capable to provide the untyped ANTLR token type (primitive int) for a given grammar element.

Special considerations have been made for the last token in the produced source. If it overlaps with an existing leaf node but does not fully cover it, the plain Antlr lexer is used to consume the prefix that is overlapping. Since the terminals will never overlap with each other the longest match always wins without backtracking in the lexer, it is save to assume that only one token is produced from the prefix. The very last token in the org.eclipse.n4js.ui.contentassist.NodeModelTokenSource is always the EOF token (org.antlr.runtime.Token.EOF_TOKEN).

Given that the token source is equal to the prefix in the production token source, some more thought has to be put into the synthesized end of file. The production parser used the complete file to decide where to automatically insert a semicolon and where not to. This would potentially change if there was another token next to the artificial EOF. Therefore, two cases have to considered. The first one describes CA request next to an automatically inserted semicolon and the second one describes CA requests at a position where a semicolon could have been inserted if the token to the right was another one. The org.eclipse.n4js.ui.contentassist.CustomN4JSParser reflects these cases. Heuristics are applied to the end of the token sequence to decide whether a second pass has to be performed to collect yet more following elements. Based on the concrete sequence, the last automatically inserted semicolon is removed from the sequence prior to the second pass or such is a token is explicitly synthesized and appended. Besides the second pass, another special treatment is made for postfix expressions. Those may not be interrupted by a hidden semicolon so those are filtered from the resulting follow set if appropriate.

The parser is used by the org.eclipse.n4js.ui.contentassist.ContentAssistContextFactory where all relevant entry points from the super class are specialized to pass the node model in the the parser facade (org.eclipse.n4js.ui.contentassist.CustomN4JSParser). In that sense, the ContentAssistContextFactory serves as a drop-in replacement binding the default ParserBasedContentAssistContextFactory.StatefulFactory.