| Metalua Parser |
| ============== |
| |
| `metalua-parser` is a subset of the Metalua compiler, which turns |
| valid Lua source files and strings into abstract syntax trees |
| (AST). This README includes a description of this AST format. People |
| interested by Lua code analysis and generation are encouraged to |
| produce and/or consume this format to represent ASTs. |
| |
| It has been designed for Lua 5.1. It hasn't been tested against |
| Lua 5.2, but should be easily ported. |
| |
| ## Usage |
| |
| Module `metalua.compiler` has a `new()` function, which returns a |
| compiler instance. This instance has a set of methods of the form |
| `:xxx_to_yyy(input)`, where `xxx` and `yyy` must be one of the |
| following: |
| |
| * `srcfile` the name of a Lua source file; |
| * `src` a string containing the Lua sources of a list of statements; |
| * `lexstream` a lexical tokens stream; |
| * `ast` an abstract syntax tree; |
| * `bytecode` a chunk of Lua bytecode that can be loaded in a Lua 5.1 |
| VM (not available if you only installed the parser); |
| * `function` an executable Lua function. |
| |
| Compiling into bytecode or executable functions requires the whole |
| Metalua compiler, not only the parser. The most frequently used |
| functions are `:src_to_ast(source_string)` and |
| `:srcfile_to_ast("path/to/source/file.lua")`. |
| |
| mlc = require 'metalua.compiler'.new() |
| ast = mlc :src_to_ast[[ return 123 ]] |
| |
| A compiler instance can be reused as much as you want; it's only |
| interesting to work with more than one compiler instance when you |
| start extending their grammars. |
| |
| ## Abstract Syntax Trees definition |
| |
| ### Notation |
| |
| Trees are written below with some Metalua syntax sugar, which |
| increases their readability. the backquote symbol introduces a `tag`, |
| i.e. a string stored in the `"tag"` field of a table: |
| |
| * `` `Foo{ 1, 2, 3 }`` is a shortcut for `{tag="Foo", 1, 2, 3}`; |
| * `` `Foo`` is a shortcut for `{tag="Foo"}`; |
| * `` `Foo 123`` is a shortcut for `` `Foo{ 123 }``, and therefore |
| `{tag="Foo", 123 }`; the expression after the tag must be a literal |
| number or string. |
| |
| When using a Metalua interpreter or compiler, the backtick syntax is |
| supported and can be used directly. Metalua's pretty-printing helpers |
| also try to use backtick syntax whenever applicable. |
| |
| ### Tree elements |
| |
| Tree elements are mainly categorized into statements `stat`, |
| expressions `expr` and lists of statements `block`. Auxiliary |
| definitions include function applications/method invocation `apply`, |
| are both valid statements and expressions, expressions admissible on |
| the left-hand-side of an assignment statement `lhs`. |
| |
| block: { stat* } |
| |
| stat: |
| `Do{ stat* } |
| | `Set{ {lhs+} {expr+} } -- lhs1, lhs2... = e1, e2... |
| | `While{ expr block } -- while e do b end |
| | `Repeat{ block expr } -- repeat b until e |
| | `If{ (expr block)+ block? } -- if e1 then b1 [elseif e2 then b2] ... [else bn] end |
| | `Fornum{ ident expr expr expr? block } -- for ident = e, e[, e] do b end |
| | `Forin{ {ident+} {expr+} block } -- for i1, i2... in e1, e2... do b end |
| | `Local{ {ident+} {expr+}? } -- local i1, i2... = e1, e2... |
| | `Localrec{ ident expr } -- only used for 'local function' |
| | `Goto{ <string> } -- goto str |
| | `Label{ <string> } -- ::str:: |
| | `Return{ <expr*> } -- return e1, e2... |
| | `Break -- break |
| | apply |
| |
| expr: |
| `Nil | `Dots | `True | `False |
| | `Number{ <number> } |
| | `String{ <string> } |
| | `Function{ { ident* `Dots? } block } |
| | `Table{ ( `Pair{ expr expr } | expr )* } |
| | `Op{ opid expr expr? } |
| | `Paren{ expr } -- significant to cut multiple values returns |
| | apply |
| | lhs |
| |
| apply: |
| `Call{ expr expr* } |
| | `Invoke{ expr `String{ <string> } expr* } |
| |
| ident: `Id{ <string> } |
| |
| lhs: ident | `Index{ expr expr } |
| |
| opid: 'add' | 'sub' | 'mul' | 'div' |
| | 'mod' | 'pow' | 'concat'| 'eq' |
| | 'lt' | 'le' | 'and' | 'or' |
| | 'not' | 'len' |
| |
| ### Meta-data (lineinfo) |
| |
| |
| ASTs also embed some metadata, allowing to map them to their source |
| representation. Those informations are stored in a `"lineinfo"` field |
| in each tree node, which points to the range of characters in the |
| source string which represents it, and to the content of any comment |
| that would appear immediately before or after that node. |
| |
| Lineinfo objects have two fields, `"first"` and `"last"`, describing |
| respectively the beginning and the end of the subtree in the |
| sources. For instance, the sub-node ``Number{123}` produced by parsing |
| `[[return 123]]` will have `lineinfo.first` describing offset 8, and |
| `lineinfo.last` describing offset 10: |
| |
| |
| > mlc = require 'metalua.compiler'.new() |
| > ast = mlc :src_to_ast "return 123 -- comment" |
| > print(ast[1][1].lineinfo) |
| <?|L1|C8-10|K8-10|C> |
| > |
| |
| A lineinfo keeps track of character offsets relative to the beginning |
| of the source string/file ("K8-10" above), line numbers (L1 above; a |
| lineinfo spanning on several lines would read something like "L1-10"), |
| columns i.e. offset within the line ("C8-10" above), and a filename if |
| available (the "?" mark above indicating that we have no file name, as |
| the AST comes from a string). The final "|C>" indicates that there's a |
| comment immediately after the node; an initial "<C|" would have meant |
| that there was a comment immediately before the node. |
| |
| Positions represent either the end of a token and the beginning of an |
| inter-token space (`"last"` fields) or the beginning of a token, and |
| the end of an inter-token space (`"first"` fields). Inter-token spaces |
| might be empty. They can also contain comments, which might be useful |
| to link with surrounding tokens and AST subtrees. |
| |
| Positions are chained with their "dual" one: a position at the |
| beginning of and inter-token space keeps a refernce to the position at |
| the end of that inter-token space in its `"facing"` field, and |
| conversly, end-of-inter-token positions keep track of the inter-token |
| space beginning, also in `"facing"`. An inter-token space can be |
| empty, e.g. in `"2+2"`, in which case `lineinfo==lineinfo.facing`. |
| |
| Comments are also kept in the `"comments"` field. If present, this |
| field contains a list of comments, with a `"lineinfo"` field |
| describing the span between the first and last comment. Each comment |
| is represented by a list of one string, with a `"lineinfo"` describing |
| the span of this comment only. Consecutive lines of `--` comments are |
| considered as one comment: `"-- foo\n-- bar\n"` parses as one comment |
| whose text is `"foo\nbar"`, whereas `"-- foo\n\n-- bar\n"` parses as |
| two comments `"foo"` and `"bar"`. |
| |
| So for instance, if `f` is the AST of a function and I want to |
| retrieve the comment before the function, I'd do: |
| |
| f_comment = f.lineinfo.first.comments[1][1] |
| |
| The informations in lineinfo positions, i.e. in each `"first"` and |
| `"last"` field, are held in the following fields: |
| |
| * `"source"` the filename (optional); |
| * `"offset"` the 1-based offset relative to the beginning of the string/file; |
| * `"line"` the 1-based line number; |
| * `"column"` the 1-based offset within the line; |
| * `"facing"` the position at the opposite end of the inter-token space. |
| * `"comments"` the comments in the associated inter-token space (optional). |
| * `"id"` an arbitrary number, which uniquely identifies an inter-token |
| space within a given tokens stream. |
| |