Metalua Parser

metalua-parser is a subset of the Metalua compiler, which turns valid Lua source files and strings into abstract syntax trees (AST). This README includes a description of this AST format. People interested by Lua code analysis and generation are encouraged to produce and/or consume this format to represent ASTs.

It has been designed for Lua 5.1. It hasn't been tested against Lua 5.2, but should be easily ported.

Usage

Module metalua.compiler has a new() function, which returns a compiler instance. This instance has a set of methods of the form :xxx_to_yyy(input), where xxx and yyy must be one of the following:

  • srcfile the name of a Lua source file;
  • src a string containing the Lua sources of a list of statements;
  • lexstream a lexical tokens stream;
  • ast an abstract syntax tree;
  • bytecode a chunk of Lua bytecode that can be loaded in a Lua 5.1 VM (not available if you only installed the parser);
  • function an executable Lua function.

Compiling into bytecode or executable functions requires the whole Metalua compiler, not only the parser. The most frequently used functions are :src_to_ast(source_string) and :srcfile_to_ast("path/to/source/file.lua").

mlc = require 'metalua.compiler'.new()
ast = mlc :src_to_ast[[ return 123 ]]

A compiler instance can be reused as much as you want; it's only interesting to work with more than one compiler instance when you start extending their grammars.

Abstract Syntax Trees definition

Notation

Trees are written below with some Metalua syntax sugar, which increases their readability. the backquote symbol introduces a tag, i.e. a string stored in the "tag" field of a table:

  • `Foo{ 1, 2, 3 } is a shortcut for {tag="Foo", 1, 2, 3};
  • `Foo is a shortcut for {tag="Foo"};
  • `Foo 123 is a shortcut for `Foo{ 123 }, and therefore {tag="Foo", 123 }; the expression after the tag must be a literal number or string.

When using a Metalua interpreter or compiler, the backtick syntax is supported and can be used directly. Metalua's pretty-printing helpers also try to use backtick syntax whenever applicable.

Tree elements

Tree elements are mainly categorized into statements stat, expressions expr and lists of statements block. Auxiliary definitions include function applications/method invocation apply, are both valid statements and expressions, expressions admissible on the left-hand-side of an assignment statement lhs.

block: { stat* }

stat:
  `Do{ stat* }
| `Set{ {lhs+} {expr+} }                    -- lhs1, lhs2... = e1, e2...
| `While{ expr block }                      -- while e do b end
| `Repeat{ block expr }                     -- repeat b until e
| `If{ (expr block)+ block? }               -- if e1 then b1 [elseif e2 then b2] ... [else bn] end
| `Fornum{ ident expr expr expr? block }    -- for ident = e, e[, e] do b end
| `Forin{ {ident+} {expr+} block }          -- for i1, i2... in e1, e2... do b end
| `Local{ {ident+} {expr+}? }               -- local i1, i2... = e1, e2...
| `Localrec{ ident expr }                   -- only used for 'local function'
| `Goto{ <string> }                         -- goto str
| `Label{ <string> }                        -- ::str::
| `Return{ <expr*> }                        -- return e1, e2...
| `Break                                    -- break
| apply

expr:
  `Nil  |  `Dots  |  `True  |  `False
| `Number{ <number> }
| `String{ <string> }
| `Function{ { ident* `Dots? } block }
| `Table{ ( `Pair{ expr expr } | expr )* }
| `Op{ opid expr expr? }
| `Paren{ expr }       -- significant to cut multiple values returns
| apply
| lhs

apply:
  `Call{ expr expr* }
| `Invoke{ expr `String{ <string> } expr* }

ident: `Id{ <string> }

lhs: ident | `Index{ expr expr }

opid: 'add'   | 'sub'   | 'mul'   | 'div'
    | 'mod'   | 'pow'   | 'concat'| 'eq'
    | 'lt'    | 'le'    | 'and'   | 'or'
    | 'not'   | 'len'

Meta-data (lineinfo)

ASTs also embed some metadata, allowing to map them to their source representation. Those informations are stored in a "lineinfo" field in each tree node, which points to the range of characters in the source string which represents it, and to the content of any comment that would appear immediately before or after that node.

Lineinfo objects have two fields, "first" and "last", describing respectively the beginning and the end of the subtree in the sources. For instance, the sub-node ``Number{123}produced by parsing[[return 123]]will havelineinfo.firstdescribing offset 8, andlineinfo.last` describing offset 10:

> mlc = require 'metalua.compiler'.new()
> ast = mlc :src_to_ast "return 123 -- comment"
> print(ast[1][1].lineinfo)
<?|L1|C8-10|K8-10|C>
>

A lineinfo keeps track of character offsets relative to the beginning of the source string/file (“K8-10” above), line numbers (L1 above; a lineinfo spanning on several lines would read something like “L1-10”), columns i.e. offset within the line (“C8-10” above), and a filename if available (the “?” mark above indicating that we have no file name, as the AST comes from a string). The final “|C>” indicates that there's a comment immediately after the node; an initial “<C|” would have meant that there was a comment immediately before the node.

Positions represent either the end of a token and the beginning of an inter-token space ("last" fields) or the beginning of a token, and the end of an inter-token space ("first" fields). Inter-token spaces might be empty. They can also contain comments, which might be useful to link with surrounding tokens and AST subtrees.

Positions are chained with their “dual” one: a position at the beginning of and inter-token space keeps a refernce to the position at the end of that inter-token space in its "facing" field, and conversly, end-of-inter-token positions keep track of the inter-token space beginning, also in "facing". An inter-token space can be empty, e.g. in "2+2", in which case lineinfo==lineinfo.facing.

Comments are also kept in the "comments" field. If present, this field contains a list of comments, with a "lineinfo" field describing the span between the first and last comment. Each comment is represented by a list of one string, with a "lineinfo" describing the span of this comment only. Consecutive lines of -- comments are considered as one comment: "-- foo\n-- bar\n" parses as one comment whose text is "foo\nbar", whereas "-- foo\n\n-- bar\n" parses as two comments "foo" and "bar".

So for instance, if f is the AST of a function and I want to retrieve the comment before the function, I'd do:

f_comment = f.lineinfo.first.comments[1][1]

The informations in lineinfo positions, i.e. in each "first" and "last" field, are held in the following fields:

  • "source" the filename (optional);
  • "offset" the 1-based offset relative to the beginning of the string/file;
  • "line" the 1-based line number;
  • "column" the 1-based offset within the line;
  • "facing" the position at the opposite end of the inter-token space.
  • "comments" the comments in the associated inter-token space (optional).
  • "id" an arbitrary number, which uniquely identifies an inter-token space within a given tokens stream.