blob: 8a0cf42b911c6ae64244de694f6ab2b01f8dffa7 [file] [log] [blame]
% The Photran Lexers and Parser
The Fortran 95 lexers and parser are stored in the org.eclipse.photran.parser
project. Because this project takes several minutes and a lot of memory
(600-800 MB) to compile, it is exported as a (non-sealed) JAR archive,
\texttt{f95parser.jar}, which is stored in the Core plug-in. The Core,
Managed Builder, and UI use classes from this JAR file, not from the
org.eclipse.photran.parser project. So you do \textit{not} need the
org.eclipse.photran.parser project in your workspace unless you will be
regenerating this JAR file.
\section{Accessing the Lexers and Parser from Photran: The
\texttt{FortranProcessor}}
(The following does \textit{not} apply to JUnit tests.)
To scan or parse a file, or to build a symbol table from a parse tree,
create a \texttt{FortranProcessor} and use it. It knows how to distinguish
between fixed and free form file extensions based on the user's workspace
preferences and can determine whether or not parser debugging has been
enabled (also in the workspace preferences).
Examples:
\begin{itemize}
\item Parsing a file: See \texttt{FortranModelBuilder\#parse}
\item Lexing a file without parsing:
See \texttt{FortranPartitionScanner\#documentChanged}
\item Creating a symbol table after a file has been parsed:
(TODO-Jeff: No good examples yet)
\end{itemize}
\section{Structure of the Lexers and Parser}
Unfortunately, Fortran is not an easy language to process. It has two
\textit{source forms,} and keywords are not reserved (so it's fine to call a
variable ``if''). This greatly complicates lexical analysis. To make
things as simple as possible, we have separated the lexical analysis into
several phases, depending on whether the input is in free form or fixed
form.
When a free form source file is being parsed...
\begin{itemize}
\item \texttt{FreeFormLexerPhase1}, which is generated from
\texttt{FreeFormLexerPhase1.flex} by JFlex, splits the input into tokens.
Tokens such as ``if,'' ``do,'' ``len=,'' etc. are labeled as keywords,
even if they are being used as identifiers (it doesn't know any better).
\item \texttt{FreeFormLexerPhase2} reads the token stream from
\texttt{FreeFormLexerPhase1} and buffers an entire statement worth of
tokens. It then uses Sale's algorithm and an elaborate set of rules
(specified in the method \texttt{buildAdditionalRules}) to determine which
keyword tokens should be kept as keywords and which ones should actually
be identifiers. In the case of ``keywords'' with an equal sign on the end
(such as ``len=''), if that token should really be an identifier, it is
split into two tokens: the identifier ``len'' and the = token T\_EQUALS.
\item The \texttt{Parser}, which is generated from \texttt{Fortran95.bnf}
by my parser generator, reads the tokens from \texttt{FreeFormLexerPhase2}
and parses the token stream, returning a parse tree. The parse tree is
described in more detail later.
\end{itemize}
When a fixed form source file is being parsed...
\begin{itemize}
\item \texttt{FixedFormLexerPrepass} discards whitespace and comments
and concatenates continuation lines.
\item \texttt{FixedFormLexerPhase1}, which is generated from
\texttt{FreeFormLexerPhase1.flex} by JFlex, splits the input into tokens.
Essentially, it is identical to \texttt{FreeFormLexerPhase1}, except that
there are no rules for whitespace, comments, or line continuations.
\item \texttt{FreeFormLexerPhase2} reads the token stream from
\texttt{FixedFormLexerPhase1}, resolving identifiers as it does for free form
input.
\item \texttt{FixedFormLexerPhase2} reads the token stream from
\texttt{FreeFormLexerPhase2} and concatenates consecutive identifiers.
\item The \texttt{Parser} reads the tokens from \texttt{FixedFormLexerPhase2}.
\end{itemize}
\section{Parse Trees}
\subsection{Overview}
\textit{
\textbf{NOTE.}
We expect that you already know what a parse tree is and what the difference
is between abstract and concrete syntax. Terms like ``terminal,''
``nonterminal,'' and ``token'' should be familiar. If they are not,
you will need to spend some time with the Dragon book\footnote{Aho, Sethi,
and Ullman, \textit{Compilers: Principles, Techniques, and Tools}} before
tackling this section.}
When you call one of the \texttt{parse} methods on a \texttt{FortranProcessor},
either an exception will be thrown (if the lexer or parser encounters an
error), or the method will return a (non-null) parse tree.
A parse tree is just a tree of \texttt{ParseTreeNode}s with \texttt{Token}s
as leaves. The \texttt{ParseTreeNode} returned by the \texttt{parse} method
is the root of the parse tree.
The parse tree for a program \textit{exactly} follows the derivation of the
program from the grammar in \texttt{Fortran95.bnf}. It is literally a
parse tree, also known as a concrete syntax tree; it is \textit{not} an
abstract syntax tree (AST).\footnote{If you want to know why we didn't use an
abstract syntax tree, see the appendix on Miscellaneous Parser and Language
Processing Notes.}
It is important to remember that the grammar in
\texttt{Fortran95.bnf} is different than the one in the Fortran standard.
The grammar in the Fortran standard is \textit{not} LALR(1)---not even
close---and so it had to be modified (quite heavily) to successfully
generate a parser from it. So the parse trees you get from our parser are
\textit{not} always going to match what the Fortran standard grammar would
lead you to expect. For example, there is no SFExpr in the Fortran standard
grammar, but you will find them popping up in your parse trees.
\textbf{TIP.}
When running Photran, go into your workspace preferences; in the Fortran Parser
section, there is an (very useful) option to ``display parse tree rather
than normal Outline view.''
\subsection{An Example}
\begin{figure}[h]
\centering
\includegraphics[width=5in]{images/parsetree1.png}
\caption{Sample program and its parse tree}
\label{parsetree}
\end{figure}
Figure \ref{parsetree} shows a trivial program and its parse tree.
\begin{itemize}
\item The node labels in angle brackets, such as $<$xExecutableProgram$>$ and
$<$xProgramStmt$>$, are nonterminals in the grammar and correspond to
\texttt{ParseTreeNodes} in the parse tree.\footnote{Note that $<$Free Form
Source$>$ (at the top of the Outline view) is just a label and not part of
the parse tree.}
\item Terminals in the grammar have names starting with T\_: for example,
T\_PROGRAM is the ``program'' keyword and T\_IDENT is an identifier.
Notice that the carriage return is also a token: end-of-statement (T\_EOS).
In the figure, these identify \texttt{Token}s in the parse
tree.\footnote{The distinction between terminals and tokens is somewhat
subtle. Terminals are things like ``identifier,'' ``plus sign,'' and
``float'' that appear in a grammar. Tokens are the actual chunks of text
from your input file, such as ``hello,'' ``+,'' and ``3.14.'' Every token
\textit{corresponds} to a terminal, e.g., ``hello'' is an identifier,
``+'' is (the only representation of) a plus sign, and ``3.14'' is a
float.}
\end{itemize}
Recall that each parse tree node corresponds to a nonterminal (on the
left-hand side of a production in the grammar) and each token corresponds
to a terminal.
You can determine the nonterminal corresponding to a \texttt{ParseTreeNode}
by invoking \texttt{ParseTreeNode\#getRootNonterminal}. (This method name
will soon be changed to \texttt{getCorrespondingNonterminal} or something
more intuitive like that.) This will return a \texttt{Nonterminal}. The
only valid \texttt{Nonterminal}s are all constants in the \texttt{Nonterminal}
class, so you can do identity comparisons like this:
\begin{verbatim}
if (parseTreeNode.getRootNonterminal() == Nonterminal.XEXECUTABLEPROGRAM) ...
\end{verbatim}
Similarly, you can determine the terminal corresponding to a token by
calling \texttt{Token\#getTerminal} and doing an identity comparison, as in
\begin{verbatim}
if (token.getTerminal() == Terminal.T_IDENT) ...
\end{verbatim}
You can get the text of the token (``Hello'' for the only T\_IDENT in
Figure \ref{parsetree}) by calling \texttt{Token\#getText}.
\texttt{Terminal}s and \texttt{Nonterminal}s both have a
\texttt{getDescription} method which returns a (String) description of that
terminal or nonterminal, e.g., ``T\_IDENT'' or ``$<$xExecutableProgram$>$.''