blob: 10c255848b5f1ed93907b7d3af890cc915e4ec09 [file] [log] [blame]
% The Abstract Syntax Tree and Virtual Program Graph
\vspace{-0.5in}
{\scriptsize Revision: \footnotesize \$Id: cha-ast-vpg.ltx-inc,v 1.1 2010/05/21 20:12:20 joverbey Exp $ $ - based on 2008/08/08 nchen}
\section{How to Get Acquainted with the Program Representation}
\label{sec:how_to_get_acquainted_with_the_program_representation}
\emph{These features work only on files that are located inside a Fortran
project with analysis and refactoring enabled. See the Photran user
documentation \textit{(Photran Advanced Features Manual)} for instructions.}
\subsection{Visualizing ASTs}
When Photran parses a file, it produces an abstract syntax tree (AST). This
is the central data structure used when implementing refactorings, program
analyses, etc.
Photran can display its abstract syntax tree (AST) for a file in place of the
ordinary Outline view. This behavior can be enabled from the Fortran workspace
preferences:
\begin{itemize}
\item Click on Window $>$ Preferences in Windows/Linux, or Eclipse $>$ Preferences in Mac OS X.
\item Select ``Fortran'' on the left hand side of the preference dialog (do not expand it).
\item Select ``(Debugging) Show entire abstract syntax tree rather than Outline view''
\end{itemize}
Clicking on a \texttt{Token} in the Outline view will move the cursor
to that construct's position in the source file.
Figure~\ref{fig:simple_fortran_ast} is an example of this display.
(It is explained further in Section~\ref{sec:asts}.)
\subsection{Visually Resolving Bindings}
Each use of an identifier in a Fortran program corresponds to a declaration.
For example, the use of the variable \textit{count} in \texttt{print *, count}
might correspond to a declaration like \texttt{integer~::~count}. We say that
the use of the identifier is \textbf{bound} to its declaration, and determining
the declaration that corresponds to a particular use is called \textbf{resolving}
the name binding.
To visualize what declaration a particular variable use is bound to,
click on an identifier in a Fortran editor (position the cursor over
it), and press F3 (or click Navigate $>$ Open Declaration, or right-click and
choose Open Declaration.) The binding will be resolved and the declaration
highlighted. If there are multiple bindings, a pop-up window will open and one
can be selected. If the identifier is bound to a declaration in a module defined
in a different file, an editor will be opened on that file.
\subsection{Visualizing Enclosing Scopes}
Every declaration in a Fortran program exists in a particular \textit{lexical
scope.} For example, if a subroutine definition includes a variable declaration
like \texttt{integer~::~i}, then that variable is only visible within the subroutine,
and that subroutine is its \textit{enclosing scope.}
In the AST, the subroutine is represented by a \textit{ASTSubroutineSubprogramNode},
a class which inherits from \textit{ScopingNode}, and it would an the ancestor of the
\textit{ASTTypeDeclarationStmtNode} representing the variable declaration.
To view the enclosing scope for a particular token in the AST,
click on any token in the Fortran editor, and click Refactor $>$ (Debugging) $>$
Select Enclosing Scope. The entire range of source text corresponding to that token's
enclosing \texttt{ScopingNode} will be highlighted.
\subsection{Visualizing Definitions}
For every declaration in a Fortran program, Photran maintains a \textit{Definition}
object which summarizes the information available about that symbol. For example,
by invoking methods on the \textit{Definition} for a symbol, you could determine
that the symbol is a local variable named \textit{matrix} and that it is a
two-dimensional, allocatable array. Of course, you could figure this out by
traversing the AST, but that can be very tedious (and expensive).\footnote{Actually,
Photran \textit{does} have to traverse the AST to figure this out, but it only does
this once per file, and then it saves the \textit{Definition} to disk (in the
``VPG database''). When you ask Photran for a \textit{Definition}, it simply loads
the information from this database.}
To get a sense of what symbols have \textit{Definition} objects (these are things
like variable declarations, subprogram declarations, common block names, etc.),
open a file in the Fortran editor, and click Refactor $>$ (Debugging) $>$
Display Symbol Table for Current File. Indentation shows scope nesting, and each
line summarizes the information in a \texttt{Definition} object.
\section{Abstract Syntax Trees}
\label{sec:asts}
\subsection{Simple AST Example}
The Fortran grammar is very lengthy, containing
hundreds of rules. Even the simplest Fortran program has a fairly
non-trivial AST. For instance this simple Fortran program: \\
\begin{lstlisting}[language=Fortran, frame=lines]
program main
integer a
end
\end{lstlisting}
generates the AST shown in Figure~\ref{fig:simple_fortran_ast}.
%As an exercise,
%the reader is encouraged to derive the structure of the AST from the grammar
%specifications in the \emph{fortran2008.bnf} file beginning with \# R201,
%\verb!<ExecutableProgram>!.
\begin{image}
\centering
\includegraphics[height=4in]{images/simple_fortran_AST.png}
\caption{AST for simple Fortran program as viewed through the Outline View}
\label{fig:simple_fortran_ast}
\end{image}
Fortunately, it is not necessary to know every construct in the grammar. For
most refactoring and program analysis tasks, it is sufficient to rely on the
information that the VPG provides (e.g., \textit{Definition} objects) and
to construct a Visitor to visit only the nodes of interest and
``collect'' the information that is required.
To get a sense of what AST nodes are involved in the particular task you want to
accomplish, we recommend enabling AST visualization (described above), writing
some sample programs that exercise the relevant parts of the Fortran grammar, and
then observing the AST that is displayed.
\subsection{AST Structure for DO-Loops}
Due to a deficiency in the parser, DO-constructs are not recognized as a single
construct; DO and END DO statements are recognized as ordinary statements
alongside the statements comprising their body. There is a package in the
core.vpg plug-in called org.eclipse.photran.internal.core.analysis.loops which
provides machinery to fix this, giving DO-loops a ``proper'' AST structure.
If you call \texttt{LoopReplacer.replaceAllLoopsIn(ast)}, it will identify all
of the new-style DO-loops and change them to \texttt{ASTProperLoopConstructNode}s,
which a more natural structure, with a loop header, body, and END DO statement.
Once this is done, visitors must be implemented by subclassing one of the
Visitor classes \emph{in the org.eclipse.photran.internal.core.analysis.loops
package;} these have a callback method to handle
\texttt{ASTProperLoopConstructNode}s.
The AST visualization (described above) shows the AST \textit{after}
\texttt{LoopReplacer.replaceAllLoopsIn(ast)} has been invoked.
\section{Virtual Program Graph}
\label{sec:virtual_program_graph}
In Photran, it is \emph{almost} never necessary to call the lexer, parser, or
analysis components directly. Instead, Photran uses a
\textbf{virtual program graph}, or VPG, which provides the fa\c{c}ade of a
whole-program abstract syntax tree (AST) with embedded analysis information.
To the programmer building a refactoring, the VPG appears quite simple.
\begin{enumerate}
\item When the programmer requires an AST, the programmer asks the VPG to
provide it. He does not parse the file directly.
\item Methods on (certain) AST nodes provide name binding information, control flow
information, source/AST rewriting, etc. For example, the programmer can
ask an AST node for its control flow successors, or he can tell it to
remove itself from the AST, to reindent itself, etc.\footnote{Control flow
analysis is not (yet) implemented in Photran. But, in theory, this is
how it should work\dots}
\item The VPG maintains a database containing the analysis information (e.g.,
what names are references to what other names). Analyses are run when the
user saves a file, and the result is saved to the database. Then, when
requests are made for analysis information (e.g., asking for all of the
references to a particular name), it can simply be loaded from the database.
It is the programmer's responsibility to make sure that the database is
up-to-date before he attempts to access it, and that any task that
requires database information is scheduled so that it will not attempt to
read from the database while analysis information is being updated.
\end{enumerate}
Photran's VPG is implemented by the class \texttt{PhotranVPG}. This is a
\emph{singleton} object whose instance is available via
\texttt{PhotranVPG.getInstance()}. The remaining subsections describe how the
above tasks are implemented in Photran.
\subsection{Acquiring and Releasing ASTs}
ASTs are retrieved by invoking either of these
methods:
\begin{code}
\begin{lstlisting}[numbers=none]
public IFortranAST acquireTransientAST(IFile file)
public IFortranAST acquirePermanentAST(IFile file)
\end{lstlisting}
%\caption{Acquiring the Fortran AST}
\end{code}
\vspace{-0.15in}
The returned object is an \texttt{IFortranAST}, an object which has a method for
returning the root node of the AST as well as methods to quickly locate tokens
in the AST by offset or line information. A \emph{transient AST} can be garbage
collected as soon as references to any of its nodes disappear. A \emph{permanent
AST} will be explicitly kept in memory until a call is made to either of the
following methods:
\begin{code}
\begin{lstlisting}[numbers=none]
public void releaseAST(IFile file)
public void releaseAllASTs()
\end{lstlisting}
\caption{Releasing the Fortran AST}
\end{code}
Often, it is better to acquire a transient AST and rely on the garbage collector
to reclaim the memory once we are done using it. However, there are times when
acquiring a permanent AST would be more beneficial performance-wise. For
instance, if we will be using the same AST multiple times during a refactoring,
it would be better to just acquire a permanent AST. This prevents the garbage
collector from reclaiming the memory midway through the refactoring once all
references to the AST have been invalidated. While it is always possible to
reacquire the same AST, doing so can be an expensive operation since it requires
\emph{lexing}, \emph{parsing} \textbf{and} finally \emph{reconstructing} the AST
from scratch.
Only one AST for a particular file is in memory at any particular point in time,
so successive requests for the same \texttt{IFile} will return the same
(pointer-identical) AST until the AST is released (permanent) or garbage
collected (transient).
The \texttt{acquireTransientAST} and \texttt{acquirePermanentAST} methods return
an object implementing \texttt{IFortranAST}. This interface has several methods,
notably including the following:
\begin{itemize}
\item
The \texttt{getRoot} method returns the root of the AST, while the
\texttt{find...} methods provide efficient means to search for tokens based
on their lexical positioning in the source code.
\item
The \texttt{accept} method allows an external visitor to traverse the AST. This
method is usually used when it is necessary to ``collect'' information about
certain nodes.
\item
Because \texttt{IFortranAST} extends the \texttt{Iterable} interface, it is
possible to use the \emph{foreach} loop to conveniently iterate through all the
tokens in the AST e.g. \\ \lstinline!for (Token token : new IterableWrapper<Token>(ast))!
\end{itemize}
\subsection{Scope and Binding Analysis}
Currently, the only semantic analysis performed by Photran is binding analysis:
mapping \emph{identifiers} to their \emph{declarations}. Compilers usually do
this using symbol tables but Photran uses a more IDE/refactoring-based approach.
Certain nodes in a Fortran AST represent a lexical scope. All of these nodes are
declared as subclasses of \texttt{ScopingNode}:
\begin{itemize}
\item ASTBlockDataSubprogramNode
\item ASTDerivedTypeDefNode
\item ASTExecutableProgramNode
\item ASTFunctionSubprogramNode
\item ASTInterfaceBlockNode\footnote{An interface block defines a nested scope only if it is a named interface.Anonymous (unnamed) interfaces provide signatures for subprograms in their enclosing scope.}
\item ASTMainProgramNode
\item ASTModuleNode
\item ASTSubroutineSubprogramNode
\end{itemize}
Each of the subclasses of \texttt{ScopingNode} represents a scoping unit in
Fortran. The \texttt{ScopingNode} class has several public methods that provide
information about a scope. For example, one can retrieve a list of all of the
symbols declared in that scope; retrieve information about its
\texttt{IMPLICIT} specification; find its header statement (e.g. a
\texttt{FUNCTION} or \texttt{PROGRAM} statement); and so forth.
The enclosing scope of a \texttt{Token} can be retrieved by calling the
following method on the \texttt{Token} object:
\begin{lstlisting}[numbers=none]
public ScopingNode getEnclosingScope()
\end{lstlisting}
Identifier tokens (\texttt{Token}s for which \texttt{token.getTerminal() ==
Terminal.T\_IDENT}), which represent functions, variables, etc. in the Fortran
grammar, are \emph{bound} to a declaration\footnote{The introduction to VPGs
earlier in this chapter (URL above) provides an example visually.}. Although,
ideally, every identifier will be bound to exactly one declaration, this is not
always the case: the programmer may have written incorrect code, or Photran may
not have enough information to resolve the binding uniquely). So the
\texttt{resolveBinding} method returns a \emph{list} of \texttt{Definition}
objects:
\begin{lstlisting}[numbers=none]
public List<Definition> resolveBinding()
\end{lstlisting}
A \texttt{Definition} object contains many public methods which provide a wealth
of information. From a \texttt{Definition} object, it is possible to get a list
of all the references to a particular declaration (using
\texttt{findAllReferences}) and where that particular declaration is located in
the source code (using \texttt{getTokenRef}). Both of these methods return a
\texttt{PhotranTokenRef} object. See Section~\ref{sub:token_or_tokenref} for a
comparison between \texttt{Token} and \texttt{TokenRef}.
\subsubsection{Obtaining the \texttt{Definition} of a variable}
If you have a reference to the \texttt{Token} object of that variable (for
instance through iterating over all \texttt{Tokens} in the current Fortran AST)
then use:
\begin{lstlisting}[numbers=none, frame=lines]
// myToken is the reference to that variable
List<Definition> bindings = myToken.resolveBinding();
if(bindings.size() == 0)
throw new Exception(myToken.getText() + " is not declared");
else if (bindings.size() > 1)
throw new Exception(myToken.getText() + " is an ambiguous reference");
Definition definition = bindings.get(0);
\end{lstlisting}
If you do \textbf{not} have a reference to a \texttt{Token} but you know the
name of the identifier, you can first construct a \emph{hypothetical}
\texttt{Token} representing an identifier and search for that in a
\emph{particular} \texttt{ScopingNode} (possibly obtained by calling the static
method \texttt{ScopingNode.getEnclosingScope(IASTNode node)}).
\\
\begin{lstlisting}[numbers=none, frame=lines]
Token myToken = new Token(Terminal.T_IDENT, "myNameOfIdentifier");
List<PhotranTokenRef> definitions = myScopingNode.manuallyResolve(myToken);
\end{lstlisting}
If you want to search for the identifier in \textbf{all} \texttt{ScopingNodes}
for the current source file, then retrieve all the \texttt{ScopingNodes} and
manually iterate through each one. Remember that the root of the AST is a
\texttt{ScopingNode} and you may obtain the root of the AST through the
\texttt{getRoot} method declared in \texttt{IFortranAST}.
\\
\begin{lstlisting}[numbers=none, frame=lines]
List<ScopingNode> scopes = myRoot.getAllContainedScopes();
for (ScopingNode scopingNode : scopes)
{
// search through each ScopingNode
}
\end{lstlisting}
\subsubsection{Examples in \texttt{FortranEditorASTActionDelegate} subclasses}
The following subclasses of \texttt{FortranEditorASTActionDelegate} all contain short working examples of how to use the binding analysis API in Photran:
\begin{itemize}
\item DisplaySymbolTable
\item FindAllDeclarationsInScope
\item OpenDeclaration
\item SelectEnclosingScope
\end{itemize}
\subsection{Scheduling and (Avoiding) Concurrent Access}
It is important to note that, because \texttt{PhotranVPG} is a singleton object,
it may not be accessed concurrently by multiple threads.
Most actions that require an AST will be subclasses of
\texttt{FortranEditorActionDelegate}. (All refactoring actions in Photran
are descendants of \texttt{FortranEditorActionDelegate}, for example.)
These are always scheduled in a way that avoids this problem.
Otherwise, the thread must either be scheduled using a
\texttt{VPGSchedulingRule} or it must lock the entire workspace. See
\texttt{EclipseVPG\#queueJobToEnsureVPGIsUpToDate} as an example on how to use
the \texttt{VPGSchedulingRule} and \texttt{FortranEditorActionDelegate\#run} as
an example of how to lock the entire workspace.
As a guideline, contributors who are interested in accessing the VPG should
consider structuring their contributions as descendants of
\texttt{FortranEditorActionDelegate}. However, if that
approach is not feasible, then they should consider using
\texttt{VPGSchedulingRule} before resorting to locking the entire workspace.
\section{Common Tasks}
In the org.eclipse.photran.internal.core.refactoring package, there is a class
called \textit{\_AST\_VPG\_HOWTO} which contains several small ``snippets''
illustrating how to perform common tasks with the AST/VPG. This includes how
to traverse the AST using a visitor, how to find nodes of a particular type,
how to find all of the references to a particular name, etc.
\newif\ifxxxxxx
\ifxxxxxx
\section{Background: Program Graphs and the VPG}
\textit{Portions of this section are excerpted from
Jeff's dissertation proposal (2009).}
\subsection{Theory: Program Graphs}
According to Mens et al.,~
a \textit{program graph} ``may be viewed, in broad lines, as an abstract syntax
tree augmented by extra edges.''\footnote{T.~Mens, N.~Van~Eetvelde, S.~Demeyer,
and D.~Janssens. ``Formalizing refactorings with graph transformations.''
\textit{J.~Softw. Maint. Evol.}, 17(4): 247--276, 2005.} These
``extra edges'' represent semantic information, such as variable scopes and
bindings, control flow, inheritance relationships, and so forth.
Unlike a compiler, where the AST, symbol table, and control flow graph are
usually separate data structures, a program graph combines all such information
into a single data structure. In a program graph, there are no symbol tables;
rather, some nodes in the AST correspond to declarations, and references
contain an edge pointing to their declaration. Scopes can similarly be
represented by AST nodes; symbols can point to the scope in which they are
defined. The control flow successors of a node are other AST nodes. A
variable's du-chains would likely consist of AST nodes representing assignment
statements and variable-use expressions. And so forth. In other words, in a
program graph, all program analyses are stated in terms of AST nodes.
Alternatively, one might think of a program graph as an AST with the graph
structures of a control flow graph, program dependence graph, du-chains,
etc.~``superimposed.''\footnote{M.~Verbaere, A.~Payement, O.~de~Moor.
``Scripting refactorings with JunGL.'' In \textit{Proc.~OOPSLA~'06},
pp.~651--652.} The nodes of the AST also
serve as nodes of the various graph structures; the edges connecting them are
different.
\begin{image}
\centering
\includegraphics[height=6in]{images/sample_vpg.png}
\caption{Sample program graph (AST with semantic edges)
for a Java program}
\label{test2before}
\end{image}
An example of a Java program and a plausible program graph representation are
shown in Figure~\ref{test2before}. The underlying abstract syntax tree is
shown in outline form; the dotted lines are the extra edges that make the AST a
program graph. We have shown four types of edges. \textit{Scope} edges link a
declaration to the class or method in which it is defined. \textit{Binding}
edges link the use of an identifier to its corresponding declaration. Within
the method body, \textit{control flow} edges form the (intraprocedural) control
flow graph; the method declaration node is used as the entry block and null as
the exit block. Similarly, there are two du-chains, given by \textit{def-use}
edges.
The advantages of the program graph representation are twofold. First, it is
generic enough that it a program graph can be built to represent virtually any
conventional programming language. Every program has a syntax which can be
represented as a syntax tree, and we hypothesize that the semantics needed to
perform common refactorings can be represented as extra edges (and node
annotations) on this tree. The second advantage of the program graph is that
it summarizes the ``interesting'' aspects of both the syntax and semantics of a
program in a single representation, obviating the need to maintain a mapping
between several distinct representations.
\subsection{Implementation: The Virtual Program Graph}
Often, program graphs only conceptual: "Real" programs can be extremely large,
spanning hundreds of files, and edges may span across these files. For example,
a function defined in one file may be used in another, so a binding edge would
span across files. Of course, it is inefficient--and sometimes impossible--to
store ASTs for several hundred files in memory.
The Rephraser Engine provides a base implementation for a \textit{virtual program
graph,} which takes a different approach. For the most part, the VPG allows the
programmer to pretend that an AST for the entire (several-hundred-file) program
is available in memory, with edges spanning between files when necessary.
TODO
Therefore, most systems cache information about externally-visible declarations
in a program database. Often they expose this to the programmer, so
it becomes the programmer's responsibility to parse individual files
and to distinguish what is available the program database from what is available
in the AST.
The \textit{virtual program graph}, or
VPG, provides a cross-reference database and indexing infrastructure TODO
To the programmer building a refactoring, the VPG appears quite simple.
\begin{itemize}
\item When the programmer requires an AST, the programmer asks the VPG to
provide it. He does not parse the file directly.
\item Methods on the AST nodes provide name binding information, control flow
information, source/AST rewriting, etc. For example, the programmer can
ask an AST node for its control flow successors, or he can tell it to
remove itself from the AST, to reindent itself, etc.
\end{itemize}
One advantage of the VPG architecture is that programmers developing
refactorings are insulated from much of the underlying infrastructure. The VPG
acts as an AST repository, insulating the programmer from the lexer/parser and
preprocessor, while the AST nodes insulate the programmer from the semantic
analyses, the source rewriter, and the program database.
One particularly interesting example is when a subprogram node is asked for all
of its call sites. Note that these are not necessarily nodes in the same AST.
Generally, this information would be stored in the cross-reference database,
but the programmer need not be aware of this; the VPG can simply provide an
iterator which provides pointers to AST nodes. As the iterator progresses,
ASTs for the various files can be transparently loaded and unloaded from
memory. (Recall that the programmer always asks the VPG to provide ASTS; thus,
careful implementation can guarantee that there are never multiple ASTs for the
same file in memory.) In essence, the VPG is intended to provide the illusion
that there are always ASTs available for the whole program.\footnote{The
``virtual'' in ``virtual program graph'' is analogous to ``virtual'' in
``virtual memory:'' The system shields the programmer from knowing whether the
requested data is already in memory or if it must first be loaded from disk.}
\begin{image}
\centering
\includegraphics[height=2.5in]{images/vpg_db_er.png}
\caption{VPG database schema}
\label{fig:vpg_db_er}
\end{image}
\fi