org.eclipse.photran-dev-docs/dev-guide/cha-ast-vpg.ltx-inc - ptp/org.eclipse.photran - Git at Google

 % The Abstract Syntax Tree and Virtual Program Graph
 \vspace{-0.5in}

 {\scriptsize Revision: \footnotesize \$Id: cha-ast-vpg.ltx-inc,v 1.1 2010/05/21 20:12:20 joverbey Exp $ $ - based on 2008/08/08 nchen}

 \section{How to Get Acquainted with the Program Representation}
 \label{sec:how_to_get_acquainted_with_the_program_representation}

 \emph{These features work only on files that are located inside a Fortran
 project with analysis and refactoring enabled.  See the Photran user
 documentation \textit{(Photran Advanced Features Manual)} for instructions.}

 \subsection{Visualizing ASTs}

 When Photran parses a file, it produces an abstract syntax tree (AST).  This
 is the central data structure used when implementing refactorings, program
 analyses, etc.

 Photran can display its abstract syntax tree (AST) for a file in place of the
 ordinary Outline view. This behavior can be enabled from the Fortran workspace
 preferences:

 \begin{itemize}
 	\item Click on Window $>$ Preferences in Windows/Linux, or Eclipse $>$ Preferences in Mac OS X.
 	\item Select ``Fortran'' on the left hand side of the preference dialog (do not expand it).
 	\item Select ``(Debugging) Show entire abstract syntax tree rather than Outline view''
 \end{itemize}

 Clicking on a \texttt{Token} in the Outline view will move the cursor
 to that construct's position in the source file.

 Figure~\ref{fig:simple_fortran_ast} is an example of this display.
 (It is explained further in Section~\ref{sec:asts}.)

 \subsection{Visually Resolving Bindings}

 Each use of an identifier in a Fortran program corresponds to a declaration.
 For example, the use of the variable \textit{count} in \texttt{print *, count}
 might correspond to a declaration like \texttt{integer~::~count}.  We say that
 the use of the identifier is \textbf{bound} to its declaration, and determining
 the declaration that corresponds to a particular use is called \textbf{resolving}
 the name binding.

 To visualize what declaration a particular variable use is bound to,
 click on an identifier in a Fortran editor (position the cursor over
 it), and press F3 (or click Navigate $>$ Open Declaration, or right-click and
 choose Open Declaration.) The binding will be resolved and the declaration
 highlighted. If there are multiple bindings, a pop-up window will open and one
 can be selected. If the identifier is bound to a declaration in a module defined
 in a different file, an editor will be opened on that file.

 \subsection{Visualizing Enclosing Scopes}

 Every declaration in a Fortran program exists in a particular \textit{lexical
 scope.}  For example, if a subroutine definition includes a variable declaration
 like \texttt{integer~::~i}, then that variable is only visible within the subroutine,
 and that subroutine is its \textit{enclosing scope.}
 In the AST, the subroutine is represented by a \textit{ASTSubroutineSubprogramNode},
 a class which inherits from \textit{ScopingNode}, and it would an the ancestor of the
 \textit{ASTTypeDeclarationStmtNode} representing the variable declaration.

 To view the enclosing scope for a particular token in the AST,
 click on any token in the Fortran editor, and click Refactor $>$ (Debugging) $>$
 Select Enclosing Scope. The entire range of source text corresponding to that token's
 enclosing \texttt{ScopingNode} will be highlighted.

 \subsection{Visualizing Definitions}

 For every declaration in a Fortran program, Photran maintains a \textit{Definition}
 object which summarizes the information available about that symbol.  For example,
 by invoking methods on the \textit{Definition} for a symbol, you could determine
 that the symbol is a local variable named \textit{matrix} and that it is a
 two-dimensional, allocatable array.  Of course, you could figure this out by
 traversing the AST, but that can be very tedious (and expensive).\footnote{Actually,
 Photran \textit{does} have to traverse the AST to figure this out, but it only does
 this once per file, and then it saves the \textit{Definition} to disk (in the
 ``VPG database'').  When you ask Photran for a \textit{Definition}, it simply loads
 the information from this database.}

 To get a sense of what symbols have \textit{Definition} objects (these are things
 like variable declarations, subprogram declarations, common block names, etc.),
 open a file in the Fortran editor, and click Refactor $>$ (Debugging) $>$
 Display Symbol Table for Current File. Indentation shows scope nesting, and each
 line summarizes the information in a \texttt{Definition} object.

 \section{Abstract Syntax Trees}
 \label{sec:asts}

 \subsection{Simple AST Example}

 The Fortran grammar is very lengthy, containing
 hundreds of rules. Even the simplest Fortran program has a fairly
 non-trivial AST. For instance this simple Fortran program: \\

 \begin{lstlisting}[language=Fortran, frame=lines]
 program main
     integer a
 end
 \end{lstlisting}

 generates the AST shown in Figure~\ref{fig:simple_fortran_ast}.
 %As an exercise,
 %the reader is encouraged to derive the structure of the AST from the grammar
 %specifications in the \emph{fortran2008.bnf} file beginning with \# R201,
 %\verb!<ExecutableProgram>!.

 \begin{image}
 	\centering
 			\includegraphics[height=4in]{images/simple_fortran_AST.png}
 			\caption{AST for simple Fortran program as viewed through the Outline View}
 		\label{fig:simple_fortran_ast}
 \end{image}

 Fortunately, it is not necessary to know every construct in the grammar. For
 most refactoring and program analysis tasks, it is sufficient to rely on the
 information that the VPG provides (e.g., \textit{Definition} objects) and
 to construct a Visitor to visit only the nodes of interest and
 ``collect'' the information that is required.
 To get a sense of what AST nodes are involved in the particular task you want to
 accomplish, we recommend enabling AST visualization (described above), writing
 some sample programs that exercise the relevant parts of the Fortran grammar, and
 then observing the AST that is displayed.

 \subsection{AST Structure for DO-Loops}

 Due to a deficiency in the parser, DO-constructs are not recognized as a single
 construct; DO and END DO statements are recognized as ordinary statements
 alongside the statements comprising their body.  There is a package in the
 core.vpg plug-in called org.eclipse.photran.internal.core.analysis.loops which
 provides machinery to fix this, giving DO-loops a ``proper'' AST structure.

 If you call \texttt{LoopReplacer.replaceAllLoopsIn(ast)}, it will identify all
 of the new-style DO-loops and change them to \texttt{ASTProperLoopConstructNode}s,
 which a more natural structure, with a loop header, body, and END DO statement.
 Once this is done, visitors must be implemented by subclassing one of the
 Visitor classes \emph{in the org.eclipse.photran.internal.core.analysis.loops
 package;} these have a callback method to handle
 \texttt{ASTProperLoopConstructNode}s.

 The AST visualization (described above) shows the AST \textit{after}
 \texttt{LoopReplacer.replaceAllLoopsIn(ast)} has been invoked.

 \section{Virtual Program Graph}
 \label{sec:virtual_program_graph}

 In Photran, it is \emph{almost} never necessary to call the lexer, parser, or
 analysis components directly. Instead, Photran uses a
 \textbf{virtual program graph}, or VPG, which provides the fa\c{c}ade of a
 whole-program abstract syntax tree (AST) with embedded analysis information.

 To the programmer building a refactoring, the VPG appears quite simple.
 \begin{enumerate}
 \item When the programmer requires an AST, the programmer asks the VPG to
       provide it.  He does not parse the file directly.
 \item Methods on (certain) AST nodes provide name binding information, control flow
       information, source/AST rewriting, etc.  For example, the programmer can
       ask an AST node for its control flow successors, or he can tell it to
       remove itself from the AST, to reindent itself, etc.\footnote{Control flow
       analysis is not (yet) implemented in Photran.  But, in theory, this is
       how it should work\dots}
 \item The VPG maintains a database containing the analysis information (e.g.,
       what names are references to what other names).  Analyses are run when the
       user saves a file, and the result is saved to the database.  Then, when
       requests are made for analysis information (e.g., asking for all of the
       references to a particular name), it can simply be loaded from the database.
       It is the programmer's responsibility to make sure that the database is
       up-to-date before he attempts to access it, and that any task that
       requires database information is scheduled so that it will not attempt to
       read from the database while analysis information is being updated.
 \end{enumerate}

 Photran's VPG is implemented by the class \texttt{PhotranVPG}.  This is a
 \emph{singleton} object whose instance is available via
 \texttt{PhotranVPG.getInstance()}.  The remaining subsections describe how the
 above tasks are implemented in Photran.

 \subsection{Acquiring and Releasing ASTs}

 ASTs are retrieved by invoking either of these
 methods:
 \begin{code}
 \begin{lstlisting}[numbers=none]
 public IFortranAST acquireTransientAST(IFile file)
 public IFortranAST acquirePermanentAST(IFile file)
 \end{lstlisting}
 %\caption{Acquiring the Fortran AST}
 \end{code}

 \vspace{-0.15in}

 The returned object is an \texttt{IFortranAST}, an object which has a method for
 returning the root node of the AST as well as methods to quickly locate tokens
 in the AST by offset or line information. A \emph{transient AST} can be garbage
 collected as soon as references to any of its nodes disappear. A \emph{permanent
 AST} will be explicitly kept in memory until a call is made to either of the
 following methods:
 \begin{code}
 \begin{lstlisting}[numbers=none]
 public void releaseAST(IFile file)
 public void releaseAllASTs()
 \end{lstlisting}
 \caption{Releasing the Fortran AST}
 \end{code}

 Often, it is better to acquire a transient AST and rely on the garbage collector
 to reclaim the memory once we are done using it. However, there are times when
 acquiring a permanent AST would be more beneficial performance-wise. For
 instance, if we will be using the same AST multiple times during a refactoring,
 it would be better to just acquire a permanent AST. This prevents the garbage
 collector from reclaiming the memory midway through the refactoring once all
 references to the AST have been invalidated. While it is always possible to
 reacquire the same AST, doing so can be an expensive operation since it requires
 \emph{lexing}, \emph{parsing} \textbf{and} finally \emph{reconstructing} the AST
 from scratch.

 Only one AST for a particular file is in memory at any particular point in time,
 so successive requests for the same \texttt{IFile} will return the same
 (pointer-identical) AST until the AST is released (permanent) or garbage
 collected (transient).

 The \texttt{acquireTransientAST} and \texttt{acquirePermanentAST} methods return
 an object implementing \texttt{IFortranAST}.  This interface has several methods,
 notably including the following:

 \begin{itemize}

 \item
 The \texttt{getRoot} method returns the root of the AST, while the
 \texttt{find...} methods provide efficient means to search for tokens based
 on their lexical positioning in the source code.

 \item
 The \texttt{accept} method allows an external visitor to traverse the AST. This
 method is usually used when it is necessary to ``collect'' information about
 certain nodes.

 \item
 Because \texttt{IFortranAST} extends the \texttt{Iterable} interface, it is
 possible to use the \emph{foreach} loop to conveniently iterate through all the
 tokens in the AST e.g. \\ \lstinline!for (Token token : new IterableWrapper<Token>(ast))!

 \end{itemize}

 \subsection{Scope and Binding Analysis}

 Currently, the only semantic analysis performed by Photran is binding analysis:
 mapping \emph{identifiers} to their \emph{declarations}. Compilers usually do
 this using symbol tables but Photran uses a more IDE/refactoring-based approach.

 Certain nodes in a Fortran AST represent a lexical scope. All of these nodes are
 declared as subclasses of \texttt{ScopingNode}:

 \begin{itemize}
 \item ASTBlockDataSubprogramNode
 \item ASTDerivedTypeDefNode
 \item ASTExecutableProgramNode
 \item ASTFunctionSubprogramNode
 \item ASTInterfaceBlockNode\footnote{An interface block defines a nested scope only if it is a named interface.Anonymous (unnamed) interfaces provide signatures for subprograms in their enclosing scope.}
 \item ASTMainProgramNode
 \item ASTModuleNode
 \item ASTSubroutineSubprogramNode
 \end{itemize}

 Each of the subclasses of \texttt{ScopingNode} represents a scoping unit in
 Fortran. The \texttt{ScopingNode} class has several public methods that provide
 information about a scope. For example, one can retrieve a list of all of the
 symbols declared in that scope; retrieve information about its
 \texttt{IMPLICIT} specification; find its header statement (e.g. a
 \texttt{FUNCTION} or \texttt{PROGRAM} statement); and so forth.

 The enclosing scope of a \texttt{Token} can be retrieved by calling the
 following method on the \texttt{Token} object:

 \begin{lstlisting}[numbers=none]
 public ScopingNode getEnclosingScope()
 \end{lstlisting}

 Identifier tokens (\texttt{Token}s for which \texttt{token.getTerminal() ==
 Terminal.T\_IDENT}), which represent functions, variables, etc. in the Fortran
 grammar, are \emph{bound} to a declaration\footnote{The introduction to VPGs
 earlier in this chapter (URL above) provides an example visually.}. Although,
 ideally, every identifier will be bound to exactly one declaration, this is not
 always the case: the programmer may have written incorrect code, or Photran may
 not have enough information to resolve the binding uniquely). So the
 \texttt{resolveBinding} method returns a \emph{list} of \texttt{Definition}
 objects:

 \begin{lstlisting}[numbers=none]
 public List<Definition> resolveBinding()
 \end{lstlisting}

 A \texttt{Definition} object contains many public methods which provide a wealth
 of information. From a \texttt{Definition} object, it is possible to get a list
 of all the references to a particular declaration (using
 \texttt{findAllReferences}) and where that particular declaration is located in
 the source code (using \texttt{getTokenRef}). Both of these methods return a
 \texttt{PhotranTokenRef} object. See Section~\ref{sub:token_or_tokenref} for a
 comparison between \texttt{Token} and \texttt{TokenRef}.

 \subsubsection{Obtaining the \texttt{Definition} of a variable}

 If you have a reference to the \texttt{Token} object of that variable (for
 instance through iterating over all \texttt{Tokens} in the current Fortran AST)
 then use:

 \begin{lstlisting}[numbers=none, frame=lines]
 // myToken is the reference to that variable
 List<Definition> bindings = myToken.resolveBinding();

 if(bindings.size() == 0)
 	throw new Exception(myToken.getText() + " is not declared");
 else if (bindings.size() > 1)
 	throw new Exception(myToken.getText() + " is an ambiguous reference");

 Definition definition = bindings.get(0);
 \end{lstlisting}

 If you do \textbf{not} have a reference to a \texttt{Token} but you know the
 name of the identifier, you can first construct a \emph{hypothetical}
 \texttt{Token} representing an identifier and search for that in a
 \emph{particular} \texttt{ScopingNode} (possibly obtained by calling the static
 method \texttt{ScopingNode.getEnclosingScope(IASTNode node)}).
 \\

 \begin{lstlisting}[numbers=none, frame=lines]
 Token myToken = new Token(Terminal.T_IDENT, "myNameOfIdentifier");
 List<PhotranTokenRef> definitions = myScopingNode.manuallyResolve(myToken);
 \end{lstlisting}

 If you want to search for the identifier in \textbf{all} \texttt{ScopingNodes}
 for the current source file, then retrieve all the \texttt{ScopingNodes} and
 manually iterate through each one. Remember that the root of the AST is a
 \texttt{ScopingNode} and you may obtain the root of the AST through the
 \texttt{getRoot} method declared in \texttt{IFortranAST}.
 \\

 \begin{lstlisting}[numbers=none, frame=lines]
 List<ScopingNode> scopes = myRoot.getAllContainedScopes();

 for (ScopingNode scopingNode : scopes)
 {
 	// search through each ScopingNode
 }
 \end{lstlisting}

 \subsubsection{Examples in \texttt{FortranEditorASTActionDelegate} subclasses}

 The following subclasses of \texttt{FortranEditorASTActionDelegate} all contain short working examples of how to use the binding analysis API in Photran:
 \begin{itemize}
 	\item DisplaySymbolTable
 	\item FindAllDeclarationsInScope
 	\item OpenDeclaration
 	\item SelectEnclosingScope
 \end{itemize}


 \subsection{Scheduling and (Avoiding) Concurrent Access}

 It is important to note that, because \texttt{PhotranVPG} is a singleton object,
 it may not be accessed concurrently by multiple threads.

 Most actions that require an AST will be subclasses of
 \texttt{FortranEditorActionDelegate}. (All refactoring actions in Photran
 are descendants of \texttt{FortranEditorActionDelegate}, for example.)
 These are always scheduled in a way that avoids this problem.

 Otherwise, the thread must either be scheduled using a
 \texttt{VPGSchedulingRule} or it must lock the entire workspace. See
 \texttt{EclipseVPG\#queueJobToEnsureVPGIsUpToDate} as an example on how to use
 the \texttt{VPGSchedulingRule} and \texttt{FortranEditorActionDelegate\#run} as
 an example of how to lock the entire workspace.

 As a guideline, contributors who are interested in accessing the VPG should
 consider structuring their contributions as descendants of
 \texttt{FortranEditorActionDelegate}. However, if that
 approach is not feasible, then they should consider using
 \texttt{VPGSchedulingRule} before resorting to locking the entire workspace.

 \section{Common Tasks}

 In the org.eclipse.photran.internal.core.refactoring package, there is a class
 called \textit{\_AST\_VPG\_HOWTO} which contains several small ``snippets''
 illustrating how to perform common tasks with the AST/VPG.  This includes how
 to traverse the AST using a visitor, how to find nodes of a particular type,
 how to find all of the references to a particular name, etc.


 \newif\ifxxxxxx
 \ifxxxxxx

 \section{Background: Program Graphs and the VPG}

 \textit{Portions of this section are excerpted from
 Jeff's dissertation proposal (2009).}

 \subsection{Theory: Program Graphs}

 According to Mens et al.,~
 a \textit{program graph} ``may be viewed, in broad lines, as an abstract syntax
 tree augmented by extra edges.''\footnote{T.~Mens, N.~Van~Eetvelde, S.~Demeyer,
 and D.~Janssens.  ``Formalizing refactorings with graph transformations.''
 \textit{J.~Softw. Maint. Evol.}, 17(4): 247--276, 2005.}  These
 ``extra edges'' represent semantic information, such as variable scopes and
 bindings, control flow, inheritance relationships, and so forth.

 Unlike a compiler, where the AST, symbol table, and control flow graph are
 usually separate data structures, a program graph combines all such information
 into a single data structure.  In a program graph, there are no symbol tables;
 rather, some nodes in the AST correspond to declarations, and references
 contain an edge pointing to their declaration.  Scopes can similarly be
 represented by AST nodes; symbols can point to the scope in which they are
 defined.  The control flow successors of a node are other AST nodes.  A
 variable's du-chains would likely consist of AST nodes representing assignment
 statements and variable-use expressions.  And so forth.  In other words, in a
 program graph, all program analyses are stated in terms of AST nodes.

 Alternatively, one might think of a program graph as an AST with the graph
 structures of a control flow graph, program dependence graph, du-chains,
 etc.~``superimposed.''\footnote{M.~Verbaere, A.~Payement, O.~de~Moor.
 ``Scripting refactorings with JunGL.'' In \textit{Proc.~OOPSLA~'06},
 pp.~651--652.}  The nodes of the AST also
 serve as nodes of the various graph structures; the edges connecting them are
 different.

 \begin{image}
 	\centering
 			\includegraphics[height=6in]{images/sample_vpg.png}
 			\caption{Sample program graph (AST with semantic edges)
 			for a Java program}
 		\label{test2before}
 \end{image}

 An example of a Java program and a plausible program graph representation are
 shown in Figure~\ref{test2before}.  The underlying abstract syntax tree is
 shown in outline form; the dotted lines are the extra edges that make the AST a
 program graph.  We have shown four types of edges.  \textit{Scope} edges link a
 declaration to the class or method in which it is defined.  \textit{Binding}
 edges link the use of an identifier to its corresponding declaration.  Within
 the method body, \textit{control flow} edges form the (intraprocedural) control
 flow graph; the method declaration node is used as the entry block and null as
 the exit block.  Similarly, there are two du-chains, given by \textit{def-use}
 edges.

 The advantages of the program graph representation are twofold.  First, it is
 generic enough that it a program graph can be built to represent virtually any
 conventional programming language.  Every program has a syntax which can be
 represented as a syntax tree, and we hypothesize that the semantics needed to
 perform common refactorings can be represented as extra edges (and node
 annotations) on this tree.  The second advantage of the program graph is that
 it summarizes the ``interesting'' aspects of both the syntax and semantics of a
 program in a single representation, obviating the need to maintain a mapping
 between several distinct representations.

 \subsection{Implementation: The Virtual Program Graph}

 Often, program graphs only conceptual: "Real" programs can be extremely large,
 spanning hundreds of files, and edges may span across these files. For example,
 a function defined in one file may be used in another, so a binding edge would
 span across files. Of course, it is inefficient--and sometimes impossible--to
 store ASTs for several hundred files in memory.

 The Rephraser Engine provides a base implementation for a \textit{virtual program
 graph,} which takes a different approach.  For the most part, the VPG allows the
 programmer to pretend that an AST for the entire (several-hundred-file) program
 is available in memory, with edges spanning between files when necessary.

 TODO

 Therefore, most systems cache information about externally-visible declarations
 in a program database.  Often they expose this to the programmer, so
 it becomes the programmer's responsibility to parse individual files
 and to distinguish what is available the program database from what is available
 in the AST.

 The \textit{virtual program graph}, or
 VPG, provides a cross-reference database and indexing infrastructure TODO

 To the programmer building a refactoring, the VPG appears quite simple.
 \begin{itemize}
 \item When the programmer requires an AST, the programmer asks the VPG to
       provide it.  He does not parse the file directly.
 \item Methods on the AST nodes provide name binding information, control flow
       information, source/AST rewriting, etc.  For example, the programmer can
       ask an AST node for its control flow successors, or he can tell it to
       remove itself from the AST, to reindent itself, etc.
 \end{itemize}

 One advantage of the VPG architecture is that programmers developing
 refactorings are insulated from much of the underlying infrastructure.  The VPG
 acts as an AST repository, insulating the programmer from the lexer/parser and
 preprocessor, while the AST nodes insulate the programmer from the semantic
 analyses, the source rewriter, and the program database.

 One particularly interesting example is when a subprogram node is asked for all
 of its call sites.  Note that these are not necessarily nodes in the same AST.
 Generally, this information would be stored in the cross-reference database,
 but the programmer need not be aware of this; the VPG can simply provide an
 iterator which provides pointers to AST nodes.  As the iterator progresses,
 ASTs for the various files can be transparently loaded and unloaded from
 memory.  (Recall that the programmer always asks the VPG to provide ASTS; thus,
 careful implementation can guarantee that there are never multiple ASTs for the
 same file in memory.)  In essence, the VPG is intended to provide the illusion
 that there are always ASTs available for the whole program.\footnote{The
 ``virtual'' in ``virtual program graph'' is analogous to ``virtual'' in
 ``virtual memory:'' The system shields the programmer from knowing whether the
 requested data is already in memory or if it must first be loaded from disk.}

 \begin{image}
 	\centering
 			\includegraphics[height=2.5in]{images/vpg_db_er.png}
 			\caption{VPG database schema}
 		\label{fig:vpg_db_er}
 \end{image}

 \fi