org.eclipse.photran-dev-docs/misc-notes/old-dev-guide-chapters/inc-a4-misc-notes.tex - ptp/org.eclipse.photran - Git at Google

 % Miscellaneous Parser and Language Processing Notes

 \section{Notes on the (almost-)LALR(1) Grammar}

 The current grammar is based on an Eli grammar that has been worked on
 for several years.  All of the lexer work is totally original, and we fixed
 several bugs in the Eli grammar, but for the most part the grammar in
 \texttt{Fortran95.bnf} is the work of other people.  We have invested
 several months of work in it---but that does not compare to the years of
 work invested by the previous authors of that grammar.

 More information on the Eli grammar is available at
 http://members.aol.com/wclodius/eli\_f95\_parser.html

 \section{Why not an abstract syntax tree?}

 We started by building an AST for Fortran by hand.  For a compiler, this
 wouldn't be a big deal.  The purpose of an AST is provide a tree that only
 contains ``useful'' information from the parse, so superfluous tokens like
 parentheses never end up in an AST.  For a refactoring tool, though,
 it is important to remember every token in the parse, because you need to
 reproduce code that looks similar to the user's original code.  Fortran
 has a number of cases where there are several different ways to specify the
 same thing.  For example, you can end a program with ``end,'' ``end program,''
 or ``end program'' followed by the name of the program.  Other cases, with
 several optional tokens in the middle of statements, are trickier.  So,
 long story short, after a couple of months, we had about 600 AST classes and
 were nowhere near finished.

 So we needed a different alternative.

 One would be to have the parser generator generate the AST stuff for us.
 However, aside from the fact that it would require lots of tedious annotations
 in the grammar, we would still be in a place of having several hundred AST
 classes and a Visitor with several hundred methods.

 Instead, we\footnote{Actually, ``I'' is more correct... Spiros was leaving
 for Google, Brian was indifferent, and I think Dr. Johnson still would
 have preferred an AST... so now you know where to put the blame...} decided
 to use a good old-fashioned parse tree.

 Here's why.

 First, it made the ``superfluous token'' problem go away.
 Since each node just had a \texttt{List} of child nodes (either tokens or
 other parse tree nodes), we did not have to do anything extra to accommodate
 all of the variations of a statement.  All the tokens were there when we
 needed them (e.g., for prettyprinting), and they could be ignored when we
 didn't need them.

 Second, it gave us two possible Visitors, as described above.  More
 importantly, unlike visiting an AST, these Visitors could just do a ``dumb''
 recursive traversal of the tree, totally ignorant of the underlying grammar.
 This also made parse tree searches and grammar modifications easier.

 There are probably other reasons as well which I can try to remember if
 someone is still not convinced that this was a good idea.
	% Miscellaneous Parser and Language Processing Notes

	\section{Notes on the (almost-)LALR(1) Grammar}

	The current grammar is based on an Eli grammar that has been worked on
	for several years. All of the lexer work is totally original, and we fixed
	several bugs in the Eli grammar, but for the most part the grammar in
	\texttt{Fortran95.bnf} is the work of other people. We have invested
	several months of work in it---but that does not compare to the years of
	work invested by the previous authors of that grammar.

	More information on the Eli grammar is available at
	http://members.aol.com/wclodius/eli\_f95\_parser.html

	\section{Why not an abstract syntax tree?}

	We started by building an AST for Fortran by hand. For a compiler, this
	wouldn't be a big deal. The purpose of an AST is provide a tree that only
	contains ``useful'' information from the parse, so superfluous tokens like
	parentheses never end up in an AST. For a refactoring tool, though,
	it is important to remember every token in the parse, because you need to
	reproduce code that looks similar to the user's original code. Fortran
	has a number of cases where there are several different ways to specify the
	same thing. For example, you can end a program with ``end,'' ``end program,''
	or ``end program'' followed by the name of the program. Other cases, with
	several optional tokens in the middle of statements, are trickier. So,
	long story short, after a couple of months, we had about 600 AST classes and
	were nowhere near finished.

	So we needed a different alternative.

	One would be to have the parser generator generate the AST stuff for us.
	However, aside from the fact that it would require lots of tedious annotations
	in the grammar, we would still be in a place of having several hundred AST
	classes and a Visitor with several hundred methods.

	Instead, we\footnote{Actually, ``I'' is more correct... Spiros was leaving
	for Google, Brian was indifferent, and I think Dr. Johnson still would
	have preferred an AST... so now you know where to put the blame...} decided
	to use a good old-fashioned parse tree.

	Here's why.

	First, it made the ``superfluous token'' problem go away.
	Since each node just had a \texttt{List} of child nodes (either tokens or
	other parse tree nodes), we did not have to do anything extra to accommodate
	all of the variations of a statement. All the tokens were there when we
	needed them (e.g., for prettyprinting), and they could be ignored when we
	didn't need them.

	Second, it gave us two possible Visitors, as described above. More
	importantly, unlike visiting an AST, these Visitors could just do a ``dumb''
	recursive traversal of the tree, totally ignorant of the underlying grammar.
	This also made parse tree searches and grammar modifications easier.

	There are probably other reasons as well which I can try to remember if
	someone is still not convinced that this was a good idea.