View RSS Feed

JosAH's blog

Compilers

Rate this Entry
by , 07-16-2011 at 04:41 PM (2240 Views)
Greetings,

Frequently a question pops up in this forum (or other forums for that matter) where the OP wants to implement a calculator or wants to compile an expression given in text (String) form. Often they get lost in tokenizers or regular expressions or whatever. This article describes how a compiler (for expressions) can be designed and implemented. The attachment contains a zip file with all the sources for a complete compiler.

Top level architecture

A compiler consists of the following parts:

1) a lexical analyzer; this component chops up the character input stream and forms 'tokens'. e.g. the expression "x+4.2" contains three tokens: [x], [+] and [4.2]; the three tokens have the followng types: NAME, OPERATOR and DOUBLE. A lexical analyzer needs a Reader to read characters from. The stuff between square brackets are a textual representation of the tokens, not of the text the tokens are generated from.

2) a parser: this component works in lock step with a lexical analyzer, i.e. it asks the analyzer to produce tokens and the parser checks for the syntactic validity of the sequence of tokens. e.g. the text "*x-/y" produces the token sequence[*], [x], [-], [/] and [y]. Everything is fine according to the lexical analyzer but the parser protests because the original string doesn't make up a syntactically correct expression.

3) a backend or processor; this backend component is called by the parser to generate code or interpret the token stream generated by the lexical analyzer and checked by the parser. The zip file contains three different processor implementations: an interpreter, a tree generator and a code generator.

4) an evaluator: a processor delivers an evaluator that can produce the result of the expression. For an interpreter the result of the expression is already known and the evaluator just hands it back to the user; for a code generator, an evaluator runs the produced code and finally hands the result back to the user.

5) a compiler: this component is a 'facade' in pattern language; it generates the lexical analyzer, the parser and constructs the entire 'compiler' out of it. The user hands a processor to the compiler (used by the parser) so the entire engine can do what it has to do: produce a result to be handed back to the user.

As far as the user is concerned only 5) is of interest; it handles all the other components mentioned in 1) ... 4). The following parts describe the components mentioned in the five parts. The parser is of interest syntactically, i.e. it implements a 'recursive descent' parser for ordinary infix notation expressions.

The compiler conducts the following flow of data and actions:

reader --> analyzer --> parser --> processor --> evaluator.

The analyzer is only used by the parser and the processor is activated by that same parser. The parser is started by the compiler and the evaluator is generated by the processor which is also called by the compiler.

This all looks like this is going to be a big, fat bunch of code; while writing a compiler isn't a trivial task by far, it isn't rocket science either; even more important, the approach outlined in this artice works and all the struggling by those posters, juggling with regular expressions, tokenizers and semi-smart shortcuts doesn't. The scenario above is implemented by five classes that take at most two pages of source code each (including a lot of empty lines). To be fair, a few small utility classes are needed but they don't take up a single page of source code each. Another benefit of this scenario is that, say, a parser can be altered while the other components can stay as they are. The same is true for the processor component, e.g. the attachment contains three entirely different processor implementations demonstrating this independency.

The rest of this article briefly describes the (small) utility classes and the five components mentioned above that make up the compiler.

The utility classes

Token

An analyzer produces tokens given a character input stream. The Token class implements those tokens. A Token has three parts:

1) the type of the Token;
2) an associated value;
3) its textual representation.

The types of a Token are implemented as an enumeration. The types are:

Java Code:
- EOF      : the end of the character input stream has been reached before the analyzer could produce any other Token;
- ERROR    : an error occurred while reading the character input stream; 
- INT      : an int literal number was read;
- DOUBLE   : a floating point literal number was read;
- CHAR     : any non specified character was read;
- OPERATOR : an operator was read;
- NAME     : a name was read;
- FUNC     : a function name was read.
For example, if the input stream contains the text "x=(3+4)" the following token types are produced: NAME, OPERATOR, CHAR, INT, OPERATOR, INT, CHAR, EOF. For the
token types INT, DOUBLE and OPERATOR the associated value contains an Int, a Double or an Obcode (see below) object. For all other types of Tokens the associated value equals null.

Opcode

This enumeration is used for all the operators in an expression. The values of the Opcode enumeration are:

Java Code:
	OUT("out", -1),
	
	SOBINARY("",0),
	SEQ(","	,	0),
	MOV("="	,	1),
	EQL("==",	2),
	NEQ("!=",	2),
	LES("<",	3),
	LEQ("<=",	3),
	GEQ(">=",	3),
	GRT(">",	3),
	SHL("<<",	4),
	SHR(">>",	4),
	OR("|",	5),
	XOR("^",	5),
	AND("&",	6),
	ADD("+",	7),
	SUB("-",	7),
	MUL("*",	8),
	DIV("/",	9),
	MOD("%",	9),
	EOBINARY("",10),
	
	SOUNARY("",10),
	PLS("+",   10),
	MIN("-",   10),
	NOT("!",   10),
	INV("~",   10),
	INC("++",  10),
	DEC("--",  10),
	EOUNARY("",11),

	SOFUNC("", 11),
	SIN("sin", 11),
	COS("cos", 11),
	TAN("tan", 11),
	MNF("min", 11),
	MXF("max", 11),
	EOFUNC("", 12);
The SOBINARY, EOBINARY, SOUNARY, EOUNARY, SOFUNC and EOFUNC values mark the start and end of the binary operators, the start and end of the unary operators and the start and end of the functions (SO is 'Start Of' and 'EO' is 'End Of'). The OUT value is a special 'operator' that moves the result of the expression to a well known place. Note that functions and operators are considered similar; and they are: operators can be interpreted as functions taking one or two arguments and producing a result. Lisp is a fine example of this similarity, i.e. (+ 3 4) is considered to be a function + applied to its operators 3 and 4; the function value is 7 here (surprise!). In this example implementation five functions are implemented: sin, cos, tan, min and max. Each value has an associated precedence value: functions have the highest precedence, followed by the unary operators and finally the lowest precedence operators are the binary operators. Because we want to use ordinary infix notation we have to distinguish between operators and functions.

Symbols

This utility class helps the lexical analyzer, i.e. given a textual representation of an operator or function it can produce the corresponding Opcode (see previous paragraph). This class can also produce the 'arity' of an operator or function, i.e. it knows how many arguments the operator or function needs. It also knows which characters are used for parenthesized expression (the '(' and ')' characters) and it knows which character is used as a separator used in parameter lists (here the ',' is used (as in min(3, 4)).

Analyzer

The Analyzer object has to produce Token objects given a character input stream. Basically, what it does is this:

Java Code:
			for (int c= read(); c != -1; c= read()) {
				
				if (Character.isWhitespace(c)) continue;

				if (Character.isDigit(c)) return numberToken();
				
				if (Character.isJavaIdentifierStart(c)) return nameToken(c);
				
				return tokenToken(c);
			}

			return new Token(Token.TokType.EOF, null, "eof");
It keeps reading characters and decides what to do given the fresh character:

- white space is simply consumed and skiped;
- a digit must be the start of a number;
- a character that is a Character.isJavaIdentifierStart(int) must be a name or a function name;
- otherwise it must be the start of an operator or just an ordinary character;
- if nothing can be read anymore an EOF token is produced.

The numberToken method uses a regular expression to match a Double type number or an Integer type number and produces a corresponding Token object. The nameToken method checks the characters to be the start of a Java identifier or part of a Java identifier. If the entire name has been read it checks whether or not the name represents a function and a corresponding Token object is produced. The tokenToken method reads the character input stream as long as the characters represent an operator. This scenario produces a single Token object for the character sequence ">>" instead of two Token objects [>] and [>]. This scenario also has the restriction that a character secuence p c can only be an operator if only p is a possible start of an operator; so @> can't be an operator because @ isn't an operator. The Analyzer always consumes the longest possible operator, i.e. "+++++x" produces the Tokens [++] [++] [+] [x] instead of [+] [+] [+] [+] [+] [x].

Finally the Analyzer implements a nice utility method: if the Parser detects a syntactic error, e.g. in the expression "x+3(a-4)" the left parenthesis wasn't expected (a '*' character is missing for instance) I want to produce this:

Java Code:
x+3(a-4)
   ^

the utility method procuces the line with the single caret '^' character at the right position. For further details of the methods mentioned above see the source code in the zip file.

Parser

Many consider this object the central part of a compiler and in a certain way it is: it sucks Tokens out of the Analyzer and calls a Producer whenever something needs to be produced but all it does is check for a correct syntax of a Token stream. This Parser object implements 'normal' infix expressions (note the quotes because it is far from normal if your used to prefix notation of (god forbid) postfix notation). Some languages (FORTH comes to mind) don't even have a syntax, i.e. any sequence of Tokens makes up a valid sequence, which is nice imho, but our language does have syntax: infix expression syntax, made up by those math folks. On a top level an expression looks like this:

<expression> [operator] <expression> [operator] <expression> [operator] ...

where <expression> is any expression containing only operators (if any) with higher precedence than <operator>. On the top level none of the parts need to be present (an empty expression that doesn't do anything) or just the first part needs to be present (no operators with lowest precedence present in the expression). When we 'zoom in' one level we see the same thing again:

<expression> [operator] <expression> [operator] <expression> [operator] ...

where [operator] isn't an operator of the lowest precendence nor one with a precedence one level up. Here's an example: the expression:

Java Code:
2*(3+4)
can be seen at the lowest level as: <expression>. One 'zoom in' step further, this <expression> can be seen as: <expression>*<expression>. The second <expression> can be seen as (<expression>) and this one can be interpreted again as <expression>+<expression>. Each 'zoom in' step only considers operators of higher level than at one 'zoom out' step.
Finally we have zoomed in that far that no binary operators are present anymore. We have reached the unary expressions, which are:

<unary operator>* <atomic expression>

where <unary operator> is one of +, -, ~ or ! and the <atomic expression> can be either a parenthesized expression, a numerical constant, a name or a function call. This is exactly what our parser does: zooming in recursively and calling the Processor object whenever it must be called (see the next pargraph). Whenever something goes wrong, i.e. the Parser doesn't 'see' whatever it wants to see during the recursive process, a syntax error is generated and the parsing process stops. See the attachment for the source code of the Parser class and its gory details described above.

In Backus Naur form this is the grammar accepted by the Parser:

Java Code:
expression          : sequenceexpression { , sequenceexpression } *
sequenceexpression  : equalityexpression { [ == != ] equalityexpression } *
equalityexpression  : comparisonexpression { [ <  >  <= >= ] comparisonexpression } *
comparisonexpression: shiftexpression { [ >> << ] shiftexpression ) *
shiftexpression     : orexpression { [ | ^ ] orexpression } *
orexpression        : andexpression { & andexpression } *
andexpression       : addexpression { [ + - ] addexpression } *
addexpression       : unaryexpression { [* / % ] unaryexppression } *
unaryexpression     : [ + - ~ ! ] unaryexpression | atomicexpression
atomicexpression    : [ { ( expression ) }  DOUBLE INT name function ( { sequenceexpression { , sequenceexpression } * }? ) ]
Anything between [ ... ] denotes a choice; anything between { ... } is just grouping and the * symbol denotes zero or more times of the previous and the ? symbol denotes zero or one times of the previous. All except the last two rules represent the 'zooming' decribed above and it's implemented by just three highly recursive methods. Note that the parameter list of a function call doesn't accept expressions but only sequenceexpressions; expressions can contain commas (,) which are also used to separate the individual parameters. Parenthesizing circumvents this slight inconvenience, e.g min(2, (1, 3)) is identical to min(2, 3) and results in the value 2.

Processor

The Processor is an interface; three implementations are present: Interpreter, TreeProcessor and CodeGenerator. The first implementation evaluates the expression while it is being parsed and scanned while the third implementation generates code for a (tiny) virtual machine. The TreeGenerator can generate a tree given a infix expression and produce three of its corresponding text forms: infix, postfix and prefix expressions. See the source code of the *Test.java files for details. Here is the Processor interface:

Java Code:
package compiler;

public interface Processor {

	public static class ProcessorException extends RuntimeException {

		private static final long serialVersionUID = 5131994858045876960L;

		public ProcessorException(String message) {
			super(message);
		}
	};
	
	public Name setName(String name, Number value);
	
	public Evaluator getEvaluator();
	
	public void processBinary(Token token);
	public void processUnary(Token token);
	public void processDouble(Token token);
	public void processInt(Token token);
	public void processName(Token token);	
	public void processFunction(Token token);
}
It defines a (nested) Exception to be used by implementations of the Processor interface in case anything goes wrong. It is generated whenever an expression wants to assign to or modify anything but a name; e.g "54= 42" or "++42". While both these expressions are syntactically correct (the assignment is a binary operator syntactically speaking) they are nonsense semantically speaking. A Parser doesn't know anything about semantics, it just checks the syntax of a stream of Tokens. The setName( ... ) method can be used by a user to preset a Name to a certain value and the getEvaluator() method retrieves an Evaluator that can produce the result of the compiled expression.

The rest of the methods can be called by a Parser object when it has parsed a binary expression, a unary expression, a double type constant, in int type constant, a name or a function respectively. If the expression was "3*(4.5+x)" the following methods are called in that particular order:

Java Code:
1) processInt(3)
2) processDOuble(4.5)
3) processName(x)
4) processBinary(+)
5) processBinary(*)
Interpreter

An Interpreter implements the Processor interface and is a simple stack machine. A stack can store Number type objects and an interpreter just does what it has to do when one of its methods are called. For the expression "x=2, 3*(4.5+x)" the sequence:

Java Code:
processName(x)
processInt(3)
processBinary(=)
processDouble(4.5)
processName(x)
processBinary(+)
processBinary(*)
processBinary(,)
causes the following push and pop actions; the pop actions are mostly commented between parentheses:

Java Code:
push x (a new name with the value set to 0)
push 2
pop (2)
pop (x the variable)
push 2.0 (the new value of x)
push 3
push 4.5 
push x (which is 2)
pop (2 the variable x value)
pop 4.5
push 6.5 (after addition)
pop (6.5)
pop (3)
push 19.5 (afer multiplication)
pop (19.5)
pop (2.0)
push 19.5
pop (19.5 the result of the entire expression)

An Interpreter object can be constructed with a boolean parameter; if it is set to true the entire evaluation is printed on the standard output stream which can be handy for testing purposes or testing. If the parameter value is set to false nothing is printed.

TreeProcessor

This class generates an AST (Abstract Syntax Tree) and can produce three of its equivalent text forms: a prefix, an infix and a postfix form. This Processor doesn't have an associated Evaluator, i.e. it does everything itself. Here is its output when the above expression is compiled:

Java Code:
infix  : ((x = 2) , (3 * (4.5 + x)))
postfix: x 2 = 3 4.5 x + * ,
prefix:  (, (= x 2) (* 3 (+ 4.5 x)))
The TreeProcessor creates the AST using just two types of Nodes, i.e. an AtomicNode (for numbers and names) and a FunctionNode for everything else. Just because the infix notation isn't very consequent it distinguishes in the representation of operators and function, e.g. a binary operator is represented as: left operator right, while a binary function is represented as: operator(left, righ); but theyb behave identical if evaluated.

CodeGenerator

This class is also an implementation of the Processor interface but it generates instructions for a very simple virtual machine. This virtual machine is impemented by the CodeEvAluator class. When it is asked to produce the result of the compiled expression it runs the code and retrieves the result. The programming model for this tiny virtual machine is extremely simple: there are two spaces: a linear addressable data space that can contain Numbers (it's just an array of type Number) and code space which is just a List<Instruction>; an Instruction is also very simple: it is a so called 'three address instruction' where each address is an address in the data space (here it is an int index) and an opcode that tells what to do with the three addresses (and the Numbers in the data space). The instructions are on of:

Java Code:
- data[rd]= data[rl] op data[rr]
- data[rd]= op data[rl]
- data[rd]= op
where rd, rl and rr are the three addresses and op represents what to do with the addresses (and the corresponding elements in the data space); note that the 'op' is implemented by the Opcode enum (see above). There is one instruction of the third form and it produces the desired output or result of the entire expression. For example, the expression "x=2, 3*(4.5+x)" produces the folling virtual machine code after execution:

Java Code:
data:
[2.0, 2.0, 3.0, 4.5, 19.5]
code:
r0= r0 = r1
r4= r3 + r0
r4= r2 * r4
r4= out
the data space contains the items x, 2.0, 3.0, 4.5 and the value of the result. Note that variable x has the value 2.0 and the instruction sequence produces the value 19.5 in data[4]. The CodeGenerator object can be instantiated with a boolean parameter. If true is passed the code generator optimizes the code to be generated a bit, i.e. for the expression "+-+-+x" the instruction for x is generated, i.e. no code is generated for the unary plus operator and an even number of unary minus operators cancel out. The same holds for the expression "~~~~x", i.e. an even number of ~ operators cancel out as well. Also code for the , operator (the sequence operator) doesn't yield an instruction. If the parameter is false all spurious instructions are generated.

Concluding remarks

I realize that a lot of aspects and techniques aren't discussed in this little article; I leave that to the actual source code and the reader of it. The parsing part of it all uses old, very well established techniques: recursive descent parsing; the analyzer part isn't much miraculous either. The code generator part is new and self invented. It can be enhanced in numerous ways. The grammer being parsed by the parser can also be enhanced so the parser can parse much more than just infix mathematical expressions. Maybe other datatypes can be included (Strings?). The Interpreter uses techniques that can be found in many text books: it's a simple stack machine.

Both the Interpreter as well as the CodeGenerator (and its accompanying CodeEValuator) evaluate all numbers as if they were of type Double; I left the distinction in (generated by the Analyzer) just in case anybody feels tempted to implement two versions of all operators, one for the Integer operands and one for the Double type operands. I left the distinction out of the Processor implementations for reasons of simplicity.

Three small test programs are present in the zip file: InterpreterTest, TreeProcessorTest and GeneratorTest that test the Interpreter, the TreeProcessor and CodeGenerator Processor implementations a bit; play with them and do with the entire bunch whatever you want. Please mention my name in the source code (whether modified or not), especially if you want to make money with it ;-) If you ever see a question about expression evaluation pop up in one of the forums again, refer them to this blog article and don't think about it anymore.

kind regards,

Jos
Attached Thumbnails Attached Files

Submit "Compilers" to Facebook Submit "Compilers" to Digg Submit "Compilers" to del.icio.us Submit "Compilers" to StumbleUpon Submit "Compilers" to Google

Updated 07-18-2011 at 05:18 PM by JosAH

Tags: None Add / Edit Tags
Categories
Uncategorized