Make a compiler that supports a subset of the ANSI-C programming language
We saw in the previous part of our compiler writing journey that a parser doesn’t necessarily enforce the semantics of our language. It only enforces the syntax and structural rules of the grammar.
We ended up with code that calculates the wrong value of expressions
like 2 * 3 + 4 * 5
, because the code created an AST that looks like:
*
/ \
2 +
/ \
3 *
/ \
4 5
instead of:
+
/ \
/ \
/ \
* *
/ \ / \
2 3 4 5
To solve this, we have to add code to our parser to perform operator precedence. There are (at least) two ways of doing this:
Here is our grammar from the last part of the journey:
expression: number
| expression '*' expression
| expression '/' expression
| expression '+' expression
| expression '-' expression
;
number: T_INTLIT
;
Note that there is no differentiation between any of the four maths operators. Let’s tweak the grammar so that there is a difference:
expression: additive_expression
;
additive_expression:
multiplicative_expression
| additive_expression '+' multiplicative_expression
| additive_expression '-' multiplicative_expression
;
multiplicative_expression:
number
| number '*' multiplicative_expression
| number '/' multiplicative_expression
;
number: T_INTLIT
;
We now have two types of expressions: additive expressions and multiplicative expressions. Note that the grammar now forces the numbers to be part of multiplicative expressions only. This forces the ‘*’ and ‘/’ operators to bind more tightly to the numbers on either side, thus having higher precedence.
Any additive expression is actually either a multiplicative expression by itself, or an additive (i.e. multiplicative) expression followed by a ‘+’ or ‘-‘ operator then another multiplicative expression. The additive expression is now at a much lower predencence than the multiplicative expression.
How do we take the above version of our grammar and implement it into
our recursive descent parser? I’ve done this in the file expr2.c
and
I’ll cover the code below.
The answer is to have a multiplicative_expr()
function to deal with the
‘*’ and ‘/’ operators, and an additive_expr()
function to deal with the
lower precedence ‘+’ and ‘-‘ operators.
Both functions are going to read in something and an operator. Then, while there are following operators at the same precedence, each function will parse some more of the input and combine the left and right halves with the first operator.
However, additive_expr()
will have to defer to the higher-precedence
multiplicative_expr()
function. Here is how this is done.
additive_expr()
// Return an AST tree whose root is a '+' or '-' binary operator
struct ASTnode *additive_expr(void) {
struct ASTnode *left, *right;
int tokentype;
// Get the left sub-tree at a higher precedence than us
left = multiplicative_expr();
// If no tokens left, return just the left node
tokentype = Token.token;
if (tokentype == T_EOF)
return (left);
// Loop working on token at our level of precedence
while (1) {
// Fetch in the next integer literal
scan(&Token);
// Get the right sub-tree at a higher precedence than us
right = multiplicative_expr();
// Join the two sub-trees with our low-precedence operator
left = mkastnode(arithop(tokentype), left, right, 0);
// And get the next token at our precedence
tokentype = Token.token;
if (tokentype == T_EOF)
break;
}
// Return whatever tree we have created
return (left);
}
Right at the beginning, we immediately call multiplicative_expr()
in case the first operator is a high-precedence ‘*’ or ‘/’. That
function will only return when it encounters a low-precedence
‘+’ or ‘-‘ operator.
Thus, when we hit the while
loop, we know we have a ‘+’ or ‘-‘ operator.
We loop until there are no tokens left in the input, i.e. when we hit
the T_EOF token.
Inside the loop, we call multiplicative_expr()
again in case any
future operators are higher precedence than us. Again, this will
return when they are not.
Once we have a left and right sub-tree, we can combine them with
the operator we got the last time around the loop. This repeats, so that
if we had the expression 2 + 4 + 6
, we would end up with the AST tree:
+
/ \
+ 6
/ \
2 4
But if multiplicative_expr()
had its own higher precedence operators,
we would be combining sub-trees with multiple nodes in them.
// Return an AST tree whose root is a '*' or '/' binary operator
struct ASTnode *multiplicative_expr(void) {
struct ASTnode *left, *right;
int tokentype;
// Get the integer literal on the left.
// Fetch the next token at the same time.
left = primary();
// If no tokens left, return just the left node
tokentype = Token.token;
if (tokentype == T_EOF)
return (left);
// While the token is a '*' or '/'
while ((tokentype == T_STAR) || (tokentype == T_SLASH)) {
// Fetch in the next integer literal
scan(&Token);
right = primary();
// Join that with the left integer literal
left = mkastnode(arithop(tokentype), left, right, 0);
// Update the details of the current token.
// If no tokens left, return just the left node
tokentype = Token.token;
if (tokentype == T_EOF)
break;
}
// Return whatever tree we have created
return (left);
}
The code is similar to additive_expr()
except that we get to call
primary()
to get real integer literals! We also only loop when
we have operators at our high precedence level, i.e. ‘*’ and ‘/’
operators. As soon as we hit a low precedence operator, we simply
return the sub-tree that we’ve built to this point. This goes
back to additive_expr()
to deal with the low precedence operator.
The above way of constructing a recursive descent parser with explicit operator precedence can be inefficient because of all the function calls needed to reach the right level of precedence. There also has to be functions to deal with each level of operator precedence, so we end up with lots of lines of code.
One way to cut down on the amount of code is to use a Pratt parser which has a table of precedence values associated with each token instead of having functions that replicate the explicit precedence in the grammar.
At this point I highly recommend that you read Pratt Parsers: Expression Parsing Made Easy by Bob Nystrom. Pratt parsers still make my head hurt, so read as much as you can and get comfortable with the basic concept.
expr.c
: Pratt ParsingI’ve implemented Pratt parsing in expr.c
which is a drop-in replacement
for expr2.c
. Let’s start the tour.
Firstly, we need some code to determine the precedence levels for each token:
// Operator precedence for each token
static int OpPrec[] = { 0, 10, 10, 20, 20, 0 };
// EOF + - * / INTLIT
// Check that we have a binary operator and
// return its precedence.
static int op_precedence(int tokentype) {
int prec = OpPrec[tokentype];
if (prec == 0) {
fprintf(stderr, "syntax error on line %d, token %d\n", Line, tokentype);
exit(1);
}
return (prec);
}
Higher numbers (e.g. 20) mean a higher precedence than lower numbers (e.g. 10).
Now, you might ask: why have a function when you have a look-up table called
OpPrec[]
? The answer is: to spot syntax errors.
Consider an input that looks like 234 101 + 12
. We can scan in the
first two tokens. But if we simply used OpPrec[]
to get the precedence of the
second 101
token, we wouldn’t notice that it isn’t an operator. Thus,
the op_precedence()
function enforces the correct grammar syntax.
Now, instead of having a function for each precedence level, we have a single expression function that uses the table of operator precedences:
// Return an AST tree whose root is a binary operator.
// Parameter ptp is the previous token's precedence.
struct ASTnode *binexpr(int ptp) {
struct ASTnode *left, *right;
int tokentype;
// Get the integer literal on the left.
// Fetch the next token at the same time.
left = primary();
// If no tokens left, return just the left node
tokentype = Token.token;
if (tokentype == T_EOF)
return (left);
// While the precedence of this token is
// more than that of the previous token precedence
while (op_precedence(tokentype) > ptp) {
// Fetch in the next integer literal
scan(&Token);
// Recursively call binexpr() with the
// precedence of our token to build a sub-tree
right = binexpr(OpPrec[tokentype]);
// Join that sub-tree with ours. Convert the token
// into an AST operation at the same time.
left = mkastnode(arithop(tokentype), left, right, 0);
// Update the details of the current token.
// If no tokens left, return just the left node
tokentype = Token.token;
if (tokentype == T_EOF)
return (left);
}
// Return the tree we have when the precedence
// is the same or lower
return (left);
}
Firstly, note that this is still recursive like the previous parser
functions. This time, we receive the precedence level of the token
that was found before we got called. main()
will call us with the
lowest precedence, 0, but we will call ourselves with higher values.
You should also spot that the code is quite similar to the
multiplicative_expr()
function: read in an integer literal,
get the operator’s token type, then loop building a tree.
The difference is the loop condition and body:
multiplicative_expr():
while ((tokentype == T_STAR) || (tokentype == T_SLASH)) {
scan(&Token); right = primary();
left = mkastnode(arithop(tokentype), left, right, 0);
tokentype = Token.token;
if (tokentype == T_EOF) return (left);
}
binexpr():
while (op_precedence(tokentype) > ptp) {
scan(&Token); right = binexpr(OpPrec[tokentype]);
left = mkastnode(arithop(tokentype), left, right, 0);
tokentype = Token.token;
if (tokentype == T_EOF) return (left);
}
With the Pratt parser, when the next operator has a higher precedence
than our current token, instead of just getting the next integer
literal with primary()
, we call ourselves with binexpr(OpPrec[tokentype])
to raise the operator precedence.
Once we hit a token at our precedence level or lower, we will simply:
return (left);
This will either be a sub-tree with lots of nodes and operators at a higher precedence that the operator that called us, or it might be a single integer literal for an operator at the same predence as us.
Now we have a single function to do expression parsing. It uses a small helper function to enforce the operator precedence, and thus implements the semantics of our language.
You can make two programs, one with each parser:
$ make parser # Pratt Parser
cc -o parser -g expr.c interp.c main.c scan.c tree.c
$ make parser2 # Precedence Climbing
cc -o parser2 -g expr2.c interp.c main.c scan.c tree.c
You can also test both parsers with the same input files from the previous part of our journey:
$ make test
(./parser input01; \
./parser input02; \
./parser input03; \
./parser input04; \
./parser input05)
15 # input01 result
29 # input02 result
syntax error on line 1, token 5 # input03 result
Unrecognised character . on line 3 # input04 result
Unrecognised character a on line 1 # input05 result
$ make test2
(./parser2 input01; \
./parser2 input02; \
./parser2 input03; \
./parser2 input04; \
./parser2 input05)
15 # input01 result
29 # input02 result
syntax error on line 1, token 5 # input03 result
Unrecognised character . on line 3 # input04 result
Unrecognised character a on line 1 # input05 result
It’s probably time to step back a bit and see where we’ve got to. We now have:
What we don’t have yet is a compiler. But we are so close to making our first compiler!
In the next part of our compiler writing journey, we will replace the interpreter. In its place, we will write a translator that generates x86-64 assembly code for each AST node that has a maths operator. We will also generate some assembly preamble and postamble to support the assembly code that the generator outputs.