Make a compiler that supports a subset of the ANSI-C programming language
In this part of our compiler writing journey, I started with a pointer
issue that needed to fix, and I ended up restructuring about half of
expr.c
and changing the API to another quarter of the functions in the
compiler. So this is a big step in terms of number of lines touched, but
not a big step in terms of fixes or improvements.
We’ll start with the problem that caused all of this. When running the compiler’s source code through itself, I realised that I couldn’t parse a chain of pointers, e.g. something like the expression:
ptr->next->next->next
The reason for this is that primary()
is called and gets the value of
the identifer at the beginning of the expression. If it sees a following
postfix operator, it then calls postfix()
to deal with it. postfix()
deals with, for example, one ->
operator and returns. And that’s it.
There is no loop to follow a chain of ->
operators.
Even worse, primary()
looks for a single identifier. This means that
it won’t parse the following, either:
ptrarray[4]->next OR
unionvar.member->next
because neither of these are single identifers before the ->
operator.
This happened because of the rapid prototyping nature of our development. I only add functionality one small step at a time, and I don’t usually look too far ahead in terms of future needs. So, now and then, we have to undo what has been written to make it more general and flexible.
If we look at the BNF Grammar for C, we see this:
primary_expression
: IDENTIFIER
| CONSTANT
| STRING_LITERAL
| '(' expression ')'
;
postfix_expression
: primary_expression
| postfix_expression '[' expression ']'
| postfix_expression '(' ')'
| postfix_expression '(' argument_expression_list ')'
| postfix_expression '.' IDENTIFIER
| postfix_expression '->' IDENTIFIER
| postfix_expression '++'
| postfix_expression '--'
;
In other words, we have things backwards. postfix
should call primary()
to get an AST node that represents the identifier. Then, we can loop looking
for any postfix tokens, parse them and add new AST parent nodes on to the
identifier node that we received back from primary()
.
It all sounds nice and simple except for one thing. The current
primary()
doesn’t build an AST node; it only parses the identifier and
leaves it in Text
. It’s the job of postfix()
to build the AST node or
AST tree for the identifier plus any postfix operations.
At the same time, the AST node structure in defs.h
only knows about the
primitive type:
// Abstract Syntax Tree structure
struct ASTnode {
int op; // "Operation" to be performed on this tree
int type; // Type of any expression this tree generates
int rvalue; // NOTE: no ctype
...
};
The reason for this is that we only recently added structs and unions. This,
and the fact that postfix()
did most of the parsing work meant that we
haven’t needed to store a pointer to the struct or union symbol for
an identifier which is a struct or union.
So, to fix things, we need to:
ctype
pointer to struct ASTnode
so that the full type
is stored in each AST node.ctype
of a node is stored.primary()
up near the top of expr.c
and get it to build
an AST node.postfix()
to call primary()
to get the unadorned AST node
for an identifier (A_IDENT).postfix()
to loop while there are postfix operators to process.That’s a lot and, as the AST node calls are sprinkled everywhere, every single source file in the compiler will need to be touched. Sigh.
I’m not going to bore you with all the details, but we can start with
the change to the AST node structure in defs.h
, and the main function
in tree.c
that builds an AST node:
// Abstract Syntax Tree structure
struct ASTnode {
int op; // "Operation" to be performed on this tree
int type; // Type of any expression this tree generates
struct symtable *ctype; // If struct/union, ptr to that type
...
};
// Build and return a generic AST node
struct ASTnode *mkastnode(int op, int type,
struct symtable *ctype, ...) {
...
// Copy in the field values and return it
n->op = op;
n->type = type;
n->ctype = ctype;
...
}
There are also changes to mkastleaf()
and mkastunary()
: they now receive
a ctype
and call mkastnode()
with this argument.
In the compiler there are about 40 calls to these three functions, so I’m not
going to go through each and every one. For most of them, there is a primitive
type
and ` ctype pointer available. Some calls set the AST node
type to P_INT and thus the
ctype is NULL. Some calls set the AST node type
to P_NONE and, again, the
ctype` is NULL.
modify_type()
The modify_type()
is used to determine if an AST node’s type is compatible
with another type and, if necessary, to widen the node to match the other
type. It calls mkastunary()
and thus we also need to provide it with a
ctype
argument. I’ve done this and, as a consequence, the six calls to
modify_type()
have had to be modified to pass in the ctype
of the
type which we are comparing the AST node against.
expr.c
Now we get to the meat of the changes, the restucturing of primary()
and postfix()
. I’ve already outlined what we have to do above. As with
much of what we’ve done, there are a few wrinkles along the way to iron out.
postfix()
postfix()
actually looks much cleaner now:
// Parse a postfix expression and return
// an AST node representing it. The
// identifier is already in Text.
static struct ASTnode *postfix(void) {
struct ASTnode *n;
// Get the primary expression
n = primary();
// Loop until there are no more postfix operators
while (1) {
switch (Token.token) {
...
default:
return (n);
}
}
We now call primary()
to get an identifier or a constant. Then we
loop applying any postfix operators to the AST node we received from
primary()
. We call out to helper functions like array_access()
and
member_access()
for [..]
, .
and ->
operators.
We do post-increment and post-decrement here. Now that there is a loop,
we have to check that we don’t try to do these operations more than once.
We also check that the AST we received from primary()
is an lvalue and
not an rvalue, as we need an address in memory to increment or decrement.
paren_expression()
I realised that the new primary()
function was getting a bit too big, so
I split some of its code off into a new function, paren_expression()
. This
parses expressions that are enclosed in (..)
: casts and ordinary
parenthesised expressions. The code is nearly identical to the old code, so
I won’t go into it here. It returns an AST node with the tree that represents
either a cast expression or a parenthesised expression.
primary()
This is where the biggest change has occurred. Firstly, here are the tokens it looks for:
primary()
and, on reflection,
perhaps I should make this into its own function.(..)
which is where paren_expression()
gets called.Looking at the code, primary()
now builds AST nodes for each of the above to
return to postfix()
. This used to be done in postfix
but I now do it
in primary()
.
member_access()
With the previous member_access()
, the global Text
variable still
held the identifier, and member_access()
built the AST node to represent
the struct/union identifier.
In the current member_access()
, we receive the AST node for the
struct/union identifier, and this could be an array element or a member of
another struct/union.
So the code is different in that we don’t build the leaf AST node for the original identifier anymore. We still build AST nodes to add on the offset from the base and dereference the pointer to the member.
One other difference is this code:
// Check that the left AST tree is a struct or union.
// If so, change it from an A_IDENT to an A_ADDR so that
// we get the base address, not the value at this address.
if (!withpointer) {
if (left->type == P_STRUCT || left->type == P_UNION)
left->op = A_ADDR;
else
fatal("Expression is not a struct/union");
}
Consider the expression foo.bar
. foo
is the name of a struct, for example,
and bar
is a member of that struct`.
In primary()
we will have created an A_IDENT AST node for foo
, because
we can’t tell if this is a scalar variable (e.g. int foo
) or a structure
(e.g. struct fred foo
). Now that we know it’s a struct or a union, we
need the base address of the struct and not the value at the base address.
So, the code converts the A_IDENT AST node operation into an A_ADDR operation
on the identifier.
I think I spent about two hours running through our hundred plus regression tests, finding things I’d missed and fixing them up. It certainly felt good to get through all the tests again.
tests/input128.c
now checks that we can follow a chain of pointers, which
was the whole point of this exercise:
struct foo {
int val;
struct foo *next;
};
struct foo head, mid, tail;
int main() {
struct foo *ptr;
tail.val= 20; tail.next= NULL;
mid.val= 15; mid.next= &tail;
head.val= 10; head.next= ∣
ptr= &head;
printf("%d %d\n", head.val, ptr->val);
printf("%d %d\n", mid.val, ptr->next->val);
printf("%d %d\n", tail.val, ptr->next->next->val);
return(0);
}
And tests/input129.c
checks that we can’t post-increment twice in a row.
Linestart
There is one more change that I made to the compiler as part of our effort to get it to self-compile.
The scanner was looking for a ‘#’ token. When it saw this token, it assumed that we had hit a C pre-processor line and it parsed this line. Unfortunately, I hadn’t tied the scanner down to looking in the first column of each line. So, when our compiler hit this source code line:
while (c == '#') {
it got upset that the ‘)’ ‘{‘ were not a C pre-processor line.
We now have a Linestart
variable which flags if the scanner is at the
first column of a new line or not. The main function which is modified
is next()
in scan.c
. I think the changes are a bit ugly but they work;
I should come back sometime and see if I can clean this up a bit. Anyway,
we only expect C pre-processor lines when we see a ‘#’ in column 1.
In the next part of our compiler writing journey, I’ll go back to feeding the compiler source code to itself, see what errors pop and up choose one or more to fix.