Make a compiler that supports a subset of the ANSI-C programming language
I’ve started to pass some of the source code of our compiler as input to itself, as this is how we are going to get it to eventually compile itself. The first big hurdle is to get the compiler to parse and recognise its source code. The second big hurdle will be to get the compiler to generate correct, working, code from its source code.
This is also the first time that the compiler has been given some substantial input to chew on, and it’s going to reveal a bunch of bugs, misfeatures and missing features.
I started with cwj -S defs.h
and found several header files missing. For now
they exist but are empty. With these in place, the compiler crashes with a
segfault. I had a few pointers which should be initialised to NULL and places
where I wasn’t checking for a NULL pointer.
Next up, I hit enum { NOREG = -1 ...
in defs.h
and realised that the
scanner wasn’t dealing with integer literals which start with a minus sign.
So I’ve added this code to scan()
in scan.c
:
case '-':
if ((c = next()) == '-') {
t->token = T_DEC;
} else if (c == '>') {
t->token = T_ARROW;
} else if (isdigit(c)) { // Negative int literal
t->intvalue = -scanint(c);
t->token = T_INTLIT;
} else {
putback(c);
t->token = T_MINUS;
}
If a ‘-‘ is followed by a digit, scan in the integer literal and negate its
value. At first I was worried that the expression 1 - 1
would be treated
as the two tokens ‘1’, ‘int literal -1’, but I forgot that next()
doesn’t
skip the space. So, by having a space between the ‘-‘ and the ‘1’, the
expression 1 - 1
is correctly parsed as ‘1’, ‘-‘, ‘1’.
However, as Luke Gruber has pointed out,
this also means that the input 1-1
is treated as 1 -1
instead of
1 - 1
. In other words, the scanner is too greedy and forces -1
to
always be treated as a T_INTLIT when sometimes it shouldn’t be. I’m going
to leave this for now, as we can work around it when writing our source
code. Obviously, in a production compiler this would have to be fixed.
In the AST node and symbol table node structures, I’ve been using unions to try and keep the size of each node down. I guess I’m a bit old school and I worry about wasting memory. An example is the AST node structure:
struct ASTnode {
int op; // "Operation" to be performed on this tree
...
union { // the symbol in the symbol table
int intvalue; // For A_INTLIT, the integer value
int size; // For A_SCALE, the size to scale by
};
};
But the compiler isn’t able to parse and work with a union inside a struct, and especially an unnamed union inside a struct. I could add this functionality, but it will be easier to redo the two structs where I do this. So, I’ve made these changes:
// Symbol table structure
struct symtable {
char *name; // Name of a symbol
...
#define st_endlabel st_posn // For functions, the end label
int st_posn; // For locals, the negative offset
// from the stack base pointer
...
};
// Abstract Syntax Tree structure
struct ASTnode {
int op; // "Operation" to be performed on this tree
...
#define a_intvalue a_size // For A_INTLIT, the integer value
int a_size; // For A_SCALE, the size to scale by
};
This way, I still have two named fields sharing the same location in each
struct, but the compiler will see only the one field name in each struct.
I’ve given each #define
a different prefix to prevent pollution of the
global namespace.
A consequence of this is that I’ve had to rename the endlabel
, posn
,
intvalue
and size
fields across half a dozen source files. C’est la vie.
So now the compiler, when doing cwj -S misc.c
gets up to:
Expected:] on line 16 of data.h, where the line is
extern char Text[TEXTLEN + 1];
This fails as the compiler as it stands does not parse expressions in a global variable declaration. I’m going to have to rethink this.
My thoughts so far are to use binexpr()
to parse the expression,
and to add some optimisation code to perform
constant folding
on the resulting AST tree. This should result in a single A_INTLIT node
from which I can extract the literal value. I could even let binexpr()
parse any casts, e.g.
char x= (char)('a' + 1024);
Anyway, that’s something for the future. I was going to do constant folding at some point, but I thought it would be further down the track.
What I will do in this part of the journey is add some more operators: specifically, ‘+=’, ‘-=’, ‘*=’ and ‘/=’. We currently use the first two operators in the compiler’s source code.
Adding new keywords to our compiler is easy: a new token and a change to the scanner. Adding new operators is much harder as we have to:
We are adding four operators: ‘+=’, ‘-=’, ‘*=’ and ‘/=’. They have matching
tokens: T_ASPLUS, T_ASMINUS, T_ASSTAR and T_ASSLASH. These have
corresponding AST operations: A_ASPLUS, A_ASMINUS, A_ASSTAR, A_ASSLASH. The AST operations must have the same enum value as the tokens
because of this function in expr.c
:
// Convert a binary operator token into a binary AST operation.
// We rely on a 1:1 mapping from token to AST operation
static int binastop(int tokentype) {
if (tokentype > T_EOF && tokentype <= T_SLASH)
return (tokentype);
fatald("Syntax error, token", tokentype);
return (0); // Keep -Wall happy
}
We also need to configure the precedence of the new operators.
According to this list of C operators, these new operators have the same precedence as
our existing assignment operator, so we can modify the OpPrec[]
table in
expr.c
as follows:
// Operator precedence for each token. Must
// match up with the order of tokens in defs.h
static int OpPrec[] = {
0, 10, 10, // T_EOF, T_ASSIGN, T_ASPLUS,
10, 10, 10, // T_ASMINUS, T_ASSTAR, T_ASSLASH,
20, 30, // T_LOGOR, T_LOGAND
...
};
But that list of C operators also notes that the assignment operators are right_associative. This means, for example, that:
a += b + c; // needs to be parsed as
a += (b + c); // not
(a += b) + c;
So we also need to update this function in expr.c
to do this:
// Return true if a token is right-associative,
// false otherwise.
static int rightassoc(int tokentype) {
if (tokentype >= T_ASSIGN && tokentype <= T_ASSLASH)
return (1);
return (0);
}
Fortunately, these are the only changes we need to make to our scanner and expression parser: the Pratt parser for binary expressions is now primed to deal with the new operators.
Now that we can parse expressions with the four new operators, we need to
deal with the AST that is created for each expression. One thing we need to
do is dump the AST tree. So, in dumpAST()
in tree.c
, I added this code:
case A_ASPLUS:
fprintf(stdout, "A_ASPLUS\n"); return;
case A_ASMINUS:
fprintf(stdout, "A_ASMINUS\n"); return;
case A_ASSTAR:
fprintf(stdout, "A_ASSTAR\n"); return;
case A_ASSLASH:
fprintf(stdout, "A_ASSLASH\n"); return;
Now when I run cwj -T input.c
with the expression a += b + c
, I see:
A_IDENT rval a
A_IDENT rval b
A_IDENT rval c
A_ADD
A_ASPLUS
which we can redraw as:
A_ASPLUS
/ \
A_IDENT A_ADD
rval a / \
A_IDENT A_IDENT
rval b rval c
Well, in gen.c
we already walk the AST tree and deal with A_ADD and A_ASSIGN.
Is there a way to use the existing code to make implementing the new A_ASPLUS
operator a bit easier? Yes!
We can rewrite the above AST tree to look like this:
A_ASSIGN
/ \
A_ADD lval a
/ \
A_IDENT A_ADD
rval a / \
A_IDENT A_IDENT
rval b rval c
Now, we don’t have to rewrite the tree as long as we perform the tree walking as if the tree had been rewritten like this.
So in genAST()
, we have:
int genAST(...) {
...
// Get the left and right sub-tree values. This code already here.
if (n->left)
leftreg = genAST(n->left, NOLABEL, NOLABEL, NOLABEL, n->op);
if (n->right)
rightreg = genAST(n->right, NOLABEL, NOLABEL, NOLABEL, n->op);
}
From the perspective of doing the work for the A_ASPLUS node, we have
evaluated the left-hand child (e.g. a
’s value) and the right-hand child
(e.g. b+c
) and we have the values in two registers. If this was an A_ADD
operation, we would cgadd(leftreg, rightreg)
at this point. Well, it is
an A_ADD operation on these children, then followed by an assignment back into a
.
So, the genAST()
code now has this:
switch (n->op) {
...
case A_ASPLUS:
case A_ASMINUS:
case A_ASSTAR:
case A_ASSLASH:
case A_ASSIGN:
// For the '+=' and friends operators, generate suitable code
// and get the register with the result. Then take the left child,
// make it the right child so that we can fall into the assignment code.
switch (n->op) {
case A_ASPLUS:
leftreg= cgadd(leftreg, rightreg);
n->right= n->left;
break;
case A_ASMINUS:
leftreg= cgsub(leftreg, rightreg);
n->right= n->left;
break;
case A_ASSTAR:
leftreg= cgmul(leftreg, rightreg);
n->right= n->left;
break;
case A_ASSLASH:
leftreg= cgdiv(leftreg, rightreg);
n->right= n->left;
break;
}
// And the existing code to do A_ASSIGN is here
...
}
In other words, for each new operator, we perform the correct maths operation on the children. But before we can drop into the A_ASSIGN we have to move the left-child pointer over to be the right child. Why? Because the A_ASSIGN code expects the destination to be the right child:
return (cgstorlocal(leftreg, n->right->sym));
And that’s it. We were lucky to have code which we could adapt to add in these four new operators. There are more assignment operators which I haven’t implemented: ‘%=’, ‘<=’, ‘»=’, ‘&=’, ‘^=’ and ‘|=’. They should also be as easy to add as the four we just added.
The tests/input110.c
program is our testing program:
#include <stdio.h>
int x;
int y;
int main() {
x= 3; y= 15; y += x; printf("%d\n", y);
x= 3; y= 15; y -= x; printf("%d\n", y);
x= 3; y= 15; y *= x; printf("%d\n", y);
x= 3; y= 15; y /= x; printf("%d\n", y);
return(0);
}
and produces these results:
18
12
45
5
We’ve added some more operators, and the hardest part really was aligning all
the tokens, the AST operators and setting the precedence levels and
right-associativity. After that, we could reuse some of the code generation
code in genAST()
to make our lives a bit easier.
In the next part of our compiler writing journey, it looks like I’ll be adding constant folding to the compiler.