Make a compiler that supports a subset of the ANSI-C programming language
I was going to add IF statements next, but then I realised that I’d better add some comparison operators first. This turned out to be quite easy, because they are binary operators like the existing ones.
So let’s quickly see what the changes are to add the six comparison
operators: ==
, !=
, <
, >
, <=
and >=
.
We will have six new tokens, so let’s add them to defs.h
:
// Token types
enum {
T_EOF,
T_PLUS, T_MINUS,
T_STAR, T_SLASH,
T_EQ, T_NE,
T_LT, T_GT, T_LE, T_GE,
T_INTLIT, T_SEMI, T_ASSIGN, T_IDENT,
// Keywords
T_PRINT, T_INT
};
I’ve rearranged the tokens so that the ones with precedence come, in low-to-high precedence order, before the tokens that don’t have any precedence.
Now we have to scan them in. Note that we have to distinguish between
=
and ==
, <
and <=
, >
and >=
. So we will need to read in
an extra character from the input and put it back if we don’t need it.
Here’s the new code in scan()
from scan.c
:
case '=':
if ((c = next()) == '=') {
t->token = T_EQ;
} else {
putback(c);
t->token = T_ASSIGN;
}
break;
case '!':
if ((c = next()) == '=') {
t->token = T_NE;
} else {
fatalc("Unrecognised character", c);
}
break;
case '<':
if ((c = next()) == '=') {
t->token = T_LE;
} else {
putback(c);
t->token = T_LT;
}
break;
case '>':
if ((c = next()) == '=') {
t->token = T_GE;
} else {
putback(c);
t->token = T_GT;
}
break;
I also changed the name of the =
token to T_ASSIGN to ensure I didn’t get
confused between it and the new T_EQ token.
We can now scan in the six new tokens. So now we have to parse them when they appear in expressions, and also enforce their operator precedence.
By now you would have worked out that:
The implication is that I’m writing a compiler for enough of a subset of C (just as SubC) so that it will compile itself. Therefore, I should use the normal C operator precedence order. This means that the comparison operators have lower precedence than multiply and divide.
I also realised that the switch statement I was using to map tokens to AST
node types was only going to get bigger. So I decided to rearrange the
AST node types so that there is a 1:1 mapping between them for all the
binary operators (in defs.h
):
// AST node types. The first few line up
// with the related tokens
enum {
A_ADD=1, A_SUBTRACT, A_MULTIPLY, A_DIVIDE,
A_EQ, A_NE, A_LT, A_GT, A_LE, A_GE,
A_INTLIT,
A_IDENT, A_LVIDENT, A_ASSIGN
};
Now in expr.c
, I can simplify the token to AST node conversion and also
add in the new tokens’ precedence:
// Convert a binary operator token into an AST operation.
// We rely on a 1:1 mapping from token to AST operation
static int arithop(int tokentype) {
if (tokentype > T_EOF && tokentype < T_INTLIT)
return(tokentype);
fatald("Syntax error, token", tokentype);
}
// Operator precedence for each token. Must
// match up with the order of tokens in defs.h
static int OpPrec[] = {
0, 10, 10, // T_EOF, T_PLUS, T_MINUS
20, 20, // T_STAR, T_SLASH
30, 30, // T_EQ, T_NE
40, 40, 40, 40 // T_LT, T_GT, T_LE, T_GE
};
That’s it for the parsing and operator precedence!
As the six new operators are binary operators, it’s easy to modify the
generic code generator in gen.c
to deal with them:
case A_EQ:
return (cgequal(leftreg, rightreg));
case A_NE:
return (cgnotequal(leftreg, rightreg));
case A_LT:
return (cglessthan(leftreg, rightreg));
case A_GT:
return (cggreaterthan(leftreg, rightreg));
case A_LE:
return (cglessequal(leftreg, rightreg));
case A_GE:
return (cggreaterequal(leftreg, rightreg));
Now it gets a bit tricky. In C, the comparison operators return a value. If they evaluate true, their result is 1. If they evaluate false, their result is 0. We need to write x86-64 assembly code to reflect this.
Luckily there are some x86-64 instructions to do this. Unfortunately, there are some issues to deal with along the way. Consider this x86-64 instruction:
cmpq %r8,%r9
The above cmpq
instruction performs %r9 - %r8 and sets several status
flags including the negative and zero flags. Thus, we can look at the
flag combinations to see the result of the comparisons:
Comparison | Operation | Flags If True |
---|---|---|
%r8 == %r9 | %r9 - %r8 | Zero |
%r8 != %r9 | %r9 - %r8 | Not Zero |
%r8 > %r9 | %r9 - %r8 | Not Zero, Negative |
%r8 < %r9 | %r9 - %r8 | Not Zero, Not Negative |
%r8 >= %r9 | %r9 - %r8 | Zero or Negative |
%r8 <= %r9 | %r9 - %r8 | Zero or Not Negative |
There are six x86-64 instructions which set a register to 1 or 0 based on
the two flag values: sete
, setne
, setg
, setl
, setge
and setle
in the order of the above table rows.
The problem is, these instructions only set the lowest byte of a register. If the register already has bits set outside of the lowest byte, they will stay set. So we might set a variable to 1, but if it already has the value 1000 (decimal), then it will now be 1001 which is not what we want.
The solution is to andq
the register after the setX
instruction to
get rid of the unwanted bits. In cg.c
there is a general comparison
function to do this:
// Compare two registers.
static int cgcompare(int r1, int r2, char *how) {
fprintf(Outfile, "\tcmpq\t%s, %s\n", reglist[r2], reglist[r1]);
fprintf(Outfile, "\t%s\t%s\n", how, breglist[r2]);
fprintf(Outfile, "\tandq\t$255,%s\n", reglist[r2]);
free_register(r1);
return (r2);
}
where how
is one of the setX
instructions. Note that we perform
cmpq reglist[r2], reglist[r1]
because this is actually reglist[r1] - reglist[r2]
which is what we really
want.
We need to take a short diversion here to discuss the registers in the x86-64 architecture. x86-64 has several 64-bit general purpose registers, but we can also use different register names to access and work on subsections of these registers.
The above image from stack.imgur.com shows that, for the 64-bit r8 register, we can access the low 32 bits of this register by using the “r8d” register. Similarly, the “r8w” register is the low 16 bits and the “r8b” register is the low 8 bits of the r8 register.
In the cgcompare()
function, the code uses the reglist[]
array to
compare the two 64-bit registers, but then sets a flag in the 8-bit
version of the second register by using the names in the breglist[]
array.
The x86-64 architecture only allows the setX
instructions to operate on
the 8-bit register names, thus the need for the breglist[]
array.
Now that we have this general function, we can write the six actual comparison functions:
int cgequal(int r1, int r2) { return(cgcompare(r1, r2, "sete")); }
int cgnotequal(int r1, int r2) { return(cgcompare(r1, r2, "setne")); }
int cglessthan(int r1, int r2) { return(cgcompare(r1, r2, "setl")); }
int cggreaterthan(int r1, int r2) { return(cgcompare(r1, r2, "setg")); }
int cglessequal(int r1, int r2) { return(cgcompare(r1, r2, "setle")); }
int cggreaterequal(int r1, int r2) { return(cgcompare(r1, r2, "setge")); }
As with the other binary operator functions, one register is freed and the other register returns with the result.
Have a look at the input04
input file:
int x;
x= 7 < 9; print x;
x= 7 <= 9; print x;
x= 7 != 9; print x;
x= 7 == 7; print x;
x= 7 >= 7; print x;
x= 7 <= 7; print x;
x= 9 > 7; print x;
x= 9 >= 7; print x;
x= 9 != 7; print x;
All of these comparisons are true, so we should print nine 1s out. Do
a make test
to confirm this.
Let’s look at the assembly code output by the first comparison:
movq $7, %r8
movq $9, %r9
cmpq %r9, %r8 # Perform %r8 - %r9, i.e. 7 - 9
setl %r9b # Set %r9b to 1 if 7 is less than 9
andq $255,%r9 # Remove all other bits in %r9
movq %r9, x(%rip) # Save the result in x
movq x(%rip), %r8
movq %r8, %rdi
call printint # Print x out
Yes there is some inefficient assembly code above. We haven’t even started to worry about optimised code yet. To quote Donald Knuth:
Premature optimization is the root of all evil (or at least most of it) in programming.
That was a nice and easy addition to the compiler. The next part of the journey will be much more complicated.
In the next part of our compiler writing journey, we will add IF statements to the compiler and make use of the comparison operators that we just added.