Make a compiler that supports a subset of the ANSI-C programming language
Now that we can compare values, it’s time to add IF statements to our language. So, firstly, let’s look at the general syntax of IF statements and how they get converted into assembly language.
The IF statement syntax is:
if (condition is true)
perform this first block of code
else
perform this other block of code
Now, how is this normally converted into assembly language? It turns out that we do the opposite comparison and jump/branch if the opposite comparison is true:
perform the opposite comparison
jump to L1 if true
perform the first block of code
jump to L2
L1:
perform the other block of code
L2:
where L1
and L2
are assembly language labels.
Right now, we output code to set a register based on a comparison, e.g.
int x; x= 7 < 9; From input04
becomes
movq $7, %r8
movq $9, %r9
cmpq %r9, %r8
setl %r9b Set if less than
andq $255,%r9
But for an IF statement, we need to jump on the opposite comparison:
if (7 < 9)
should become:
movq $7, %r8
movq $9, %r9
cmpq %r9, %r8
jge L1 Jump if greater then or equal to
....
L1:
So, I’ve implemented IF statements in this part of our journey. As this is a working project, I did have to undo a few things and refactor them as part of the journey. I’ll try to cover the changes as well as the additions along the way.
We are going to need a bunch of new tokens in our language. I also (for now) want to
avoid the dangling else problem. To that
end, I’ve changed the grammar so that all groups of statements are wrapped around
‘{‘ … ‘}’ curly brackets; I called such a grouping a “compound statement”.
We also need ‘(‘ … ‘)’ parentheses to hold the IF expression, plus keywords ‘if’ and
‘else’. Thus, the new tokens are (in defs.h
):
T_LBRACE, T_RBRACE, T_LPAREN, T_RPAREN,
// Keywords
..., T_IF, T_ELSE
The single-character tokens should be obvious and I won’t give the code to scan them.
The keywords should also be pretty obvious, but I’ll give the scanning code from
keyword()
in scan.c
:
switch (*s) {
case 'e':
if (!strcmp(s, "else"))
return (T_ELSE);
break;
case 'i':
if (!strcmp(s, "if"))
return (T_IF);
if (!strcmp(s, "int"))
return (T_INT);
break;
case 'p':
if (!strcmp(s, "print"))
return (T_PRINT);
break;
}
Our grammar is starting to get big, so I’ve rewritten it somewhat:
compound_statement: '{' '}' // empty, i.e. no statement
| '{' statement '}'
| '{' statement statements '}'
;
statement: print_statement
| declaration
| assignment_statement
| if_statement
;
print_statement: 'print' expression ';' ;
declaration: 'int' identifier ';' ;
assignment_statement: identifier '=' expression ';' ;
if_statement: if_head
| if_head 'else' compound_statement
;
if_head: 'if' '(' true_false_expression ')' compound_statement ;
identifier: T_IDENT ;
I’ve left out the definition of true_false_expression
, but at some point when
we’ve added a few more operators I’ll add it in.
Note the grammar for the IF statement: it’s either an if_head
(with no ‘else’ clause),
or an if_head
followed by a ‘else’ and a compound_statement
.
I’ve separated out all the different statement types to have their own non-terminal
name. Also, the previous statements
non-terminal is now the compound_statement
non-terminal, and this requires ‘{‘ … ‘}’ around the statements.
This means that the compound_statement
in the head is surrounded by ‘{‘ … ‘}’
and so is any compound_statement
after the ‘else’ keyword. So if we have nested IF
statements, they have to look like:
if (condition1 is true) {
if (condition2 is true) {
statements;
} else {
statements;
}
} else {
statements;
}
and there is no ambiguity about which ‘if’ each ‘else’ belongs to. This solves the dangling else problem. Later on, I’ll make the ‘{‘ … ‘}’ optional.
The old void statements()
function is now compound_statement()
and looks like this:
// Parse a compound statement
// and return its AST
struct ASTnode *compound_statement(void) {
struct ASTnode *left = NULL;
struct ASTnode *tree;
// Require a left curly bracket
lbrace();
while (1) {
switch (Token.token) {
case T_PRINT:
tree = print_statement();
break;
case T_INT:
var_declaration();
tree = NULL; // No AST generated here
break;
case T_IDENT:
tree = assignment_statement();
break;
case T_IF:
tree = if_statement();
break;
case T_RBRACE:
// When we hit a right curly bracket,
// skip past it and return the AST
rbrace();
return (left);
default:
fatald("Syntax error, token", Token.token);
}
// For each new tree, either save it in left
// if left is empty, or glue the left and the
// new tree together
if (tree) {
if (left == NULL)
left = tree;
else
left = mkastnode(A_GLUE, left, NULL, tree, 0);
}
}
Firstly, note that the code forces the parser to match the ‘{‘ at the start of the
compound statement with lbrace()
, and we can only exit when we’ve matched the ending
‘}’ with rbrace()
.
Secondly, note that print_statement()
, assignment_statement()
and
if_statement()
all return an AST tree, as does compound_statement()
.
In our old code, print_statement()
itself called genAST()
to evaluate the
expression, followed by a call to genprintint()
. Similarly,
assignment_statement()
also called genAST()
to do the assignment.
Well, this means that we have AST trees over here, and others over there. It makes some
sense to generate just a single AST tree, and call genAST()
once to generate the
assembly code for it.
This isn’t mandatory. For example, SubC only generates ASTs for expressions. For the structural parts of the language, like statements, SubC makes specific calls to the code generator as I was doing in the previous versions of the compiler.
I’ve decided to, for now, generate a single AST tree for the whole input with the parser. Once the input has been parsed, the assembly output can be generated from the one AST tree.
Later on, I’ll probably generate an AST tree for each function. Later.
Because we are a recursive descent parser, parsing the IF statement is not too bad:
// Parse an IF statement including
// any optional ELSE clause
// and return its AST
struct ASTnode *if_statement(void) {
struct ASTnode *condAST, *trueAST, *falseAST = NULL;
// Ensure we have 'if' '('
match(T_IF, "if");
lparen();
// Parse the following expression
// and the ')' following. Ensure
// the tree's operation is a comparison.
condAST = binexpr(0);
if (condAST->op < A_EQ || condAST->op > A_GE)
fatal("Bad comparison operator");
rparen();
// Get the AST for the compound statement
trueAST = compound_statement();
// If we have an 'else', skip it
// and get the AST for the compound statement
if (Token.token == T_ELSE) {
scan(&Token);
falseAST = compound_statement();
}
// Build and return the AST for this statement
return (mkastnode(A_IF, condAST, trueAST, falseAST, 0));
}
Right now, I don’t want to deal with input like if (x-2)
, so I’ve limited
the binary expression from binexpr()
to have a root which is one of the
six comparison operators A_EQ, A_NE, A_LT, A_GT, A_LE or A_GE.
I nearly smuggled something past you without properly explaining it. In
the last line of if_statement()
I build an AST node with:
mkastnode(A_IF, condAST, trueAST, falseAST, 0);
That’s three AST sub-trees! What’s going on here? As you can see, the IF statement will have three children:
So we now need an AST node structure with three children (in defs.h
):
// AST node types.
enum {
...
A_GLUE, A_IF
};
// Abstract Syntax Tree structure
struct ASTnode {
int op; // "Operation" to be performed on this tree
struct ASTnode *left; // Left, middle and right child trees
struct ASTnode *mid;
struct ASTnode *right;
union {
int intvalue; // For A_INTLIT, the integer value
int id; // For A_IDENT, the symbol slot number
} v;
};
Thus, an A_IF tree looks like this:
IF
/ | \
/ | \
/ | \
/ | \
/ | \
/ | \
condition statements statements
There is also a new A_GLUE AST node type. What is this used for? We now build a single AST tree with lots of statements, so we need a way to glue them together.
Review the end of the compound_statement()
loop code:
if (left != NULL)
left = mkastnode(A_GLUE, left, NULL, tree, 0);
Each time we get a new sub-tree, we glue it on to the existing tree. So, for this sequence of statements:
stmt1;
stmt2;
stmt3;
stmt4;
we end up with:
A_GLUE
/ \
A_GLUE stmt4
/ \
A_GLUE stmt3
/ \
stmt1 stmt2
And, as we traverse the tree depth-first left to right, this still generates the assembly code in the correct order.
Now that our AST nodes have multiple children, our generic code generator is going to become a bit more complicated. Also, for the comparison operators, we need to know if we are doing the compare as part of an IF statement (jump on the opposite comparison) or a normal expression (set register to 1 or 0 on the normal comparison).
To this end, I’ve modified getAST()
so that we can pass in the
parent AST nodes operation:
// Given an AST, the register (if any) that holds
// the previous rvalue, and the AST op of the parent,
// generate assembly code recursively.
// Return the register id with the tree's final value
int genAST(struct ASTnode *n, int reg, int parentASTop) {
...
}
The code in genAST()
now has to deal with specific AST nodes:
// We now have specific AST node handling at the top
switch (n->op) {
case A_IF:
return (genIFAST(n));
case A_GLUE:
// Do each child statement, and free the
// registers after each child
genAST(n->left, NOREG, n->op);
genfreeregs();
genAST(n->right, NOREG, n->op);
genfreeregs();
return (NOREG);
}
If we don’t return, we carry on to do the normal binary operator AST nodes, with one exception: the comparison nodes:
case A_EQ:
case A_NE:
case A_LT:
case A_GT:
case A_LE:
case A_GE:
// If the parent AST node is an A_IF, generate a compare
// followed by a jump. Otherwise, compare registers and
// set one to 1 or 0 based on the comparison.
if (parentASTop == A_IF)
return (cgcompare_and_jump(n->op, leftreg, rightreg, reg));
else
return (cgcompare_and_set(n->op, leftreg, rightreg));
I’ll cover the new functions cgcompare_and_jump()
and
cgcompare_and_set()
below.
We deal with the A_IF AST node with a specific function, along with a function to generate new label numbers:
// Generate and return a new label number
static int label(void) {
static int id = 1;
return (id++);
}
// Generate the code for an IF statement
// and an optional ELSE clause
static int genIFAST(struct ASTnode *n) {
int Lfalse, Lend;
// Generate two labels: one for the
// false compound statement, and one
// for the end of the overall IF statement.
// When there is no ELSE clause, Lfalse _is_
// the ending label!
Lfalse = label();
if (n->right)
Lend = label();
// Generate the condition code followed
// by a zero jump to the false label.
// We cheat by sending the Lfalse label as a register.
genAST(n->left, Lfalse, n->op);
genfreeregs();
// Generate the true compound statement
genAST(n->mid, NOREG, n->op);
genfreeregs();
// If there is an optional ELSE clause,
// generate the jump to skip to the end
if (n->right)
cgjump(Lend);
// Now the false label
cglabel(Lfalse);
// Optional ELSE clause: generate the
// false compound statement and the
// end label
if (n->right) {
genAST(n->right, NOREG, n->op);
genfreeregs();
cglabel(Lend);
}
return (NOREG);
}
Effectively, the code is doing:
genAST(n->left, Lfalse, n->op); // Condition and jump to Lfalse
genAST(n->mid, NOREG, n->op); // Statements after 'if'
cgjump(Lend); // Jump to Lend
cglabel(Lfalse); // Lfalse: label
genAST(n->right, NOREG, n->op); // Statements after 'else'
cglabel(Lend); // Lend: label
So we now have a few new x86-64 code generation functions. Some of
these replace the six cgXXX()
comparison functions we created in the
last part of the journey.
For the normal comparison functions, we now pass in the AST operation
to choose the relevant x86-64 set
instruction:
// List of comparison instructions,
// in AST order: A_EQ, A_NE, A_LT, A_GT, A_LE, A_GE
static char *cmplist[] =
{ "sete", "setne", "setl", "setg", "setle", "setge" };
// Compare two registers and set if true.
int cgcompare_and_set(int ASTop, int r1, int r2) {
// Check the range of the AST operation
if (ASTop < A_EQ || ASTop > A_GE)
fatal("Bad ASTop in cgcompare_and_set()");
fprintf(Outfile, "\tcmpq\t%s, %s\n", reglist[r2], reglist[r1]);
fprintf(Outfile, "\t%s\t%s\n", cmplist[ASTop - A_EQ], breglist[r2]);
fprintf(Outfile, "\tmovzbq\t%s, %s\n", breglist[r2], reglist[r2]);
free_register(r1);
return (r2);
}
I’ve also found an x86-64 instruction movzbq
that moves the lowest byte from
one register and extends it to fit into a 64-bit register. I’m using that now
instead of the and $255
in the old code.
We need a functions to generate a label and to jump to it:
// Generate a label
void cglabel(int l) {
fprintf(Outfile, "L%d:\n", l);
}
// Generate a jump to a label
void cgjump(int l) {
fprintf(Outfile, "\tjmp\tL%d\n", l);
}
Finally, we need a function to do a comparison and to jump based on the opposite comparison. So, using the AST comparison node type, we do the opposite comparison:
// List of inverted jump instructions,
// in AST order: A_EQ, A_NE, A_LT, A_GT, A_LE, A_GE
static char *invcmplist[] = { "jne", "je", "jge", "jle", "jg", "jl" };
// Compare two registers and jump if false.
int cgcompare_and_jump(int ASTop, int r1, int r2, int label) {
// Check the range of the AST operation
if (ASTop < A_EQ || ASTop > A_GE)
fatal("Bad ASTop in cgcompare_and_set()");
fprintf(Outfile, "\tcmpq\t%s, %s\n", reglist[r2], reglist[r1]);
fprintf(Outfile, "\t%s\tL%d\n", invcmplist[ASTop - A_EQ], label);
freeall_registers();
return (NOREG);
}
Do a make test
which compiles the input05
file:
{
int i; int j;
i=6; j=12;
if (i < j) {
print i;
} else {
print j;
}
}
Here’s the resulting assembly output:
movq $6, %r8
movq %r8, i(%rip) # i=6;
movq $12, %r8
movq %r8, j(%rip) # j=12;
movq i(%rip), %r8
movq j(%rip), %r9
cmpq %r9, %r8 # Compare %r8-%r9, i.e. i-j
jge L1 # Jump to L1 if i >= j
movq i(%rip), %r8
movq %r8, %rdi # print i;
call printint
jmp L2 # Skip the else code
L1:
movq j(%rip), %r8
movq %r8, %rdi # print j;
call printint
L2:
And, of course, make test
shows:
cc -o comp1 -g cg.c decl.c expr.c gen.c main.c misc.c
scan.c stmt.c sym.c tree.c
./comp1 input05
cc -o out out.s
./out
6 # As 6 is less than 12
We’ve added our first control structure to our language with the IF statement. I had to rewrite a few existing things along the way and, given I don’t have a complete architectural plan in my head, I’ll likely have to rewrite more things in the future.
The wrinkle for this part of the journey was that we had to perform the opposite comparison for the IF decision than what we would do for the normal comparison operators. My solution was to inform each AST node of the node type of their parent; the comparison nodes can now see if the parent is an A_IF node or not.
I know that Nils Holm chose a different approach when he was implementing SubC, so you should look at his code just to see this different solution to the same problem.
In the next part of our compiler writing journey, we will add another control structure: the WHILE loop.