Make a compiler that supports a subset of the ANSI-C programming language
It’s time to add some “proper” statements to the grammar of our language. I want to be able to write lines of code like this:
print 2 + 3 * 5;
print 18 - 6/3 + 4*2;
Of course, as we are ignoring whitespace, there’s no necessity that
all the tokens for one statement are on the same line. Each statement
starts with the keyword print
and is terminated with a semicolon. So
these are going to become new tokens in our language.
We’ve already seen the BNF notation for expressions. Now let’s define the BNF syntax for the above types of statements:
statements: statement
| statement statements
;
statement: 'print' expression ';'
;
An input file consists of several statements. They are either one statement,
or a statement followed by more statements. Each statement starts with the
keyword print
, then one expression, then a semicolon.
Before we can get to the code that parses the above syntax, we need to add a few more bits and pieces to the existing code. Let’s start with the lexical scanner.
Adding a token for semicolons will be easy. Now, the print
keyword.
Later on, we’ll have many keywords in the language, plus identifiers
for our variables, so we’ll need to add some code which helps us to
deal with them.
In scan.c
, I’ve added this code which I’ve borrowed from the SubC
compiler. It reads in alphanumeric characters into a
buffer until it hits a non-alphanumeric character.
// Scan an identifier from the input file and
// store it in buf[]. Return the identifier's length
static int scanident(int c, char *buf, int lim) {
int i = 0;
// Allow digits, alpha and underscores
while (isalpha(c) || isdigit(c) || '_' == c) {
// Error if we hit the identifier length limit,
// else append to buf[] and get next character
if (lim - 1 == i) {
printf("identifier too long on line %d\n", Line);
exit(1);
} else if (i < lim - 1) {
buf[i++] = c;
}
c = next();
}
// We hit a non-valid character, put it back.
// NUL-terminate the buf[] and return the length
putback(c);
buf[i] = '\0';
return (i);
}
We also need a function to recognise keywords in the language. One way
would be to have a list of keywords, and to walk the list and strcmp()
each one against the buffer from scanident()
. The code from SubC has
an optimisation: match against the first letter before doing the strcmp()
.
This speeds up the comparison against dozens of keywords. Right now we
don’t need this optimisation but I’ve put it in for later:
// Given a word from the input, return the matching
// keyword token number or 0 if it's not a keyword.
// Switch on the first letter so that we don't have
// to waste time strcmp()ing against all the keywords.
static int keyword(char *s) {
switch (*s) {
case 'p':
if (!strcmp(s, "print"))
return (T_PRINT);
break;
}
return (0);
}
Now, at the bottom of the switch statement in scan()
, we add this code
to recognise semicolons and keywords:
case ';':
t->token = T_SEMI;
break;
default:
// If it's a digit, scan the
// literal integer value in
if (isdigit(c)) {
t->intvalue = scanint(c);
t->token = T_INTLIT;
break;
} else if (isalpha(c) || '_' == c) {
// Read in a keyword or identifier
scanident(c, Text, TEXTLEN);
// If it's a recognised keyword, return that token
if (tokentype = keyword(Text)) {
t->token = tokentype;
break;
}
// Not a recognised keyword, so an error for now
printf("Unrecognised symbol %s on line %d\n", Text, Line);
exit(1);
}
// The character isn't part of any recognised token, error
printf("Unrecognised character %c on line %d\n", c, Line);
exit(1);
I’ve also added a global Text
buffer to store the keywords and
identifiers:
#define TEXTLEN 512 // Length of symbols in input
extern_ char Text[TEXTLEN + 1]; // Last identifier scanned
Up to now our input files have contained just a single expression; therefore,
in our Pratt parser code in binexpr()
(in expr.c
), we had this code to
exit the parser:
// If no tokens left, return just the left node
tokentype = Token.token;
if (tokentype == T_EOF)
return (left);
With our new grammar, each expression is terminated by a semicolon. Thus,
we need to change the code in the expression parser to spot the T_SEMI
tokens and exit the expression parsing:
// Return an AST tree whose root is a binary operator.
// Parameter ptp is the previous token's precedence.
struct ASTnode *binexpr(int ptp) {
struct ASTnode *left, *right;
int tokentype;
// Get the integer literal on the left.
// Fetch the next token at the same time.
left = primary();
// If we hit a semicolon, return just the left node
tokentype = Token.token;
if (tokentype == T_SEMI)
return (left);
while (op_precedence(tokentype) > ptp) {
...
// Update the details of the current token.
// If we hit a semicolon, return just the left node
tokentype = Token.token;
if (tokentype == T_SEMI)
return (left);
}
}
I want to keep the generic code generator in gen.c
separate from the CPU-specific code in cg.c
. That also means
that the rest of the compiler should only ever call the functions in
gen.c
, and only gen.c
should call the code in cg.c
.
To this end, I’ve defined some new “front-end” functions in gen.c
:
void genpreamble() { cgpreamble(); }
void genpostamble() { cgpostamble(); }
void genfreeregs() { freeall_registers(); }
void genprintint(int reg) { cgprintint(reg); }
We have a new file stmt.c
. This will hold the parsing code for all
the main statements in our language. Right now, we need to parse the
BNF grammar for statements which I gave up above. This is done with
this single function. I’ve converted the recursive definition into
a loop:
// Parse one or more statements
void statements(void) {
struct ASTnode *tree;
int reg;
while (1) {
// Match a 'print' as the first token
match(T_PRINT, "print");
// Parse the following expression and
// generate the assembly code
tree = binexpr(0);
reg = genAST(tree);
genprintint(reg);
genfreeregs();
// Match the following semicolon
// and stop if we are at EOF
semi();
if (Token.token == T_EOF)
return;
}
}
In each loop, the code finds a T_PRINT token. It then calls binexpr()
to
parse the expression. Finally, it finds the T_SEMI token. If a T_EOF token
follows, we break out of the loop.
After each expression tree, the code in gen.c
is called to convert
the tree into assembly code and to call the assembly printint()
function
to print out the final value.
There are a couple of new helper functions in the above code, which I’ve put
into a new file, misc.c
:
// Ensure that the current token is t,
// and fetch the next token. Otherwise
// throw an error
void match(int t, char *what) {
if (Token.token == t) {
scan(&Token);
} else {
printf("%s expected on line %d\n", what, Line);
exit(1);
}
}
// Match a semicon and fetch the next token
void semi(void) {
match(T_SEMI, ";");
}
These form part of the syntax checking in the parser. Later on, I’ll add
more short functions to call match()
to make our syntax checking easier.
main()
main()
used to call binexpr()
directly to parse the single expression
in the old input files. Now it does this:
scan(&Token); // Get the first token from the input
genpreamble(); // Output the preamble
statements(); // Parse the statements in the input
genpostamble(); // Output the postamble
fclose(Outfile); // Close the output file and exit
exit(0);
That’s about it for the new and changed code. Let’s give the new code
a whirl. Here is the new input file, input01
:
print 12 * 3;
print
18 - 2
* 4; print
1 + 2 +
9 - 5/2 + 3*5;
Yes I’ve decided to check that we have have tokens spread out across multiple
lines. To compile and run the input file, do a make test
:
$ make test
cc -o comp1 -g cg.c expr.c gen.c main.c misc.c scan.c stmt.c tree.c
./comp1 input01
cc -o out out.s
./out
36
10
25
And it works!
We’ve added our first “real” statement grammar to our language. I’ve defined it in BNF notation, but it was easier to implement it with a loop and not recursively. Don’t worry, we’ll go back to doing recursive parsing soon.
Along the way we had to modify the scanner, add support for keywords and identifiers, and to more cleanly separate the generic code generator and the CPU-specific generator.
In the next part of our compiler writing journey, we will add variables to the language. This will require a significant amount of work.