Make a compiler that supports a subset of the ANSI-C programming language
This part of our compiler writing journey really doesn’t have anything to
do with scanning, parsing, semantic analysis or code generation. In this
part, I add the -c
, -S
and -o
run-time flags to the compiler so
that it behaves more like a traditional Unix C compiler.
So, if that’s not interesting, feel free to skip to the next part of the journey.
Up to now, our compiler has only been outputting assembly files. But there are more steps to convert a source code file in a high-level language to an executable file:
We’ve been doing the last two steps manually or with our Makefile, but I’m going to modify the compiler to call an external assembler and linker to perform the last two steps.
To do this, I’m going to rearrange some of the code in main.c
and also
write more functions in main.c
to do the assembling and linking. Most
of this code is typical string and file handling code done in C, so I’ll
go through the code but it may only be interesting if you’ve never seen
this sort of code.
I’ve renamed the compiler to be cwj
to reflect the name of the project.
When you run it with no command-line arguments, it now gives this usage
message:
$ ./cwj
Usage: ./cwj [-vcST] [-o outfile] file [file ...]
-v give verbose output of the compilation stages
-c generate object files but don't link them
-S generate assembly files but don't link them
-T dump the AST trees for each input file
-o outfile, produce the outfile executable file
We now allow multiple source code files as inputs. We have four boolean
flags, -v
, -c
, -S
and -T
, and we can now name the output executable
file.
The argv[]
parsing code in main()
is now changed to deal with this, and
there are several more option variables to hold the results.
// Initialise our variables
O_dumpAST = 0; // If true, dump the AST trees
O_keepasm = 0; // If true, keep any assembly files
O_assemble = 0; // If true, assemble the assembly files
O_dolink = 1; // If true, link the object files
O_verbose = 0; // If true, print info on compilation stages
// Scan for command-line options
for (i = 1; i < argc; i++) {
// No leading '-', stop scanning for options
if (*argv[i] != '-')
break;
// For each option in this argument
for (int j = 1; (*argv[i] == '-') && argv[i][j]; j++) {
switch (argv[i][j]) {
case 'o':
outfilename = argv[++i]; break; // Save & skip to next argument
case 'T':
O_dumpAST = 1; break;
case 'c':
O_assemble = 1; O_keepasm = 0; O_dolink = 0; break;
case 'S':
O_keepasm = 1; O_assemble = 0; O_dolink = 0; break;
case 'v':
O_verbose = 1; break;
default:
usage(argv[0]);
}
}
}
Note that some options are mutually exclusive, e.g. if we only want assembly
output with -S
, then we don’t want to link or create object files.
With the command-line flags parsed, we can now run the compilation stages.
We can compile and assemble each input file easily, but there may be a
number of object files that we need to link together at the end. So we have
some local variables in main()
to store the object file names:
#define MAXOBJ 100
char *objlist[MAXOBJ]; // List of object file names
int objcnt = 0; // Position to insert next name
We first process all the input source files in turn:
// Work on each input file in turn
while (i < argc) {
asmfile = do_compile(argv[i]); // Compile the source file
if (O_dolink || O_assemble) {
objfile = do_assemble(asmfile); // Assemble it to object format
if (objcnt == (MAXOBJ - 2)) {
fprintf(stderr, "Too many object files for the compiler to handle\n");
exit(1);
}
objlist[objcnt++] = objfile; // Add the object file's name
objlist[objcnt] = NULL; // to the list of object files
}
if (!O_keepasm) // Remove the assembly file if
unlink(asmfile); // we don't need to keep it
i++;
}
do_compile()
has the code that used to be in main()
to
open the file, parse it ourselves and generate the assembly file.
But we can’t open up the hard-coded filename
out.s
like we used to; we now need to convert filename.c
to filename.s
.
We have a helper function to alter filenames.
// Given a string with a '.' and at least a 1-character suffix
// after the '.', change the suffix to be the given character.
// Return the new string or NULL if the original string could
// not be modified
char *alter_suffix(char *str, char suffix) {
char *posn;
char *newstr;
// Clone the string
if ((newstr = strdup(str)) == NULL) return (NULL);
// Find the '.'
if ((posn = strrchr(newstr, '.')) == NULL) return (NULL);
// Ensure there is a suffix
posn++;
if (*posn == '\0') return (NULL);
// Change the suffix and NUL-terminate the string
*posn++ = suffix; *posn = '\0';
return (newstr);
}
Only the strdup()
, strrchr()
and the last two lines do any real work;
the rest is error checking.
Here is the code that we used to have, now repackaged into a new function.
// Given an input filename, compile that file
// down to assembly code. Return the new file's name
static char *do_compile(char *filename) {
Outfilename = alter_suffix(filename, 's');
if (Outfilename == NULL) {
fprintf(stderr, "Error: %s has no suffix, try .c on the end\n", filename);
exit(1);
}
// Open up the input file
if ((Infile = fopen(filename, "r")) == NULL) {
fprintf(stderr, "Unable to open %s: %s\n", filename, strerror(errno));
exit(1);
}
// Create the output file
if ((Outfile = fopen(Outfilename, "w")) == NULL) {
fprintf(stderr, "Unable to create %s: %s\n", Outfilename,
strerror(errno));
exit(1);
}
Line = 1; // Reset the scanner
Putback = '\n';
clear_symtable(); // Clear the symbol table
if (O_verbose)
printf("compiling %s\n", filename);
scan(&Token); // Get the first token from the input
genpreamble(); // Output the preamble
global_declarations(); // Parse the global declarations
genpostamble(); // Output the postamble
fclose(Outfile); // Close the output file
return (Outfilename);
}
There’s very little new code here, just the call to alter_suffix()
to
get the correct output file’s name.
There is one important change: the assembly output file is now a global
variable called Outfilename
. This allows the fatal()
function and
friends in misc.c
to remove assembly files if we never fully generated them,
e.g.
// Print out fatal messages
void fatal(char *s) {
fprintf(stderr, "%s on line %d\n", s, Line);
fclose(Outfile);
unlink(Outfilename);
exit(1);
}
Now that we have assembly output files, we can now call an external
assembler to do this. This is defined as ASCMD in defs.h
. Here’s
the function to do this:
#define ASCMD "as -o "
// Given an input filename, assemble that file
// down to object code. Return the object filename
char *do_assemble(char *filename) {
char cmd[TEXTLEN];
int err;
char *outfilename = alter_suffix(filename, 'o');
if (outfilename == NULL) {
fprintf(stderr, "Error: %s has no suffix, try .s on the end\n", filename);
exit(1);
}
// Build the assembly command and run it
snprintf(cmd, TEXTLEN, "%s %s %s", ASCMD, outfilename, filename);
if (O_verbose) printf("%s\n", cmd);
err = system(cmd);
if (err != 0) { fprintf(stderr, "Assembly of %s failed\n", filename); exit(1); }
return (outfilename);
}
I’m using snprintf()
to build the assembly command which we will run.
If the user used the -v
command-line flag, this command will be shown to
them. Then we use system()
to execute this Linux command. Example:
$ ./cwj -v -c tests/input54.c
compiling tests/input54.c
as -o tests/input54.o tests/input54.s
Down in main()
we build up a list of object files that do_assemble()
returns to us:
objlist[objcnt++] = objfile; // Add the object file's name
objlist[objcnt] = NULL; // to the list of object files
So, when we need to link them all together, we need to pass this list to
the do_link()
function. The code is similar to do_assemble()
in that
it uses snprintf()
and system()
. The difference is that we must
track where we are up to in our command buffer, and how much room is left
to do more snprintf()
ing.
#define LDCMD "cc -o "
// Given a list of object files and an output filename,
// link all of the object filenames together.
void do_link(char *outfilename, char *objlist[]) {
int cnt, size = TEXTLEN;
char cmd[TEXTLEN], *cptr;
int err;
// Start with the linker command and the output file
cptr = cmd;
cnt = snprintf(cptr, size, "%s %s ", LDCMD, outfilename);
cptr += cnt; size -= cnt;
// Now append each object file
while (*objlist != NULL) {
cnt = snprintf(cptr, size, "%s ", *objlist);
cptr += cnt; size -= cnt; objlist++;
}
if (O_verbose) printf("%s\n", cmd);
err = system(cmd);
if (err != 0) { fprintf(stderr, "Linking failed\n"); exit(1); }
}
One annoyance is that I’m still calling the external C compiler cc
to do
the linking. We really should be able to break this dependency on another
compiler.
A long time ago, it was possible to link a set of object files manually by doing, e.g.
$ ln -o out /lib/crt0.o file1.o file.o /usr/lib/libc.a
I assume that it should be possible to do a similar command on current Linux, but so far my Google-fu isn’t enough to work this out. If you read this and know the answer, let me know!
printint()
and printchar()
Now that we can call printf()
directly in the programs that we can compile,
we no longer need our hand-written printint()
and printchar()
functions.
I’ve removed lib/printint.c
, and I’ve updated all of the tests in the
tests/
directory to use printf()
.
I’ve also updated the tests/mktests
and tests/runtests
scripts so that
they use the new compiler command-line arguments, and ditto the top-level
Makefile
. So a make test
still runs our regression tests OK.
That’s about it for this part of our journey. Our compiler now feels like the traditional Unix compilers that I’m used to.
I did promise to add in support for an external pre-processor in this step, but I decided against it. The main reason is that I would need to parse the filenames and line numbers that the pre-processor embeds in its output, e.g.
# 1 "tests/input54.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 31 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 32 "<command-line>" 2
# 1 "tests/input54.c"
int printf(char *fmt);
int main()
{
int i;
for (i=0; i < 20; i++) {
printf("Hello world, %d\n", i);
}
return(0);
}
In the next part of our compiler writing journey, we will look at adding support for structs to our compiler. I think we might have to do another design step first before we get to implementing the changes.