Designing a test methodology for a compiler frontend

When it comes to designing tests for a compiler, the usual unit testing principles apply. However, taking text and producing errors or warnings about it is such a large feature that it motivates a purpose-built test setup. Other parts of a compiler, such as middle-end IR generation and backend code generation, also deserve their own special test setups, but those will wait for another article. A compiler frontend serves merely as an example in this article, which is more about testing than about compilers.

The API under test

The high-level API of your compiler probably takes source code as input and produces messages such as errors if appropriate. Even simple languages have many inputs that such an interface needs to be tested against, that is, verifying whether the compiler gives us the right messages at the right times. Compilers of course do a lot more than this, but this is the sort of interface I will discuss testing in this post. The ideas here apply equally well to any SUT that takes textual input and produces some kind of annotations for it.

Accepting the input

Assuming we have some functions we can call to compile a given source and to get the generated messages, our first question might be How do we provide the input source code? Writing the input source code as string literal is attractive because it makes the code visible right there in the unit test. But this is not very pretty if the language you’re writing the unit test in lacks multiline string literals.

In C# with NUnit, you could have code like

[Test]
void TestDeclareFunction() {
    var compiler = new Compiler();
    compiler.AddSourceCode(@"class C {
    void Function() { }
}");
    compiler.Run();
    Assert.AreEqual(0, compiler.Messages.Count); 
}

[Test]
void TestDeclareField() {
    var compiler = new Compiler();
    compiler.AddSourceCode(@"class C {
    int foo;
}");
    compiler.Run();
    Assert.AreEqual(0, compiler.Messages.Count); 
}

After a little time spent following that pattern, we observe that we’re repeating the boilerplate before and after the source code, leading to an improvement that factors it out:

void AssertNoMessages(string sourceCode) {
    var compiler = new Compiler();
    compiler.AddSourceCode(sourceCode);
    compiler.Run();
    Assert.AreEqual(0, compiler.Messages.Count); 
}

[Test]
void TestDeclareFunction() {
    AssertNoMessages(@"class C {
    void Function() { }
}");
}

[Test]
void TestDeclareField() {
    AssertNoMessages(@"class C {
    int foo;
}");
}

This is better. Now let’s add support for asserting the presence of certain messages rather than just the absence of all messages. We want to be more specific than merely counting the number of messages generated—we’d like to specify the type or content of the message, as well as its associated location in the code. For example, let’s say our test case is the string:

class C {
    int foo;
    int foo;
}

Let’s say that in our compiler’s language, this program is considered to have errors at both occurrences of int foo;. This could lead us to write some test code like

[Test]
void TestDuplicateFieldDeclaration() {
    var compiler = new Compiler();
    compiler.AddSourceCode(@"class C {
    int foo;
    int foo;
}");
    compiler.Run();
    Assert.IsTrue(compiler.Messages.Any(m => m.Text.Contains("same name") && m.Line == 2));
    Assert.IsTrue(compiler.Messages.Any(m => m.Text.Contains("same name") && m.Line == 3));
    Assert.AreEqual(2, compiler.Messages.Count);
}

Again, we could create a helper function to condense the code for asserting the presence of a message matching certain criteria.

Rethinking matching on message text

How should we decide when an actual message matches our expectations? Checking that the message is associated with a certain source line is simple enough, but what about the message’s content? There are several options. We could expect the message text to be character-for-character equal to the expectation, or we could relax that in various ways, like ignoring case and punctuation, and expecting the message to contain a superset of the expected words in any order.

One nice way to express message expectations is to denote them by a preassigned number. You can have your compiler denote each kind of error or warning it produces with a unique number. Then your message expectation can include this number, and checking whether it matches an actual message becomes as easy as comparing numbers. Using numbers instead of relying on strings also increases the stability of your tests in the face of possible rewording of messages. Your expectations will also be less prone to false matching if an incorrect but similarly-worded message is produced which still passes your textual evaluation.

You can also use a hybrid approach, such as an assigned number combined with particular words that must be present.

Negative expectations are also possible, though I have found a blanket rejection of any message not specifically expected to suffice.

Reducing test code by embedding expectations into the test input

Solutions like the above work and could be considered sufficient. However, when you are writing vast numbers of unit tests, it’s worthwhile to invest further in making it as easy and unrepetitive as possible. Instead of writing code to check for certain expected messages after the compilation, what if we embedded those expectations into the source code itself? That would not only save us from having to repeatedly call the “check for this message” function, but it would also make our system less error-prone, since it would reduce the distance between each line of code expected to generate a message and the corresponding information about that expectation.

[Test]
void TestDuplicateFieldDeclaration() {
    var code = @"class C {
    // error: same name
    int foo;

    // error: same name
    int foo;
}";
    var expectedMessages = GetExpectedMessages(code);
    var compiler = new Compiler();
    compiler.AddSourceCode(code);
    compiler.Run();
    var actualMessages = compiler.Messages;
    AssertExpectationsMet(expectedMessages, actualMessages);
}

The comments in the test source (// error: same name) are meaningless to the compiler, but our test driver can use them to extract expectations. Pseudocode for GetExpectedMessages and AssertExpectationsMet is given below.

enum MessageKind {
    Error,
    Warning,
    Info
}
record ExpectedMessage(MessageKind Kind, string[] Words, int Line);
static ExpectedMessage GetExpectedMessages(string code) {
    // Create an empty list of ExpectedMessages
    // For each line in code
        // If a regex matches this line, extract the expected message kind and message words
        // Make a new ExpectedMessage object and append it to a list
    // Return the list
}

static void AssertExpectationsMet(IList<ExpectedMessage> expectedMessages, IList<Message> actualMessages) {
    // Assert that there is a one-to-one correspondence
    // between expected messages and actual messages:

    // Iterate through each expected message.
    // If it successfully matches an actual message,
    // remove the actual message from the list.

    // After processing all expected messages,
    // all expected messages should have found a match,
    // and there should be no leftover actual messages.
}

We could again factor out everything into a test helper function, reducing each test function down to a single function call plus a string literal.

void TestCompile(string code) {
    var expectedMessages = GetExpectedMessages(code);
    var compiler = new Compiler();
    compiler.AddSourceCode(code);
    compiler.Run();
    var actualMessages = compiler.Messages;
    AssertExpectationsMet(expectedMessages, actualMessages);
}

[Test]
void TestDuplicateFieldDeclaration() {
    TestCompile(@"class C {
    // error: same name
    int foo;

    // error: same name
    int foo;
}");
}

Scaling via data-driven tests

What could be more concise than a test function that just calls a function with a string argument?

How about no test function at all! Instead, we can take advantage of the feature variously known as data-driven tests or parameterized tests, which is where you instruct your test framework to programmatically create tests or test cases. Parameterized tests are an advanced feature, but they are more or less supported by current versions of NUnit, JUnit, GoogleTest, and other frameworks.

We can use parameterized testing here to dynamically discover files and folders of test data. In particular, I like to turn each folder into a test case. Every source file in each folder will be added to the same test. This gives us a straightforward way to have test cases consisting of zero, one, or multiple source files. Our test sources directory could look like below, in which case we’d have 4 test cases, one for each subdirectory of test-sources.

test-sources/
├── declare function/
│   └── function.src
├── declare field/
│   └── field.src
├── no source/
└── multiple files/
    ├── file1.src
    └── file2.src

NUnit C# example

static IEnumerable<string> GetFrontendTestSourceFolders
	=> Directory.EnumerateDirectories("test-sources");

[Test, TestCaseSource(nameof(GetFrontendTestSourceFolders))]
void AutoTestFrontend(string folderPath) {
	// todo: Test compilation of source files in 'folderPath'
}

GoogleTest C++ example

class AutoTestFrontend
    : public testing::TestWithParam<std::filesystem::path> {};

TEST_P(AutoTestFrontend, Compile) {
    const auto folderPath = GetParam();
    // GetParam is defined by Googletest

    // todo: Test compilation of source files in 'folderPath'
}

INSTANTIATE_TEST_SUITE_P(FrontendAutoTests,
    AutoTestFrontend,
    testing::ValuesIn(getAutoTestDirectories()),
        [](const testing::TestParamInfo<AutoTest::ParamType>& info) {
            auto str = info.param.filename().string();
            for (auto c = str.begin(); c != str.end(); ++c) {
                if (!isalnum(*c))
                    *c = '_';
            }
            return str;
        }
);

getAutoTestDirectories returns an iterable container of std::filesystem::path. I make it return a global variable that I populate in main based on a command-line argument. I use std::filesystem::directory_iterator to build the list of subdirectories. It’s crucial that the container is populated before testing::InitGoogleTest is called.

The lambda here is necessary because by default, GoogleTest tries to name each generated test after the value of GetParam(), which is an arbitrary path, but only C function names are allowed. Therefore, we replace each non-alphanumeric character with an underscore.

Summary and impact

With data-driven testing in place, we can scale the number of test cases indefinitely without writing a single line of unit test code. In terms of time and lines of code, setting up a data-driven mechanism represents a large and ever-growing return on investment once we get beyond a dozen test cases. By minimizing the effort needed to add new test cases, it also has the indirect effect of encouraging developers to add more tests.

Don’t repeat yourself. Testing in general tends to be fertile ground for factoring out patterns, so don’t hesitate to consider creating functions, classes, domain-specific languages, or other tools to help¹. Effort spent on test utilities and infrastructure pays off every time you add or maintain a test, so it’s worth trying to do a great job instead of one that’s just good enough for the first few tests you write.

Of course, you shouldn’t overbuild anything, but people overwhelmingly err on the side of doing too little rather than too much. ↩︎