Death tests refusing to run in parallel?

I ran into some unit test weirdness while developing a debugging library that raises SIGTRAP under some circumstances. The nature of that project isn’t important. Indeed, this issue turns out to have a very small reproducer:

#include <gtest/gtest.h>
TEST(A, A) {
  EXPECT_EXIT({
    /* do something that raises SIGTRAP */
  },
  ::testing::KilledBySignal(SIGTRAP), "");
}

I have 8 tests like this, all of which are fast and end by raising SIGTRAP. On my modest laptop, they all pass in about 1 second each. However, it seems they refuse to run in parallel!

At first, I assumed this was due to an interaction between the test runner and GoogleTest. I tried various ways to run them in parallel:

  • CLion’s CTest runner plugin with -j8
  • Command-line CTest with -j8
  • GNU parallel

They all behaved the same, with all tests taking about 8 seconds overall. I noted CLion and CTest misreporting the total time as some 36 seconds. Watching the output made it obvious the tests were running in series, but in CTest’s view, the 1st test took 1 second, the 2nd test took 2 seconds, … the 8th test took 8 seconds, and since CTest thought all of this was happening in parallel, it reported the overall time used by all tests was 36 seconds (= 1 + 2 + … + 8).

I blamed GoogleTest, thinking somehow I’d missed that serialization was just another limitation of death tests, which are one of GoogleTest’s more fraught features1. I inspected a few runs with strace but didn’t see any obvious attempt by GoogleTest at mutual exclusion. I failed to find any smoking guns in the source code too.

I then tried the test run on a different machine and found they finished much faster, in about 0.1s instead of 1s each, and more surprisingly, they ran in parallel! The speed difference per test wasn’t as surprising since the second machine had a much better processor and faster memory.

Finding the root cause

The issue failed to reproduce on a different machine, which suggested GoogleTest wasn’t to blame. But why wouldn’t they run in parallel on the first machine? I spent the next two hours answering that question, which I now condense for the reader.

One useful experiment was lengthening the tests (e.g. sleep for several seconds and then raise SIGRTRAP), which showed that the tests are in fact running in parallel to some strange extent. In particular, rather than the expected parallel

toverall=maxtestTttestt_\text{overall} = \max_{\text{test} \in T} t_\text{test}

or the serial

toverall=testTttestt_\text{overall} = \sum_{\text{test} \in T} t_\text{test}

it behaved more like

toverall=||T||1 second+maxtestTttestt_\text{overall} = ||T|| \cdot 1 \text{ second} +\max_{\text{test}\in T} t_\text{test}

Notice that when each ttestt_\text{test} is close to zero, this looks a lot like the serial regime with each ttest=1t_\text{test} = 1 second.

Despite several attempts, I was unsuccessful in using gdb and strace to catch the tests during their unexplained delay. Eventually, I wrote a tiny standalone program that did nothing but raise SIGTRAP and exit. Running this program on its own took about 1 second. Since there was no GoogleTest overhead here, this slowness demanded an explanation. Changing the signal type to other things like SIGTERM saw the program exit instantly instead. By this point, I was fully blaming the system.

I started searching the Internet for how Kubuntu (the slow machine’s OS) handles crashes. I eventually found apport was installed and active. Sure enough, checking /var/log/apport.log and /var/crash/ showed that apport was being informed of my programs’ exits due to SIGTRAP and was creating some sort of crash dump files. It was clear from the log’s timestamps that writing these files was taking about 1 second. This explains why the SIGTRAP programs were appearing to take so long to run.

What to do about it

I am not an apport expert, but I can offer 3 suggestions to avoid apport’s delays.

  • Use a system that doesn’t have apport installed.
  • Stop the apport systemd service. I did this and it made my tests much faster.
  • Add your program to apport’s blacklist.
  • (Non-solution: Note that ulimit -c 0 does not suffice.)

All of these are administrative decisions that random unit test programs cannot or should not be making for users. Therefore, I conclude that systems running apport are simply ill-suited for running unit tests.

Why did this happen

As a crash reporter, apport evidently takes termination via SIGTRAP as a sign of a crash and goes about making a crash report. This is not unreasonable. Even though it takes about 1 second, and even though apport seems unable to work in parallel, such delays are likely to be entirely unnoticed when the crashing program is a background service or a GUI app. But when your use case is to run a group of 8 extremely short-lived applications in parallel, apport’s behavior comes as a nasty surprise, as it costs you 8 seconds, a far cry from the expected nearly 0 seconds.

Because of this and other use cases, it would be nice if there were an API you could easily call that excluded the current process from any sort of ambient crash handling. I won’t hold my breath for Canonical to make such an enhancement though, as apport’s wiki page is covered in dead links, and its bugtracker has “high” priority bugs that are over a decade old. Meanwhile the kernel team is likely to ignore this as not a kernel problem.

  1. One Google employee described death tests as “really fucky”. ↩︎

Posted

in

by