In part 1 of this series, I made the case that you should run away from the shell when writing integration tests for your software and that you should embrace the primary language of your project to write those.

Depending on the language you are using, doing this will mean significant more work upfront to lay out the foundations for your tests, but this work will pay off. You may also feel that the tests could be more verbose than if they were in shell, though that’s not necessarily the case.

In this second part, I’ll assume I have already convinced you that using the shell is a horrible idea, and that using another interpreted language is equally bad unless your project is already using that language. Given those, I’ll guide you through a case study using the sandboxfs project, present some key design ideas that you should consider, propose a very rudimentary API, and finally cover what the roadmap for Bazel could look like.


A case for writing integration tests in Java: practical aspects
Fixing Bazel’s reliability by treating tests as production-grade code


Case study: sandboxfs

Before diving into a proof of concept, let’s look at a project that just went through this process: sandboxfs.

Initial integration tests were in shell

In sandboxfs, we started with integration tests written in shell because of the same fallacy: it was easier to get them running upfront, and the execution of sandboxfs and interaction with the file system would closely resemble what a user would run by hand.

Unfortunately, in a matter of months, the integration tests became hard to read and hard to maintain. The very simple cases that we first conceived were not sufficient to assess the more difficult corner cases of sandboxfs, and writing such more difficult tests (e.g. testing for race conditions, testing for signal interaction) quickly became cumbersome and just impossible in some cases.

The tests could not be open-sourced

Another problem was that those tests relied on a Google-internal shell testing library. This prevented open-sourcing the tests verbatim.

A solution to this problem was to rewrite the tests using some other shell testing library. The obvious one was to repurpose what Bazel has (the unittest.bash “infrastructure”) but the quality of the Bazel shell integration framework is “not great”. We could have jumped to other shell integration libraries—such as my own shtk—, but given the unpopularity of shell in general for large-scale coding, such libraries are not actively maintained nor used.

Initial steps in the rewrite in Go

In the end, I chose to rewrite the integration tests in Go, which is the same exact language the rest of the project is written in. So how did it go?

Well… at first, writing the initial tests was painful. I had to come up with a lot of infrastructure to start the sandboxfs binary under different test conditions, shut it down cleanly, provide helper methods to test for common conditions, etc.

The CLs that added the few first tests usually came with necessary changes to the testing infrastructure to tweak abstractions and to add missing features. A few CLs later, however, the infrastructure solidified. Further CLs added tests and did so exclusively. And further on, we now have pretty sophisticated integration tests because a first-class language like Go has direct access to all operating system facilities.

You can take a look at the sandboxfs/integration directory if you are curious. I personally am not a fan of Go’s “best practices” regarding the structure of tests, but I believe that sticking to the same language as the project’s core and sticking to the community-acknowledged best practices is what’s best for the project.

Gained benefits

The resulting integration tests are very reliable. Because the tests look pretty-much identical to production code, code reviewers treat them as such and catch the same kind of mistakes you’d catch when reviewing the actual project’s code. We do not experience test flakiness. We have sophisticated tests that spawn the binary under test after recreating a complex environment around it and abuse the binary in many ways that could be impossible from shell.

And… the tests are not more difficult to write than the shell ones. In fact, it is easier to write them because there is no context switch in changing languages. Furthermore, at this point we have a large-enough collection of integration tests that writing a new one is often a matter of finding an existing test that resembles what we have to do and copying it.

Finally, because we use the same language, things like the project’s build infrastructure and IDE interaction are the same for both the project’s core and the tests.

Design ideas

To make this proposal feasible, we must provide a bunch of process- and file-management primitives that make the test’s code concise. These primitives differ from the ones commonly used in unit tests so we need to supply extra ones.

Process execution and validation

The key primitive that we need is a process runner that takes a command in the form that most resembles what the user would type, executes it, and validates a bunch of properties from the execution.

In particular, we are interested in checking:

  • The exit code of the process. This can be whether the process exited with a specific code or whether it did not exit with a specific code. It can also be whether the process exited successfully or exited with a signal.
  • The expected behavior of the process’s stdout. Should the output be silent? Should it match a golden file or string? Should it match a pattern? Any combination of the previous?
  • The expected behavior of the process’s stderr. Same as for the stdout.

All these checks should be expressible in a single statement. On a failure, the framework has to dump detailed status to the test log: the actual and expected error codes, the actual outputs and how they didn’t match the expectations, the work directory contents, etc. Even more: if we want to check the output of a command against a golden file, the framework should print a diff of the output, not just the verbatim copy.

The oldest prior art that in this area is the AT_CHECK macro shipped in GNU Autotest, which offers this functionality via m4 scripts (ugh). This idea was repurposed by myself later on within the ATF libraries in 2008 and again within the shtk libraries in 2014.

Nowadays, this primitive is widely used in the FreeBSD and NetBSD integration test suites, which verify most of the command line tools that ship with these operating systems. These account for hundreds of integration tests.

POSIX-like file and text manipulation

The second class of primitives that we need in our integration test framework are a bunch of calls that make it easy to replicate command-line operations. It’s common for integration tests to have to move or copy files around, to change their permissions, to check for text matches, etc. so these operations should be trivial and concise.

We already have a lot of this functionality in our codebase, but such functionality is buried layers deep due to the common abstraction patterns used in Java. We need to unbury these helper tools by providing thin layers that mimic the standard command-line tools. You can think of this as having a parent IntegrationTest class that offers trivial convenience methods like cp, rmR, mkdirP that, under the hood, delegate to full-blown implementations. Or you can think of using static imports for these helper methods.

Work directory and process isolation

The last primitive or concept we need is test case isolation. This is two-fold:

First, at the file system level. Each test has to start in a clean empty directory and should be free to modify that directory at will (which is very common to do while preparing a workspace). This directory ought to be cleaned up automatically upon test termination. (Here you can imagine having a test flag that tells the framework to leave these work directories behind on exit so that the developer can interactively investigate a failed test after-the-fact.)

There are difficulties in achieving this in Java though, given that one cannot change the current directory of the JVM. It is certainly feasible via JNI and some other magic, but Java might have cached the value of the current working directory and use that for other path manipulation operations (think converting a relative path to absolute). We could hack this. We could encapsulate the concept of the current directory in IntegrationTest and make sure all helper methods respect it. Or we could have a custom test runner that spawns each test case as a subprocess in its own directory (see below).

And second, at the process level. Integration tests spawn Bazel instances. To avoid cross-test pollution, we should ensure that all Bazel instances are terminated at the test boundary.

There is the question on what to do with the test case itself. It’s possible to spawn a subprocess for each test case, which has the benefit of transparently having a custom work directory for each and a reliable out-of-bound way to clean that up. This is a different model than most tests and could confuse users because each test will effectively have its own memory space and global variables stop working as one thinks they do… but sharding effectively already does that, so the concern is minor.

Proof of concept

To show that integration tests written in Java don’t have to be painful to write or maintain (which is a concern I’ve heard before), I’ve taken a couple of integration tests and converted them to Java using the key primitives outlined above. The goal of this exercise is to show that Java-based integration tests needn’t suffer from the typical verbosity of Java, not that these are the specific APIs we should use.

WARNING: The APIs shown here are very rudimentary and just an example of what we could achieve. Since the publication of this document, we have been discussing better mechanisms to implement these concepts but I decided to leave the document “as is” for publication. The design and implementation of the infrastructure will come separately from this motivational document.

You can find the proof of concept on my jmmv/bazel/java-inttests branch in GitHub. In there, you will find a new PreludeTest.java file that you can compare line-by-line against prelude_test.sh. You may notice that the Java version is actually 10 lines shorter than the shell one. And under src/test/java/com/google/devtools/build/lib/integration/util/ you can find a super-simplistic implementation of the APIs proposed above.

Some things to highlight below.

The process checking API permits expressing calls to Bazel like this:

// Check that Bazel exits with 0 and that stdout and stderr were empty:
run("bazel", "build", "//package:foo").check();

// Check that Bazel exits with 0 and ignore stdout and stderr:
run("bazel", "build", ":gr").ignoreStdout().ignoreStderr().check();

// Check that Bazel exits with an error, that stdout was empty, and that
// stderr contains a specific error:
run("bazel", "builad", "//pkg")
    .exitStatusIs(2)
    .matchStderr("Command 'builad' not found")
    .matchStderr("Try 'bazel help'")
    .check();

// Ignore Bazel's exit status but save the outputs for later processing:
run("bazel", "--exoblaze", "build", "//foo")
    .ignoreExitStatus()
    .saveStdout("stdout.txt")
    .saveStderr("stderr.txt")
    .check();
... now inspect the contents of stdout.txt and stderr.txt ...

This API should be generic and decoupled from Bazel, but note that our tests should ever only invoke Bazel: running other external tools should be discouraged unless we own them because any external tool that is invoked is a portability liability. (Think: the simple act of running, say, an external grep makes the test not easily runnable on Windows.)

Similarly, tests end up looking like the following:

@RunWith(JUnit4.class)
public class ConceptTest extends IntegrationTest {
    @Test
    public void testSimple() throws Exception {
        writeFile("WORKSPACE", "");

        // Explicitly depend on the exact things this test needs so that data
        // dependencies become obvious at the BUILD file level.
        provisionCppRules();

        mkdirP("package");
        writeFile("package/BUILD", "genrule(..., outs = ['test.txt'], ...)");

        run("bazel", "build", "//package")
            .ignoreStdout().ignoreStderr().check();
        assertThat(
            grep("the test data", "bazel-genfiles/test.txt")).isTrue();
    }
}

Note how the IntegrationTest base class exposes a bunch of helper functions to simplify the interaction between our test and the file system. Also note how these calls are all “loosely typed”: while the internals use Paths and Files and whatnot, the thin wrappers in IntegrationTest just deal with Strings to make the caller sites succinct. (These bare interfaces are an example and are probably a bad idea in the long run though. They need more design work.)

How do we get there?

Rewriting our tests from scratch, in one go, is unfeasible: we have too many of them and spending time rewriting the code for rewrite’s sake is a waste of time. We’d better spend our time doing work on Bazel itself.

We can only do this incrementally and opportunistically, at least until we achieve critical mass and introducing new Java-based tests is as easy, if not easier, than adding shell-based tests. At that point, we will naturally prefer using the Java APIs to write the tests and the shell tests will quiesce. Only then we can consider rewriting whatever is left.

Here is a possible plan with no explicit timeline:

  1. Implement the foundations of the testing APIs and and convert a few representative test cases to use it.

    • Make sure to cover both existing Bazel and Blaze test cases (and maybe even Exoblaze).

    • This is to ensure we get the right level of abstraction regarding mock workspace creation.

  2. Rewrite all existing Python-based tests en masse.

    • Kinda contradicts my statement above but… there are only 11 test programs in the Bazel case, so this is feasible.

    • Getting rid of these allows us to eliminate one of the testing infrastructure variants, thus prevents making the current project structure worse.

  3. Enable strict mode in our existing shell tests, discover anything that’s broken, and try to address those failures either by minor fixes or by rewriting them in Java.

  4. Discourage the practice of open-sourcing shell tests unless they are rewritten in Java.

    • Publishing shell tests is rarely on the critical path, so if this happens, the change should come with a justification as to why the test was not rewritten in Java.

    • The reason is likely to be “insufficient framework support” or “insufficient sample tests from which to borrow code”. Both of these issues should be addressed separately to not block CL authors. The hope is that the increased framework coverage will prevent this practice from recurring.

  5. Discourage the addition of new shell tests. Any new tests ought to be written in Java using the new foundations.

    • This is the best moment to ensure that any new tests are open-sourceable either immediately or in the long term.

    • Same considerations as in the previous point regarding justification.

  6. Disallow the modification of existing shell tests. Any need to modify those tests should be accompanied with a proper rewrite.

So, what do you say? Shall we make this happen?