The web-platform-tests project is a massive suite of tests (over one million in total) which verify that software (mostly web browsers) correctly implement web technologies. It’s as important as it is ambitious: the health of the web depends on a plurality of interoperable implementations.

Although Bocoup has been contributing to the web-platform-tests, or “WPT,” for many years, it wasn’t until late in 2017 that we began collecting test results from web browsers and publishing them to wpt.fyi. The enormity of the task ruled out popular hosted testing solutions like Travis CI. We started with our own primitive system that would take up to 24 hours to execute the test suite… If we were lucky.

Over the year that followed, we struggled through network problems, service outages, browser bugs, and good old-fashioned programming errors. With every build error, we made the system a little more robust. Now, we’d like to reflect on what we learned from the experience.

This won’t be a run-down of all the specific problems we encountered (you can review the issue tracker for that). Instead, we’ll be looking at the systemic flaws which explain trends in our efforts over the past twenty months. Whether you’re building your own test automation system or you’re just a connoisseur of schadenfreude, we hope you’ll enjoy this story of our blunders.

Intermittent errors

From the very beginning, our efforts to collect test results would fail at seemingly-random intervals. It was never truly random, though; the trends just took some time to discover.

In the early days, the system would occasionally fail due to bugs in our own infrastructure. Since we maintained this code directly, fixing these problems was straightforward. Things got trickier as our code matured and the failures started to come from the libraries and services on which we relied. Nobody’s perfect, and we were reminded of this every time an otherwise reliable service exhibited some kind of hiccup. From Sauce Labs service outages to race conditions in Python package distribution, we saw how calling something a “dependency” does not make it “dependable.”

When it comes to problems like these, we kinda envy werewolf hunters. They can count on silver bullets to solve all their problems. No such luck with test automation. Intermittent errors don’t have enough in common. The lesson we learned is to accept intermittent errors as a fact of life and to build resiliency into the system. Detect errors at the highest level, report them, and automatically retry. This won’t fix anything, but it will reduce loss when hiccups inevitably crop up.

Costly crashes

Like any good proof-of-concept, the system we started with was only just complex enough to produce the expected results. For each browser under test, it executed all of WPT in one go. If the attempt failed unexpectedly (see above), then all the partial results would be discarded, and we’d have to start a new collection from scratch.

Browser crashes were far more common when running WPT in late 2017, so the “restart from scratch” policy delayed results on a regular basis. We took this as sufficient motivation to reduce risk by subdividing tasks. We scripted the WPT runner to segment the test suite into smaller subsets, execute each in turn, and consolidate the results immediately prior to publishing them to wpt.fyi. This made the system more complex overall, but it also allowed us to recover from crashes far more efficiently.

Failing badly

The demands of the project grew over time; in general, we wanted to test more browsers more often. To meet this need, we provisioned dedicated “worker” machines on Amazon Web Services. When these machines received work orders from a central organizer, they would install the browser under test and collect the results defined by the work orders.

Certain kinds of failures had side-effects that we didn’t anticipate. Even though our fancy automatic recovery mechanisms kicked in, the workers were doomed to fail all subsequent attempts. That’s because the unexpected side-effects persisted across independent work orders.

The most common explanation will be familiar to desktop computer users: the machines ran out of disk space. From overflowing logs and temporary web browser profiles, to outdated operating system files and discarded test results, the machines had a way of accumulating useless cruft. It wasn’t just storage, though. Sometimes, the file system persisted faulty state (e.g. operating system locks and file permissions).

This entire class of problem can be addressed by avoiding state. This is a core tenet in many of today’s popular web application deployment strategies. The “immutable infrastructure” pattern achieves this by operating in terms of machine images and recovering from failure by replacing broken deployments with brand new ones. The “serverless” pattern does away with the concept of persistence altogether, which can make sense if the task is small enough.

We lacked the resources to re-architect our solution, so we stuck with more traditional remedies–configuration management tools, temporary filesystems, and good old-fashioned reboots. This didn’t eradicate state, but it was an easier policy to apply to a running system.

Brittle schedules

To segment the work into smaller pieces, we subdivided the test suite according to the directory name of each test. The number of tests in each directory varies greatly, but WPT is so large that this method still produced “chunks” of roughly even size. The thing is, the duration of each test also varies, and this variation correlates to the directory (e.g. the type of test or the failure mode of the feature under test).

All this is to say: some segments took longer to run than others. This worked against our naive scheduling strategy of assigning the same number of jobs to each worker. Some workers would receive shorter jobs than others. Those lucky workers would finish early and sit idle while the others still had more jobs in their queue.

It’s another case where the maxim is obvious in hindsight: when you can’t predict time requirements, defer task assignment. You don’t necessarily need anything as fancy as Firefox’s work stealing algorithm; just don’t queue tasks in the workers. With this setup, the optimal schedule will emerge naturally.

The Next Generation

Today, WPT is migrating from the Buildbot-powered system to some incredible hosted solutions: Taskcluster from Mozilla and Azure Pipelines from Microsoft. As we wind down our virtual machines, we’re asking ourselves: was running this system worth the trouble?

Well, back in 2017, we had no choice. Taskcluster, though public, was still developing features that WPT needed. Azure Pipelines didn’t exist. wpt.fyi required some extension to support integration with a service like Taskcluster. Building and deploying our own system allowed us to begin publishing results only a few weeks into the project, meaning we were able to help implementers prioritize interoperability work far sooner.

Reducing time-to-market was probably the least important benefit, though. By maintaining this system, we were exposed to all sorts of rare bugs in WPT’s infrastructure. We reported these as a matter of course, and we fixed them when we could. In this way, collecting results helped us to improve the experience of using WPT for everyone. It even helped us to catch a few regressions in the browsers themselves! Having test results from a system that was totally under our control gave us a baseline for evaluating results collected on third-party services. Finally, we were able to help the Taskcluster team diagnose and resolve the issues that blocked our adoption while we were collecting results on our system.

So despite all the flaky tests, disastrous crashes, sickly workers, and wasteful schedules, we know it was certainly worth it. We’d do things a little differently if we were back in 2017, but the experience of making a mistake can be just as rewarding as the lesson learned. For now, though, we’re glad to offload some of the operations work; we’ve got still bigger problems to solve.