The Eight-Month Omelette: adding a feature to one million conformance tests

The web-platform-tests project (WPT) houses over a million tests written to ensure our browsers provide a consistent experience of the web. WPT predates most of today’s popular JavaScript testing frameworks, so it implements one of its own: testharness.js. In December of 2017, I offered to extend testharness.js with a new feature. No one expected this to be simple. I’d be messing with some low-level semantics, so the change could effect any existing test. Still, the feature was small enough, and we’d been contributing since January. “This’ll take two weeks, tops,” I thought to myself. Little did I know that I was signing up for over half a year of trial and error: hundreds of regressions, each a broken egg in service of a delicious testing omelette.

The setup

WPT’s testharness.js has long offered a “cleanup” feature which allows test writers to revert shared state following a test, promoting isolation. Here’s what it looks like:

test((t) => {
  const video = document.createElement('video');
  video.src = 'zombocom.ogg';
  document.body.appendChild(video);

  video.play();

  // testharness.js will invoke this function (causing the
  // video to stop playing) whether the test passes or fails
  t.add_cleanup(() => video.remove());

  assert_false(video.paused);
});

For years, WPT contributors have been using that feature to remove elements from the DOM, close popup windows and clearlocalStorage. Unlike those APIs, many of the web features being designed and implemented today are asynchronous. Calling the API starts the asynchronous action, but program flow continues without waiting for the action to finish. For instance: invoking document.exitFullscreen() will cause the browser only to begin leaving “full screen” mode. The next statement in your code will execute before the window has shrunk down to size. The function returns a promise which will settle when the operation is complete.

Back in 2017, there was no way to instruct testharness.js to wait for asynchronous cleanup operations. You could call document.exitFullscreen(), but the next test might start to run before it was done. This could even be a problem between tests in different files. In automation, we run these tests as fast as our computers can handle. Without this feature, we were liable to interrupt cleanup by navigating to the next test.

We first experienced this need while importing Chromium’s tests for Service Workers into WPT where we had to do some pretty wonky things to make sure those tests didn’t leave a mess. After that, use cases kept popping up: tests for Async Cookies, IndexedDB, and even HTTP would all benefit from the functionality.

That’s why Bocoup teamed up with Google’s Ecosystem Infra team to implement “async cleanup.” In broad strokes, we set out to extend thatadd_cleanup function to recognize when test authors return a Promise and wait for it to settle before continuing:

promise_test(async function(t) {
  await fetch('./set-http-only-cookie.py');

  // We wanted testharness.js to wait for the Promise returned
  // by `fetch` to settle before continuing to the next test.
  // This hypothetical test uses a Python script to unset an
  // HTTP-only cookie.
  t.add_cleanup(() => fetch('./unset-http-only-cookie.py'));

  const result = await fetch('./query-http-only-cookie.py');

  assert_equals(await result.text(), 'set');
});

When we started, the test suite for testharness.js couldn’t be run from the command line, so it couldn’t be automated in continuous integration. Instead, a human operator had to manually open a handful of files in a web browser. We were a long way from running the harness tests with every commit to master. Indeed, most contributors probably weren’t aware that the tests even existed.

This isn’t quite as bad as it might sound, though. (The more rigorous test writers out there may want to skip this paragraph.) That’s because this code was used to author tests, so every time a contributor wrote a test for a web platform feature, they were tacitly validating the harness’s behavior. In a way, WPT was itself the test suite for testharness.js.

That’s how we decided to validate our work, anyway. To make sure the feature didn’t cause any regressions, we’d schedule a trial by running all of WPT against Chromium, the open source version of Chrome (one of many perks to having Chromium contributors on your team). We’d then compare the results of the trial with the results of running WPT without the new feature. Any discrepancies between the two datasets would indicate that the new feature had invalidated some tests.

And we found some discrepancies, let me tell you.

Broken egg #1: Too slow

testharness.js offers a few different ways to define tests. The global testfunction creates a fully-synchronous test, while the promise_test function creates an asynchronous test whose result is determined by the resolution of a promise.

Since a given file in WPT might include both kinds of tests, determining when all tests are complete is a little tricky. We initially implemented an “always asynchronous” behavior using setTimeout. Here’s the idea in pseudocode:

runTests(tests, () => {
  // `runTests` may invoke this function synchronously.
  // Use `setTimeout` to ensure `all_done` is always
  // executed asynchronously.
  setTimeout(all_done, 0);
});

This guaranteed that the harness would always complete asynchronously, even when there were only fully-synchronous tests to execute. That kind of consistency is a highly desirable trait in any JavaScript API, but it was also the first cause for regression.

We ran all of WPT using this new version of testharness.js, and Chromium failed a slew of conformance tests that it was supposed to be passing. The failure report ostensibly described specification violations (for instance, a failure in css/css-values/calc-unit-analysis.html looks like a problem with Chromium’s CSS engine), but we knew that it was really our fault. It just took some digging to sort things out.

In this case, we learned that many tests in WPT happen to produce an uncaught exception asynchronously. Prior to our patch, these exceptions were hidden because testharness.js would report the test results before they occurred. Our patch was too slow because the delay it introduced influenced the result of those tests.

Broken egg containing a turtle
This iteration was too slow.

Generally, uncaught exceptions are suspicious. If they occur between tests, then testharness.js reports an error. We were tempted to consider those the bug and to move forward by updating the tests themselves. In the end, we decided that ignoring such “late” uncaught exceptions had ossified into a full-fledged feature of testharness.js, making it our responsibility to preserve it.

runTests(tests, () => {
  // `runTests` may invoke this function synchronously.
  all_done();
});

As much as we don’t like an “occasionally asynchronous” interface, our priorities for compatibility and simplicity won out.

Broken egg #2: Too unlucky

Our initial diagnosis explained most of the regressions, but not all of them. We foolheartedly hoped that our fix for the bugs we understood would conveniently also resolve the bugs we didn’t. Those puzzling regressions stuck around, so we’d have to keep digging. (That was probably for the best; a fix that you don’t understand can really wear at your conscience, kinda like a sinister talking bird.)

The browser is a messy place. From event loops to (micro-)task queues, layout reflow to paint cycles, and browsing contexts to environment settings objects, a huge amount of work has to happen just so in order for your favorite websites to work the way they do. The same can be said for automated tests like those in WPT. Whether writing the code for the browser or the code that tests the browser, it’s deceptively easy to get things wrong in subtle ways. Mistakes like these can sneak into projects because they only cause problems occasionally and under very specific conditions.

What does all this mean for the web-platform-tests? We are forever fighting against “flaky” tests–those that seem to pass or fail randomly with each execution. It also means that even once we’ve identified an intermittent failure, we can never predict if the mistake is in the browser or in the test.

Many of the regressions that we hadn’t yet identified were a result of pre-existing flakiness. While we watched them go from “passing” to “failing” after we introduced the feature, that wasn’t enough to place the blame on the patch. We also had to research their history in Chromium to see if they were known to be bad apples.

Broken egg containing a black cat
This iteration was just unlucky.

We certainly wanted to fix tests like this in all cases. However, doing so was beyond the scope of our current work. For the rest of this undertaking, once we verified that a test was known to behave sporadically, we took it off our plate.

That might seem like a cop-out, but rest assured: we had plenty of other challenges to work through.

Broken egg #3: Too asynchronous

Every bit of code we added was a potential source of error. That might be a pessimistic way of looking at it, but it motivated us to prefer the simplest possible implementation. Initially, this involved interpreting the result of all cleanup functions asynchronously, even if they didn’t return a Promise. We did this by passing each function’s return value to Promise.resolve. That transformed synchronous values to promises without changing the meaning of promise values. From there, we’d have a consistent interface for scheduling the next test:

Promise.resolve(test.cleanup())
  .then(() => {
    const nextTest = tests.pop();
    nextTest.run();
  });

In testharness.js, a typical synchronous test is defined using the globaltest function. It turns out that in some rare cases, inserting any kind of delay between these tests invalidates them. Our unilateral application ofPromise.resolve did exactly that because it scheduled work on the browser’s microtask queue. We’d need to complicate things a bit.

Broken egg containing a stop watch
This iteration was too asynchronous.

Fortunately, tests defined via the global promise_test function were already being scheduled with a delay. We decided to limit the new functionality to those particular tests:

const result = test.cleanup();
const nextTest = tests.pop();

if (test.is_promise_test) {
  Promise.resolve(result)
    .then(() => nextTest.run());
} else {
  nextTest.run();
}

This distinction between test and promise_test is a little dangerous, though. Contributors could easily use test to write a test that required asynchronous cleanup, but our solution couldn’t support that. Cleanup for tests like that would still be susceptible to interruption. That’s why we also updated the harness to report an error if a synchronous cleanup function returned a Promise.

const result = test.cleanup();
const nextTest = tests.pop();

if (test.is_promise_test) {
  Promise.resolve(result)
    .then(() => nextTest.run());
} else {
  if (result && typeof result.then === 'function') {
    throw new Error('Only `promise_test` supports async cleanup');
  }
  nextTest.run();
}

In the process, we sowed the seeds of the next regression.

Broken egg #4: Too nosy

Although we restricted the return value as a well-meaning safety mechanism, the next trial in Chromium showed that the change was a little invasive. Here’s some code from a test that regressed:

test((t) => {
  t.add_cleanup(() => document.exitFullscreen());

  // etc.
});

This code gets straight to the point without any syntactic flourishes: “exit fullscreen mode when the test is complete.” That kind of brevity can be helpful in complicated conformance tests. However, there’s a lot more going on here than meets the eye:

  1. When the body of an ES6 arrow function is not wrapped in braces, the function returns the result of its only expression.
  2. document.exitFullscreen returns a Promise in some browsers (others are still catching up to a recent change to the specification).

The problem for our patch might be a little more clear if we refactor the example to make these details explicit:

test((t) => {
  t.add_cleanup(() => {
    var result = document.exitFullscreen();

    if (result !== undefined) {
      return result;
    }
  });

  // etc.
});

We tried to help by disallowing Promise return values within test, but in browsers that implement the new “return a Promise” feature, this test suddenly began to fail.

Broken egg containing Killroy
This iteration was too nosy.

Ideally, if the Promise is available, the test should wait for it to settle. Our patch was big enough as it was, though. We wanted to avoid changing the meaning of the tests whenever possible. We could revisit tests like these with smaller, dedicated patches after the feature was complete.

To avoid problems from implicit return values, we decided to modify the tests themselves. We updated all existing uses of arrow functions with add_cleanupto be a little less concise:

test((t) => {
  t.add_cleanup(() => { document.exitFullscreen(); });

  // etc.
});

That arrow function returns undefined no matter what the exitFullscreen method returns. A couple curly braces go a long way!

Broken egg #5: Too fragile

The number of regressions was decreasing with each trial. Still, it seemed like every time we fixed one issue, we uncovered some latent problem. The next trial we ran in Chromium once again caused the browser to output different information. Here’s an example of how the output changed for one test:

- CONSOLE ERROR: line 2684: Uncaught Error: assert_false: Should not call createdCallback in UA ShadowRoot. expected false got true
- CONSOLE ERROR: line 2684: Uncaught Error: assert_false: Should not call attachedCallback in UA ShadowRoot. expected false got true
- CONSOLE ERROR: line 2684: Uncaught Error: assert_false: Should not call createdCallback in UA ShadowRoot. expected false got true
- CONSOLE ERROR: line 2684: Uncaught Error: assert_false: Should not call attachedCallback in UA ShadowRoot. expected false got true
+ CONSOLE ERROR: line 2962: Uncaught Error: assert_false: Should not call createdCallback in UA ShadowRoot. expected false got true
+ CONSOLE ERROR: line 2962: Uncaught Error: assert_false: Should not call attachedCallback in UA ShadowRoot. expected false got true
+ CONSOLE ERROR: line 2962: Uncaught Error: assert_false: Should not call createdCallback in UA ShadowRoot. expected false got true
+ CONSOLE ERROR: line 2962: Uncaught Error: assert_false: Should not call attachedCallback in UA ShadowRoot. expected false got true
  This is a testharness.js-based test.
  Harness Error. harness_status.status = 1 , harness_status.message = Uncaught Error: assert_false: Should not call attachedCallback in UA ShadowRoot. expected false got true
  PASS SVG <use> shadow trees should not be exposed through custom elements.
  Harness: the test ran to completion.

This was all sorts of confusing. The test was originally failing, and it continued to fail with our patch applied. The lines starting CONSOLE ERRORare what changed: they are uncaught exceptions that testharness.js wasn’t aware of. We really thought we had the “uncaught exception” problem licked, so it was surprising to see this kind of failure crop up again.

Unlike last time, the number and content of the uncaught exceptions didn’t change. Instead, the change was in the line number which produced the exception. So what happened?

Our new feature shifted the location of the code in testharness.js. Since the line numbers are present in Chromium’s “test expectations” files, the shifted errors violated the expectations of the test runner. In other words, we broke this test because we added code.

The solution was obvious: we simply removed the newline characters from our patch so that it could be applied without altering the position of the surrounding statements.

Kidding! This was a case where Chromium’s expectations were too “brittle”–they broke in response to reasonable change. While the Chromium infrastructure probably shouldn’t encode line numbers, addressing that problem was beyond the scope of our work. We decided update the test to handle those exceptions. No uncaught exceptions meant no CONSOLE ERROR lines which meant no change in output.

Broken egg containing a cracked egg
This iteration was too fragile.

Meanwhile, the web platform pushed forward. One of the newer features of the<script> element is the integrityattribute. It allows developers to guarantee that their scripts aren’t modified by some third party. Just as we were wrapping up this latest “regression,” the first web-platform-tests to use the integrity attribute were merged tomaster. Those tests included markup like:

<script src="/resources/testharness.js" integrity="sha384-4Nybydhnr3tOpv1yrTkDxu3RFpnxWAxlU5kGn7c8ebKvh1iUdfVMjqP6jf0dacrV"></script>

Quite a mouthful, huh? More than being verbose, this code also resisted changes to testharness.js. The new feature invalidated these tests because it changed the cryptographic hash of the file. The integrity value is sensitive to any change (not just newlines), so the ol’ “cram everything on one line” trick wouldn’t work here.

We needed to parameterize that hash value so that the test would be resilient to all future changes. Fortunately, the WPT server already allowed authors to inject dynamic values into their test’s markup, allowing them to reference the server’s domain name, port number, etc. Here’s what it looks like to use that feature:

<img src="http://{{domains[]}}:{{ports[http][0]}}/images/smiley.png">

To resolve this particular brittleness, we had to extend the templating system to support a function invocation syntax and introduce a file_hashfunction. This is what those <script> tags look like today:

<script src="/resources/testharness.js" integrity="sha384-{{file_hash(sha384, resources/testharness.js)}}"></script>

Not much prettier, but certainly more resilient.

Broken egg #6: Too fidgety

Chromium’s test results were finally consistent, with or without the patch applied, so things were starting to look pretty good! We weren’t done, though, because Chromium could only gave us part of the picture.

When a test fails in WPT, it generally does so by throwing an exception. That interrupts the normal control flow, so the code that immediately follows is not executed. Any test that Chromium failed may have contained some unused code. Other browsers that passed such tests would execute the code, and more regressions could be lurking in those corners. For example,

test(() => {
  // This assertion causes an exception in Chromium. (Don't worry: we wouldn't
  // write a test like this in WPT.)
  assert_false(/Chrome/.test(navigator.userAgent));

  // Chromium can't reach this section. In all our trials to date, we had
  // never executed these statements!

  t.add_cleanup(() => {
    // There was no telling what kind of edge cases were hidden away here.

    // Spooky!
  });
});

That’s why we requested assistance from the maintainers of Firefox and Servo (an experimental browser from Mozilla). By scheduling a “dry run” of the kind we’d been running in Chromium all this time, they could give us another perspective on the correctness of the patch.

If you’ve picked up on the theme of this blog post, then you won’t be surprised to read that they found more problems. This time, we learned that we’d broken tests which had been marked as TIMEOUT. In this context, “timing out” means that the test failed to produce results after a significant delay, and so testharness.js forced it to exit. After applying our patch, Firefox’s results for many tests went from TIMEOUT to PASS.

Broken egg containing a cup of coffee
This iteration was too fidgety.

Even unresponsive tests may modify global state. That’s why we were careful to run cleanup callbacks even after designating the test as “timed out.” This particular invocation of the cleanup callbacks contained a mistake, though: it fidgeted with the TIMEOUT result by changing it to PASS. To fix it, we only modified the result if it hadn’t already been set.

Cleaning up

Finally, our patch passed muster in Chromium, Firefox, and Servo. It was accepted into WPT, and we began extending all sorts of conformance tests to make use of it.

Eight months might seem like a long time to spend on an ergonomic test-writing feature. It was certainly longer than we expected. WPT has plenty of open infrastructure issues, many of which seem more compelling than putting things back in their place. You could be forgiven for assuming that we’d been indoctrinated by a fastidious nanny. Actually, we were motivated by more rational factors (although with considerably less singing).

We’re convinced that keeping WPT relevant is critical to the health of the web platform, so any improvement that makes it more appealing for people to add more tests is strategic. Maybe four weeks into the project (just as that somewhat idealistic inspiration was starting to wane), we began to recognize a slew of more concrete benefits that “async cleanup” was driving:

Through this work, we improved the conformance tests themselves. This included extending coverage, resolving flakiness (3 months on that one alone), and fixing timeouts.

We enhanced the web-platform-tests infrastructure. The file hash computation made the server more powerful. We followed up with improvements to testharness.js that made it better able to detect test bugs. Today, the framework alerts authors when they misuse test and when they misusepromise_test. Both of those features caught test bugs in Chromium.

Maybe most importantly, we increased test coverage for testharness.js. Most of those “broken eggs” showed us how we’d slipped up, but they also demonstrated missing test coverage for the framework. We memorialized each lesson learned with a new set of unit tests. Today, the expected behavior of testharness.js is much more tightly circumscribed by fully-automated tests. The test suite will be an asset to future contributors, giving them confidence about their changes in a matter of seconds rather than the hours necessary to run all of WPT.

As much as we love cleaning up after ourselves, we’re most proud of these improvements. Speaking personally, I’m always impressed at the level dedication of the folks who work on the web platform. Each of the 20+ pull requests referenced here is full of careful feedback from people across the industry, none more so than the feature branch itself which sports over 60 pieces of feedback from Philip J├Ągenstedt alone. Inserting a new feature under a million tests may have been the most challenging task we took on this year, but thanks to the comradery of those experts, it was definitely the most rewarding.

Comments

Contact Us

We'd love to hear from you. Get in touch!

Phone

+1 617-379-2752

Mail

P.O. Box 961436
Boston, MA 02196