Have you ever noticed how easy it can be to overlook small problems in everyday life? Some things start off as imperceptible but slowly intensify over time, and just like the apocryphal frog in boiling water, we acclimate. In pathological cases, we don’t recognize what’s happening until the issue has grown way out of hand.

This might have you thinking of a squeaky door or some unkempt bushes, but today, I’m talking about software.

Test262 is the official conformance test suite for the JavaScript programming language, and Bocoup’s been maintaining it for six years, now. We give it our all during day-to-day maintenance, but we can’t help but wonder if our routine has blinded us to more insidious problems.

That’s why a little over a year ago, I took a week to turn the project on its head–to hopefully kind of see it for the first time. It was a great learning experience, which means (of course) it was nothing like I expected.

Down on the upside

To understand how we changed our perspective, you’ll have to learn a bit about how Test262 tests are normally interpreted.

Most of the tests are valid within JavaScript’s strict mode and without it. Implementers are expected to run those tests twice, once with strict mode enabled and once with strict mode disabled. We informally refer to these related executions as “scenarios.”

There are plenty of tests that are only relevant with strict mode disabled, and still others that only make sense with strict mode enabled. We annotate those tests with special metadata (noStrict and onlyStrict, respectively) so implementers know to only run them once.

Presumably, doing the opposite (e.g. running a test labeled noStrict in strict mode) would result in a failure. We’ve never told anyone to do that, though, so I got to wondering what would actually happen. Maybe we’d find that tests were using the metadata incorrectly. Or maybe we’d find new bugs in the implementations. Or maybe we’d find that everything was perfect and nothing needed changing. But I doubted it.

With over 72,000 tests in total, the only feasible way to do large-scale analysis was to actually execute the tests in a real JavaScript engine. We were hard-pressed to choose just one, though, since none of our options passed all of the tests. That’s why we studied the behavior of three different engines: SpiderMonkey (which powers Firefox), V8 (which powers Chrome and Node.js), and JavaScriptCore (which powers Safari).

Mostly, we found test bugs, but we also found opportunities to improve the testing infrastructure. Even more surprising: we discovered gaps in the test suite’s coverage of the specification.

Test bug: unnecessary flags

The most common test bug was superfluous use of the flags. In many cases, tests declared noStrict or onlyStrict when the behavior under test was actually expected in both modes. For example, check out this test for the global object:

// Copyright (c) 2012 Ecma International.  All rights reserved.
// This code is governed by the BSD license found in the LICENSE file.

/*---
es5id: 10.2.1.1.3-4-22-s
description: >
    Strict Mode - TypeError is not thrown when changing the value of
    the Constructor Properties of the Global Object under strict mode
    (Object)
flags: [onlyStrict]
---*/

var objBak = Object;

try {
  Object = 12;
} finally {
  Object = objBak;
}

This test explicitly concerns strict mode, but the semantics are the same even when strict mode isn’t enabled. We removed the onlyStrict flag so consumers would run the test in both scenarios.

Test bug: unintended syntax errors

We also found a handful of tests that had unintended syntax errors. The tricky part was, they were supposed to include invalid syntax. It was only by intentionally misinterpreting these tests that we discovered the gotcha: they were failing to parse for the wrong reason. Here’s an example:

/*---
description: >
  It is a Syntax Error if LeftHandSideExpression is neither an ObjectLiteral
  nor an ArrayLiteral and IsValidSimpleAssignmentTarget(LeftHandSideExpression)
  is false. (for-await-of statement in an async function declaration)
esid: sec-for-in-and-for-of-statements-runtime-semantics-labelledevaluation
features: [destructuring-binding, async-iteration]
flags: [generated, onlyStrict, async]
negative:
  phase: parse
  type: SyntaxError
---*/
$DONOTEVALUATE();

async function fn() {
  for await ([arguments] of [[]])
}

This test is intended to fail in strict mode because it assigns to arguments, and that’s a no-no. However, that’s not the only syntactic infraction (there’s a free band name for you). Can you spot the other?

We won’t blame you if you can’t; we missed it the first time around, after all. Following that nest of brackets and parenthesis, there ought to be a statement of some kind, but there’s nothing. That’s also a no-no. Engines that correctly reported a syntax error were just as likely to be complaining about the for loop as the arguments assignment. We corrected the tests by inserting an empty block.

A syntax error is a syntax error, right? What difference does it make how it’s produced? As it happens, JavaScriptCore was only passing that particular test because of the unintentional syntax error. The engine parsed the corrected file without producing an error, so our fix uncovered a failure!

We love discovering bugs here at Bocoup. It’s an important step in our mission to improve interoperability on the web. I pictured myself filing a report in the WebKit bug tracker and, following a hazy sequence of escalations, being paraded around Infinite Loop on Tim Cook’s shoulders.

…but I’ll have to dream on–Test262 already had a more generic test case for that behavior, so there was nothing new to report to the JavaScriptCore maintainers. It’s too bad that Test262 is so darned thorough.

Test bug: overly minimal

We generally prefer that each individual test verifies a single “behavior.” A test can only fail once, so in order to give implementers a clearer picture of their bugs, we avoid asserting too many details at the same time.

That said, it’s possible for a test to be too minimal. It’s not common, but it was a problem with a few of the tests we found. Here’s an example:

/*---
es5id: 10.6-14-1-s
description: Strict Mode - 'callee' exists under strict mode
flags: [onlyStrict]
---*/

var argObj = function () {
  return arguments;
}();

assert(argObj.hasOwnProperty("callee"), 'argObj.hasOwnProperty("callee") !== true');

This test verifies only the presence of the callee property. That can be satisfied in strict mode or outside of strict mode. We almost classified this as another case of unnecessary flags. After all, removing onlyStrict would produce a second valid scenario, and that would improve coverage.

But wait! There’s more that’s interesting about the callee property as it relates to strict mode. It can be deleted without strict mode, but it can’t be deleted within it. If this test were more specific (asserting the complete property descriptor), then it would actually warrant the onlyStrict flag. As another testament to Test262’s thoroughness, such tests already existed (e.g. for noStrict and for onlyStrict). So we just removed these.

Test bug: false positives

We found one other kind of test bug, and only a single test that exhibited it:

/*---
es5id: 15.2.3.6-4-243-2
description: >
    Object.defineProperty - 'O' is an Array, 'name' is an array index
    named property,  'name' is accessor property and  assignment to
    the accessor property, fails to convert accessor property from
    accessor property to data property (15.4.5.1 step 4.c)
includes: [propertyHelper.js]
flags: [onlyStrict]
---*/

var arrObj = [];
function getFunc() { return 3; }
Object.defineProperty(arrObj, "1", {
  get: getFunc,
  configurable: true
});

try {
  arrObj[1] = 4;
} catch (e) {
  verifyEqualTo(arrObj, "1", getFunc());
  verifyNotEnumerable(arrObj, "1");
  verifyConfigurable(arrObj, "1");

  if (!(e instanceof TypeError)) {
    $ERROR("Expected TypeError, got " + e);
  }
}

This test is intended to verify that the property assignment produces a TypeError and that the property is not modified. However, it doesn’t account for the possibility that no error is thrown in the first place. A JavaScript engine that incorrectly permitted the assignment would skirt by unnoticed and pass the test.

As most experienced unit testers will tell you, verifying exceptions can be tricky. That’s why so many testing frameworks offer utility functions; it’s just too easy to make mistakes like the one above. Test262 is no different, so we fixed this by making use of the project’s assert.throws function.

Infrastructure deficiencies

This experiment also exposed a few problems with how we were interpreting tests.

For instance, we found a subtle bug in the helper function used to verify object properties. Take a look at the flawed implementation:

function isConfigurable(obj, name) {
  try {
    delete obj[name];
  } catch (e) {
    if (!(e instanceof TypeError)) {
      $ERROR("Expected TypeError, got " + e);
    }
  }
  return !Object.prototype.hasOwnProperty.call(obj, name);
}

This function is designed to determine if a given property is configurable (that is: if it can be deleted) by attempting to delete it and inspecting the result. It fails for one particular input, though. Can you guess which?

Time’s up. As written, isConfigurable would report incorrect results if it was called with the Object prototype and the string “hasOwnProperty”. In conforming JavaScript engines, it would successfully delete the property and then be unable to verify the result of the deletion. This didn’t directly impact any tests, but it was a rough edge nonetheless, so we smoothed it out.

We also learned that many tests included helper files without actually using them. This didn’t threaten the accuracy of test results, but it was still worth fixing. For one, it made tests longer than they had to be. With over 72,000 tests, a few superfluous bytes here and there can have a perceptible impact on the time it takes to load, parse, and execute the entire suite. Just as important, the unnecessary dependencies made the tests harder for us humans to understand.

We removed all the needless “includes” directives, and we extended the project’s self-tests to help folks avoid making the same mistake again.

Missing test coverage

Legacy RegExp Features is a proposed extension to the JavaScript programming language (and kind of a strange one, at that). It was thought to be well-tested in Test262, and tests are an important requirement for reaching stage 4 of the standardization process. Though our work on this side project, we discovered that most of the proposal did not have any tests.

The proposal stalled a bit in the months that followed, but someone has just recently stepped up to fill out the missing coverage. With that patch merged, the proposal is just a little closer to standardization.

Back to the known

Even though we didn’t know what to expect from this experiment, we were happy with the results. Sure, the one-off fixes were nice, and the structural enhancements were even better. Mostly, though, we were impressed by what we didn’t find.

Imagine discovering some mold beneath the corner of an area rug. That’d have you questioning the cleanliness of the room and maybe the safety of the building. In the case of Test262, the floor boards weren’t spotless, but what we found was more like a few old Milk Duds. Worth cleaning up, but nothing to worry about.

So while there are probably still more subtle bugs in Test262, this experience gave us even greater confidence in the thoroughness of the project. Of course, that’s no reason to rest on our laurels. New tests are being written every day, after all. As new language features are designed and standardized, we’ll be working just as hard to preserve the quality of the test suite.