Have you ever noticed how easy it can be to overlook small problems in everyday life? Some things start off as imperceptible but slowly intensify over time, and just like the apocryphal frog in boiling water, we acclimate. In pathological cases, we don’t recognize what’s happening until the issue has grown way out of hand.
This might have you thinking of a squeaky door or some unkempt bushes, but today, I’m talking about software.
Test262 is the official conformance test suite for the JavaScript programming language, and Bocoup’s been maintaining it for six years, now. We give it our all during day-to-day maintenance, but we can’t help but wonder if our routine has blinded us to more insidious problems.
That’s why a little over a year ago, I took a week to turn the project on its head–to hopefully kind of see it for the first time. It was a great learning experience, which means (of course) it was nothing like I expected.
Down on the upside
To understand how we changed our perspective, you’ll have to learn a bit about how Test262 tests are normally interpreted.
Most of the tests are valid within JavaScript’s strict mode and without it. Implementers are expected to run those tests twice, once with strict mode enabled and once with strict mode disabled. We informally refer to these related executions as “scenarios.”
There are plenty of tests that are only relevant with strict mode disabled, and
still others that only make sense with strict mode enabled. We annotate those
tests with special metadata (noStrict
and onlyStrict
, respectively) so
implementers know to only run them once.
Presumably, doing the opposite (e.g. running a test labeled noStrict
in
strict mode) would result in a failure. We’ve never told anyone to do that,
though, so I got to wondering what would actually happen. Maybe we’d find that
tests were using the metadata incorrectly. Or maybe we’d find new bugs in the
implementations. Or maybe we’d find that everything was perfect and nothing
needed changing. But I doubted it.
With over 72,000 tests in total, the only feasible way to do large-scale analysis was to actually execute the tests in a real JavaScript engine. We were hard-pressed to choose just one, though, since none of our options passed all of the tests. That’s why we studied the behavior of three different engines: SpiderMonkey (which powers Firefox), V8 (which powers Chrome and Node.js), and JavaScriptCore (which powers Safari).
Mostly, we found test bugs, but we also found opportunities to improve the testing infrastructure. Even more surprising: we discovered gaps in the test suite’s coverage of the specification.
Test bug: unnecessary flags
The most common test bug was superfluous use of the flags. In many cases, tests
declared noStrict
or onlyStrict
when the behavior under test was actually
expected in both modes. For example, check out this test for the global object:
// Copyright (c) 2012 Ecma International. All rights reserved.
// This code is governed by the BSD license found in the LICENSE file.
/*---
es5id: 10.2.1.1.3-4-22-s
description: >
Strict Mode - TypeError is not thrown when changing the value of
the Constructor Properties of the Global Object under strict mode
(Object)
flags: [onlyStrict]
---*/
var objBak = Object;
try {
Object = 12;
} finally {
Object = objBak;
}
This test explicitly concerns strict mode, but the semantics are the same even
when strict mode isn’t enabled. We removed the onlyStrict
flag so consumers would run the
test in both scenarios.
Test bug: unintended syntax errors
We also found a handful of tests that had unintended syntax errors. The tricky part was, they were supposed to include invalid syntax. It was only by intentionally misinterpreting these tests that we discovered the gotcha: they were failing to parse for the wrong reason. Here’s an example:
/*---
description: >
It is a Syntax Error if LeftHandSideExpression is neither an ObjectLiteral
nor an ArrayLiteral and IsValidSimpleAssignmentTarget(LeftHandSideExpression)
is false. (for-await-of statement in an async function declaration)
esid: sec-for-in-and-for-of-statements-runtime-semantics-labelledevaluation
features: [destructuring-binding, async-iteration]
flags: [generated, onlyStrict, async]
negative:
phase: parse
type: SyntaxError
---*/
$DONOTEVALUATE();
async function fn() {
for await ([arguments] of [[]])
}
This test is intended to fail in strict mode because it assigns to arguments
,
and that’s a no-no. However, that’s not the only syntactic infraction (there’s
a free band name for you). Can you spot the other?
We won’t blame you if you can’t; we missed it the first time
around, after all. Following that
nest of brackets and parenthesis, there ought to be a statement of some kind,
but there’s nothing. That’s also a no-no. Engines that correctly reported a
syntax error were just as likely to be complaining about the for
loop as the
arguments
assignment. We corrected the tests by inserting an empty
block.
A syntax error is a syntax error, right? What difference does it make how it’s produced? As it happens, JavaScriptCore was only passing that particular test because of the unintentional syntax error. The engine parsed the corrected file without producing an error, so our fix uncovered a failure!
We love discovering bugs here at Bocoup. It’s an important step in our mission to improve interoperability on the web. I pictured myself filing a report in the WebKit bug tracker and, following a hazy sequence of escalations, being paraded around Infinite Loop on Tim Cook’s shoulders.
…but I’ll have to dream on–Test262 already had a more generic test case for that behavior, so there was nothing new to report to the JavaScriptCore maintainers. It’s too bad that Test262 is so darned thorough.
Test bug: overly minimal
We generally prefer that each individual test verifies a single “behavior.” A test can only fail once, so in order to give implementers a clearer picture of their bugs, we avoid asserting too many details at the same time.
That said, it’s possible for a test to be too minimal. It’s not common, but it was a problem with a few of the tests we found. Here’s an example:
/*---
es5id: 10.6-14-1-s
description: Strict Mode - 'callee' exists under strict mode
flags: [onlyStrict]
---*/
var argObj = function () {
return arguments;
}();
assert(argObj.hasOwnProperty("callee"), 'argObj.hasOwnProperty("callee") !== true');
This test verifies only the presence of the callee
property. That can be
satisfied in strict mode or outside of strict mode. We almost classified this
as another case of unnecessary flags. After all, removing onlyStrict
would
produce a second valid scenario, and that would improve coverage.
But wait! There’s more that’s interesting about the callee
property as it
relates to strict mode. It can be deleted without strict
mode, but it can’t
be deleted within
it. If this test
were more specific (asserting the complete property descriptor), then it would
actually warrant the onlyStrict
flag. As another testament to Test262’s
thoroughness, such tests already existed (e.g. for
noStrict
and for
onlyStrict
).
So we just removed these.
Test bug: false positives
We found one other kind of test bug, and only a single test that exhibited it:
/*---
es5id: 15.2.3.6-4-243-2
description: >
Object.defineProperty - 'O' is an Array, 'name' is an array index
named property, 'name' is accessor property and assignment to
the accessor property, fails to convert accessor property from
accessor property to data property (15.4.5.1 step 4.c)
includes: [propertyHelper.js]
flags: [onlyStrict]
---*/
var arrObj = [];
function getFunc() { return 3; }
Object.defineProperty(arrObj, "1", {
get: getFunc,
configurable: true
});
try {
arrObj[1] = 4;
} catch (e) {
verifyEqualTo(arrObj, "1", getFunc());
verifyNotEnumerable(arrObj, "1");
verifyConfigurable(arrObj, "1");
if (!(e instanceof TypeError)) {
$ERROR("Expected TypeError, got " + e);
}
}
This test is intended to verify that the property assignment produces a
TypeError
and that the property is not modified. However, it doesn’t account
for the possibility that no error is thrown in the first place. A JavaScript
engine that incorrectly permitted the assignment would skirt by unnoticed and
pass the test.
As most experienced unit testers will tell you, verifying exceptions can be
tricky. That’s why so many testing frameworks offer utility functions; it’s
just too easy to make mistakes like the one above. Test262 is no different, so
we fixed this by making use of the project’s assert.throws
function.
Infrastructure deficiencies
This experiment also exposed a few problems with how we were interpreting tests.
For instance, we found a subtle bug in the helper function used to verify object properties. Take a look at the flawed implementation:
function isConfigurable(obj, name) {
try {
delete obj[name];
} catch (e) {
if (!(e instanceof TypeError)) {
$ERROR("Expected TypeError, got " + e);
}
}
return !Object.prototype.hasOwnProperty.call(obj, name);
}
This function is designed to determine if a given property is configurable (that is: if it can be deleted) by attempting to delete it and inspecting the result. It fails for one particular input, though. Can you guess which?
Time’s up. As written, isConfigurable
would report incorrect results if it
was called with the Object prototype and the string “hasOwnProperty”. In
conforming JavaScript engines, it would successfully delete the property and
then be unable to verify the result of the deletion. This didn’t directly
impact any tests, but it was a rough edge nonetheless, so we smoothed it
out.
We also learned that many tests included helper files without actually using them. This didn’t threaten the accuracy of test results, but it was still worth fixing. For one, it made tests longer than they had to be. With over 72,000 tests, a few superfluous bytes here and there can have a perceptible impact on the time it takes to load, parse, and execute the entire suite. Just as important, the unnecessary dependencies made the tests harder for us humans to understand.
We removed all the needless “includes” directives, and we extended the project’s self-tests to help folks avoid making the same mistake again.
Missing test coverage
Legacy RegExp Features is a proposed extension to the JavaScript programming language (and kind of a strange one, at that). It was thought to be well-tested in Test262, and tests are an important requirement for reaching stage 4 of the standardization process. Though our work on this side project, we discovered that most of the proposal did not have any tests.
The proposal stalled a bit in the months that followed, but someone has just recently stepped up to fill out the missing coverage. With that patch merged, the proposal is just a little closer to standardization.
Back to the known
Even though we didn’t know what to expect from this experiment, we were happy with the results. Sure, the one-off fixes were nice, and the structural enhancements were even better. Mostly, though, we were impressed by what we didn’t find.
Imagine discovering some mold beneath the corner of an area rug. That’d have you questioning the cleanliness of the room and maybe the safety of the building. In the case of Test262, the floor boards weren’t spotless, but what we found was more like a few old Milk Duds. Worth cleaning up, but nothing to worry about.
So while there are probably still more subtle bugs in Test262, this experience gave us even greater confidence in the thoroughness of the project. Of course, that’s no reason to rest on our laurels. New tests are being written every day, after all. As new language features are designed and standardized, we’ll be working just as hard to preserve the quality of the test suite.