The Caper of the Flaky Test

Posted by Mike Pennisi

Aug 03 2017

A lineup of browser logos

The test showed up on my desk just like any other. There I was, working with Google’s Web Platform Predictability team to find so-called “flaky” tests in the venerable Web Platform Tests project. I’d run a hundred or so at a time, over and over again, looking for any that reported inconsistent results. It was a bit like panning for gold, only my prize was more valuable than any precious metal: interoperability.

See, the way I figure it, the Web Platform Tests project is a vital tool for allowing different browser makers to discover and reconcile differences in their respective products. The more reliable the tests are, the more the various browser teams would commit to the project, and this would make the “web platform” more cohesive overall.

But there I go again, waxing philosophical.

After an hour of those trials, my terminal fingered a single file: eventsource/request-status-error.htm. “Swell,” I thought to myself, “I’ve never even heard of Event Source before.” (Little did I know that my partner Rick had implemented this spec seven years prior. He was off tending to his newborn baby girl, so I’d have to go it alone.) I braced myself for some homework: a few hours with dry specification language.

…but maybe not. Familiarity with the specification is critical to identifying many subtle test errors, but in my experience, it’s not always necessary. Sometimes the bug is so insidious, so rotten, so downright evil that it defies the very fiber of the web platform itself. I didn’t know it at the time, but this was to be one of those bugs.

The Test

Tests in the WPT project are typically defined with plain old HTML files. They express expectations in JavaScript using the testharness.js framework. When web browsers load files like these, testharness.js provides them with a report about the tests that ran and whether they were successful or not. When the behavior under test concerns network activity, the tests may also include additional files that describe special server behavior in Python. That’s how the culprit was structured.

Here’s the salient part of the HTML file:

<script>
  function statusTest(status) {
    var test = async_test(document.title + " (" + status +")")
    test.step(function() {
      var source = new EventSource("resources/status-error.py?status=" + status)
      source.onmessage = function() {
        test.step(function() {
          assert_unreached()
        })
        test.done()
      }
      source.onerror = function() {
        test.step(function() {
          assert_equals(this.readyState, this.CLOSED)
        }, this)
        test.done()
      }
    })
  }
  statusTest("204")
  statusTest("205")
  statusTest("210")
  statusTest("299")
  statusTest("404")
  statusTest("410")
  statusTest("503")
</script>

This test was composed of 7 distinct “sub-tests.” Each one created an EventSource instance, primed to make a slightly different request to a Python script named status-error.py. In every case, the expected behavior was the invocation of the onerror handler. Nothing too fancy here.

For its part, that Python script was even less complicated.

def main(request, response):
  status = (request.GET.first("status", "404"), "HAHAHAHA")
  headers = [("Content-Type", "text/event-stream")]
  return status, headers, "data: data\n\n"

This script was serving a fairly routine role: reply to any request with an HTTP response whose status code matched the request’s “status” parameter.

At first glance, this all seemed very hum-drum. Sure, the server’s “reason phrase” was oddly maniacal, but browsers generally ignore that text. And while in my experience, there’s a strong correlation between semicolon omission and delinquency, I’m too much of a professional to let that cloud my judgement while on the job.

So this wasn’t going to be an open-and-shut case, after all. I rolled up my sleeves, cracked my knuckles, and got to work.

Placing Blame

The test failed for roughly 0.5 percent of trials. In these cases, it was always the sub-test for the HTTP code 503 (i.e. “Service Unavailable”), and the error message read, “assert_equals: expected 2 but got 0”. This meant the readyState property was not being updated as expected. Also noteworthy: in all my testing (which included thousands of trials), Mozilla’s Firefox browser never reported a failure. Only Google Chrome demonstrated instability. This tended to implicate Chrome, but not conclusively so. Other possibilities included:

the problem was with underspecified behavior (so everybody’s right!)
there was a subtle bug in the test itself, and Firefox was actually in error for not failing when that bug was expressed

Whatever the case, I’d need to stick to debugging in Chrome for the time being. Here’s a screenshot of the network activity immediately after the failure:

Screenshot of network tab

Chrome’s developer tools show the responses with codes in the 400 and 500 range in red because those are considered errors in terms of HTTP.

My initial hunch? Some race condition in Chromium’s internals, possibly related specifically to 503 responses and text/event-stream content. I removed all but the 503 sub-test, and the flakiness was gone, though. I’d have to try harder than that.

It was starting to look like there was some interaction between separate request/response pairs. I wondered if concurrent requests were somehow getting munged. Fabled engineers from long ago taught us about the dangers of “crossing the streams,” and it seemed possible that coordination problems might explain this behavior. Maybe the test was in error. I scanned the source for telltale concurrency mistakes (e.g. binding redeclaration, functions closing over loop variables, etc.), but no dice. Maybe Chrome was to blame, after all. Was it failing to multiplex the requests properly? This seemed possible, but I was not looking forward to reviewing the Chrome source code to find out.

Around this time, I realized that the network log was being updated even after the test failed. Here’s what it looked like 10 seconds after the failure:

Screenshot of network tab

The request for the 503 response was being made a second time. To explain this, I needed to brush up on the EventSource specification a bit. From that document:

Network errors that prevents the connection from being established in the first place (e.g. DNS errors), should cause the user agent to reestablish the connection in parallel, unless the user agent knows that to be futile, in which case the user agent may fail the connection.

Scrutinizing the initial request for the 501 response, I found that (unlike the other error responses), it was a “failed” request. Drilling down into that log entry further demonstrated this: “net::ERR_INVALID_HTTP_RESPONSE”. I felt a chill–a break in the case! I could now ask more specific questions: is the response actually invalid (e.g. is there a bug in the WPT Server?), or is Chrome doing something fishy when parsing responses? Things were getting serious, so I shucked my trench coat, loosened my tie, and opened WireShark.

WireShark is an excellent Open Source tool for conducting low-level traffic analysis. Comparing the network traffic between passing and failing trials, I noticed a correlation between the HTTP 503 failure and a seemingly-unrelated request: the “HTTP 204” request. The way the data was split across TCP packets seemed to have some bearing on the results. Whenever the 503 response was transmitted with the body of the 204 response, the test failed. That’s when it hit me: in HTTP lingo, the status code 204 means something very specific: “No Content.” The response claimed to have no body, but it went ahead and wrote some data there anyway. In other words, the test was violating the HTTP protocol. Feeling dizzy, I quickly dialed in a patch for the Python script to omit the body for 204 responses. Sure enough, Chrome began passing consistently.

The end?

With the flakiness resolved and the guilt placed on the test, I thought I’d call it a night. I headed to the diner, ordered the blue plate special, and let my mind wander.

Almost immediately, I was back to thinking about that test. I couldn’t shake one important detail: in all my trials, Firefox never failed. My solution may have resolved the flakiness in Chrome, but it brought me no closer to understanding why the browsers behaved differently.

The web is a wild place, and it’s full of bad actors who violate all sorts of rules. If a server can send an HTTP 204 response with a body, then you can bet one will. Shouldn’t all browsers handle that rowdiness in the same way? The test might have been rehabilitated, but I was no closer to understanding why the browsers were acting differently.

The next step, as they teach in the Academy, is to re-create the crime scene. It was a quiet night, so I set my laptop down right there on the counter and got to it.

I’d need a web page that would issue two requests and clearly describe the browser’s behavior when the server sent a 204 response with a body followed by a regular old 200 response. “On the wire,” that would look something like this:

HTTP/1.1 204 No Content
x

HTTP/1.1 200 OK
Content-Length: 0

I could then run any browser I wanted through the simulation and build up a picture of the going trend.

…but of course, things weren’t so simple. It looked like the number of bytes in the invalid body actually mattered, so that needed to be controlled. For instance, the response stream for the “three invalid bytes” case would look like this:

HTTP/1.1 204 No Content
xxx

HTTP/1.1 200 OK
Content-Length: 0

I ran these trials with Chrome, Firefox, Edge, and Safari. Regardless of the number of invalid bytes, they all reacted to the first response in the same way: they simply reported a valid 204 response. The differences came in how they interpreted the response to the second request:

Browser	0	1	2	3	4	5	6
Chrome 59	OK	OK	OK	OK	OK	X	X
Firefox 55	OK	OK	OK	OK	OK	OK	OK
Edge 40	OK	OK	OK	OK	OK	X	X
Safari 10.2	OK	–	–	–	–	–	–

The “0” column describes the “control,” condition, where no invalid bytes were transmitted. Things started to get interesting as soon as just one byte of invalid data was transmitted between the responses. Chrome, Firefox, and Edge all discarded it and looked to the next byte. Safari discarded all bytes that followed. Interestingly, after 5 or more invalid bytes, Chrome and Edge gave up and reported an error. Firefox continued to look for a valid response.

Here’s how the conversation played out in more human terms:

Chrome: Can I have a 204 response and a 200 response?

Server: Sure! Here’s your 204 response…

Chrome: Thanks.

Server: …and here’s a sack of garbage with a 200 response inside it.

Chrome: That’s gross. I’m taking my business elsewhere.

Put in the same situation, Edge behaved similarly. Firefox took the “beggars can’t be choosers” position–it dug through the garbage, wiped the crud off the response, and went on from there. For its part, Safari just stared blankly at the server, refusing to acknowledge the offered parcel, and continued to wait for something that matched its expectations.

Pulling up the relevant source code from the Chromium project, I could tell that browser’s behavior was definitely intentional. It would tolerate up to four bytes of so-called “slop” to handle those bad actors I mentioned earlier. As a closed source project, Edge’s intentions weren’t so easily verified. I could only surmise that its behavior was intended to mimic Chromium’s. For Firefox, I decided to dig a little deeper–how much garbage would it politely accept? I wrote up a script to run the trial over and over again, increasing the number of invalid bytes with each run.

What I found made my blood run colder than my long-neglected Salisbury steak.

Firefox would accept over a kilobyte of invalid data following a 204 response. Even more confounding: the exact amount that caused it to finally give up was variable. The Firefox source code proved this behavior was intentional:

// Normally we insist on seeing HTTP/1.x in the first few bytes,
// but if we are on a persistent connection and the previous transaction
// was not supposed to have any content then we need to be prepared
// to skip over a response body that the server may have sent even
// though it wasn't allowed.

But I wasn’t satisfied, so I got the maintainers on the horn. They had this to say:

yes, its because servers often send bodies on responses defined not to have them (204/304). so we do it on purpose.

So I’d found the conflict! My next steps depended on one simple question: who was right? For this, I had to go straight to the top brass. That’s right, the Internet Engineering Task Force. The IETF maintains a set of so-called “RFC”s–documents that define many fundamental Internet technologies, HTTP among them. Mark Nottingham pointed me to RFC7230 section 3.3.3, which essentially says, “browsers can do what they want in this case.” This will probably come as no surprise at this point, but that ambiguity rubbed me the wrong way. I took a deep breath, steeled my nerves, and requested that the protocol be made more explicit.

The higher ups were not inspired by my petition. They explained how the problem was resolved in version 2 of the HTTP protocol. While I could see that was true, it didn’t change the fact that version 1 was still in widespread use and therefore still capable of causing interoperability issues.

They weren’t done, though. They went on to politely point out that my case was built around a truly rarefied condition, one which was better addressed with the simple advice, “don’t do that.” For all my self-righteous talk about “platform compatibility,” the pragmatist in me had a hard time arguing with that.

They can’t all be winners

I closed my laptop, tipped my waitress, and walked out of the diner feeling pretty deflated. What do I have to show for all that legwork? A bizarre little test script, a tiny paragraph on the Mozilla Developer Network, and an enhanced reputation as an industry pedant. Meanwhile, back in the Web Platform Tests, who knows how many flaky tests are still at large?

But then again, maybe I’m letting my day job cloud my perspective here. I’ve been trained to sniff out any procedure that produces different results with repetition. This certainly helps find flaky tests, but it’s in direct opposition to the way things work in the real world. As my pal Darius Kazemi has pointed out, the most successful folks are the ones who refuse to give up after failure. For every fruitful inquiry into a browser bug, there are many hours spent reading specifications and writing uncontroversial tests. Thoroughness and persistence are the name of the game. If that involves a few dead ends here and there, I’ll take it. Interoperability on the web is well worth that price.

Posted by
Mike Pennisi
on August 3rd, 2017

Contact Us

We'd love to hear from you. Get in touch!

Email

hello@bocoup.com