Bocoup has been a long-time contributor to the Web Platform Tests (WPT) project, helping spec writers draft testable specs and helping browser implementers test features for correctness and interoperability based on those specs. In 2018, we’ve made great strides improving the coverage of WPT, the ergonomics of writing and running tests, and the infrastructure necessary to run WPT at scale. In this post, I’ll outline how we’ve collected daily conformance results, and then use those results to provide a historical analysis of web platform interoperability trends in 2018.
The Backstory
One of the most exciting highlights from the past year has been the deployment of the results-collection project which displays the results of running over a million conformance tests against 4 browser engines, everyday on the wpt.fyi dashboard. The dashboard has proven to be a useful way to scan a feature or directory of tests and determine at a glance the degree to which an API is well-tested and interoperable.
However, the wpt.fyi dashboard only displays a snapshot of a particular point in time (usually represented by a SHA-aligned run of tests across four browsers: Chrome, Edge, Firefox, and Safari). In order to find historical trends across the million tests over a year and a half of recorded results, we decided to run offline computation instead. This allowed us to query all of the data collected since September 2017. Our goal was to understand how the test suite has evolved over time and how our investment in particular features has impacted the interoperability of the web platform as a whole. Through this process we developed a set of analyses that show a summary of the past year’s work and paint an optimistic, nuanced portrait of the future of the web platform. Here’s to further exploration of the history of the project and continued improvements in web interoperability!
Methodology
WPT test results are stored in records called reports. Each test run for a given browser generates a JSON summary of the test results for each test file (OK, PASS, FAIL, or ERROR) and each test (PASS, FAIL, or TIMEOUT). Different web platform features use the file/test hierarchy to mean different things. Some features contain just one test per file while other features group thousands of tests under a single file.
I downloaded all of these reports from Google Cloud Storage by querying the wpt.fyi /runs API to return the full report URL for all runs recorded on the dashboard. Once the data were downloaded, I made sure each run was unique, extracted each test into its own object with a simplified schema, and concatenated all of the tests into ndjson files, in preparation to ingest the data into BigQuery.
In BigQuery, I organized the data into two append-only tables: a tests table, where each row represented the results of a test file for a particular test run (with individual tests embedded as nested columns) and a runs table, which contained metadata information for each test run such as the date of the test, the specific browser version, and the commit hash. By sorting the tests on different columns, it was possible to filter test result data to a particular run, to a particular browser, or to a particular feature or test directory.
All told, the tests table contained almost 200 million rows of test files (containing further individual tests), representing about 7,000 complete runs. Of those runs, about 600 were useful for comparison because the same version of the test suite had been run across a stable version of all four browsers: Chrome, Firefox, Edge, and Safari.
Using those groups of 4-browser runs, I calculated for each group the number of tests that passed in all four browsers (normalized to a range between 0 and 1 for each test file, to account for files with an inordinately high number of tests) and compared that to the number of tests that passed in three, two, or just one browser. When displayed together on the same chart, this data can be seen to represent the amount of interoperability among browsers for a given web feature or, in aggregate, for the whole platform.
The Analysis
This graph shows the sum of the number of tests passing in n/4 browsers (Chrome, Edge, Safari, and Firefox), normalized per file (y-axis) and graphed by the date of each test run (x-axis). The data represent 153 test runs since September 2017.
Although different features have varying degrees of test coverage, the overall graph tells an encouraging story about the interoperability of the web platform at large.
Since March 17, 2018, more features have passed in all four browsers than in any other combination of browsers. At the present time, 47% of all tests pass in all four browsers. Since January 1, 2018, the total number of passing tests has increased by 30%.
The biggest jump in interoperable features comes in late March, when EdgeHTML 16 was released and thousands of missing tests began to pass in Edge. Between mid-October and November, interoperability increased significantly again due to a group of HTML DOM, parsing, and semantics tests passing for the first time in Edge. Because every new implementation of a web platform spec provides valuable feedback and adds to the diverse set of implementer perspectives, these Edge improvements benefited the entire ecosystem. For WPT tests, a rising tide lifts all boats.
It’s also interesting to see where the chart doesn’t report changes, particularly in relation to infrastructural modifications under the hood. On August 9, Mike Pennisi was finally able to merge an async test cleanup PR over eight months in the making. Fortunately, the chart shows little motion around that date: the PR successfully changed the way thousands of tests ran, without introducing a noticeable increase in regressions.
Both the HTML and CSS charts show a healthy level of interoperability for two of the most important WPT directories. It’s worth mentioning, for HTML and CSS in particular given the scope of their respective API surfaces, that the charts above do not take code coverage into account. Although we can use the number of tests as an indicator of coverage, we can’t make any absolute claims about interoperability without cross-referencing the tests to the specs themselves. We have some ideas at Bocoup for researching browser engine code-coverage from WPT runs, and have had promising results on the efficacy from our work in the Gecko code base with Mozilla. However, significant further investment is needed here in order to create a platform-wide analysis.
It’s also interesting to note that the HTML tests appear to be among the noisiest tests, with local variations in scores appearing run-to-run throughout much of the second half of 2018.
The graph of service worker interoperability is particularly exciting. Safari 11.1 enabled service workers by default. We began collecting test runs from that release in May 2018, and the chart shows a corresponding jump in the number of test cases that pass in 3 browsers. For what it’s worth, a technical constraint around SSL support in our Windows CI environment meant that we could not properly test Edge service worker support until December of this year (despite the fact that Edge has supported service workers in some capacity since last year). Nevertheless, the data from this month accurately represent the large number of service worker tests that finally pass in 4 browsers.
Many of these service worker tests are present in WPT due to Bocoup’s earlier work migrating Chromium’s service worker tests to the WPT infrastructure. In this case, we can be confident that the service worker test suite is comprehensive and the above results speak to concrete predictability for web developers.
The Shadow DOM and Custom Elements charts are similarly exciting in that they depict interoperability among three browsers, starting at the beginning of the year for Custom Elements and in September for Shadow DOM with the release of Firefox 63. Although it is exciting to see these interoperability charts as a marker of Web Component maturity, it is worth noting that the graphs show progress for a relatively small number of tests overall. In the future, we hope to increase the test coverage of Shadow DOM and Custom Elements to better ensure that the entire API surfaces are interoperable among browsers.
In August 2018, Bocoup began working with browser implementers to improve the state of fieldset interoperability. In this case, rather than the number of tests that pass, it is the existence of the fieldset tests at all that is exciting! Note that the first data are from September 2018 when we added the first new tests affecting rendering issues and legend display. The new test coverage helped browsers, including Firefox, to begin to fix these longstanding issues. Although there is a long way to go before all four browsers have interoperable fieldset implementations, adding test coverage is the first step, and we can see a clear trend towards improvement in the future!
Further Analysis
The data at hand are rich and we have only just begun to scratch the surface. Charting historical interoperability is not only a way to understand how contributions have affected browsers over the past year, but also a way to make informed predictions about the evolution of the web platform going forward. What’s more, both web developers and browser implementers can benefit from a careful analysis of the data.
For example, as a web developer, seeing the steady increase of service worker features supported in 3 or 4 browsers suggests that it might be time to start investing in learning the API if you have been holding off until widespread browser support. Or, if you are an early adopter with an interest in the newest CSS features, consulting the interoperability history for the relevant CSS directories gives a more nuanced understanding of how closely the spec has been implemented and allows you to extrapolate when you might be able to use the features in all four browsers.
Historical data helps browser implementers and spec writers answer questions about prioritization and impact. With access to conformance results data for other engines, a browser implementer is able to prioritize their work based on the tests that are failing in their engine and passing in three others. Comparing an individual test directory over time to a broader directory or the entire set of all tests helps answer how work on part of WPT impacts the overall trajectory of the web platform. Moreover, historical data sheds light on organizational investments, including how and when different teams appear to be focusing on interoperability and correctness issues for different features (with the caveat that WPT bugs and CI issues can make some browser results misleading in the short term). From a broad enough vantage point and with more sophisticated measures of test coverage, the historical interoperability data will help answer: How close is the web platform to achieving spec conformance across all browsers?
Conclusion
The elephant in the room is the recent announcement that Microsoft’s Edge browser will soon replace its EdgeHTML engine with Chromium’s engine. Although in the short term this move will reduce the number of interoperability targets to three (and at least in theory increase the interoperability among browsers in the wild), it represents a blow to the implementer diversity that the web platform depends on for facilitating a plural consensus on how the web should work.
Fortunately, an exciting development for wpt.fyi and interoperability in the web platform ecosystem is on the horizon. Early efforts have proven successful in getting the Servo browser engine to run WPT to completion and generate the test reports necessary to be integrated into the wpt.fyi dashboard. With work halting on EdgeHTML, Servo is now the newest in-progress browser engine. New implementations of the web platform breathe life into the standards which underpin browsers, and so we are excited to direct our interoperability analysis toward this new fourth implementation. In the future, it is our hope that Servo takes a permanent place in the WPT interoperability charts and represents a cornerstone in the foundation of a diverse, healthy web platform ecosystem.
The graphs above show the enormous scale of the web platform and the complexities that come with writing and maintaining a browser engine in the face of continually changing specs. As we continue to invest in Web Platform Tests as a way to quantify interoperability at scale, it’s important to remember that each browser engine plays an essential role in ensuring that web platform evolution is based on a plurality of voices, experiences, and visions for the future of the web.