How do you visualize the “Health of the Internet”? This was the challenge posed to the Data Vis team at Bocoup by our client Measurement Lab, a nonprofit that collects millions of Internet speed tests every month from around the world since 2009. This data is invaluable to policy makers, researchers, and the general public for understanding how Internet speeds are changing over time as well as for highlighting and understanding the impact of service disruptions. However, with petabytes of individual speed test data reports as a data source, it can be difficult to make a visualization tool that is engaging and useful for such a broad audience.
Our collaboration resulted in the creation of the Measurement Lab Visualization Tool, which provides comprehensive access to this dataset for everyone, through interactive exploratory visualizations and a RESTful data API. In this post, we’d like to highlight some of our design process and lessons learned while building this tool. You can also read about our work on the Measurement Lab blog.
Designing With the User in Mind
We knew that our partners at M-Lab wanted to build a tool that provided for powerful exploration, but with a dataset this large, we had to plan in advance for what those explorations would be. To do so, we turned to our user-centered design process to get a better understanding of our many types of audiences. For design in data visualization to be successful, it should be focused on how best to shape the tool to address the needs of the users of that tool. We contacted potential users from both policy and research domains and performed a series of interviews with them. These interviews started with a few opening questions about how the interviewee uses, or could use, the existing data and what types of aggregations and groupings would be most insightful. The point of the process was to get them talking about their day-to-day needs and to listen for places where we could facilitate understanding or improve their process. We also listened for patterns and questions that were repeated by multiple individuals.
The result was an ordered list of requirements for the tool — the features and functions that would be needed for it to be useful to our users. We refined and prioritized this list with our stakeholders and it served as a reference point for building and iterating on sketches and mockups of the tool. The interviews provided a path through the infinite design space toward solutions that packaged the data into useful, high impact visualizations.
A Tool for All the Users
We arrived at a design that focuses on two ways for users to interact with the data: a zoomed in view of ISP (Internet Service Provider) speeds for a particular location and a comparison view between locations and ISPs. We wanted casual users to be able to quickly dive into the internet connection quality in their city while also allowing power users to be able to slice and dice the data in all pertinent ways, exploring more complex question such as the impact of business relationships between providers, the impact of political events on connectivity in various locations and potential throttling impacting consumers. In both views, we focused on aggregating individual speed tests by ISP to provide a viewpoint in the data that was granular enough to be relatable, but could still show meaningful patterns.
With a focus on familiarity and functionality, several common chart forms, such as line charts and histograms, populate the tool. This enables users to leverage their previous experience and immediately begin drawing insights from the data.. After all, this tool is for spending time browsing the data instead of learning how to read novel visualizations. As a result, we were able to spend additional time and effort tuning these simple visualizations to function beautifully while conveying the data our users care about.
As Measurement Lab’s international presence is still growing, it is possible that certain locations may not have enough data to accurately represent a given ISP over a particular stretch of time. Core Data Vis team member Peter Beshai noted that the default method of simply connecting the dots is misleading and can lead to spurious visual representations of the data. For this project, he developed an amazing d3 plugin, d3-line-chunked that allows for different stylings to be applied to line segments in a line chart to explicitly indicate areas of low data. Along with this, he implemented another very thoughtful tool, d3-interpolate-path which can correctly animate line charts in the presence of missing data. Go check out both of these tools now!
Another spot where we spent some love and care is in the use of color. Color is used to distinguish different ISPs from one another and is kept consistent throughout the site. With so many different ISPs, there are many colors that are repeated, but it was important to us that the specific color associated with an ISP stays consistent even across users and sessions. This makes sharing and exporting the visualizations, another critical requirement of the tool, much less confusing. You can see some experimentation we did on this with our consistent colors tool.
Piping in Big Data
While we were developing our elegant and functional front-end, we also needed to get the data to power it. Visualizing petabytes of information on the Open Web is no simple matter!
In an amazing display of openness and transparency, all of Measurement Lab’s data is open and accessible via a Google BigQuery Table which can be queried by anyone for free. This is already an amazing and indispensable resource for researchers, but while BigQuery allows for extensive analysis of the data, it isn’t built for real-time aggregation access. To power our visualization tool, we built our own data processing and aggregation pipeline which populates a near real-time data store that is then accessed by the client via an API service.
Some details on each piece of this Big Data puzzle are in order. The data processing pipeline is written in Java and uses Google’s Cloud Dataflow service to transform the raw data into a series of aggregations. The DataFlow SDK is implemented based on the Apache Beam open source project, which provides an advanced unified programming model, allowing users to implement batch and streaming data processing jobs that can run on any execution engine (such as Google’s DataFlow). Dataflow is a completely managed service that allows you to pull in data from any number of sources and apply transformations on that data. We took data from the raw BigQuery table and processed it, with the results ultimately populating a series of BigTable tables.
During the design process, an important decision that we made was what aggregation method should be used to combine raw individual speed tests into metrics representing speeds at the ISP and location level. While averaging speeds would be the easiest method to use, it would be heavily skewed by severe outliers. After some initial data exploration, we concluded that this methodology would not be appropriate for our particular dataset. We instead use median speeds as the primary metric. This provides a much more accurate representation of the underlying raw data, but also poses a unique problem: unlike averages, the median of a combination of medians is not the median of the raw data. In other words, medians cannot be sliced and recombined in the same way that averages can.
To deal with this, our pipeline generates median values for different levels of time-based aggregation: day, month, year, for all locations and ISPs. These different medians are piped into separate BigTable tables that are queried by the API when needed. This solution requires more pre-computation upfront, but provides a more robust and accurate representation of the underlying data – which we think is worth the effort.
The last piece of the puzzle is our python-based API which serves as the glue between aggregated data and interactive front-end. We chose to use python and Flask to power this portion of the tool as Google has built stable libraries for communicating with BigTable via python. The API provides a RESTful interface to the terabytes of aggregated data and allows access to all this data in a fast and scalable manner. While this API serves our front-end, it is free and open to other users to may want access to this aggregated data here.
With the data wrangling, processing, storing, and visualization, this project represents a great “full-stack” big-data visualization application. Thanks to Measurement Lab’s and Bocoup’s shared belief in moving the Open Web forward, we were able to develop this entire project in the open, so everyone can learn from and continue learning from this amazing data resource. You can see all of our code in these open github repositories: front-end application, data processing pipeline and REST API server.
And the end result really has something for everyone. Go check out the Measurement Lab Visualization Tool right now!