This post is the first in a three-part series describing our investigations into scalability for a second screen application we built with PBS. You can read more about the project in the series introduction here.

Some Background

We built the Map Center second-screen application in Node.js with the help of a number of open-source libraries. The most pertinent to this discussion is Socket.io, an interface for realtime communication with a WebSocket-like API implemented across a broad range of browsers.

At a high level, the application supported a basic “broadcaster” style communication flow: a small number (one or two) broadcasters would push data to a large number (10,000) of clients. The clients would not reply to the broadcaster nor communicate with each other.

                                             +->  client
                                             +->  client
                                             +->  client
[ broadcaster ] --> [ Application server ] --+->  (+9,994 clients)
      (me)                                   +->  client
                                             +->  client
                                             +->  client

This protocol makes for simple business logic, and it also minimizes overhead that would be incurred from maintaining many distinct “channels”. No matter how simple the protocol is, 10,000 is still a very large number.

I made some simple back-of-the-envelope calculations to estimate things like raw data throughput. For instance:

(0.5 message/second) * (40 bytes/message) * (10,000 clients) = 200 kB/sec

These calculations were of limited use, however, because they greatly oversimplified the inner workings of the application. The Socket.io protocol provides for non-data transmissions (such as handshakes and heartbeats), and information frames are padded with extra characters. Things get even more complicated when trying to model CPU utilization.

Much as it might pain my highschool calc teacher (sorry, Mr. Lussier), math alone wasn’t going to cut it. I’d need to find a way to simulate clients.

Note: Much of my experience was based on the hardware of the server running my application. Your mileage will most certainly vary according to the specs of your setup. My hope is that this post gives you some useful ideas about where to look when performance is an issue. That said, the code used to conduct this study is also open source; you can find it in the project’s main repository.

Building a Swarm

At this point in the project, I had already run some primitive client simulation. Simply opening multiple browsers was all I needed to confirm that a given feature was working as expected. This approach would not scale past three or four instances, and certainly wouldn’t be feasible on computers I controlled remotely.

I decided to make a “client simulator”, or a program that would allow a single computer to masquerade as a large number of viewers. I would only need to control a small number of these simulators (10 to 20) to model the stress my server would face on election night:

                                                  +--------------------+
                                                  | Client simulator   |+
                                             +->  | - simulated client ||+
                                             +->  | - simulated client |||+
                                             +->  |   (+597 clients)   ||||
[ broadcaster ] --> [ Application server ] --+->  | - simulated client ||||
      (me)                                   +->  +--------------------+|||
                                             +->   | (+19 simulators)   |||
                                             +->   +--------------------+||
                                             +->    +--------------------+|
                                             +->     +--------------------+

I’ll get into the proverbial nitty-gritty of building a stress testing setup in an upcoming post. For now, here are the Cliff’s Notes:

The result of this process was a simple method for ruthlessly testing the realtime application from the comfort of the Bocoup Loft:

$ bees up
$ bees exec - "forever startall client.js -c 500"
$ bees down

(cue maniacal laughter)

Early Success

The goal of the tests was to prove the application could handle 10,000 concurrent connections (both idling and broadcasting). I had two additonal parameters for my test design:

  1. Test method efficacy – the tests should actually stress the system in a meaningful way
  2. Simulation validity – the tests should prove that there are no artificial results due to the low-powered client simulators themselves

To measure the performance experienced by a single simulator, I programmed the simulation to start a timer when its first client received a given message. As each simulated client received that same message, the simulator would note the relative delay. It would then average those values and report this as a measure of performance. For example, this is how a single simulator would report the average delay when four “clients” received the same message in the broadcast:

                       +----------------------------+
                       | Client simulator           |
[ message ] ---------> | - simulated client (0ms)   |
  [ message ] -------> | - simulated client (249ms) |
   [ message ] ------> | - simulated client (348ms) |
       [ message ] --> | - simulated client (906ms) |
                       |                            | -> 501ms -> [ Test runner ]
                       +----------------------------+                  (me)

I would then average the value reported by all the client simulators. This would give a sense of the average experienced delay for each message.

I ran this test with three client distributions:

  • 10,000 concurrent connections (10 instances simulating 1,000 clients each)
  • 10,000 concurrent connections (20 instances simulating 500 clients each)
  • 20,000 concurrent connections (20 instances simulating 1,000 clients each)

I also scripted three “rounds” of broadcasts. In each round, the same messages would be sent over a shorter interval in order to demonstrate how message frequency would impact performance.

Distributed performance

As you can see, the Node.js application handled 10,000 concurrent connections without any special optimizations. This felt like magic to me, but I suppose there’s a reason Node.js has become so well-known for scalability.

Predictably, performance tended to decrease with message frequency.

The test involving 20,000 concurrent connections (of which only 17,000 managed to connect) clearly demonstrates a failure state, which satisfied the first parameter of the test design. There is also no discernable difference in results between the two 10,000-client simulations. This suggests that there are no artifacts from bottlenecks in the client simulator (i.e. network, memory, CPU), satisfying the second parameter of the test design.

I wish I could tell you that this is the end of the story. (It isn’t.)

Overly Optimistic

While gathering the above data, I realized that all the simulated clients were using WebSockets as their transport mechanism. This amounted to an oversimplification:

A big chunk of NewsHour’s viewership would not be visiting the site with a browser capable of using WebSockets (most notably, any version of Internet Explorer prior to 10). These viewers would seamlessly fall back to a simpler transport mechanism thanks to Socket.io. From the server’s perspective, these various transports look very much the same: long-lived TCP connections (often referred to as “long-polling”). This approach to realtime communication is obviously more demanding than WebSockets (for one, it requires the server to constantly close and re-open TCP connections).

My previous tests completely failed to account for this detail.

The first step to getting more realistic data was to update the simulation. I added a command line flag to control which transport the clients would use (WebSockets or long polling), and made a simple contribution to the Socket.io-client library to enable the transport in Node.

The results from re-running the tests were the opposite of heartening… Disheartening is the word I’m looking for:

Server resource utilization

The abysmal performance across all tests is obvious. One odd aspect of the data is the staggered nature of the CPU utilization. I decided that I should account for that before digging in on performance tweaks.

10,000 Hearts Beat as One

The drop in CPU utilization seemed to be occurring once every 20 seconds or so. That number seems a little familiar from the documentation on configuring Socket.io: the default client heartbeat interval is 25 seconds. I re-inspected the logic I had written to initiate the connections, and sure enough:

for (idx = 0; idx < clientCount; ++idx) {
  connection = new this.liveMap.Connection();
  connection.connect();
}

…every single client was trying to initiate a connection at more or less the same time. Besides being unrealistically demanding at the simulation start time, this behavior would lead to all clients having a synchronized heartbeats.

This additional dimension of strain initially seemed desirable. The application we were building was expected to receive a high amount of exposure, and this motivated me to be as conservative as possible in my estimates.

I ultimately decided that this specific condition was unrealistic (would everyone really tune in to News Hour at exactly the same time?) and would not make my results any more meaningful. The effect would actually be somewhat unpredictable, making the results less meaningful overall. So I dispersed the connections across the heartbeat interval:

var idx = 0;
var intervalID;
var makeConnection = function() {
  (new this.liveMap.Connection()).connect();

  idx++;
  if (idx === clientCount) {
    clearInterval(intervalID);
  }
};
intervalID = setInterval(makeConnection, heartbeatInterval/clientCount);

As I should have expected, making this fix uncovered a new obstacle: long-polling clients now only connected in “bursts”. A group of clients would immediately connect, the others would hang for a brief period until another small group managed to “get through.” Stranger still, this problem was not exhibited by WebSocket connections.

Because I didn’t bother to graph this data (shame on me), it took some time before I recognized that the delay between connection “bursts” was 25 seconds. More digging revealed that specifically the “handshake” phase of the Socket.io protocol was being interrupted. It wasn’t until the connected clients sent a heartbeat message (thus closing their long-held connection) that other clients finished establishing a connection.

It took a lot of trial-and-error, but I eventually found that 5 seemed to be a new “magic number”. This was how many long-held connections could be made before additional requests would be blocked. Searching my source, Socket.io’s source (even skimming the source of Express and Connect, just in case) yielded no answers. At this point, the Node.js developers reading this are no doubt rolling their eyes. “Check the docs on the http.Agent‘s maxSockets,” they’re saying. And they’re right–from the Node.js HTTP module documentation:

agent.maxSockets

By default set to 5. Determines how many concurrent sockets the agent can have open per host.

Luckily, Node exposes an http.globalAgent to modify all agents. I could have set it to some value that related to the application’s needs. But I was frustrated:

require("http").globalAgent.maxSockets = Infinity;

All long-polling connections now connected successfully over the course of the heartbeat interval!

With that mystery solved, I returned to the task of optimization.

Why u --nouse-idle-notification

I’ll admit it: I was getting frustrated. I had invested a fair amount of time proving how the application might fail, but I had yet to spend any time actually optimizing the code. I hoped that there might be some low-hanging fruit that would make for an easy solution.

I came across some easy tweaks for high-performance networking in node, including disabling Nagle’s algorithm on sockets and running Node with the --nouse-idle-notification command-line flag. (Caustik covers these settings and more in a great series on scaling Node.js. I highly recommend checking it out.)

There Ain’t No Such Thing as a Free Lunch, though–these tweaks did not magically solve my problems.

One Step Forward, Two Steps Back

After another review of the code, I was becoming convinced that the performance degradation was out of my control. It seemed more and more likely that managing long-polling connections was innately more difficult than WebSocket connections and that my server was buckling under the pressure.

I decided to start optimizing.

The Node.js cluster module allows applications to fork multiple processes (generally one for each available processor core) and route network requests intelligently. This seemed like a great way to scale the application horizontally.

There’s a catch, though: distributing across cores means spawning multiple processes, each with its own memory space. If your application uses in-memory storage for global state, spawning across processes will require some re-structuring.

Socket.io keeps track of all connected clients in this way by default. Other storage solutions can be used; Socket.io even ships with a Redis-powered solution. This was perfect because the application already used Redis for scheduling information.

Integration was a snap (check out the wiki page on configuring Socket.io for details). Once that was up and running, everything worked exactly as planned and the stress testing was finally done.

Not buying it, huh? Well, the truth is that performance was worse than ever. So bad, in fact, that most clients couldn’t even connect:

Connections with Redis

This was by far the most surprising setback to date. Redis is often touted for use in real time applications. Why should using it for persistence with Socket.io be so taxing?

There’s a hidden cost to moving from in-memory storage: data serialization*. In order for data to be shared between processes, it must be serialized to some standard format. In order to send data to the Redis server, the Redis module has to convert all data to a string. (Don’t believe me? Check out the implementation of pack and unpack in the source.)

Now the question became, “Can we do better than JSON.stringify and JSON.parse for data serialization?” As it turns out, there are a ton of interesting ways you can do better. Instead of rolling my own optimizations, it was suggested that I look into MessagePack. From the project’s website:

If you ever wished to use JSON for convenience (storing an image with metadata) but could not for technical reasons (encoding, size, speed…), MessagePack is a perfect replacement.

Even though Message Pack boasts some impressive speed improvements over JSON, it seemed as though 10,000 clients simply required more serialization than my server was able to handle. We decided to continue to use Redis for our application-level logic but revert back to in-memory storage for socket management.

Election Night and Beyond

Clearly, this research became something of a rabbit’s hole for me. Since the US Elections Committee wasn’t particularly concerned with my stress testing results, election night came and went before I solved the performance problems. I’m happy to report that our Node.js application performed just fine–between 10,000 WebSockets and 10,000 long-polling connections, we hit a manageable client distribution that was somewhere between “all WebSocket” and “all long-polling”.

Had there been more time, I would have investigated using a tool like stud to handle all incoming connections and load balance between multiple Node.js processes. As long as sticky sessions** were enabled, this would preclude the need for shared state between the Node.js processes (and thus the need for large-scale data serialization).

Hopefully, this harrowing tale has given you some ideas about optimizing for scale in your own Node.js/Socket.io applications. There are no easy answers, but with the right mindset, the process can be fascinating! If you want in on the fun, check out the next article in this series to learn how you can set up a stress testing control center of your own. After that, I cover how to a virtual server to run the application in production.

* One of the best parts of working in open source is that so many maintainers are so willing to help. This means that when you have a question about your tools, git blame will point you in the right direction. This is how I learned about some of the issues with serialization, and I have to thank Daniel Shaw for his generosity in lending a hand.

** “Sticky sessions” refers to the behavior of proxies to consistently forward requests from the same client to the same node in a cluster. This is necessary to prevent long-polling connections from bouncing between processes with each transmission.