This post is the first in a three-part series describing our investigations
into scalability for a second screen application we built with PBS. You can
read more about the project in the series introduction

Some Background

We built the Map Center second-screen application in
Node.js with the help of a number of open-source
The most pertinent to this discussion is, an
interface for realtime communication with a
WebSocket-like API implemented across a
broad range of browsers.

At a high level, the application supported a basic “broadcaster” style
communication flow: a small number (one or two) broadcasters would push data to
a large number (10,000) of clients. The clients would not reply to the
broadcaster nor communicate with each other.

                                             +->  client
                                             +->  client
                                             +->  client
[ broadcaster ] --> [ Application server ] --+->  (+9,994 clients)
      (me)                                   +->  client
                                             +->  client
                                             +->  client

This protocol makes for simple business logic, and it also minimizes overhead
that would be incurred from maintaining many distinct “channels”. No matter how
simple the protocol is, 10,000 is still a very large number.

I made some simple back-of-the-envelope calculations to estimate things like
raw data throughput. For instance:

(0.5 message/second) * (40 bytes/message) * (10,000 clients) = 200 kB/sec

These calculations were of limited use, however, because they greatly
oversimplified the inner workings of the application. The protocol
provides for non-data transmissions (such as handshakes and heartbeats), and
information frames are padded with extra characters. Things get even more
complicated when trying to model CPU utilization.

Much as it might pain my highschool calc teacher (sorry, Mr. Lussier), math
alone wasn’t going to cut it. I’d need to find a way to simulate clients.

Note: Much of my experience was based on the hardware of the server running my
application. Your mileage will most certainly vary according to the specs of
your setup. My hope is that this post gives you some useful ideas about where
to look when performance is an issue. That said, the code used to conduct this
study is also open source; you can find it in the project’s main

Building a Swarm

At this point in the project, I had already run some primitive client
simulation. Simply opening multiple browsers was all I needed to confirm that a
given feature was working as expected. This approach would not scale past three
or four instances, and certainly wouldn’t be feasible on computers I
controlled remotely.

I decided to make a “client simulator”, or a program that would allow a single
computer to masquerade as a large number of viewers. I would only need to
control a small number of these simulators (10 to 20) to model the stress my
server would face on election night:

                                                  | Client simulator   |+
                                             +->  | - simulated client ||+
                                             +->  | - simulated client |||+
                                             +->  |   (+597 clients)   ||||
[ broadcaster ] --> [ Application server ] --+->  | - simulated client ||||
      (me)                                   +->  +--------------------+|||
                                             +->   | (+19 simulators)   |||
                                             +->   +--------------------+||
                                             +->    +--------------------+|
                                             +->     +--------------------+

I’ll get into the proverbial nitty-gritty of building a stress testing setup in
an upcoming post. For now, here are the Cliff’s Notes:

The result of this process was a simple method for ruthlessly testing the
realtime application from the comfort of the Bocoup

$ bees up
$ bees exec - "forever startall client.js -c 500"
$ bees down

(cue maniacal laughter)

Early Success

The goal of the tests was to prove the application could handle 10,000
concurrent connections (both idling and broadcasting). I had two additonal
parameters for my test design:

  1. Test method efficacy – the tests should actually stress the system in a
    meaningful way
  2. Simulation validity – the tests should prove that there are no artificial
    results due to the low-powered client simulators themselves

To measure the performance experienced by a single simulator, I programmed the
simulation to start a timer when its first client received a given message.
As each simulated client received that same message, the simulator would note
the relative delay. It would then average those values and report this as a
measure of performance. For example, this is how a single simulator would
report the average delay when four “clients” received the same message in the

                       | Client simulator           |
[ message ] ---------> | - simulated client (0ms)   |
  [ message ] -------> | - simulated client (249ms) |
   [ message ] ------> | - simulated client (348ms) |
       [ message ] --> | - simulated client (906ms) |
                       |                            | -> 501ms -> [ Test runner ]
                       +----------------------------+                  (me)

I would then average the value reported by all the client simulators. This
would give a sense of the average experienced delay for each message.

I ran this test with three client distributions:

  • 10,000 concurrent connections (10 instances simulating 1,000 clients each)
  • 10,000 concurrent connections (20 instances simulating 500 clients each)
  • 20,000 concurrent connections (20 instances simulating 1,000 clients each)

I also scripted three “rounds” of broadcasts. In each round, the same messages
would be sent over a shorter interval in order to demonstrate how message
frequency would impact performance.

Distributed performance

As you can see, the Node.js application handled 10,000 concurrent connections
without any special optimizations. This felt like magic to me, but I suppose
there’s a reason Node.js has become so well-known for scalability.

Predictably, performance tended to decrease with message frequency.

The test involving 20,000 concurrent connections (of which only 17,000 managed
to connect) clearly demonstrates a failure state, which satisfied the first
parameter of the test design. There is also no discernable difference in
results between the two 10,000-client simulations. This suggests that there
are no artifacts from bottlenecks in the client simulator (i.e. network,
memory, CPU), satisfying the second parameter of the test design.

I wish I could tell you that this is the end of the story. (It isn’t.)

Overly Optimistic

While gathering the above data, I realized that all the simulated clients were
using WebSockets as their transport mechanism. This amounted to an

A big chunk of NewsHour’s viewership would not be visiting the site with a
browser capable of using WebSockets (most notably, any version of Internet
Explorer prior to 10
). These viewers
would seamlessly fall back to a simpler transport mechanism thanks to From the server’s perspective, these various transports look very
much the same: long-lived TCP connections (often referred to as
“long-polling”). This approach to realtime communication is obviously more
demanding than WebSockets (for one, it requires the server to constantly close
and re-open TCP connections).

My previous tests completely failed to account for this detail.

The first step to getting more realistic data was to update the simulation. I
added a command line flag to control which transport the clients would use
(WebSockets or long polling), and made a simple contribution to the library to enable the transport in

The results from re-running the tests were the opposite of heartening…
Disheartening is the word I’m looking for:

Server resource utilization

The abysmal performance across all tests is obvious. One odd aspect of the data
is the staggered nature of the CPU utilization. I decided that I should account
for that before digging in on performance tweaks.

10,000 Hearts Beat as One

The drop in CPU utilization seemed to be occurring once every 20 seconds or so.
That number seems a little familiar from the documentation on configuring
the default client heartbeat interval is 25 seconds. I re-inspected the logic I
had written to initiate the connections, and sure enough:

for (idx = 0; idx < clientCount; ++idx) {
  connection = new this.liveMap.Connection();

…every single client was trying to initiate a connection at more or less the
same time. Besides being unrealistically demanding at the simulation start
time, this behavior would lead to all clients having a synchronized heartbeats.

This additional dimension of strain initially seemed desirable. The application
we were building was expected to receive a high amount of exposure, and this
motivated me to be as conservative as possible in my estimates.

I ultimately decided that this specific condition was unrealistic (would
everyone really tune in to News Hour at exactly the same time?) and would not
make my results any more meaningful. The effect would actually be somewhat
unpredictable, making the results less meaningful overall. So I dispersed the
connections across the heartbeat interval:

var idx = 0;
var intervalID;
var makeConnection = function() {
  (new this.liveMap.Connection()).connect();

  if (idx === clientCount) {
intervalID = setInterval(makeConnection, heartbeatInterval/clientCount);

As I should have expected, making this fix uncovered a new obstacle:
long-polling clients now only connected in “bursts”. A group of clients
would immediately connect, the others would hang for a brief period until
another small group managed to “get through.” Stranger still, this problem was
not exhibited by WebSocket connections.

Because I didn’t bother to graph this data (shame on me), it took some time
before I recognized that the delay between connection “bursts” was 25 seconds.
More digging revealed that specifically the “handshake” phase of the
was being interrupted.
It wasn’t until the connected clients sent a heartbeat message (thus closing
their long-held connection) that other clients finished establishing a

It took a lot of trial-and-error, but I eventually found that 5 seemed to be a
new “magic number”. This was how many long-held connections could be made
before additional requests would be blocked. Searching my source,’s
source (even skimming the source of Express and
Connect, just in case) yielded no
answers. At this point, the Node.js developers reading this are no doubt
rolling their eyes. “Check the docs on the http.Agent‘s maxSockets,”
they’re saying. And they’re right–from the Node.js HTTP module


By default set to 5. Determines how many concurrent sockets the agent can
have open per host.

Luckily, Node exposes an http.globalAgent to modify all agents. I could
have set it to some value that related to the application’s needs. But I was

require("http").globalAgent.maxSockets = Infinity;

All long-polling connections now connected successfully over the course of the
heartbeat interval!

With that mystery solved, I returned to the task of optimization.

Why u --nouse-idle-notification

I’ll admit it: I was getting frustrated. I had invested a fair amount of time
proving how the application might fail, but I had yet to spend any time
actually optimizing the code. I hoped that there might be some low-hanging
fruit that would make for an easy solution.

I came across some easy tweaks for high-performance networking in node,
including disabling Nagle’s
on sockets and
running Node with the --nouse-idle-notification command-line flag. (Caustik
covers these settings and more in a great series on scaling
. I highly recommend
checking it out.)

There Ain’t No Such Thing as a Free Lunch, though–these tweaks did not
magically solve my problems.

One Step Forward, Two Steps Back

After another review of the code, I was becoming convinced that the performance
degradation was out of my control. It seemed more and more likely that managing
long-polling connections was innately more difficult than WebSocket connections
and that my server was buckling under the pressure.

I decided to start optimizing.

The Node.js cluster module
allows applications to fork multiple processes (generally one for each
available processor core) and route network requests intelligently. This seemed
like a great way to scale the application horizontally.

There’s a catch, though: distributing across cores means spawning multiple
processes, each with its own memory space. If your application uses in-memory
storage for global state, spawning across processes will require some
re-structuring. keeps track of all connected clients in this way by default. Other
storage solutions can be used; even ships with a
Redis-powered solution. This was perfect because the
application already used Redis for scheduling information.

Integration was a snap (check out the wiki page on configuring

for details). Once that was up and running, everything worked exactly as
planned and the stress testing was finally done.

Not buying it, huh? Well, the truth is that performance was worse than ever. So
bad, in fact, that most clients couldn’t even connect:

Connections with Redis

This was by far the most surprising setback to date. Redis is
for use in real time applications. Why should using it for persistence with be so taxing?

There’s a hidden cost to moving from in-memory storage: data serialization*.
In order for data to be shared between processes, it must be serialized to some
standard format. In order to send data to the Redis server, the Redis module
has to convert all data to a string. (Don’t believe me? Check out the
implementation of pack and unpack in the

Now the question became, “Can we do better than JSON.stringify and
JSON.parse for data serialization?” As it turns out, there are a ton of
interesting ways you can do
. Instead of
rolling my own optimizations, it was suggested that I look into
MessagePack. From the project’s website:

If you ever wished to use JSON for convenience (storing an image with
metadata) but could not for technical reasons (encoding, size, speed…),
MessagePack is a perfect replacement.

Even though Message Pack boasts some impressive speed improvements over JSON,
it seemed as though 10,000 clients simply required more serialization than my
server was able to handle. We decided to continue to use Redis for our
application-level logic but revert back to in-memory storage for socket

Election Night and Beyond

Clearly, this research became something of a rabbit’s hole for me. Since the
US Elections Committee wasn’t particularly concerned with my stress testing
results, election night came and went before I solved the performance problems.
I’m happy to report that our Node.js application performed just fine–between
10,000 WebSockets and 10,000 long-polling connections, we hit a manageable
client distribution that was somewhere between “all WebSocket” and “all

Had there been more time, I would have investigated using a tool like
stud to handle all incoming connections and
load balance between multiple Node.js processes. As long as sticky sessions**
were enabled, this would preclude the need for shared state between the Node.js
processes (and thus the need for large-scale data serialization).

Hopefully, this harrowing tale has given you some ideas about optimizing for
scale in your own Node.js/ applications. There are no easy answers,
but with the right mindset, the process can be fascinating! If you want in on
the fun, check out the next article in this
to learn how you
can set up a stress testing control center of your own. After that, I cover
how to a virtual server to run the application in

* One of the best parts of working in open source is that so many
maintainers are so willing to help. This means that when you have a question
about your tools, git blame will point you in the right direction. This is
how I learned about some of the issues with serialization, and I have to thank
Daniel Shaw for his generosity in lending a hand.

** “Sticky sessions” refers to the behavior of proxies to consistently
forward requests from the same client to the same node in a cluster. This is
necessary to prevent long-polling connections from bouncing between processes
with each transmission.