Earlier this year, we shared our work on the launch of Scratch 3.0, a major version of the visual programming environment for children of all ages. The new version of Scratch marked a complete rewrite of the runtime in JavaScript leveraging open web APIs. In our previous post, we enumerated the many performance optimizations that were necessary to achieve smooth frame rates in the new engine on devices such as the 2015 iPad Mini and the 2017 Samsung Chromebook.
In this post, I’d like to dive into one specific optimization: speeding up the built-in Pen extension. Previously, we noted that optimizing the Pen extension required moving work from the CPU to the GPU. Here I’ll go more in depth about the bottleneck in modifying WebGL state and the shader that made the conversion possible. Ultimately, we were able to improve performance by up to 10x—the difference between a sluggish program and a smooth interactive drawing!
For context, the Pen extension provides a set of Scratch blocks, or bits of code in Scratch’s drag and drop language, that draw lines or make multiple copies of a sprite. Lines are drawn by placing the Pen Down Block on a sprite, and moving the sprite. Copies are made by applying the Stamp Block to a sprite. Stamps can be used, for example, to create a trailing effect. A lot of popular Scratch projects draw lines and stamp sprites in bulk to great effect. This makes the performance of drawing a lot of lines and stamping sprites important.
This pen stamp example shows a rotating cat where each new frame is a copy of the cat, resulting in a trace of cat copies in the shape of a circle.
Scratch 3 uses WebGL to handle scene rendering and apply special effects to sprites on the main rendering area called the stage. I am going to reference some high level details of WebGL to help explain this optimization story. We can break down some concepts into what WebGL is drawing and how it is drawing. In Scratch 3, WebGL draws points grouped into triangles. The triangles are first transformed to the exact position where they will be drawn, then pixelated and then their pixel’s colors are selected. We use a Vertex Shader to transform the location of individual points and a Fragment Shader to determine the pixel’s final color. We use a Framebuffer to store a rendered triangle in memory and a Blending Function to blend a pixel’s color with another existing color, such as a background or another sprite behind the current one. These different parts of drawing comprise WebGL state. Understanding WebGL’s state is important to using it performantly.
WebGL and GPUs in general are really fast at drawing multiple different triangles. They do not perform well, however, when they need to change how they are drawing or the state of the WebGL program. WebGL can easily draw thousands of different triangles, but it might only be able to change its state tens of times while rendering one frame.
Optimizing Pen Stamp
General Pen line performance was already faster in Scratch 3 than in Scratch 2 so we started with Pen Stamp optimization. When we started optimizing Pen Stamp, we noticed that it used a hybrid Canvas API and WebGL approach to render. The Pen Stamp first drew the sprite with WebGL on the GPU, then copied the GPU output to a 2d canvas on the CPU, and then drew the 2d canvas to a composited canvas with other copies of the sprite. The most expensive step in this procedure was reading the pixels from the GPU onto the CPU, a drastic state change that switches the flow of data from writing to reading. Since the final scene rendering is done in WebGL and the 2d canvas compositing could be replicated in WebGL, we thought that we could speed things up by doing all of the Pen Stamp’s work in WebGL.
So that is what we did. We swapped the 2d canvas strategy for a WebGL strategy that composites stamps and lines in an extra Framebuffer that allows us to draw to a working area of memory without having to incur the pixel-copying cost of leaving WebGL. With this strategy, we are able to keep the first draw of the sprite in WebGL and project the drawn sprite into a second compositing framebuffer. We can layer multiple stamps or lines onto one texture that represents all pen operations since the last “erase”. By drawing the main rendering framebuffer to a second one, we can skip three steps; 1) copying the GPU output to a 2d canvas, 2) drawing that canvas to a compositing canvas, and 3) transfering the canvas to the GPU as the source image for a texture.
After replacing the 2d canvas strategy with a framebuffer projection, we measured the following performance gains on our test devices. The following results come from one stress test running between 5 and 11 times depending on the system.
Test | Platform | Scratch 2 | Scratch 3 Canvas | Scratch 3 Framebuffer | Canvas / Framebuffer |
---|---|---|---|---|---|
236345336 | MacBook Pro 15″ 2017 Chrome | 59.09 | 73.66 | 10.48 | 7.02x |
MacBook Pro 15″ 2017 Safari | 59.09 | 81.52 | 10.86 | 7.50x | |
iPad Mini 2015 iOS 10 | n/a | 272.6 | 23.93 | 11.39x | |
SMS Chromebook 2017 | n/a | 114.46 | 22.56 | 5.07x |
Optimizing Pen Line
At this point, we were drawing the Pen Stamp tool with WebGL, but still drawing lines with the 2d canvas. We needed a way to composit Stamps drawn in WebGL with lines drawn in the 2d canvas, in order to be able to draw Stamps and Lines on top of each other. To do this, we first tried a hybrid approach for line drawing where lines were first drawn on the 2d canvas and then uploaded to WebGL. Depending on the mixture of lines drawn and sprites stamped in a given project, this approach yielded either about the same performance as existing line drawing, or significantly slower as the lines had to be converted to GPU textures everytime a stamp was drawn.
So we looked into other ways we could draw lines with WebGL, instead of the hybrid approach. The resulting approach was, well, to draw the line in WebGL in the first place. WebGL lets you draw native lines with vertices, but there are a few important constraints. The maximum line width varies depending on the GPU hardware or GPU driver. That system maximum might be less than the width of a line in a given Scratch project. In some environments, the maximum width is 1 pixel. Since our line width could be any value up to the size of a Scratch user’s screen, this wouldn’t work. Instead, we developed a mesh and model matrix method for drawing lines. We used a 6 triangle mesh: 2 triangles for the body of the line and 2 triangles for each end of the line.
The above graphic illustrates the mesh rotated on its side and 3 transformations performed to it. The mesh needs the caps to be transformed and cut off so that they are relative to the width of the line. Doing that, first the mesh’s y values are scaled. Then the values are clamped to no more than negative or positive 0.5. This cuts off most of the area of the original mesh, leaving a much smaller one. Last, the whole mesh is scaled leaving end caps that are twice as wide as they are tall.
This technique was fast, but at this point we were drawing rectangular end caps, and we needed rounded caps to meet the design requirements for Scratch. Rounding the mesh for the caps would require a different mesh for every width of line, or there would be somewhat obvious artifacts along the edges of the lines, making them look like polygons. We decided to create the rounded caps in the fragment shader instead. During the rendering process, the triangles of the line mesh are broken up into pixels and run through the fragment shader to assign a color. We looked at sampling a large semicircle-like texture in the fragment shader, but we wanted to avoid using multiple textures for different line widths. In the end, we didn’t need any textures at all, since the shape of the line cap can be computed mathematically by setting the alpha channel of a given pixel in the end cap to the clamped difference between the width and the distance to the body’s center multiplied by the alpha of the line, to account for semi-transparent lines.
Here is the actual fragment shader used for sampling the end caps.
gl_FragColor = u_lineColor;
gl_FragColor.a *= clamp(u_capScale - u_capScale * 2.0 * distance(v_texCoord, vec2(0.5, 0.5)), 0.0, 1.0);
A 0 length line ends up being two caps or a point. A 0 length line with a width of 128 would look like:
A 768 length and 128 width line similar in relation to our mesh image ends up with two caps and a filled body. (These images are given large values to help illustrate the shader’s output.)
Drawing the line with a single mesh, transforming it and calculating final shapes/anti-aliasing on the GPU, was about twice as fast as painting the line on canvas when stamps and lines were drawn on canvas. In one stress test, we saw improvements between 1.3x and 2.5x.
Test | Platform | Scratch 2 | Scratch 3 Last | Scratch 3 New | Canvas / Framebuffer |
---|---|---|---|---|---|
236783324 | MacBook Pro 15″ 2017 Chrome | 101.76 | 30.36 | 11.72 | 2.59x |
MacBook Pro 15″ 2017 Safari | 101.76 | 26.63 | 15.98 | 1.66x | |
iPad Mini 2015 iOS 10 | n/a | 62.99 | 48.37 | 1.30x | |
SMS Chromebook 2017 | n/a | 105.71 | 47.95 | 2.20x |
WebGL State management
As we saw before when reading from the GPU, WebGL does not perform well with lots of state changes. By introducing WebGL-based pen drawing to the framebuffer, the pen now needed to manage WebGL state, so we built an ad hoc state machine into our rendering flow. Up to this point Scratch 3’s WebGL state changes were done in one class. Not only did we need to add a new class, but we needed to accommodate lines being drawn hundreds of times a frame which could require much more frequent state changes than the previous implementation. Instead of changing the state to draw a line and then changing the state back for normal rendering after every line, we wanted to be able to track state changes in order to combine multiple draw operations that required the same state.
We called this Draw Regions. And here is what the code looks like:
/**
* Enter a draw region.
*
* A draw region is where multiple draw operations are performed with the
* same GL state. WebGL performs poorly when it changes state like blend
* mode. Marking a collection of state values as a "region" the renderer
* can skip superfluous extra state calls when it is already in that
* region. Since one region may be entered from within another a exit
* handle can also be registered that is called when a new region is about
* to be entered to restore a common inbetween state.
*
* @param {any} regionId - id of the region to enter
* @param {function} enter - handle to call when first entering a region
*/
enterDrawRegion (regionId, enter) {
if (this._regionId !== regionId) {
this._doExitDrawRegion();
this._regionId = regionId;
enter();
}
}
/**
* Register an exit handle when a new region will be entered.
* @param {function} exit - handle to call when about to enter a new region
*/
exitDrawRegion (exit) {
this._exitRegion = exit;
}
/**
* Forcefully exit the current region returning to a common in-between GL
* state.
*/
_doExitDrawRegion () {
if (this._exitRegion !== null) {
this._exitRegion();
}
this._exitRegion = null;
}
Entering and exiting regions is handled with JavaScript closures. If we are entering a new region, the enter closure is called. If we are already in that region the enter closure is ignored. This way we save ourselves from having to change state every time we need to draw a line if the last operation performed was another line drawn. After we have done the needed work we save how to exit the region with another closure so that entering a new region will call our exit closure first.
By entering and exiting with closures like this, two classes are able to manage shared WebGL state without explicitly understanding what states the other is entering. They share an understanding of a common inbetween state instead. When they enter their state, they know the common starting point. When they exit their state, they know the common point they must return to.
Adding draw regions allowed our development systems to go from drawing tens of lines per frame to drawing 1000s of lines per frame. Neither the pen stamp or pen line performance gains would have been possible without the use of draw regions to batch state changes.
Conclusion
Ultimately, optimizing Pen blocks required a deep understanding of the Scratch rendering pipeline, including the JavaScript virtual machine, the web Canvas API, the WebGL API, and the system’s GPU driver and hardware. Who knew that drawing a line could be so complex!
One of the major goals of the Scratch 3 release was ensuring that Scratch would run smoothly on as wide a variety of hardware devices as possible so that users everywhere could learn how to code in an accessible and efficient computing environment. The optimizations we implemented for the Pen blocks were especially valuable given how many Scratch programs use these drawing tools. We can’t wait to see what the community creates–sprites, lines, and all!