Last year we formed a team focused on using the growing number of new web APIs to build multimedia applications that push the edge of the web platform. Our main work in this area over the last 2 years has been with the wonderful team behind Scratch at MIT’s Lifelong Kindergarten Group (LLK), and with LLK partners integrating with the newest version: Scratch 3.0.
At the beginning of this year, Scratch 3.0 launched, and I wanted to share a little bit about our work on the platform. Scratch is a visual, block based programming language, with a lot of support and community resources for education.
In 2017, we started contributing to this effort, with a focus on: monitoring and improving performance on target devices, interoperability between Scratch 2.0 and Scratch 3.0, and third party partner integrations. Through the course of our work with Scratch 3.0 dependants and through working directly with LLK on Scratch 3 performance and interoperability, Z, Marie, Katie, Val and myself have landed 193 patches on Scratch to date.
Scratch Vocabulary Primer
To orient us before discussing our technical work, I’d like to first share some Scratch jargon. Scratch programs exist on a “stage”. You can think of the stage as a canvas, or the active screen. You can add “sprites” to the stage. Sprites are a collection of “costumes” (one or more images), “sounds”, and “blocks” (program behaviors). Sprites can be controlled programmatically using blocks. Sprites have one or more “clones” on the stage which each have their own independent coordinates, current costume, rotation, variables, etc. A block can be anything from a start event, to a move x/y pixels, to camera/video motion detection.
Scratch Block Types
We have three general categories:
“Event Blocks” – These have a rounded top and kind of look like hats (which leads to their nickname of “Hat Blocks”). They cover events like keypress, broadcast event received, button press, green flag (user clicks start), mouse click, timer, etc.
“Command Blocks” – Generally square in shape with a connector above and below, these blocks tell the sprite to change something. I.E. move, rotate, play a sound, switch costume, broadcast an event, etc.
“Reporter Blocks” – Shaped with rounded corners (numbers and strings) or triangular (boolean true/false) edges, these blocks will generally read or calculate some value and “report” it. You generally drop reporters into the input bubbles of other command or hat blocks.
The Scratch 3.0 platform, which you can think of as the web application that ships to scratch.mit.edu, is developed across many repositories in the LLK org on github. Most of the topics covered in this post deal with changes we made to the Scratch Virtual Machine, Scratch GUI and Scratch Render. The Scratch Virtual Machine, or “VM” for short, is a node.js module which implements the scratch language. The Scratch VM is responsible for taking a configuration of sprites and blocks, parsing them into a more structured representation, compiling that structure into something that can run fast, and then evaluating the result of the scratch project. Scratch Render is attached to the VM and is responsible for rendering the evaluated result of a scratch program to the device it is running on. In the case of the Scratch website, this is usually WebGL on a computer screen.
The Scratch GUI (short for Graphical User Interface) provides a set of React Components that present the graphical interface of the Scratch editor. On the Scratch Website, the Scratch GUI is also responsible for starting the VM, attaching the renderer, and bootstraping everything else (E.G. scratch-blocks, audio-engine, svg-renderer and its bitmap-adapter, scratch-paint, scratch-render-fonts, etc). GUI is the repository that ties together all of these components into the browser, so, for example, this is where keyboard and mouse events are setup.
In order to ship Scratch 3.0 we needed to maintain a baseline performance parity and ship with zero performance regression for projects created in prior versions. If a project ran well in Scratch 2.0 – no matter what the structure – the goal was that it should run at least as well in Scratch 3.0 on target devices, including the 2015 iPad Mini, 2017 Samsung Chromebooks, and the Raspberry Pi 3B+.
To ensure performance did not regress, we set up a benchmarking suite and reference hardware to test on (including a 2015 iPad and a 2017 Chromebook). In the fall of 2018 we added the Raspberry Pi 3B+ to our target machines. We manually ran all through our benchmarks on these devices and reported on performance parity in each of the areas of the code base that we worked.
Starting in mid-2017, we developed the first set of benchmarking tools and measurement methodologies to compare Scratch projects across multiple versions of Scratch. This Scratch VM Benchmark shows a good example of our performance data collection in action.
Here we have collected Scratch projects that pushed the edge of browser and device performance. Each project is played twice, to collect data from a cold start and a warm start. A cold start measures how long it takes for the Scratch VM to initialize it’s caches, while the warm state measures blocks executed once the VM is initialized or and “warmed up”. We capture data on how many blocks are executed per second, along with the speed of different function calls and frames. We used these marginal timing differences to find hot spots and prioritize our performance optimization work within the Scratch VM.
In the beginning, per-block timing reports helped us identify slow scratch blocks to target. Once we made these all fast, we moved on to overall VM performance. We disabled per-block reporting at that point because collecting per-block timing information slowed down the benchmark and obscured other parts of the overall VM performance. The page still shows spots for these values in case we do want to enable it again, so expect to see a bunch of 0s if you look at the per block timing in the reports today.
I’ll start off the performance section of this post with a walk through of some of the block-specific optimizations we developed using this benchmarking tool, and then wrap up with walkthroughs of cross-system optimizations we made to the VM, and loading pipeline.
This CPU based approach may sound like a lot of work computationally, but for the “touching another sprite” detections, only the alpha channel of color is needed; so for each pixel, we have a boolean, and a limited number of targets to test against. For color based operations, however, no target is selected, and the math is much more complex, requiring blending and alpha compositing all of the possible clones on the stage at the tested location. This adds two more whole
for loops on the outside of the operation (increase by another O(n²)) than a “touching sprite” collision detection.
To deal with this additional exponential complexity in the color touching blocks, we implemented a branching condition so that if the value of the total number of images multiplied by the total number of tested pixels exceeds 40,000, we kickoff a simultaneous GPU based approach. We still run the first 40,000 checks on the CPU while the GPU is running in case any collisions are detected in that first set. If no collisions are detected in that first set, then we cut over to the GPU and ask for the remaining set of checks. This approach cuts out out the delay of waiting for the GPU in the cases where the collision happens early enough to faster on the CPU. We manually tuned this 40,000 threshold based on the performance of the 2017 Samsung Chromebook, our target device, which incidentally made my gaming computer run this slightly slower. We determined that this was the correct performance trade off for Scratch.
The Pen Extension
The Pen Extension provides a way for Scratchers to both draw lines using the “pen down / up / color” blocks, and make “trails” or “copies” of sprites in via a “stamp” block. According to data from the Scratch team, many popular Scratch projects use the Pen tool to make tracers, gradients and other effects so this was a priority performance area for us.
We noticed that projects based on “Pen Lines” were already doing well performance-wise compared to Scratch 2, but projects using “Pen Stamp” were much slower than in Scratch 2.0. We found that the implementation of Pen Stamp had a inefficient rendering pipeline involving multiple
<canvas> elements and movement of pixels back and forth between the CPU and GPU. We moved to using a framebuffer with WebGL to keep the rendering entirely on the GPU, and avoid transferring it to the CPU for processing. This was a similar strategy to how we handle performance bottlenecks in the touching blocks, but in the opposite direction. For Touching, we needed to stay on the CPU to avoid the synchronous work of moving pixels between CPU and GPU. In this case we are able to avoid moving to the CPU at all and composite Pen Stamps entirely on the GPU.
Once we identified this GPU compositing strategy for Pen Stamp we were able to apply it to Pen Line. We measured our overall performance for the Pen Extension with a fun new metric: “Time To Bob Ross Painting”, or “TTBRP” for short. When we started it took a while to render a Bob Ross painting using Pen. Through our work here, we were able to improve overall TTBRPs from seconds to milliseconds, and then flip our metric from TTBRP to “BRP/s” (“Bob Ross Paintings Per Second”). We saw a 500% improvement on 2017 Chromebooks, 800% on 2017 Macbook Pro, 1100% on a 2015 iPad, and ∞% improvement on coolness of performance metric names:
The example above has a before (in black) and after (in color) for time comparison. There was also an alpha/color compositing bug at the time of this optimization, hence the darkness in the before version. After our optimization we are able to render 16-17 of these in the same time that it used to take 1. You can run this Bob Ross painting example if you’d like, just remember to “Shift + Click” the green flag to enable “TURBO MODE” first or you’ll be waiting a while for your BRP.
We’re planning on following up with a more in depth walkthrough of the math involved in GPU compositing, so stay tuned for that.
Procedure, Hat, Motion & Looks blocks
In the above example “My Block” is a custom “procedure”. We were able to use various caching techniques to greatly improve the performance of these blocks. We cached information about what the block does, its arguments, location and stack point, so that the VM no longer needs to look that information up on every frame. Now custom procedures are only re-evaluated when a Scratcher edits blocks.
We employed similar caching techniques for the “motion” and “looks” blocks. Motion and looks are very commonly used blocks. Scratchers use them to move and change the appearance of sprites on the stage. In the initial version of Scratch 3.0, motion and looks blocks performed unnecessarily complex calculations on initialization and at every execution. We implemented a caching scheme here, and rewrote some of the math to make this more efficient. We were also able to defer processing of some of the calculations so that the optimization not only sped up the number of times these blocks can be run in a second, but boot up time as well.
We are also currently in the process of applying a caching-based optimization all of the “hat” blocks, like the “when green flag clicked” block in the procedure block example above. Currently the Scratch VM iterates over every block for every sprite on the stage looking for hat blocks. It does this on every frame. That’s three
for loops on each tick of a Scratch project. We are creating cache to store hat block information in a more efficient event model that will only need to be updated when blocks are edited. When we’re done, this will make scratch project startup time and playback a lot faster and more efficient for lower power devices.
Once we were able to reliably get “per block” performance for the above mentioned blocks in Scratch 3.0 faster than the same blocks with Scratch 2.0 on target devices, we started looking for cross-system optimizations.
We went through the sequencer and run loops which the VM uses under the hood to execute each block and figure out what to execute next. We created faster “mathy” ways of branching, and implemented a handful of other micro-optimizations which improved overall performance. We are still actively working in this area and finding some pretty impressive gains across every Scratch project.
We also cached runtime data from execute functions which the sequencer and run loop use to evaluate Scratch programs. As you go through the execution of a Scratch program, the current block used the block before it as an input, creating the “Scratch stack”. This cache let’s us reuse values from while loops in prior blocks in subsequent blocks. This optimization prevents the VM from needing to repeatedly dereference values from string keyed objects and saves a lot of time. We also changed the way that data is processed for input blocks (blocks wrapped by other blocks) from using a more computationally expensive object key lookup to using a flattened array of inputs, and storing their return values directly on the parent block.
Loading and First Paint Performance
Across our work for Scratch 3.0, we also focused on on decreasing our load time and improving our time to first interaction.
One of the first hotspots, and probably one of the most effective optimizations we made in the loading pipeline, was decoding of ADPCM audio files, a format used by Scratch 1 and 2 programs. We used many of the same techniques as above — smarter loops, caching math results, and reusing smaller memory chunks rather than allocating large buffers — and were able to reduce the memory footprint of this process from hundreds of megabytes to under 1 megabyte. We also gained speed improvements of over ten times faster on the 2107 chromebook, and a bonus 45x times faster for Firefox on the 2017 Macbook developer machine used.
Imagine if every time you wanted to make a cookie, you had to restart the entire production line. Using this scheme lets us wait for all the orders to be in before turning on the machine. Using this approach we were able to speed up our time to first interaction, while decreasing our over all computational load and shortening the total load time. This work shaved seconds off of load time for the average project on a modern macbook, but is much more noticeable for Scratch projects with many assets that no longer take minutes to load on older iPads and Chromebooks.
Interoperability and Feature Parity
In addition to performance parity between Scratch 2.0 and Scratch 3.0, we have also been working on language compatibility. Most of the issues identified start as reports from the Scratch Community forums for projects that don’t work in Scratch 3.0. Rewriting a VM is a big undertaking, so we expected to find undocumented interactions between Scratch blocks in Scratch 2.0 that the authors of Scratch 3.0 didn’t get quite right the first time. Catching these and fixing and documenting them has been the focus of our compatibility work. We used the existing Scratch testing process to find, dedupe, prioritize and reproduce our compatibility bugs based on Scratcher submitted bugs, cross referenced with the most popular Scratch projects for priority. Because the Scratch team committed to having projects work the same in Scratch 3 as in Scratch 2, our interoperability work spanned the gamut of Scratch functionality.
Many little details needed to interact correctly for a lot of Scratch 2 programs to work correctly in Scratch 3. This was a loving process of combing through broken Scratch 2 projects, sussing out the root causes, and patching them. Below are a few example fixes we made.
Math.round rounding which caused almost every project to break a little. In the Scratch 3 Beta which we started our interoperability work from, we had lots of wonky image positions. In addition, many projects depended on the Scratch 2.0 rounding mechanism for their behavior. We identified this bug and implemented the solution using the Scratch 2.0 thresholding approach to coordinate rounding.
setTimeout. Scratch 3.0 did not have a cached timer, instead each block would call
Date.now() to orient itself. To resolve this, we recreated the Scratch 2.0 timer-based functionality, and created an interface that allows blocks depending on the millisecond timestamp to share the cached value that earlier versions of Scratch had provided.
In Scratch 2.0 and 1.0, hat blocks were executed before this internal timestamp was updated, and before any other blocks. Scratch 3 currently executes hat blocks in threads, starting with each hat block, and continuing through it’s thread which might change the state, then onto the next hat block. This causes Scratch programs that depend on state in hat blocks to break. We are working to patch this behavior to replicate the previous execution order.
Audio Engine Compatibility
We also worked on the Scratch-audio setup to help try to solve some architectural bugs that showed up as Scratch 2.0 incompatibilities. Certain aspects of the way sounds are played back can be controlled on a per sprite basis, and independently for each of its clones respectively. There are interesting rules here. One particularly tricky condition is that a sound started by a clone plays to completion, even if the clone is deleted while it plays. This condition was not supported in the original Scratch 3 audio engine, so we re-implemented the original behavior. This required the sound players to be able to finish even if their clone was deleted. In order to do this, we created the idea of a “SoundBank” for each sprite that could share the sound selection with all of its clones. This way clones can apply effects independently, even though they share a single sound with the parent sprite. It is a little weird, but definitely compatible with 2.0. We created other variants of the SoundBank and integrated them into the Music extension and the Scratch GUI’s sound library as well.
As a final compatibility requirement, we needed a way to load all historic Scratch 1.x and 2.x file formats from the Scratch cloud into Scratch 3.0. Scratch 1.x stored Scratch projects in a binary data format, whereas Scratch 2.0 stored projects in a zip file containing all dependencies in standard formats (e.g. png, svg, wav, etc).
Scratch 2.0 had an existing way to load Scratch 1.x files. Scratch 3.0 was built to be able to load files in the Scratch 2.0 format, and had already implemented a conversion pipeline. We needed a way to convert Scratch 1.x files into Scratch 3.0. We decided to develop a pipeline that used the Scratch 1.x to 2.x conversion pipeline in order to take advantage of the existing compatibility bug fixes implemented in it. The resulting system loads Scratch 1.x files into a mock Scratch 2.0 project. This gave us the benefit of bringing along all of the historic project conversion bug-fixes for pre-3 projects. This ended up being quite a useful workflow, and now all scratch 1.x projects work in Scratch 3.0. Here is our favorite one, staring Eric Rosenbaum from the LLK team.
Official and Partner Extensions
As part of our work we also developed the video motion extension for Scratch 3.0, based on the Scratch 2.0 augmented reality extension by the same name. We reached feature parity for this extension, and made it run faster, too. You can demo of the video sensing extension running with our scratch-audio rework in the AR rhodes piano project.
Bocoup and Scratch Values
As a multimedia educational programming environment, Scratch 3.0 really pushes the edge of what the web platform can do. Scratch runs an entire Virtual Machine in the browser, handles complicated graphics, audio, video, hardware, and augmented reality programming capabilities, and runs on resource constrained devices like single board computers, tablets, and Chromebooks. At Bocoup, we are dedicated to improving the web platform as a publicly standardized and royalty free technology. Pushing the edge of the web’s native capabilities helps us make sure that the web is a viable choice for many applications that might otherwise choose a privately developed, closed source and proprietary implementation solution.
Scratch also represents a warm, friendly and inclusive environment for people to learn how to work with and program computers. At Bocoup, we believe in building a radically inclusive web platform, which prioritizes the safety and involvement of marginalized people over people with greater privilege. Scratch embodies these principles in the production team’s working environment, and through community management on the Scratch website. The Scratch team is relentlessly dedicated to prioritizing “low floors” for Scratchers, to make it easy and safe to get started. There are so many barriers to folks in gaining access to the benefits associated with programming computers, and so we see it as extra important to make sure that the learning environment is not one of them.
All the work talked about in this post is open source and you can find all the relevant issues, discussion and patches on github. We’re so proud of this work, and honored to have had the opportunity to contribute. We are eager to continue our work with folks in this space. If your organization has a Scratch-related project like a Scratch extension or embedding of the Scratch VM, please do reach out to us. If you’re interested in learning more about web application performance and interoperability, keep an eye on our blog for deeper dives into the topics we’ve touched on here.