Smuggling Words – The Challenge of Recording Screen Reader Output

Posted by Mike Pennisi

Mar 17 2025

We’ve spent years building systems that abscond with the messages that screen readers typically speak to their users. Spiriting away these parcels of text has been a thrill, but before we can get into all the intrigue, you’ll have to know a bit about how we got here.

During our years-long participation in ARIA-AT, we’ve always felt that recording the behavior of screen readers was a task best left to a computer. The W3C Community Group, working to promote interoperability amongst assistive tech (AT) like screen readers, certainly needs to know what screen readers say when their users navigate the web… But the Community Group also has mountains of work that only humans can do. That’s why we built a system which can collect the speech produced by NVDA and VoiceOver for any given web content and user interaction.

Of course, this was easier said than done! In this post, we’ll review our solutions (there are two) and consider the work we still have before us.

We laid the groundwork for this effort by authoring AT Driver, a protocol for automating assistive technologies. As a W3C Editor’s Draft, AT Driver formally defines the primitives necessary to build the automation system we had in mind for ARIA-AT. (Those same primitives will also be of interest to any application developer who cares about how ATs render their work, but that’s a topic for another blog post!) AT Driver served as a common design document for a bifurcated effort: automating NVDA (a popular Windows-based screen reader) and automating VoiceOver (a popular macOS-based screen reader). Our partners at Prime Access Consulting were able to build an implementation for NVDA while we simultaneously built an implementation for VoiceOver.

NVDA offers a robust plug-in interface that allows developers to extend it with additional functionality. Prime Access Consulting used Python to surface the speech data and Go to build a server that implements AT Driver. The result is the NVDA AT Driver Server, an open source NVDA plug-in.

Automating VoiceOver took a little more effort. That’s partly because the macOS screen reader’s programmatic interface (specifically: the AppleScript facilities it exposes) isn’t quite as expressive. While it does expose speech data, ARIA-AT ultimately decided that the messaging paradigm it requires isn’t appropriate for our purposes.¹ To surface the speech data, we instead built a text-to-speech voice for macOS. Unlike a traditional text-to-speech voice, which outputs an audible representation of the text, our voice surfaces the speech data on an AT Driver server. This allows us to observe VoiceOver’s behavior without integrating with VoiceOver directly.

Traditional text-to-speech flow:

VoiceOver --> |text| --> macOS --> Apple "Siri" Voice --> |audio buffer| --> speakers

Our text-to-“speech” flow:

VoiceOver --> |text| --> macOS --> Bocoup Automation Voice --> |text| --> AT Driver server

Retrieving text in this way feels like a le Carré-style exfiltration, with the cargo passing through obscure realms and crossing borders under arcane protocols. Predictably, the result suffers from reliability issues that we’re still investigating even as I write these words. ARIA-AT can fortunately tolerate some instability here because humans are never completely out of the loop, but it’s certainly something we’d like to improve.

That said, we’d strongly prefer for all assistive technology to be open to observation. The success of projects like GuidePup proves that developers want to understand how ATs render their work, and we’re convinced interest in accessibility testing would be even greater if it weren’t for the technical hurdles present today. Whether eliminating those hurdles means implementing AT Driver or exposing a robust proprietary interface, transparent software empowers all developers (not just the folks on ARIA-AT) to build better services.

Frame from "The Sandlot" featuring a group of kids hoisting their friend over a fence via an elaborate system of pulleys. — In 1993’s *The Sandlot*, a rag-tag team of plucky adventurers devise increasingly-imaginative schemes to retrieve precious cargo from an inhospitable environment.

Moving forward, we’ll continue working to stabilize the behavior of the macOS TTS voice while simultaneously advocating for a more expressive VoiceOver API. ARIA-AT is also interested in evaluating mobile screen readers. The restrictions of Android and iOS pose novel challenges for building similar solutions, so we’re already considering even more advanced extraction techniques ranging from speech-to-text to optical character recognition. Even though these machinations can feel like rappelling down skyscrapers and crawling through ductwork, we remain convinced that interoperability is a worthy prize.

The GuidePup project, although faced with the same technical challenge, has opted to use VoiceOver’s AppleScript interface. Their practical experience affirmed the hazards we theorized, but their audience has a slightly higher tolerance for false negatives. Since ARIA-AT makes public statements about the behavior of other software projects, we set an exceedingly high bar for the correctness of our solution. ↩

Posted by
Mike Pennisi
on March 17th, 2025

Contact Us

We'd love to hear from you. Get in touch!

Email

hello@bocoup.com