5 min read

SwiftUI: .accessibilityLabel powered by SSML

orange sheets of paper lie on a green school board and form a chat bubble with three crumpled papers.
Photo by Volodymyr Hryshchenko / Unsplash

It all began with the question, "How can SwiftUI make VoiceOver speak multiple languages?" Then I went down a rabbit hole to learn more about the relationship between attributed strings and voice synthesis. This worked with some caveats, but it provided an answer that solved the question. Or so I thought. After this interaction, I couldn't stop thinking about the markup language I had to use once while working on a voice assistance startup, and I wondered if it could be applied for accessibility in SwiftUI now?

SSML

SSML is a 20-year-old standard that allows control over aspects of speech such as pronunciation, volume, pitch, and pace across most synthesis-capable platforms. 

Speech Synthesis Markup Language Specification (SSML 1.0), introduced in September 2004, is one of the standards enabling access to the Web using spoken interaction.
Speech Synthesis Markup Language: An Introduction

The complete specification can be found at W3C, as well as some useful examples in Google's Cloud Text-to-Speech documentation:

Basic syntax

<speak>
  my SSML content
</speak>
audio-thumbnail
my SSML content
0:00
/2.118625

Adds emphasis to an announcement

<emphasis level="moderate">This is an important announcement</emphasis>
audio-thumbnail
This is an important announcement
0:00
/2.3646666666666665

Read numbers as cardinals

<speak>
  <say-as interpret-as="cardinal">12345</say-as>
</speak>
audio-thumbnail
12345
0:00
/2.5672083333333333

Simplified pronunciation of a difficult-to-read word

<sub alias="にっぽんばし">日本橋</sub>
audio-thumbnail
日本橋
0:00
/1.1645833333333333

AVSpeechUtterance

For our case, we only need to know that at its core, it is an XML, which can be used in one of the AVSpeechUtterance string constructors.

init?(ssmlRepresentation string: String)

Creates a speech utterance with an Speech Synthesis Markup Language (SSML) string

I'm not sure when this was implemented, but I believe it has always been part of Apple's TTS implementation (Nuance inner pipes?)

ViewModifier

SwiftUI's Accessibility modifiers are simply ViewModifiers, which I hadn't thought about before. Accepting this nature allows us to overload accessibility labels, detect particular states, and run conditional logic. So, in practice, by introducing custom modifiers and utilizing their capabilities, it is now feasible to create one that naively contains a voice synthesizer using an SSML constructed utterance that plays when focused:

import AVFoundation
import SwiftUI

extension View {
  public func accessibilityLabel(_ ssml: SSML) -> some View {
    modifier(AccessibilitySSMLLabel(ssml: ssml))
  }
}

struct AccessibilitySSMLLabel: ViewModifier {
  @AccessibilityFocusState var isFocused: Bool
  let ssml: SSML
  let synthesizer = AVSpeechSynthesizer()

  func body(content: Content) -> some View {
    content
      .accessibilityElement()
      .accessibilityFocused($isFocused)
      .onChange(of: isFocused) { _, newValue in
        if newValue,
          let utterance = AVSpeechUtterance(ssmlRepresentation: ssml.rawValue)
        {
          synthesizer.speak(utterance)
        } else {
          synthesizer.stopSpeaking(at: .word)
        }
      }
  }
}

public struct SSML {
  public let rawValue: String

  init(_ representation: String) {
    self.rawValue = representation
  }
}

That is it—this serves as a proof of concept. The only question left to address is what degree of abstraction is suitable and whether the modifier varieties. For example, creating one for SSML, as shown below, and another with pitch, voice, and other properties? Or should we expose a single modifier for passing AVSpeechUtterance directly?

No matter which path we take, this demonstrates how adaptable SwiftUI is and how we can use it to create solutions that hide complexity (like all of these SSML aspects) while making the call site as simple as:

Text("Hello, world!")
  .accessibilityLabel(
    SSML(
      """
      <speak>
        <prosody pitch="high">
          <lang xml:lang="fr-FR">Bonjour!</lang>
        </prosody>
        After one second, I'm going to speak more slowly.
        <break time="1s"/>
        <prosody rate="x-slow">
          Slow speech using <say-as interpret-as="verbatim">SSML</say-as>...
        </prosody>
      </speak>
      """
    )
  )
GitHub - elkraneo/Accessibility-SSML-Label: Overload of .accessibilityLabel ViewModifier for SSML.
Overload of .accessibilityLabel ViewModifier for SSML. - elkraneo/Accessibility-SSML-Label

accessibilityLabel(_:) | Apple Developer Documentation
Adds a label to the view that describes its contents.
init(ssmlRepresentation:) | Apple Developer Documentation
Creates a speech utterance with an Speech Synthesis Markup Language (SSML) string.
Speech Synthesis Markup Language - Wikipedia
Speech Synthesis Markup Language (SSML) | Cloud Text-to-Speech API | Google Cloud
Speech Synthesis Markup Language (SSML) overview - Speech service - Azure AI services
Learn how to use the Speech Synthesis Markup Language to control pronunciation and prosody in text to speech.
What Is SSML? Use Cases, Examples & Best Practices For Using SSML
Curious what is SSML? Learn what Speech Synthesis Markup Language is, explore its applications and get tips for using it in your projects.