23 May 2024 5 min read SwiftUI

SwiftUI: .accessibilityLabel powered by SSML

Photo by Volodymyr Hryshchenko / Unsplash

It all began with the question, "How can SwiftUI make VoiceOver speak multiple languages?" Then I went down a rabbit hole to learn more about the relationship between attributed strings and voice synthesis. This worked with some caveats, but it provided an answer that solved the question. Or so I thought. After this interaction, I couldn't stop thinking about the markup language I had to use once while working on a voice assistance startup, and I wondered if it could be applied for accessibility in SwiftUI now?

SSML

SSML is a 20-year-old standard that allows control over aspects of speech such as pronunciation, volume, pitch, and pace across most synthesis-capable platforms.

Speech Synthesis Markup Language Specification (SSML 1.0), introduced in September 2004, is one of the standards enabling access to the Web using spoken interaction.

Speech Synthesis Markup Language: An Introduction

Peter Mikhalenko

The complete specification can be found at W3C, as well as some useful examples in Google's Cloud Text-to-Speech documentation:

Basic syntax

<speak>
  my SSML content
</speak>

my SSML content

0:00

/2.118625

Adds emphasis to an announcement

<emphasis level="moderate">This is an important announcement</emphasis>

This is an important announcement

0:00

/2.3646666666666665

Read numbers as cardinals

<speak>
  <say-as interpret-as="cardinal">12345</say-as>
</speak>

12345

0:00

/2.5672083333333333

Simplified pronunciation of a difficult-to-read word

<sub alias="にっぽんばし">日本橋</sub>

日本橋

0:00

/1.1645833333333333

AVSpeechUtterance

For our case, we only need to know that at its core, it is an XML, which can be used in one of the AVSpeechUtterance string constructors.

init?(ssmlRepresentation string: String)

Creates a speech utterance with an Speech Synthesis Markup Language (SSML) string

❓

I'm not sure when this was implemented, but I believe it has always been part of Apple's TTS implementation (Nuance inner pipes?)

ViewModifier

SwiftUI's Accessibility modifiers are simply ViewModifiers, which I hadn't thought about before. Accepting this nature allows us to overload accessibility labels, detect particular states, and run conditional logic. So, in practice, by introducing custom modifiers and utilizing their capabilities, it is now feasible to create one that naively contains a voice synthesizer using an SSML constructed utterance that plays when focused:

import AVFoundation
import SwiftUI

extension View {
  public func accessibilityLabel(_ ssml: SSML) -> some View {
    modifier(AccessibilitySSMLLabel(ssml: ssml))
  }
}

struct AccessibilitySSMLLabel: ViewModifier {
  @AccessibilityFocusState var isFocused: Bool
  let ssml: SSML
  let synthesizer = AVSpeechSynthesizer()

  func body(content: Content) -> some View {
    content
      .accessibilityElement()
      .accessibilityFocused($isFocused)
      .onChange(of: isFocused) { _, newValue in
        if newValue,
          let utterance = AVSpeechUtterance(ssmlRepresentation: ssml.rawValue)
        {
          synthesizer.speak(utterance)
        } else {
          synthesizer.stopSpeaking(at: .word)
        }
      }
  }
}

public struct SSML {
  public let rawValue: String

  init(_ representation: String) {
    self.rawValue = representation
  }
}

That is it—this serves as a proof of concept. The only question left to address is what degree of abstraction is suitable and whether the modifier varieties. For example, creating one for SSML, as shown below, and another with pitch, voice, and other properties? Or should we expose a single modifier for passing AVSpeechUtterance directly?

No matter which path we take, this demonstrates how adaptable SwiftUI is and how we can use it to create solutions that hide complexity (like all of these SSML aspects) while making the call site as simple as:

Text("Hello, world!")
  .accessibilityLabel(
    SSML(
      """
      <speak>
        <prosody pitch="high">
          <lang xml:lang="fr-FR">Bonjour!</lang>
        </prosody>
        After one second, I'm going to speak more slowly.
        <break time="1s"/>
        <prosody rate="x-slow">
          Slow speech using <say-as interpret-as="verbatim">SSML</say-as>...
        </prosody>
      </speak>
      """
    )
  )