Audio Generation: OpenAI

Text-to-Speech is a powerful interface for Apple apps. As users frequently travel with their iPhones with AirPods in their ears, offering an option to audibly present your app's content, particularly long-form content, might be the decisive factor for some users choosing your product over comparable ones on the market.

It’s very easy to perform TTS with Preternatural.

Text-to-Speech (TTS) Model

OpenAI offers two Text-to-Speech (TTS) Models at this time. The Text-to-speech 1 (tts-1) and Text-to-speech 1 HD (tts-1-hd). The tts-1 is the latest text to speech model, optimized for speed and is ideal to use for real-time text to speech use cases. The tts-1-hd is the latest text to speech model, optimized for quality.

import OpenAI
 
// optimized for speed 
// ideal to use for real-time text to speech use cases
let tts_1: OpenAI.Model.Speech = .tts_1
// optimized for quality
let tts_1_hd: OpenAI.Model.Speech = .tts_1_hd

Input Text

This is the text that the model should turn into voice. One amazing thing about these models as they are improving is that we are no longer bound to a monotone robotic voice. The voice is able much closer to human in its emotion and telling a story with intonation at the right places. It’s very impressing. Here is some text for a beginning of a story that could potentially be a new type of storytelling app:

let textToRead = "In a quiet, unassuming village nestled deep in a lush, verdant valley, young Elara leads a simple life, dreaming of adventure beyond the horizon. Her village is filled with ancient folklore and tales of mystical relics, but none capture her imagination like the legend of the Enchanted Amulet—a powerful artifact said to grant its bearer the ability to control time."

Voice Selection

OpenAI offers an option of six voices via their API. Note that these are different voices that they offer in their own app. You can check the sound of each voice on their website in their Text-to-Speech (opens in a new tab). As people are very sensitive to voices, it is a good idea to offer customers the option of selecting the voice that they prefer.

let alloy: OpenAI.Speech.Voice = .alloy
let echo: OpenAI.Speech.Voice = .echo
let fable: OpenAI.Speech.Voice = .fable
let onyx: OpenAI.Speech.Voice = .onyx
let nova: OpenAI.Speech.Voice = .nova
let shimmer: OpenAI.Speech.Voice = .shimmer

Speed

The OpenAI API offers the ability to adjust the speed of the audio. This feature is particularly useful for language-learning apps, as users may need to listen to the text at a slower pace while learning a foreign language. Alternatively, for an app designed for student content review, the voice speed could be increased to enable more learning in less time.

Speed between 0.25 and 4.0 could be selected, with 1.0 as the default.

let speed = 0.8

Generate Audio

The final Text-to-Speech call will be as follows:

let textToRead = "In a quiet, unassuming village nestled deep in a lush, verdant valley, young Elara leads a simple life, dreaming of adventure beyond the horizon. Her village is filled with ancient folklore and tales of mystical relics, but none capture her imagination like the legend of the Enchanted Amulet—a powerful artifact said to grant its bearer the ability to control time."
 
let speech: OpenAI.Speech = try await openAIClient.createSpeech(
    model: OpenAI.Model.Speech.tts_1_hd,
    text: textToRead,
    voice: .alloy,
    speed: 0.8)

The OpenAI text-to-speech models offer an array of features to create a more personalized and engaging user experience. The choice of voices, the speed adjustment feature, and the improved human-like intonation of the generated speech compared to predecessor TTS models, make these models a powerful tool for developers.

Audio Transcription: Whisper Audio Generation: ElevenLabs