Audio Generation: ElevenLabs

ElevenLabs (opens in a new tab) is a voice AI research & deployment company providing the ability to generate speech in hundreds of new and existing voices in 29 languages. They also allow voice cloning - provide only 1 minute of audio and you could generate a new voice!

However, this means that some of their voices are low quality and some may be deleted without notice if the maker of the voice decides to remove it. So be careful with testing out which voices are best for your app and have some backup voices ready just in case one no longer exists.

It is simple to create speech through the ElevenLabs API using Preternatural. Just fill in the fields below and submit the request:

The ElevenLabs Client

Specifying the ElevenLabs client is as simple as adding your API key:

import ElevenLabs
 
let client = ElevenLabs(apiKey: "YOUR_API_KEY")

It’s a bit tricky to find the API Key on the ElevenLabs website as it is hidden in your profile menu in the bottom right corner (just click on your picture!) and select “Profile + API key”: Alt

Model Selection

ElevenLabs provides four models, which you get more details about on their website here (opens in a new tab):

Turbo v2 (model_id = eleven_turbo_v2):This model uses cutting-edge technology for optimization, making it ideal for real-time applications that necessitate low latency. Despite these optimizations, it maintains the excellent quality found in our other models. Although it's tailored for real-time and conversational applications, its versatility and stability make it worth testing for other applications as well.

Multilingual v2 (model_id = eleven_multilingual_v2): This model is a powerhouse, excelling in stability, language diversity, and accuracy in replicating accents and voices. Its speed and agility are remarkable considering its size. Multilingual v2 supports 28 languages.

**English v1 (model_id = eleven_monolingual_v1): **This model was created specifically for English and is the smallest and fastest model ElevenLabs offers.

Multilingual v1 (model_id = eleven_multilingual_v1): Since its release, Multilingual v1 has remained an experimental model. Although it is still in the experimental phase, it was used to enhance the next iteration. Multilingual v1 currently supports a variety of languages.

let multilingualV2: ElevenLabs.Model = .MultilingualV2
let turboV2: ElevenLabs.Model = .TurboV2 // English
let multilingualV1: ElevenLabs.Model = .MultilingualV1
let englishV1: ElevenLabs.Model = .EnglishV1

Voice ID

Each voice on the ElevenLabs website has a unique Voice ID. Again, this might be confusing to find right away. Simply click on the “ID” button for each voice in your VoiceLab to copy (note that you first have to add a voice from the Voice Library to the VoiceLab): Alt

let voiceID = "4v7HtLWqY9rpQ7Cg2GT4"

Text

The text is the text that you would like the voice to read:

let textToRead = "In a quiet, unassuming village nestled deep in a lush, verdant valley, young Elara leads a simple life, dreaming of adventure beyond the horizon. Her village is filled with ancient folklore and tales of mystical relics, but none capture her imagination like the legend of the Enchanted Amulet—a powerful artifact said to grant its bearer the ability to control time."

Stability

Increasing stability will make the voice more consistent between re-generations, but it can also make it sounds a bit monotone. On longer text fragments it is recommended to lower this value.

import ElevenLabs
 
// this is a double between 0 (more variable) and 1 (more stable)
let voiceSettings: ElevenLabs.VoiceSettings = .init(
            stability: 0.5,
            similarityBoost: nil,
            styleExaggeration: nil,
            speakerBoost: nil)

Similarity Boost

Increasing the Similarity Boost setting enhances the overall voice clarity and targets speaker similarity. However, very high values can cause artifacts, so it is recommended to adjust this setting to find the optimal value.

import ElevenLabs
 
// this is a double between 0 (Low) and 1 (High)
let voiceSettings: ElevenLabs.VoiceSettings = .init(
            stability: 0.5,
            similarityBoost: 0.75,
            styleExaggeration: nil,
            speakerBoost: nil)

Style Exaggeration

High values are recommended if the style of the speech should be exaggerated compared to the selected voice. Higher values can lead to more instability in the generated speech. Setting this to 0 will greatly increase generation speed and is the default setting.

import ElevenLabs
 
// this is a double between 0 (Low) and 1 (High)
let voiceSettings: ElevenLabs.VoiceSettings = .init(
            stability: 0.5,
            similarityBoost: 0.75,
            styleExaggeration: 0,
            speakerBoost: nil)

Speaker Boost

Boost the similarity of the synthesized speech and the voice at the cost of some generation speed.

import ElevenLabs
 
let voiceSettings: ElevenLabs.VoiceSettings = .init(
            stability: 0.5,
            similarityBoost: 0.75,
            styleExaggeration: 0,
            speakerBoost: true)

Generate Audio

The final Text-to-Speech call will be as follows:

import ElevenLabs
 
let client = ElevenLabs(apiKey: "YOUR_API_KEY")
let model: ElevenLabs.Model = .MultilingualV2
let voiceID = "4v7HtLWqY9rpQ7Cg2GT4"
 
let textToRead = "In a quiet, unassuming village nestled deep in a lush, verdant valley, young Elara leads a simple life, dreaming of adventure beyond the horizon. Her village is filled with ancient folklore and tales of mystical relics, but none capture her imagination like the legend of the Enchanted Amulet—a powerful artifact said to grant its bearer the ability to control time."
 
let voiceSettings: ElevenLabs.VoiceSettings = .init(
    stability: 0.5,
    similarityBoost: 0.75,
    styleExaggeration: 0,
    speakerBoost: true)
 
do {
    let speech = try await client.speech(
        for: textToRead,
        voiceID: voiceID,
        voiceSettings: voiceSettings,
        model: model
    )
    
    return speech
} catch {
    print(error)
}

The ElevenLabs API provides flexible and powerful options for text-to-speech conversion. Whether you're creating a simple voiceover or developing a more complex application, this framework provides the tools you need to generate realistic and high-quality speech. Remember to experiment with different settings to achieve the optimal results for your specific use case.

Audio Generation: OpenAI Text Embeddings