Audio Transcription: Whisper

Whisper (opens in a new tab), created and open-sourced by OpenAI, is an Automatic Speech Recognition (ASR) system trained on 680,000 hours of audio content collected from the web. This makes Whisper particularly impressive at transcribing audio with background noise and varying accents compared to its predecessors. Another notable feature is its ability to transcribe audio with correct sentence punctuation.

One limitation of Whisper is that it was primarily trained on English audio (~70%). Nevertheless, it does support transcription in 98 other languages. If your app requires non-English transcription, make sure to test Whisper's performance in these languages. The error rate may be too high for practical use.

Since Whisper is open source, there are several options available for integrating Whisper into your app without using the OpenAI Whisper API (and having to pay for it), including using smaller on-device models. However, using the OpenAI Whisper API has the following advantages:

Extremely fast: The OpenAI Whisper API provides an extremely fast transcription response time compared to other methods of using Whisper, including on device.
Larger Model: The OpenAI Whisper API provides access to the largest model available with almost 99% English accuracy. These models are too big to run on device and would need to be hosted on the server and accessed via an API if you decide to not use the OpenAI Whisper API option. Smaller Whisper models which can be run on-device are much less accurate at transcription, and are essentially unusable for non-English languages. While a 5% drop in transcription accuracy might not seem like a big deal, in real daily use, it becomes too frustrating to use - try it out and see!
Model Upgrades: While the current Whisper model is open source, it is not clear if future (better / more accurate) Whisper models will be. By using the OpenAI Whisper API, you are future-proofing your app for upgrades.

Integrating Whisper in your app with Preternatural requires only a few inputs:

The Audio File

Provide an audio file to transcribe in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

let audioFile = URL(string: "AUDIO_FILE_PATH")

Prompt (Optional)

An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. Note that Whisper only considers the first 244 tokens of the prompt.

One great use-case for the Prompt option in the context of Audio Transcription is providing the correct spelling for topic-specific words (e.g. company acronyms). For example:

// The correct spelling for company-spefic words in an earnings call
let prompt = "ZyntriQix, Digique Plus, CynapseFive, VortiQore V8, EchoNix Array, OrbitalLink Seven, DigiFractal Matrix, PULSE, RAPT, B.R.I.C.K., Q.U.A.R.T.Z., F.L.I.N.T."

You can read more about prompting the Whisper model on the OpenAI website here (opens in a new tab).

Language (Optional)

The language of the input audio. Supplying the input language in ISO-639-1 (opens in a new tab) format will improve accuracy and latency.

// English language
let language: LargeLanguageModels.ISO639LanguageCode = .en

Temperature (Optional)

The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

let temperature = 0

Timestamp Granularities (Optional)

Timestamp Granularities are used to get timestamps throughout your transcription, roughly for every sentence. Consider a scenario where you wish to include timestamps with a video transcription - when a user clicks on the timestamp, the video forwards to the correct time. This is implemented especially well in Apple’s WWDC videos - see the transcript for Explore Natural Language multilingual models (opens in a new tab) as an example: Alt

When you click on any sentence in the transcript, the video will fast forward to exactly that timestamp!

There are two types of Timestamp Granularities - segment and word. Segment granularities give roughly sentence-by-sentence timestamps (as in Apple’s WWDC transcript example). Word granularities give the time stamp for specific start words of the segment. Note that there is no additional latency for segment timestamps, but generating word timestamps does incur additional latency.

To specify that you would like the timestamp granularities included:

// note that the timestampGranularties is an array of granularities, so you can inlcude both .segment and .word granularities, or simple one of them
let timestampGranularities: [OpenAI.AudioTranscription.TimestampGranularity] = [.segment]

Generate Transcript

The final Speech-to-Text code will be as follows:

import OpenAI
 
let client = OpenAI.Client(apiKey: "YOUR_API_KEY")
let audioFile = URL(string: "PATH TO FILE")
// correct spelling of company keywords in earning call
let prompt = "ZyntriQix, Digique Plus, CynapseFive, VortiQore V8, EchoNix Array, OrbitalLink Seven, DigiFractal Matrix, PULSE, RAPT, B.R.I.C.K., Q.U.A.R.T.Z., F.L.I.N.T."
 
do {
    let transcription = try await openAIClient.createTranscription(
        audioFile: audioFile,
        prompt: prompt,
        language: .en,
        temperature: 0.0,
        timestampGranularities: [.segment]
    )
    
    let fullTranscription = transcription.text
    let segements = transcription.segments
} catch {
    print(error)
}

DALL-E 3: Image Generation Audio Generation: OpenAI