Vision: Image-to-Text
Large language models (LLMs) are rapidly evolving and expanding into multimodal capabilities - the ability to process inputs in multiple modes, such as text, images, and audio. Multi-modal LLMs are now starting to be referred to as Large Multimodal Models, or LMMs. Open AI’s GPT-4o (still in beta) is now the first LLM model that supports outputting BOTH textual and non-textual content together.
As Apple Developers, we are in the perfect position for these multimodal AI models as we are making applications for devices that consumers literally use with built-in cameras and voice recorders. With the Vision capabilities of LLMs, it is easier than ever to process images that the user captures in novel ways.
For instance, we can instruct the LLM to identify the items in a specific photo and compose a poem about each one.
let systemPrompt: PromptLiteral = "You are a VisionExpertGPT. You will receive an image. Your job is to list all the items in the image and write a one-sentence poem about each item. Make sure your poems are creative, capturing the essence of each item in an evocative and imaginative way."
let userPrompt: PromptLiteral = "List the items in this image and write a short one-sentence poem about each item. Only reply with the items and poems. NOTHING MORE."
let imageLiteral = try PromptLiteral(image: imageInput)
let model = OpenAI.Model.gpt_4o
let messages: [AbstractLLM.ChatMessage] = [
.system(systemPrompt),
.user {
.concatenate(separator: nil) {
userPrompt
imageLiteral
}
}]
let result: String = try await client.complete(
messages,
model: model,
as: .string
)
return result
As we continue to explore the capabilities of multimodal AI models, we are opening up a world of possibilities for enhancing user experiences and creating brand new ones.