Function Calling + Vision

As LLMs become more multimodal, able to understand photos, speech, and videos (in the near future), it is important to call out the combination of using Function Calling with the model’s Vision functionality. In other words, we can now get very specific structured data about what is in any image. This creates never-before-possible opportunities for Apple Developers as we can get direct access to the user’s stored photos and camera.

To demonstrate how powerful Function Calling + Vision can be, we will use the example of using a screenshot organizing app. The PhotoKit API (opens in a new tab) already has a functionality to identify only photos that are screenshots. So just getting the user’s screenshots and putting them into an app is something that is simple enough to accomplish.

But now, with the power of LLMs, we can easily organize the screenshots by categories, provide a summary for each one, and add search functionality across all screenshots. In the future, we can add additional information, such as extracting any text or links included in the screenshot to make it easily actionable, and even extract specific elements from the screenshot.

To get started, we can imaging a function that we could have in our app:

func addScreenshotAnalysisToDB(
    with title: String,
    summary: String,
    description: String,
    category: String
) {
    // this function does not exist in our app, but we pretend that it does for the purpose of using function calling to get a JSON response of the function parameters.
}

The app will be providing screenshots to the LLM for Vision analysis, which the LLM will return as our addScreenshotAnalysisToDB parameters structured as follows:

struct AddScreenshotFunctionParameters: Codable, Hashable, Sendable {
    let title: String
    let summary: String
    let description: String
    let category: String
}

We will now use the AddScreenshotFunctionParameters object to set our JSON schema. In this case, we expect the LLM to return all the parameter values, so the required property is set to true.

import CorePersistence
 
do {
    let screenshotFunctionParameterSchema: JSONSchema = try JSONSchema(
        type: AddScreenshotFunctionParameters.self,
        description: "Detailed information about a mobile screenshot for organizational purposes.",
        propertyDescriptions: [
            "title": "A concise title (3-5 words) that accurately represents the content of the screenshot.",
            "summary": "A brief, one-sentence summary providing an overview of what is depicted in the screenshot.",
            "description": "A comprehensive description that elaborates on the contents of the screenshot. Include key details and keywords to facilitate easy searching.",
            "category": "A single-word tag or category that best describes the screenshot. Examples include: 'music', 'art', 'movie', 'fashion', etc. Avoid using 'app' as a category since all items are app-related."
        ],
        required: true
    )
} catch {
    print(error)
}

We will now set the properties of the call function with the screenshotFunctionParameterSchema object as the return value from the LLM:

let screenshotAnalysisProperties: [String : JSONSchema] = ["screenshot_analysis_parameters" : screenshotFunctionParameterSchema]

We will also set the decodable object with the same screenshot_analysis_parameters key (note this will be converted automatically to camel case for Swift):

struct ScreenshotAnalysisResult: Codable, Hashable, Sendable {
    let screenshotAnalysisParameters: AddScreenshotFunctionParameters
}

Now we are ready to specify the function call:

let addScreenshotAnalysisFunction = AbstractLLM.ChatFunctionDefinition(
    name: "add_screenshot_analysis_to_db",
    context: "Adds analysis of a mobile screenshot to the database",
    parameters: JSONSchema(
        type: .object,
        description: "Screenshot Analysis",
        properties: screenshotAnalysisProperties
    )
)

Our messages for the LLM to complete will include the system prompt, user prompt, and the screenshot image:

let systemPrompt: PromptLiteral = """
You are an AI trained to analyze mobile screenshots and provide detailed information about them. Your task is to examine a given screenshot and generate the following details:
 
* Title: Create a concise title (3-5 words) that accurately represents the content of the screenshot.
 
* Summary: Write a brief, one-sentence summary providing an overview of what is depicted in the screenshot.
 
* Description: Compose a comprehensive description that elaborates on the contents of the screenshot. Include key details and keywords to facilitate easy searching.
 
* Category: Assign a single-word tag or category that best describes the screenshot. Examples include 'music', 'art', 'movie', 'fashion', etc. Avoid using 'app' as a category since all items are app-related.
 
Make sure your responses are clear, specific, and relevant to the content of the screenshot.
"""
 
let userPrompt: PromptLiteral = "Please analyze the attached screenshot and provide the following details: (1) a concise title (3-5 words) that describes the screenshot, (2) a brief one-sentence summary of the screenshot content, (3) a detailed description including key details and keywords for easy searching, and (4) a single-word category that best describes the screenshot (e.g., music, art, movie, fashion)."
 
 
guard let screenshotImage = AppKitOrUIKitImage(named: "screenshot") else { return }
let screenshotImageLiteral = try PromptLiteral(image: screenshotImage)
 
let messages: [AbstractLLM.ChatMessage] = [
    .system(systemPrompt),
    .user {
        .concatenate(separator: nil) {
            userPrompt
            screenshotImageLiteral
        }
    }]

Finally, the function can be called:

import OpenAI
 
let client = OpenAI.Client(apiKey: "YOUR_API_KEY")
 
let functionCall: AbstractLLM.ChatFunctionCall = try await client.complete(
    messages,
    functions: [addScreenshotAnalysisFunction],
    as: .functionCall
)
 
let result = try functionCall.decode(ScreenshotAnalysisResult.self)
print(result.screenshotAnalysisParameters)

The Full Function Call

Putting it all together, the final function completion request will be as follows:

import OpenAI
import CorePersistence
 
let client = OpenAI.Client(apiKey: "YOUR_API_KEY")
 
let systemPrompt: PromptLiteral = """
You are an AI trained to analyze mobile screenshots and provide detailed information about them. Your task is to examine a given screenshot and generate the following details:
 
Title: Create a concise title (3-5 words) that accurately represents the content of the screenshot.
Summary: Write a brief, one-sentence summary providing an overview of what is depicted in the screenshot.
Description: Compose a comprehensive description that elaborates on the contents of the screenshot. Include key details and keywords to facilitate easy searching.
Category: Assign a single-word tag or category that best describes the screenshot. Examples include 'music', 'art', 'movie', 'fashion', etc. Avoid using 'app' as a category since all items are app-related.
 
Make sure your responses are clear, specific, and relevant to the content of the screenshot.
"""
 
let userPrompt: PromptLiteral = "Please analyze the attached screenshot and provide the following details: (1) a concise title (3-5 words) that describes the screenshot, (2) a brief one-sentence summary of the screenshot content, (3) a detailed description including key details and keywords for easy searching, and (4) a single-word category that best describes the screenshot (e.g., music, art, movie, fashion)."
 
let screenshotImageLiteral = try PromptLiteral(image: screenshot)
 
let messages: [AbstractLLM.ChatMessage] = [
    .system(systemPrompt),
    .user {
        .concatenate(separator: nil) {
            userPrompt
            screenshotImageLiteral
        }
    }]
 
struct AddScreenshotFunctionParameters: Codable, Hashable, Sendable {
    let title: String
    let summary: String
    let description: String
    let category: String
}
 
do {
    let screenshotFunctionParameterSchema: JSONSchema = try JSONSchema(
        type: AddScreenshotFunctionParameters.self,
        description: "Detailed information about a mobile screenshot for organizational purposes.",
        propertyDescriptions: [
            "title": "A concise title (3-5 words) that accurately represents the content of the screenshot.",
            "summary": "A brief, one-sentence summary providing an overview of what is depicted in the screenshot.",
            "description": "A comprehensive description that elaborates on the contents of the screenshot. Include key details and keywords to facilitate easy searching.",
            "category": "A single-word tag or category that best describes the screenshot. Examples include: 'music', 'art', 'movie', 'fashion', etc. Avoid using 'app' as a category since all items are app-related."
        ],
        required: true
    )
    
    let screenshotAnalysisProperties: [String : JSONSchema] = ["screenshot_analysis_parameters" : screenshotFunctionParameterSchema]
    
    let addScreenshotAnalysisFunction = AbstractLLM.ChatFunctionDefinition(
        name: "add_screenshot_analysis_to_db",
        context: "Adds analysis of a mobile screenshot to the database",
        parameters: JSONSchema(
            type: .object,
            description: "Screenshot Analysis",
            properties: screenshotAnalysisProperties
        )
    )
 
    let functionCall: AbstractLLM.ChatFunctionCall = try await client.complete(
        messages,
        functions: [addScreenshotAnalysisFunction],
        as: .functionCall
    )
    
    struct ScreenshotAnalysisResult: Codable, Hashable, Sendable {
        let screenshotAnalysisParameters: AddScreenshotFunctionParameters
    }
 
    let result = try functionCall.decode(ScreenshotAnalysisResult.self)
    print(result.screenshotAnalysisParameters)
    
} catch {
    print(error)
}

So, for example, when the screenshot is the Preternatural mobile website:

Alt

The result will look something like this:

AddScreenshotFunctionParameters(
	title: "AI Infrastructure Webpage", 
	summary: "The screenshot displays a webpage promoting AI infrastructure technology for Swift developers.", 
	description: "The screenshot features a webpage from \"preternatural.ai\" presenting a technology solution for Swift developers, described as \"Exhaustive client-side AI infrastructure\". The page mentions that the technology includes \"Server-grade pipelines, client-side inference\" and is backed by Y Combinator. There are options to \"Sign in\" or \"Sign up\" displayed, with an email input field provided for new users. The user interface elements are straightforward, with a dark theme and prominently displayed text in white.", 
	category: "Technology")

Combining function calling with vision opens up an array of possibilities for developers. It allows us to extract structured data from images, enhancing the functionality of applications. For instance, it can transform how we organize and interact with screenshots, providing detailed categorization and search capabilities. The integration of these technologies sets the stage for more interactive and personalized user experiences, pushing the boundaries of what's possible in application development.

Function Calling: Receiving structured data from LLMs DALL-E 3: Image Generation