介绍 gpt-4o

GPT-4o("o"代表"全能")被设计用于处理文本、音频和视频输入,并可生成文本、音频和图像输出。

背景

在GPT-4o之前,用户可以使用语音模式与ChatGPT进行交互,该模式使用三个单独的模型。GPT-4o将这些功能整合到一个单一的模型中,该模型经过文本、视觉和音频的训练。这种统一的方法确保所有输入——无论是文本、视觉还是听觉——都由同一个神经网络一致地处理。

当前 API 功能

目前,API仅支持 {text, image} 输入和 {text} 输出,与 gpt-4-turbo 具有相同的模态。不久将引入包括音频在内的其他模态。本指南将帮助您开始使用 GPT-4o 进行文本、图像和视频理解。

入门

安装 Python 版 OpenAI SDK

%pip install --upgrade openai --quiet

配置 OpenAI 客户端并提交测试请求

要设置客户端以供我们使用,我们需要创建一个 API 密钥来与我们的请求一起使用。如果您已经有了一个 API 密钥,请跳过这些步骤。

您可以通过以下步骤获取一个 API 密钥:

  1. 创建一个新项目
  2. 在您的项目中生成一个 API 密钥
  3. (建议但不是必需) 将您的 API 密钥设置为所有项目的环境变量

一旦我们设置好了这些,让我们从一个简单的 {text} 输入开始,发送到模型进行我们的第一个请求。我们将使用 systemuser 消息进行我们的第一个请求,并从 assistant 角色接收响应。

from openai import OpenAI 
import os

## Set the API key and model name
MODEL="gpt-4o"
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as an env var>"))

completion = client.chat.completions.create(
  model=MODEL,
  messages=[
    {"role": "system", "content": "You are a helpful assistant. Help me with my math homework!"},
    {"role": "user", "content": "Hello! Could you solve 2+2?"}
  ]
)

print("Assistant: " + completion.choices[0].message.content)

Output:

Assistant: Of course! 

\[ 2 + 2 = 4 \]

If you have any other questions, feel free to ask!

图像处理

GPT-4o可以直接处理图像并根据图像采取智能操作。我们可以以两种格式提供图像:

  1. Base64 编码
  2. URL

让我们先查看我们将使用的图像,然后尝试将这个图像作为 Base64 和 URL 链接发送到 API。

Base64 图像处理

from IPython.display import Image, display, Audio, Markdown
import base64

IMAGE_PATH = "data/triangle.png"

# Preview image for context
display(Image(IMAGE_PATH))

# Open the image file and encode it as a base64 string
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image = encode_image(IMAGE_PATH)

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
        {"role": "user", "content": [
            {"type": "text", "text": "What's the area of the triangle?"},
            {"type": "image_url", "image_url": {
                "url": f"data:image/png;base64,{base64_image}"
            }}
        ]}
    ],
    temperature=0.0,
)

print(response.choices[0].message.content)

Output:

To find the area of the triangle, we can use Heron's formula. First, we need to find the semi-perimeter of the triangle.

The sides of the triangle are 6, 5, and 9.

1. Calculate the semi-perimeter \( s \):
\[ s = \frac{a + b + c}{2} = \frac{6 + 5 + 9}{2} = 10 \]

2. Use Heron's formula to find the area \( A \):
\[ A = \sqrt{s(s-a)(s-b)(s-c)} \]

Substitute the values:
\[ A = \sqrt{10(10-6)(10-5)(10-9)} \]
\[ A = \sqrt{10 \cdot 4 \cdot 5 \cdot 1} \]
\[ A = \sqrt{200} \]
\[ A = 10\sqrt{2} \]

So, the area of the triangle is \( 10\sqrt{2} \) square units.

URL 图像处理

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
        {"role": "user", "content": [
            {"type": "text", "text": "What's the area of the triangle?"},
            {"type": "image_url", "image_url": {
                "url": "https://upload.wikimedia.org/wikipedia/commons/e/e2/The_Algebra_of_Mohammed_Ben_Musa_-_page_82b.png"
            }}
        ]}
    ],
    temperature=0.0,
)

print(response.choices[0].message.content)

Output:

To find the area of the triangle, we can use Heron's formula. Heron's formula states that the area of a triangle with sides of length \(a\), \(b\), and \(c\) is:

\[ \text{Area} = \sqrt{s(s-a)(s-b)(s-c)} \]

where \(s\) is the semi-perimeter of the triangle:

\[ s = \frac{a + b + c}{2} \]

For the given triangle, the side lengths are \(a = 5\), \(b = 6\), and \(c = 9\). 

First, calculate the semi-perimeter \(s\):

\[ s = \frac{5 + 6 + 9}{2} = \frac{20}{2} = 10 \]

Now, apply Heron's formula:

\[ \text{Area} = \sqrt{10(10-5)(10-6)(10-9)} \]
\[ \text{Area} = \sqrt{10 \cdot 5 \cdot 4 \cdot 1} \]
\[ \text{Area} = \sqrt{200} \]
\[ \text{Area} = 10\sqrt{2} \]

So, the area of the triangle is \(10\sqrt{2}\) square units.

视频处理

虽然无法直接将视频发送到 API,但 GPT-4o 可以理解视频,只要你采样帧并将其作为图像提供。它在这项任务上的表现优于 GPT-4 Turbo。

由于截至 2024 年 5 月 GPT-4o 的 API 还不支持音频输入,我们将使用 GPT-4o 和 Whisper 的组合来处理视频的音频和视觉部分,并展示两个用例:

  1. 总结
  2. 问题和回答

视频处理设置

我们将使用两个 Python 包进行视频处理 - opencv-python 和 moviepy。

这些需要 ffmpeg,所以请确保先安装好。根据您的操作系统,您可能需要运行 brew install ffmpegsudo apt install ffmpeg

%pip install opencv-python --quiet
%pip install moviepy --quiet

将视频处理为两个组件:帧和音频

import cv2
from moviepy.editor import VideoFileClip
import time
import base64

# We'll be using the OpenAI DevDay Keynote Recap video. You can review the video here: https://www.youtube.com/watch?v=h02ti0Bl6zk
VIDEO_PATH = "data/keynote_recap.mp4"

def process_video(video_path, seconds_per_frame=2):
    base64Frames = []
    base_video_path, _ = os.path.splitext(video_path)

    video = cv2.VideoCapture(video_path)
    total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    fps = video.get(cv2.CAP_PROP_FPS)
    frames_to_skip = int(fps * seconds_per_frame)
    curr_frame=0

    # Loop through the video and extract frames at specified sampling rate
    while curr_frame < total_frames - 1:
        video.set(cv2.CAP_PROP_POS_FRAMES, curr_frame)
        success, frame = video.read()
        if not success:
            break
        _, buffer = cv2.imencode(".jpg", frame)
        base64Frames.append(base64.b64encode(buffer).decode("utf-8"))
        curr_frame += frames_to_skip
    video.release()

    # Extract audio from video
    audio_path = f"{base_video_path}.mp3"
    clip = VideoFileClip(video_path)
    clip.audio.write_audiofile(audio_path, bitrate="32k")
    clip.audio.close()
    clip.close()

    print(f"Extracted {len(base64Frames)} frames")
    print(f"Extracted audio to {audio_path}")
    return base64Frames, audio_path

# Extract 1 frame per second. You can adjust the `seconds_per_frame` parameter to change the sampling rate
base64Frames, audio_path = process_video(VIDEO_PATH, seconds_per_frame=1)

Output:

MoviePy - Writing audio in data/keynote_recap.mp3
MoviePy - Done.
Extracted 218 frames
Extracted audio to data/keynote_recap.mp3

显示帧和音频以提供上下文

## Display the frames and audio for context
display_handle = display(None, display_id=True)
for img in base64Frames:
    display_handle.update(Image(data=base64.b64decode(img.encode("utf-8")), width=600))
    time.sleep(0.025)

Audio(audio_path)