介绍 gpt-4o
GPT-4o("o"代表"全能")被设计用于处理文本、音频和视频输入,并可生成文本、音频和图像输出。
背景
在GPT-4o之前,用户可以使用语音模式与ChatGPT进行交互,该模式使用三个单独的模型。GPT-4o将这些功能整合到一个单一的模型中,该模型经过文本、视觉和音频的训练。这种统一的方法确保所有输入——无论是文本、视觉还是听觉——都由同一个神经网络一致地处理。
当前 API 功能
目前,API仅支持 {text, image} 输入和 {text} 输出,与 gpt-4-turbo 具有相同的模态。不久将引入包括音频在内的其他模态。本指南将帮助您开始使用 GPT-4o 进行文本、图像和视频理解。
入门
安装 Python 版 OpenAI SDK
%pip install --upgrade openai --quiet
配置 OpenAI 客户端并提交测试请求
要设置客户端以供我们使用,我们需要创建一个 API 密钥来与我们的请求一起使用。如果您已经有了一个 API 密钥,请跳过这些步骤。
您可以通过以下步骤获取一个 API 密钥:
- 创建一个新项目
- 在您的项目中生成一个 API 密钥
- (建议但不是必需) 将您的 API 密钥设置为所有项目的环境变量
一旦我们设置好了这些,让我们从一个简单的 {text} 输入开始,发送到模型进行我们的第一个请求。我们将使用 system 和 user 消息进行我们的第一个请求,并从 assistant 角色接收响应。
from openai import OpenAI
import os
## Set the API key and model name
MODEL="gpt-4o"
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as an env var>"))
completion = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "You are a helpful assistant. Help me with my math homework!"},
{"role": "user", "content": "Hello! Could you solve 2+2?"}
]
)
print("Assistant: " + completion.choices[0].message.content)
Output:
Assistant: Of course!
\[ 2 + 2 = 4 \]
If you have any other questions, feel free to ask!
图像处理
GPT-4o可以直接处理图像并根据图像采取智能操作。我们可以以两种格式提供图像:
- Base64 编码
- URL
让我们先查看我们将使用的图像,然后尝试将这个图像作为 Base64 和 URL 链接发送到 API。
Base64 图像处理
from IPython.display import Image, display, Audio, Markdown
import base64
IMAGE_PATH = "data/triangle.png"
# Preview image for context
display(Image(IMAGE_PATH))
# Open the image file and encode it as a base64 string
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
base64_image = encode_image(IMAGE_PATH)
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
{"role": "user", "content": [
{"type": "text", "text": "What's the area of the triangle?"},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{base64_image}"
}}
]}
],
temperature=0.0,
)
print(response.choices[0].message.content)
Output:
To find the area of the triangle, we can use Heron's formula. First, we need to find the semi-perimeter of the triangle.
The sides of the triangle are 6, 5, and 9.
1. Calculate the semi-perimeter \( s \):
\[ s = \frac{a + b + c}{2} = \frac{6 + 5 + 9}{2} = 10 \]
2. Use Heron's formula to find the area \( A \):
\[ A = \sqrt{s(s-a)(s-b)(s-c)} \]
Substitute the values:
\[ A = \sqrt{10(10-6)(10-5)(10-9)} \]
\[ A = \sqrt{10 \cdot 4 \cdot 5 \cdot 1} \]
\[ A = \sqrt{200} \]
\[ A = 10\sqrt{2} \]
So, the area of the triangle is \( 10\sqrt{2} \) square units.
URL 图像处理
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
{"role": "user", "content": [
{"type": "text", "text": "What's the area of the triangle?"},
{"type": "image_url", "image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/e/e2/The_Algebra_of_Mohammed_Ben_Musa_-_page_82b.png"
}}
]}
],
temperature=0.0,
)
print(response.choices[0].message.content)
Output:
To find the area of the triangle, we can use Heron's formula. Heron's formula states that the area of a triangle with sides of length \(a\), \(b\), and \(c\) is:
\[ \text{Area} = \sqrt{s(s-a)(s-b)(s-c)} \]
where \(s\) is the semi-perimeter of the triangle:
\[ s = \frac{a + b + c}{2} \]
For the given triangle, the side lengths are \(a = 5\), \(b = 6\), and \(c = 9\).
First, calculate the semi-perimeter \(s\):
\[ s = \frac{5 + 6 + 9}{2} = \frac{20}{2} = 10 \]
Now, apply Heron's formula:
\[ \text{Area} = \sqrt{10(10-5)(10-6)(10-9)} \]
\[ \text{Area} = \sqrt{10 \cdot 5 \cdot 4 \cdot 1} \]
\[ \text{Area} = \sqrt{200} \]
\[ \text{Area} = 10\sqrt{2} \]
So, the area of the triangle is \(10\sqrt{2}\) square units.
视频处理
虽然无法直接将视频发送到 API,但 GPT-4o 可以理解视频,只要你采样帧并将其作为图像提供。它在这项任务上的表现优于 GPT-4 Turbo。
由于截至 2024 年 5 月 GPT-4o 的 API 还不支持音频输入,我们将使用 GPT-4o 和 Whisper 的组合来处理视频的音频和视觉部分,并展示两个用例:
- 总结
- 问题和回答
视频处理设置
我们将使用两个 Python 包进行视频处理 - opencv-python 和 moviepy。
这些需要 ffmpeg,所以请确保先安装好。根据您的操作系统,您可能需要运行 brew install ffmpeg 或 sudo apt install ffmpeg。
%pip install opencv-python --quiet
%pip install moviepy --quiet
将视频处理为两个组件:帧和音频
import cv2
from moviepy.editor import VideoFileClip
import time
import base64
# We'll be using the OpenAI DevDay Keynote Recap video. You can review the video here: https://www.youtube.com/watch?v=h02ti0Bl6zk
VIDEO_PATH = "data/keynote_recap.mp4"
def process_video(video_path, seconds_per_frame=2):
base64Frames = []
base_video_path, _ = os.path.splitext(video_path)
video = cv2.VideoCapture(video_path)
total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
fps = video.get(cv2.CAP_PROP_FPS)
frames_to_skip = int(fps * seconds_per_frame)
curr_frame=0
# Loop through the video and extract frames at specified sampling rate
while curr_frame < total_frames - 1:
video.set(cv2.CAP_PROP_POS_FRAMES, curr_frame)
success, frame = video.read()
if not success:
break
_, buffer = cv2.imencode(".jpg", frame)
base64Frames.append(base64.b64encode(buffer).decode("utf-8"))
curr_frame += frames_to_skip
video.release()
# Extract audio from video
audio_path = f"{base_video_path}.mp3"
clip = VideoFileClip(video_path)
clip.audio.write_audiofile(audio_path, bitrate="32k")
clip.audio.close()
clip.close()
print(f"Extracted {len(base64Frames)} frames")
print(f"Extracted audio to {audio_path}")
return base64Frames, audio_path
# Extract 1 frame per second. You can adjust the `seconds_per_frame` parameter to change the sampling rate
base64Frames, audio_path = process_video(VIDEO_PATH, seconds_per_frame=1)
Output:
MoviePy - Writing audio in data/keynote_recap.mp3
MoviePy - Done.
Extracted 218 frames
Extracted audio to data/keynote_recap.mp3
显示帧和音频以提供上下文
## Display the frames and audio for context
display_handle = display(None, display_id=True)
for img in base64Frames:
display_handle.update(Image(data=base64.b64decode(img.encode("utf-8")), width=600))
time.sleep(0.025)
Audio(audio_path)