赞
踩
本文翻译整理自自:Introduction to gpt-4o
https://cookbook.openai.com/examples/gpt4o/introduction_to_gpt4o
代码:
https://github.com/openai/openai-cookbook/blob/main/examples/gpt4o/introduction_to_gpt4o.ipynb
GPT-4o(“o”代表“omni”)旨在处理文本、音频和视频输入的组合,并可以生成文本、音频和图像格式的输出。
在 GPT-4o 之前,用户可以使用语音模式与 ChatGPT 进行交互,该模式通过三个独立的模型进行操作。 GPT-4o 将把这些功能集成到一个跨文本、视觉和音频训练的模型中。这种统一的方法确保所有输入(无论是文本、视觉还是听觉)都由同一神经网络统一处理。
目前,该 API 仅支持输入 {text, image}
,输出{text}
,其模式与gpt-4-turbo
相同。
包括音频在内的其他模式将很快推出。本指南将帮助您开始使用 GPT-4o 来理解文本、图像和视频。
pip install --upgrade openai --quiet
要设置客户端供我们使用,我们需要创建一个 API 密钥以用于我们的请求。如果您已有可供使用的 API 密钥,请跳过这些步骤。
您可以按照以下步骤获取 API 密钥:
完成此设置后,让我们从第一个请求的模型的简单 {text}
输入开始。
我们将在第一个请求中使用system
和user
消息,并且我们将收到来自 assistant
角色的响应。
from openai import OpenAI
import os
## Set the API key and model name
MODEL="gpt-4o"
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as an env var>"))
completion = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "You are a helpful assistant. Help me with my math homework!"}, # <-- This is the system message that provides context to the model
{"role": "user", "content": "Hello! Could you solve 2+2?"} # <-- This is the user message for which the model will generate a response
]
)
print("Assistant: " + completion.choices[0].message.content)
Assistant: Of course!
\[ 2 + 2 = 4 \]
如果您还有其他问题,请随时提问!
GPT-4o可以直接处理图像并根据图像采取智能行动。我们可以提供两种格式的图像:
我们首先查看要使用的图像,然后尝试将此图像作为 Base64 和 URL 链接发送到 API
from IPython.display import Image, display, Audio, Markdown
import base64
IMAGE_PATH = "data/triangle.png"
# Preview image for context
display(Image(IMAGE_PATH))
# Open the image file and encode it as a base64 string
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
base64_image = encode_image(IMAGE_PATH)
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
{"role": "user", "content": [
{"type": "text", "text": "What's the area of the triangle?"},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{base64_image}"}
}
]}
],
temperature=0.0,
)
print(response.choices[0].message.content)
为了求出三角形的面积,我们可以使用海伦公式。首先,我们需要找到三角形的半周长。
三角形的边长分别为 6、5、9。
1. 计算半周长 \( s \):
\[ s = \frac{a + b + c}{2} = \frac{6 + 5 + 9}{2} = 10 \]
2. 使用 Heron 公式求面积 \( A \):
\[ A = \sqrt{s(sa)(sb)(sc)} \]
代入值:
\[ A = \sqrt{10(10-6)(10-5)(10-9)} \] \
[ A = \sqrt{10 \cdot 4 \cdot 5 \cdot 1} \]
\[ A = \sqrt {200} \]
\[ A = 10\sqrt{2} \]
因此,三角形的面积为 \( 10\sqrt{2} \) 平方单位。
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
{"role": "user", "content": [
{"type": "text", "text": "What's the area of the triangle?"},
{"type": "image_url", "image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/e/e2/The_Algebra_of_Mohammed_Ben_Musa_-_page_82b.png"}
}
]}
],
temperature=0.0,
)
print(response.choices[0].message.content)
为了求出三角形的面积,我们可以使用海伦公式。海伦公式指出,边长为 \(a\)、\(b\) 和 \(c\) 的三角形的面积为:
\[ \text{Area} = \sqrt{s(sa)(sb )(sc)} \]
其中 \(s\) 是三角形的半周长:
\[ s = \frac{a + b + c}{2} \]
对于给定的三角形,边长为 \ (a = 5\)、\(b = 6\) 和 \(c = 9\)。
首先,计算半周长 \(s\):
\[ s = \frac{5 + 6 + 9}{2} = \frac{20}{2} = 10 \]
现在,应用 Heron 公式:
\[ \text{面积} = \sqrt{10(10-5)(10-6)(10-9)} \]
\[ \text{面积} = \sqrt{10 \cdot 5 \cdot 4 \cdot 1} \]
\[ \text{Area} = \sqrt{200} \]
\[ \text{Area} = 10\sqrt{2} \]
因此,三角形的面积为 \(10\sqrt{2}\ ) 平方单位。
虽然无法直接将视频发送到 API,但如果您对帧进行采样,然后将其作为图像提供,GPT-4o 就可以理解视频。它在这项任务上的表现比 GPT-4 Turbo 更好。
由于 API 中的 GPT-4o 尚不支持音频输入(截至 2024 年 5 月),我们将结合使用 GPT-4o 和 Whisper 来处理所提供视频的音频和视频,并展示两个用例:
我们将使用两个 python 包进行视频处理 - opencv-python
和 moviepy
。
这些需要ffmpeg,因此请确保事先安装它。
根据您的操作系统,您可能需要运行brew install ffmpeg
或sudo apt install ffmpeg
pip install opencv-python --quiet
pip install moviepy --quiet
import cv2
from moviepy.editor import VideoFileClip
import time
import base64
# We'll be using the OpenAI DevDay Keynote Recap video. You can review the video here: https://www.youtube.com/watch?v=h02ti0Bl6zk
VIDEO_PATH = "data/keynote_recap.mp4"
def process_video(video_path, seconds_per_frame=2):
base64Frames = []
base_video_path, _ = os.path.splitext(video_path)
video = cv2.VideoCapture(video_path)
total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
fps = video.get(cv2.CAP_PROP_FPS)
frames_to_skip = int(fps * seconds_per_frame)
curr_frame=0
# Loop through the video and extract frames at specified sampling rate
while curr_frame < total_frames - 1:
video.set(cv2.CAP_PROP_POS_FRAMES, curr_frame)
success, frame = video.read()
if not success:
break
_, buffer = cv2.imencode(".jpg", frame)
base64Frames.append(base64.b64encode(buffer).decode("utf-8"))
curr_frame += frames_to_skip
video.release()
# Extract audio from video
audio_path = f"{base_video_path}.mp3"
clip = VideoFileClip(video_path)
clip.audio.write_audiofile(audio_path, bitrate="32k")
clip.audio.close()
clip.close()
print(f"Extracted {len(base64Frames)} frames")
print(f"Extracted audio to {audio_path}")
return base64Frames, audio_path
# Extract 1 frame per second. You can adjust the `seconds_per_frame` parameter to change the sampling rate
base64Frames, audio_path = process_video(VIDEO_PATH, seconds_per_frame=1)
MoviePy - 在 data/keynote_recap.mp3 中写入音频
MoviePy - 完成。
提取了 218 帧
将音频提取到 data/keynote_recap.mp3
## Display the frames and audio for context
display_handle = display(None, display_id=True)
for img in base64Frames:
display_handle.update(Image(data=base64.b64decode(img.encode("utf-8")), width=600))
time.sleep(0.025)
Audio(audio_path)
现在我们已经有了视频帧和音频,让我们运行一些不同的测试来生成视频摘要,以比较使用不同模式的模型的结果。我们应该期望看到根据视觉和音频输入的上下文生成的摘要将是最准确的,因为该模型能够使用视频中的整个上下文。
通过仅向模型发送视频中的帧来生成视觉摘要。仅使用框架,模型可能会捕获视觉方面,但会错过演讲者讨论的任何细节。
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "You are generating a video summary. Please provide a summary of the video. Respond in Markdown."},
{"role": "user", "content": [
"These are the frames from the video.",
*map(lambda x: {"type": "image_url",
"image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64Frames)
],
}
],
temperature=0,
)
print(response.choices[0].message.content)
## Video Summary: OpenAI DevDay Keynote Recap
The video appears to be a keynote recap from OpenAI's DevDay event. Here are the key points covered in the video:
1. **Introduction and Event Overview**:
- The video starts with the title "OpenAI DevDay" and transitions to "Keynote Recap."
- The event venue is shown, with attendees gathering and the stage set up.
2. **Keynote Presentation**:
- A speaker, presumably from OpenAI, takes the stage to present.
- The presentation covers various topics related to OpenAI's latest developments and announcements.
3. **Announcements**:
- **GPT-4 Turbo**: Introduction of GPT-4 Turbo, highlighting its enhanced capabilities and performance.
- **JSON Mode**: A new feature that allows for structured data output in JSON format.
- **Function Calling**: Demonstration of improved function calling capabilities, making interactions more efficient.
- **Context Length and Control**: Enhancements in context length and user control over the model's responses.
- **Better Knowledge Integration**: Improvements in the model's knowledge base and retrieval capabilities.
4. **Product Demonstrations**:
- **DALL-E 3**: Introduction of DALL-E 3 for advanced image generation.
- **Custom Models**: Announcement of custom models, allowing users to tailor models to specific needs.
- **API Enhancements**: Updates to the API, including threading, retrieval, and code interpreter functionalities.
5. **Pricing and Token Efficiency**:
- Discussion on GPT-4 Turbo pricing, emphasizing cost efficiency with reduced input and output tokens.
6. **New Features and Tools**:
- Introduction of new tools and features for developers, including a variety of GPT-powered applications.
- Emphasis on building with natural language and the ease of creating custom applications.
7. **Closing Remarks**:
- The speaker concludes the presentation, thanking the audience and highlighting the future of OpenAI's developments.
The video ends with the OpenAI logo and the event title "OpenAI DevDay."
结果符合预期 - 该模型能够捕捉视频视觉效果的高级方面,但错过了演讲中提供的细节。
音频摘要是通过向模型发送音频脚本来生成的。仅使用音频,模型可能会偏向音频内容,并且会错过演示和视觉效果提供的上下文。
GPT-4o 的 {audio}
输入目前不可用,但很快就会推出!现在,我们使用现有的whisper-1
模型来处理音频
# Transcribe the audio
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=open(audio_path, "rb"),
)
## OPTIONAL: Uncomment the line below to print the transcription
#print("Transcript: ", transcription.text + "\n\n")
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content":"""You are generating a transcript summary. Create a summary of the provided transcription. Respond in Markdown."""},
{"role": "user", "content": [
{"type": "text", "text": f"The audio transcription is: {transcription.text}"}
],
}
],
temperature=0,
)
print(response.choices[0].message.content)
### Summary
Welcome to OpenAI's first-ever Dev Day. Key announcements include:
- **GPT-4 Turbo**: A new model supporting up to 128,000 tokens of context, featuring JSON mode for valid JSON responses, improved instruction following, and better knowledge retrieval from external documents or databases. It is also significantly cheaper than GPT-4.
- **New Features**:
- **Dolly 3**, **GPT-4 Turbo with Vision**, and a new **Text-to-Speech model** are now available in the API.
- **Custom Models**: A program where OpenAI researchers help companies create custom models tailored to their specific use cases.
- **Increased Rate Limits**: Doubling tokens per minute for established GPT-4 customers and allowing requests for further rate limit changes.
- **GPTs**: Tailored versions of ChatGPT for specific purposes, programmable through conversation, with options for private or public sharing, and a forthcoming GPT Store.
- **Assistance API**: Includes persistent threads, built-in retrieval, a code interpreter, and improved function calling.
OpenAI is excited about the future of AI integration and looks forward to seeing what users will create with these new tools. The event concludes with an invitation to return next year for more advancements.
音频摘要偏向于演讲中讨论的内容,但其结构比视频摘要要少得多。
通过同时向模型发送视频中的视觉和音频 来生成音频 + 视觉摘要。当发送这两者时,模型有望更好地总结,因为它可以立即感知整个视频。
## Generate a summary with visual and audio
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content":"""You are generating a video summary. Create a summary of the provided video and its transcript. Respond in Markdown"""},
{"role": "user", "content": [
"These are the frames from the video.",
*map(lambda x: {"type": "image_url",
"image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64Frames),
{"type": "text", "text": f"The audio transcription is: {transcription.text}"}
],
}
],
temperature=0,
)
print(response.choices[0].message.content)
## Video Summary: OpenAI Dev Day
### Introduction
- The video begins with the title "OpenAI Dev Day" and transitions to a keynote recap.
### Event Overview
- The event is held at a venue with a sign reading "OpenAI Dev Day."
- Attendees are seen entering and gathering in a large hall.
### Keynote Presentation
- The keynote speaker introduces the event and announces the launch of GPT-4 Turbo.
- **GPT-4 Turbo**:
- Supports up to 128,000 tokens of context.
- Introduces a new feature called JSON mode for valid JSON responses.
- Improved function calling capabilities.
- Enhanced instruction-following and knowledge retrieval from external documents or databases.
- Knowledge updated up to April 2023.
- Available in the API along with DALL-E 3, GPT-4 Turbo with Vision, and a new Text-to-Speech model.
### Custom Models
- Launch of a new program called Custom Models.
- Researchers will collaborate with companies to create custom models tailored to specific use cases.
- Higher rate limits and the ability to request changes to rate limits and quotas directly in API settings.
### Pricing and Performance
- **GPT-4 Turbo**:
- 3x cheaper for prompt tokens and 2x cheaper for completion tokens compared to GPT-4.
- Doubling the tokens per minute for established GPT-4 customers.
### Introduction of GPTs
- **GPTs**:
- Tailored versions of ChatGPT for specific purposes.
- Combine instructions, expanded knowledge, and actions for better performance and control.
- Can be created without coding, through conversation.
- Options to make GPTs private, share publicly, or create for company use in ChatGPT Enterprise.
- Announcement of the upcoming GPT Store.
### Assistance API
- **Assistance API**:
- Includes persistent threads for handling long conversation history.
- Built-in retrieval and code interpreter with a working Python interpreter in a sandbox environment.
- Improved function calling.
### Conclusion
- The speaker emphasizes the potential of integrating intelligence everywhere, providing "superpowers on demand."
- Encourages attendees to return next year, hinting at even more advanced developments.
- The event concludes with thanks to the attendees.
### Closing
- The video ends with the OpenAI logo and a final thank you message.
结合视频和音频后,我们能够使用视频中的视觉和音频元素的信息获得更详细、更全面的事件摘要。
对于问答,我们将使用与之前相同的概念来询问我们处理过的视频的问题,同时运行相同的 3 个测试来演示组合输入模式的好处:
QUESTION = "Question: Why did Sam Altman have an example about raising windows and turning the radio on?"
qa_audio_response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content":"""Use the transcription to answer the provided question. Respond in Markdown."""},
{"role": "user", "content": f"The audio transcription is: {transcription.text}. \n\n {QUESTION}"},
],
temperature=0,
)
print("Audio QA:\n" + qa_audio_response.choices[0].message.content)
Visual QA :
Sam Altman 使用升起窗户和打开收音机的示例来演示 GPT-4 Turbo 的函数调用能力。该示例说明了模型如何以更加结构化和高效的方式解释和执行多个命令。
“之前”和“之后”的比较显示了模型现在如何基于自然语言指令直接调用 raise_windows()
和 radio_on()
等函数,展示了改进的控制和功能。
qa_audio_response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content":"""Use the transcription to answer the provided question. Respond in Markdown."""},
{"role": "user", "content": f"The audio transcription is: {transcription.text}. \n\n {QUESTION}"},
],
temperature=0,
)
print("Audio QA:\n" + qa_audio_response.choices[0].message.content)
Audio QA :
所提供的转录不包括任何提及 Sam Altman 的内容,也不包括有关升起窗户和打开收音机的示例。因此,我无法根据给定的转录提供答案。
qa_both_response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content":"""Use the video and transcription to answer the provided question."""},
{"role": "user", "content": [
"These are the frames from the video.",
*map(lambda x: {"type": "image_url",
"image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64Frames),
{"type": "text", "text": f"The audio transcription is: {transcription.text}"},
QUESTION
],
}
],
temperature=0,
)
print("Both QA:\n" + qa_both_response.choices[0].message.content)
Both QA:
Sam Altman 使用升起窗户和打开收音机的示例来演示 GPT-4 Turbo 改进的函数调用能力。该示例说明了模型现在如何更有效地处理多个函数调用并更好地遵循指令。在“之前”场景中,必须针对每个操作单独提示模型,而在“之后”场景中,模型可以在单个提示中处理两个操作,展示其同时管理和执行多个任务的增强能力。
比较三个答案,通过使用视频中的音频和视频生成最准确的答案。 Sam Altman 在主题演讲期间没有讨论升高窗户或打开收音机,但提到了模型在单个请求中执行多个功能的改进功能,而示例则在他身后展示。
集成音频、视觉和文本等多种输入方式,显着增强了模型在各种任务上的性能。这种多模式方法可以实现更全面的理解和交互,更接近地反映人类感知和处理信息的方式。
目前,API 中的 GPT-4o 支持文本和图像输入,音频功能即将推出。
2024-05-14
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。