当前位置:   article > 正文

大模型系列:OpenAI使用技巧_ 提升Whisper转录的质量:预处理和后处理技术_openai 转录 长语音分割

openai 转录 长语音分割

提升Whisper转录的质量:预处理和后处理技术

本笔记本提供了一个指南,以改进Whisper的转录质量。我们将通过修剪和分割来简化您的音频数据,以提高Whisper的转录质量。在转录之后,我们将通过添加标点符号,调整产品术语(例如,将’five two nine’调整为’529’),以及减轻Unicode问题来优化输出结果。这些策略将有助于提高转录的清晰度,但请记住,根据您独特的用例进行定制可能会更有益。

设置

为了开始,让我们导入几个不同的库:

  • PyDub 是一个简单易用的Python库,用于音频处理任务,如切片、连接和导出音频文件。

  • IPython.display 模块中的 Audio 类允许您创建一个音频控件,在Jupyter笔记本中播放声音,为您提供了一种直接在笔记本中播放音频数据的简单方法。

  • 对于我们的音频文件,我们将使用由ChatGPT编写并由作者朗读的虚构收益电话。这个音频文件相对较短,但希望能为您提供如何应用这些预处理和后处理步骤到任何音频文件的说明性想法。

# 导入所需的库
from openai import OpenAI  # 导入OpenAI库
import os  # 导入os库,用于操作文件和目录
import urllib  # 导入urllib库,用于处理URL
from IPython.display import Audio  # 导入Audio类,用于播放音频
from pathlib import Path  # 导入Path类,用于处理文件路径
from pydub import AudioSegment  # 导入AudioSegment类,用于音频处理
import ssl  # 导入ssl库,用于处理SSL证书

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

# 从环境变量中获取OpenAI API密钥
api_key = os.getenv("OPENAI_API_KEY")
# 创建OpenAI客户端实例
client = OpenAI()

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
# 设置下载路径
earnings_call_remote_filepath = "https://cdn.openai.com/API/examples/data/EarningsCall.wav"

# 设置本地保存路径
earnings_call_filepath = "data/EarningsCall.wav"

# 下载示例音频文件并保存到本地
# 创建默认的HTTPS上下文并取消验证
ssl._create_default_https_context = ssl._create_unverified_context
# 使用urllib库中的urlretrieve方法下载文件并保存到本地
urllib.request.urlretrieve(earnings_call_remote_filepath, earnings_call_filepath)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
('data/EarningsCall.wav', <http.client.HTTPMessage at 0x11be41f50>)
  • 1

有时,开头有很长的沉默的文件可能会导致Whisper错误地转录音频。我们将使用Pydub来检测和修剪沉默。

在这里,我们将分贝阈值设置为20。如果您愿意,您可以更改这个值。

# 定义函数用于检测前导静音
# 返回第一个声音出现之前的毫秒数(平均分贝超过X分贝的音频块)
def milliseconds_until_sound(sound, silence_threshold_in_decibels=-20.0, chunk_size=10):
    trim_ms = 0  # 初始化修剪的毫秒数为0

    assert chunk_size > 0  # 为了避免无限循环,确保音频块大小大于0
    while sound[trim_ms:trim_ms+chunk_size].dBFS < silence_threshold_in_decibels and trim_ms < len(sound):
        # 当前音频块的分贝值小于静音阈值并且修剪的毫秒数小于音频长度时,继续循环
        trim_ms += chunk_size

    return trim_ms  # 返回修剪的毫秒数
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
# 定义一个函数trim_start,接收一个参数filepath,表示文件路径
def trim_start(filepath):
    # 使用Path类创建一个Path对象,表示文件路径
    path = Path(filepath)
    # 获取文件所在的目录
    directory = path.parent
    # 获取文件名
    filename = path.name
    # 使用AudioSegment类从文件中加载音频数据,格式为wav
    audio = AudioSegment.from_file(filepath, format="wav")
    # 调用milliseconds_until_sound函数,获取音频中非静音部分开始的时间(以毫秒为单位)
    start_trim = milliseconds_until_sound(audio)
    # 对音频进行裁剪,从非静音部分开始的时间开始裁剪
    trimmed = audio[start_trim:]
    # 构造新的文件名,加上前缀"trimmed_"
    new_filename = directory / f"trimmed_{filename}"
    # 将裁剪后的音频导出为wav格式的文件,保存到新的文件名所表示的路径
    trimmed.export(new_filename, format="wav")
    # 返回裁剪后的音频数据和新的文件名
    return trimmed, new_filename
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
# 定义一个函数,用于将音频文件转录为文本
def transcribe_audio(file,output_dir):
    # 拼接音频文件路径
    audio_path = os.path.join(output_dir, file)
    # 以二进制读取音频文件
    with open(audio_path, 'rb') as audio_data:
        # 调用语音识别API,将音频文件转录为文本
        transcription = client.audio.transcriptions.create(
            model="whisper-1", file=audio_data)
        # 返回转录后的文本
        return transcription.text
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

有时候,我们会在转录中看到Unicode字符注入,删除任何非ASCII字符应该有助于缓解这个问题。

请记住,如果您在希腊语、西里尔字母、阿拉伯语、中文等语言中进行转录,不应使用此功能。

# 定义一个函数,用于删除非ASCII字符
def remove_non_ascii(text):
    # 使用生成器表达式,遍历字符串中的每个字符,如果其ASCII码小于128,则保留该字符
    # 最后使用join方法将所有保留的字符拼接成一个新的字符串并返回
    return ''.join(i for i in text if ord(i)<128)
  • 1
  • 2
  • 3
  • 4
  • 5

这个函数将为我们的转录添加格式和标点。Whisper生成带有标点但没有格式的转录。

# 定义一个函数来添加标点符号
def punctuation_assistant(ascii_transcript):

    # 系统提示语
    system_prompt = """You are a helpful assistant that adds punctuation to text.
      Preserve the original words and only insert necessary punctuation such as periods,
     commas, capialization, symbols like dollar sings or percentage signs, and formatting.
     Use only the context provided. If there is no context provided say, 'No context provided'\n"""

    # 调用OpenAI的chat模型,传入系统提示语和用户输入的文本
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": ascii_transcript
            }
        ]
    )
    return response
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

我们的音频文件是一次虚假收益电话的录音,其中包含许多金融产品。这个功能可以帮助确保如果Whisper错误地转录了这些金融产品的名称,它们可以被纠正。

# 定义一个函数来修正产品拼写错误
def product_assistant(ascii_transcript):
    # 系统提示信息,解释了智能助手的任务和要求
    system_prompt = """You are an intelligent assistant specializing in financial products;
    your task is to process transcripts of earnings calls, ensuring that all references to
     financial products and common financial terms are in the correct format. For each
     financial product or common term that is typically abbreviated as an acronym, the full term 
    should be spelled out followed by the acronym in parentheses. For example, '401k' should be
     transformed to '401(k) retirement savings plan', 'HSA' should be transformed to 'Health Savings Account (HSA)'
    , 'ROA' should be transformed to 'Return on Assets (ROA)', 'VaR' should be transformed to 'Value at Risk (VaR)'
, and 'PB' should be transformed to 'Price to Book (PB) ratio'. Similarly, transform spoken numbers representing 
financial products into their numeric representations, followed by the full name of the product in parentheses. 
For instance, 'five two nine' to '529 (Education Savings Plan)' and 'four zero one k' to '401(k) (Retirement Savings Plan)'.
 However, be aware that some acronyms can have different meanings based on the context (e.g., 'LTV' can stand for 
'Loan to Value' or 'Lifetime Value'). You will need to discern from the context which term is being referred to 
and apply the appropriate transformation. In cases where numerical figures or metrics are spelled out but do not 
represent specific financial products (like 'twenty three percent'), these should be left as is. Your role is to
 analyze and adjust financial product terminology in the text. Once you've done that, produce the adjusted 
 transcript and a list of the words you've changed"""
    
    # 调用OpenAI的API,生成回复
    response = client.chat.completions.create(
        model="gpt-4",
        temperature=0,
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": ascii_transcript
            }
        ]
    )
    return response
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36

该函数将创建一个新文件,原文件名后面附加了“trimmed”。

# 对原始音频文件进行起始修剪
trimmed_audio = trim_start(earnings_call_filepath)
  • 1
  • 2
# 从earnings_call_filepath中读取音频文件,并对其进行剪辑处理,返回剪辑后的音频数据和文件名
trimmed_audio, trimmed_filename = trim_start(earnings_call_filepath)
  • 1
  • 2

我们的虚假收益报告音频文件长度相当短,因此我们将相应地调整片段。请记住,您可以根据需要调整片段长度。

# 导入所需的模块和库

# 从wav文件中加载修剪后的音频文件
trimmed_audio = AudioSegment.from_wav(trimmed_filename)

# 每个片段的持续时间(以毫秒为单位)
one_minute = 1 * 60 * 1000

# 第一个片段的起始时间
start_time = 0

# 用于命名分段文件的索引
i = 0

# 分段文件的输出目录
output_dir_trimmed = "trimmed_earnings_directory"

# 如果输出目录不存在,则创建输出目录
if not os.path.isdir(output_dir_trimmed):
    os.makedirs(output_dir_trimmed)

# 循环遍历修剪后的音频文件
while start_time < len(trimmed_audio):
    # 提取一个片段
    segment = trimmed_audio[start_time:start_time + one_minute]
    
    # 保存片段
    segment.export(os.path.join(output_dir_trimmed, f"trimmed_{i:02d}.wav"), format="wav")
    
    # 更新下一个片段的起始时间
    start_time += one_minute
    
    # 增加索引以命名下一个文件
    i += 1
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34

# 获取已经处理好的音频文件列表,并按照数字顺序排序
audio_files = sorted(
    # 列出output_dir_trimmed目录下所有以.wav结尾的文件
    (f for f in os.listdir(output_dir_trimmed) if f.endswith(".wav")),
    # 按照文件名中的数字顺序排序
    key=lambda f: int(''.join(filter(str.isdigit, f)))
)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
# 创建一个循环,将transcribe_audio函数应用于所有音频文件
transcriptions = [transcribe_audio(file, output_dir_trimmed) for file in audio_files]
  • 1
  • 2
# 将多个文本串联起来
# transcriptions是一个包含多个文本的列表
# ' '.join()函数将列表中的所有文本用空格连接成一个字符串
full_transcript = ' '.join(transcriptions)
  • 1
  • 2
  • 3
  • 4
# 打印完整的转录文本
print(full_transcript)
  • 1
  • 2
Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar Q2 with a revenue of 125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our EBITDA has surged to 37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to 16 million, which is a noteworthy increase from 10 million in Q2 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in collateralized. debt obligations, and residential mortgage-backed securities. We've also invested $25 million in AAA rated corporate bonds, enhancing our risk adjusted returns. As for our balance sheet, total assets reached $1.5 billion with total liabilities at $900 million, leaving us with a solid equity base of $600 million. Our debt-to-equity ratio stands at 1.5, a healthy figure considering our expansionary phase. We continue to see substantial organic user growth, with customer acquisition cost dropping by 15% and lifetime value growing by 25%. Our LTVCAC ratio is at an impressive 3.5%. In terms of risk management, we have a value-at-risk model in place with a 99%... confidence level indicating that our maximum loss will not exceed 5 million in the next trading day. We've adopted a conservative approach to managing our leverage and have a healthy tier one capital ratio of 12.5%. Our forecast for the coming quarter is positive. We expect revenue to be around 135 million and 8% quarter over quarter growth driven primarily by our cutting edge blockchain solutions and AI driven predictive analytics. We're also excited about the upcoming IPO of our FinTech subsidiary Pay Plus, which we expect to raise 200 million, significantly bolstering our liquidity and paving the way for aggressive growth strategies. We thank our shareholders for their continued faith in us and we look forward to an even more successful Q3. Thank you so much.
  • 1


# 调用remove_non_ascii函数,将full_transcript中的非ASCII字符移除,并将结果赋值给ascii_transcript
ascii_transcript = remove_non_ascii(full_transcript)
  • 1
  • 2
  • 3
  • 4
# 打印ascii_transcript的值
print(ascii_transcript)
  • 1
  • 2
Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar Q2 with a revenue of 125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our EBITDA has surged to 37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to 16 million, which is a noteworthy increase from 10 million in Q2 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in collateralized. debt obligations, and residential mortgage-backed securities. We've also invested $25 million in AAA rated corporate bonds, enhancing our risk adjusted returns. As for our balance sheet, total assets reached $1.5 billion with total liabilities at $900 million, leaving us with a solid equity base of $600 million. Our debt-to-equity ratio stands at 1.5, a healthy figure considering our expansionary phase. We continue to see substantial organic user growth, with customer acquisition cost dropping by 15% and lifetime value growing by 25%. Our LTVCAC ratio is at an impressive 3.5%. In terms of risk management, we have a value-at-risk model in place with a 99%... confidence level indicating that our maximum loss will not exceed 5 million in the next trading day. We've adopted a conservative approach to managing our leverage and have a healthy tier one capital ratio of 12.5%. Our forecast for the coming quarter is positive. We expect revenue to be around 135 million and 8% quarter over quarter growth driven primarily by our cutting edge blockchain solutions and AI driven predictive analytics. We're also excited about the upcoming IPO of our FinTech subsidiary Pay Plus, which we expect to raise 200 million, significantly bolstering our liquidity and paving the way for aggressive growth strategies. We thank our shareholders for their continued faith in us and we look forward to an even more successful Q3. Thank you so much.
  • 1
# 使用标点符号助手函数
response = punctuation_assistant(ascii_transcript)
  • 1
  • 2
# 从模型的回应中提取带标点的转录文本
punctuated_transcript = response.choices[0].message.content
  • 1
  • 2
# 打印出带标点符号的文本转录
print(punctuated_transcript)
  • 1
  • 2
Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar Q2 with a revenue of $125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our EBITDA has surged to $37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to $16 million, which is a noteworthy increase from $10 million in Q2 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in collateralized debt obligations, and residential mortgage-backed securities. We've also invested $25 million in AAA rated corporate bonds, enhancing our risk-adjusted returns. As for our balance sheet, total assets reached $1.5 billion with total liabilities at $900 million, leaving us with a solid equity base of $600 million. Our debt-to-equity ratio stands at 1.5, a healthy figure considering our expansionary phase. We continue to see substantial organic user growth, with customer acquisition cost dropping by 15% and lifetime value growing by 25%. Our LTVCAC ratio is at an impressive 3.5%. In terms of risk management, we have a value-at-risk model in place with a 99% confidence level indicating that our maximum loss will not exceed $5 million in the next trading day. We've adopted a conservative approach to managing our leverage and have a healthy tier one capital ratio of 12.5%. Our forecast for the coming quarter is positive. We expect revenue to be around $135 million and 8% quarter over quarter growth driven primarily by our cutting-edge blockchain solutions and AI-driven predictive analytics. We're also excited about the upcoming IPO of our FinTech subsidiary Pay Plus, which we expect to raise $200 million, significantly bolstering our liquidity and paving the way for aggressive growth strategies. We thank our shareholders for their continued faith in us and we look forward to an even more successful Q3. Thank you so much.
  • 1
# 调用产品助手函数
# 将标点符号添加到转录文本中,并将结果传递给产品助手函数
response = product_assistant(punctuated_transcript)
  • 1
  • 2
  • 3
# 从模型的响应中提取最终的转录文本
final_transcript = response.choices[0].message.content
  • 1
  • 2
# 打印最终转录结果
print(final_transcript)
  • 1
  • 2
Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar second quarter (Q2) with a revenue of $125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our Earnings Before Interest, Taxes, Depreciation, and Amortization (EBITDA) has surged to $37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to $16 million, which is a noteworthy increase from $10 million in second quarter (Q2) 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in Collateralized Debt Obligations (CDOs), and Residential Mortgage-Backed Securities (RMBS). We've also invested $25 million in AAA rated corporate bonds, enhancing our risk-adjusted returns. As for our balance sheet, total assets reached $1.5 billion with total liabilities at $900 million, leaving us with a solid equity base of $600 million. Our Debt-to-Equity (D/E) ratio stands at 1.5, a healthy figure considering our expansionary phase. We continue to see substantial organic user growth, with Customer Acquisition Cost (CAC) dropping by 15% and Lifetime Value (LTV) growing by 25%. Our LTV to CAC (LTVCAC) ratio is at an impressive 3.5%. In terms of risk management, we have a Value at Risk (VaR) model in place with a 99% confidence level indicating that our maximum loss will not exceed $5 million in the next trading day. We've adopted a conservative approach to managing our leverage and have a healthy Tier 1 Capital ratio of 12.5%. Our forecast for the coming quarter is positive. We expect revenue to be around $135 million and 8% quarter over quarter growth driven primarily by our cutting-edge blockchain solutions and AI-driven predictive analytics. We're also excited about the upcoming Initial Public Offering (IPO) of our FinTech subsidiary Pay Plus, which we expect to raise $200 million, significantly bolstering our liquidity and paving the way for aggressive growth strategies. We thank our shareholders for their continued faith in us and we look forward to an even more successful third quarter (Q3). Thank you so much.

Words Changed:
1. Q2 -> second quarter (Q2)
2. EBITDA -> Earnings Before Interest, Taxes, Depreciation, and Amortization (EBITDA)
3. Q2 2022 -> second quarter (Q2) 2022
4. CDOs -> Collateralized Debt Obligations (CDOs)
5. RMBS -> Residential Mortgage-Backed Securities (RMBS)
6. D/E -> Debt-to-Equity (D/E)
7. CAC -> Customer Acquisition Cost (CAC)
8. LTV -> Lifetime Value (LTV)
9. LTVCAC -> LTV to CAC (LTVCAC)
10. VaR -> Value at Risk (VaR)
11. IPO -> Initial Public Offering (IPO)
12. Q3 -> third quarter (Q3)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/木道寻08/article/detail/837321
推荐阅读
相关标签
  

闽ICP备14008679号