赞
踩
22年年底ChatGPT发布后,大模型开启了高速发展阶段,国内外大厂各种各样的大模型如雨后春笋般的涌现出来。大模型除了人机会话,还能完成各种各样的任务,比如:视频理解与生成、图片理解与生成、语音理解与生成、语音与文字互相转换等等。这些能力如何和数字人结合起来,困扰数字人的应用场景落地问题能得到大大改善,今后肯定会在越来越多ToB和ToC的场景中看到数字人的身影。
做了一个对话数字人项目,尝试将大模型与数字人结合起来,希望把基本的技术问题解决后,能应用到实际的业务场景中。
这个对话数字人项目的目标:用户能用语音直接和数字人对话。
这个形式就像两个人面对面沟通,简单直接,但实现起来并那么直接。需要分解成多个步骤才能完成:
图1 数字人对话实现流程
整个流程实现分成两篇文章介绍,这篇先介绍前4步,下篇文章介绍第5步。
Demo的环境:Win 11+Nvidia RTX 3050 + CUDA 12.1 + Unity
图2 数字人对话UI界面
用户输入部分提供了两种方式:
语音输入采用了UnityEngien中的Microphone实现,录制的语音保存成wav文件。具体实现直接上代码。
- using UnityEngine;
-
- private AudioClip clip; //语音录制片段
- private int audioRecordMaxLength = 60; //语音录制最大长度60秒
- private byte[] bytes; //语音录制数据
-
- private void StartRecording()
- {
- clip = Microphone.Start(null, false, audioRecordMaxLength, 44100);
- }
-
- private void StopRecording()
- {
- var position = Microphone.GetPosition(null);
- Microphone.End(null);
- var samples = new float[position * clip.channels];
- clip.GetData(samples, 0);
- bytes = EncodeAsWAV(samples, clip.frequency, clip.channels);
- SendRecording();
-
- //保存录制的音频文件
- File.WriteAllBytes(Application.dataPath + "/test.wav", bytes);
- }
-
- private byte[] EncodeAsWAV(float[] samples, int frequency, int channels)
- {
- using (var memoryStream = new MemoryStream(44 + samples.Length * 2))
- {
- using (var writer = new BinaryWriter(memoryStream))
- {
- writer.Write("RIFF".ToCharArray());
- writer.Write(36 + samples.Length * 2);
- writer.Write("WAVE".ToCharArray());
- writer.Write("fmt ".ToCharArray());
- writer.Write(16);
- writer.Write((ushort)1);
- writer.Write((ushort)channels);
- writer.Write(frequency);
- writer.Write(frequency * channels * 2);
- writer.Write((ushort)(channels * 2));
- writer.Write((ushort)16);
- writer.Write("data".ToCharArray());
- writer.Write(samples.Length * 2);
-
- foreach (var sample in samples)
- {
- writer.Write((short)(sample * short.MaxValue));
- }
- }
- return memoryStream.ToArray();
- }
- }
音频转文字采用的是OpenAI开发的自动语音识别(ASR)系统Whisper模型。Whisper模型基于一种端到端的架构,采用Transformer编码器-解码器形式实现,是一个强大、灵活且多语言的语音转文本系统,适用于各种场景,如听录、视频字幕生成、会议记录等。
为了更快看到效果,直接使用了huggingface上的Whisper模型API接口。首先需要在huggingface注册账号,在Unity中下载huggingface接口包,将huggingface中的Access Tokens设置到Unity里,这样就可以在Unity中调用huggingface接口了(具体操作可以参考:如何安装和使用 Hugging Face Unity API)。
在Unity中调用huggingface代码如下,_submittedText既是返回的语音识别文字内容。
- private void SendRecording()
- {
- HuggingFaceAPI.AutomaticSpeechRecognition(bytes, response => {
- _submittedText = GenSubmitText(response);
- }, error => {
- _errorMsg = error;
- });
- }
AI对话采用了Llama 2开源模型。Meta公司的Llama2是一个非常强大的开源大模型,包含了70亿、130亿和700亿参数的模型,它可以帮助我们进行高效的数据分析和处理。Llama 2可以采用多种部署方式,本项目中采用本地部署方式(Llama 2部署参考:在本地电脑部署Meta公司开源大语言模型-Llama2教程)。
在Unity中开发采用的是C#,很幸运在Github上找到了Llama 2推理的C#项目,直接使用方便了很多(Llama 2 C#推理项目地址:LLamaSharp)
在Unity中导入LLamaSharp包后就可以开始人机对话,outputMessage就是LLama模型回答内容,还是看代码:
- using LLama;
- using LLama.Common;
- using static LLama.StatefulExecutorBase;
- using Cysharp.Threading.Tasks;
- using System.Threading;
-
- public string ModelPath = "models/llama-2-7b-chat.Q4_0.gguf";
- [TextArea(3, 10)]
- public string SystemPrompt = "Transcript of a dialog, where I interacts with an Assistant named Amy. Amy is helpful, kind, honest, good at writing, and never fails to answer my requests immediately and with precision.\r\n\r\nI: Hello, Amy.\r\nAmy: Hello. How may I help you today?\r\nI: Please tell me the best city in Europe.\r\nAmy: Sure. The best city in Europe is Kyiv, the capital of Ukraine.\r\nI:";
-
- private ExecutorBaseState _emptyState;
- private ChatSession _chatSession;
-
- private string _submittedText = "";
- private string _errorMsg = "";
- private CancellationTokenSource _cts;
-
- async UniTaskVoid Start()
- {
- _cts = new CancellationTokenSource();
-
- // Load a model
- var parameters = new ModelParams(Application.streamingAssetsPath + "/" + ModelPath)
- {
- ContextSize = 4096,
- Seed = 1337,
- GpuLayerCount = 35
- };
- // Switch to the thread pool for long-running operations
- await UniTask.SwitchToThreadPool();
- using var model = LLamaWeights.LoadFromFile(parameters);
- await UniTask.SwitchToMainThread();
- // Initialize a chat session
- using var context = model.CreateContext(parameters);
- var ex = new InteractiveExecutor(context);
- // Save the empty state for cases when we need to switch to empty session
- _emptyState = ex.GetStateData();
- _chatSession = new ChatSession(ex);
- _chatSession.AddSystemMessage(SystemPrompt);
-
- // run the inference in a loop to chat with LLM
- await ChatRoutine(_cts.Token);
- }
-
- public async UniTask ChatRoutine(CancellationToken cancel = default)
- {
- var userMessage = "";
- var outputMessage = "";
- while (!cancel.IsCancellationRequested)
- {
- // Allow input and wait for the user to submit a message or switch the session
- SetInteractable(true);
- await UniTask.WaitUntil(() => _submittedText != "");
- userMessage = _submittedText;
- _submittedText = "";
- outputMessage = "";
-
- // Disable input while processing the message
- await foreach (var token in ChatConcurrent(
- _chatSession.ChatAsync(
- new ChatHistory.Message(AuthorRole.User, userMessage),
- new InferenceParams()
- {
- Temperature = 0.6f,
- AntiPrompts = new List<string> { " " }
- }
- )
- ))
- {
- outputMessage += token;
- await UniTask.NextFrame();
- }
- }
- }
文字转音频采用的是Bark大模型。Bark模型是由Suno AI创建的一个基于转换器(Transformer)的文本到音频模型。它是一个端到端的模型,能够生成高度逼真的多语言语音以及其他音频,包括音乐、背景噪音和简单的音效。Bark还能产生非语言交流,例如大笑、叹息和哭泣等。
Bark模型没有找到C#推理的代码,采用了本地部署(Bark模型部署参考:最强文本转语音工具:Bark,本地安装+云端部署+在线体验详细教程),通uvicorn和fastapi提供http接口,然后在Unity中访问http接口完成文本到语音的转换。
Bark模型推理的Http接口代码如下:
- import uvicorn
- from fastapi import FastAPI
- from fastapi.responses import FileResponse
- from fastapi.middleware.cors import CORSMiddleware
- from starlette.background import BackgroundTask
- from ctransformers import AutoModelForCausalLM
- from bark import SAMPLE_RATE, generate_audio
- from bark.generation import (
- generate_text_semantic,
- preload_models,
- )
- from bark.api import semantic_to_waveform
- import numpy as np
- import base64
- import os
- import nltk # we'll use this to split into sentences
- from scipy.io.wavfile import write as write_wav
- import time
- import random
- import string
-
- os.environ["CUDA_VISIBLE_DEVICES"] = "0"
-
- app = FastAPI()
-
- def productFileName():
- timestamp = int(time.time())
- characters = string.ascii_letters + string.digits
- file_name = str(timestamp) + ''.join(random.choice(characters) for _ in range(6))
- file_name = file_name + ".wav"
- print(file_name)
- return file_name
-
- @app.get("/GenAudio", summary="download audio file")
- async def GenAudio(text: str, speaker: str):
- input = text
- SPEAKER = speaker
- audio_array = generate_audio(input,history_prompt=SPEAKER)
-
- # save audio to disk
- file_name = productFileName()
- write_wav(file_name, SAMPLE_RATE, audio_array)
- directory_path = f"{os.path.dirname(__file__)}"
- file_path = os.path.join(directory_path, file_name)
- return FileResponse(file_path, filename=file_name, media_type="audio/wav", background=BackgroundTask(lambda: os.remove(file_name)),)
-
- if __name__ == "__main__":
- uvicorn.run(app, host="0.0.0.0", port=8080)
在Unity中使用UnityWebRequest访问http接口并保存成wav文件,代码如下:
- string queryStringText = Uri.EscapeDataString(text);
- string queryStringSpeaker = Uri.EscapeDataString(speaker);
- string queryString = "?text=" + queryStringText + "&speaker=" + queryStringSpeaker;
- string urlWithParams = text2AudioUrl + queryString;
- string filename = string.Empty;
-
- using (UnityWebRequest request = UnityWebRequest.Get(urlWithParams))
- {
- request.downloadHandler = new DownloadHandlerBuffer();
- await request.SendWebRequest().ToUniTask();
-
- if (request.isDone)
- {
- if (request.result == UnityWebRequest.Result.ProtocolError || request.result == UnityWebRequest.Result.ConnectionError)
- {
- Debug.Log(request.error);
- }
- else
- {
- filename = request.GetResponseHeader("Content-Disposition").Split(';')[1].Split('=')[1].Trim('"');
- fullPath = Path.Combine(Application.dataPath, "..", audioFilesDict) + filename;
- string directory = Path.GetDirectoryName(fullPath);
- if (!Directory.Exists(directory))
- {
- Directory.CreateDirectory(directory);
- }
- File.WriteAllBytes(fullPath, request.downloadHandler.data);
- }
- }
- }
到此,已经完成了语音对话的核心步骤,下一步就是根据语音生成数字人的说话动画,让用户看起来就是与数字人在对话,这部分留到下一篇文章介绍。最后看看实现的效果吧。
演示Demo
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。