赞
踩
最近需要用一个text2speech的网络模型做工程, 了解到2020的一篇论文的工作就是这个, 正好作者也开源了项目, 所以来研究研究
paper
Flowtron: an Autoregressive Flow-based Network for Text-to-Mel-spectrogram Synthesis
github
https://github.com/NVIDIA/flowtron
Setup
注意需要安装pytorch (都是搞AI的人, 不会没装pytorch的叭
git clone https://github.com/NVIDIA/flowtron.git
cd flowtron
git submodule update --init; cd tacotron2; git submodule update --init
pip install -r requirements.txt
异常:
(1) 直接https下载然后解压, 会报git没配置的错误
(2) scipy版本太高
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
paddlepaddle-gpu 1.7.2.post107 requires scipy<=1.3.1; python_version >= "3.5", but you have scipy 1.5.2 which is incompatible.
先卸载scipy
D:\flowtron\flowtron>python -m pip uninstall scipy
Found existing installation: scipy 1.5.2
Uninstalling scipy-1.5.2:
Would remove:
g:\python\lib\site-packages\scipy-1.5.2.dist-info\*
g:\python\lib\site-packages\scipy\*
Proceed (Y/n)? y
Successfully uninstalled scipy-1.5.2
然后安装1.3.1
版本scipy
D:\flowtron\flowtron>python -m pip install scipy==1.3.1
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Collecting scipy==1.3.1
Downloading https://mirrors.aliyun.com/pypi/packages/50/eb/defa40367863304e1ef01c6572584c411446a5f29bdd9dc90f91509e9144/scipy-1.3.1-cp37-cp37m-win_amd64.whl (30.3 MB)
|████████████████████████████████| 30.3 MB 57 kB/s
Requirement already satisfied: numpy>=1.13.3 in g:\python\lib\site-packages (from scipy==1.3.1) (1.19.2)
Installing collected packages: scipy
Successfully installed scipy-1.3.1
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
You should consider upgrading via the 'G:\Python\python.exe -m pip install --upgrade pip' command.
看参数设置
if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument('-c', '--config', type=str, help='JSON file for configuration') parser.add_argument('-p', '--params', nargs='+', default=[]) parser.add_argument('-f', '--flowtron_path', help='Path to flowtron state dict', type=str) parser.add_argument('-w', '--waveglow_path', help='Path to waveglow state dict', type=str) parser.add_argument('-t', '--text', help='Text to synthesize', type=str) parser.add_argument('-i', '--id', help='Speaker id', type=int) parser.add_argument('-n', '--n_frames', help='Number of frames', default=400, type=int) parser.add_argument('-o', "--output_dir", default="results/") parser.add_argument("-s", "--sigma", default=0.5, type=float) parser.add_argument("-g", "--gate", default=0.5, type=float) parser.add_argument("--seed", default=1234, type=int) args = parser.parse_args() # Parse configs. Globals nicer in this case with open(args.config) as f: data = f.read() global config config = json.loads(data) update_params(config, args.params) data_config = config["data_config"] global model_config model_config = config["model_config"] # Make directory if it doesn't exist if not os.path.isdir(args.output_dir): os.makedirs(args.output_dir) os.chmod(args.output_dir, 0o775) torch.backends.cudnn.enabled = True torch.backends.cudnn.benchmark = False infer(args.flowtron_path, args.waveglow_path, args.output_dir, args.text, args.id, args.n_frames, args.sigma, args.gate, args.seed)
看看infer
源码
def infer(flowtron_path, waveglow_path, output_dir, text, speaker_id, n_frames, sigma, gate_threshold, seed): torch.manual_seed(seed) torch.cuda.manual_seed(seed) # load waveglow waveglow = torch.load(waveglow_path)['model'].cuda().eval() waveglow.cuda().half() for k in waveglow.convinv: k.float() waveglow.eval() # load flowtron model = Flowtron(**model_config).cuda() state_dict = torch.load(flowtron_path, map_location='cpu')['state_dict'] model.load_state_dict(state_dict) model.eval() print("Loaded checkpoint '{}')" .format(flowtron_path)) ignore_keys = ['training_files', 'validation_files'] trainset = Data( data_config['training_files'], **dict((k, v) for k, v in data_config.items() if k not in ignore_keys)) speaker_vecs = trainset.get_speaker_id(speaker_id).cuda() text = trainset.get_text(text).cuda() speaker_vecs = speaker_vecs[None] text = text[None] with torch.no_grad(): residual = torch.cuda.FloatTensor(1, 80, n_frames).normal_() * sigma mels, attentions = model.infer( residual, speaker_vecs, text, gate_threshold=gate_threshold) for k in range(len(attentions)): attention = torch.cat(attentions[k]).cpu().numpy() fig, axes = plt.subplots(1, 2, figsize=(16, 4)) axes[0].imshow(mels[0].cpu().numpy(), origin='bottom', aspect='auto') axes[1].imshow(attention[:, 0].transpose(), origin='bottom', aspect='auto') fig.savefig(os.path.join(output_dir, 'sid{}_sigma{}_attnlayer{}.png'.format(speaker_id, sigma, k))) plt.close("all") with torch.no_grad(): audio = waveglow.infer(mels.half(), sigma=0.8).float() audio = audio.cpu().numpy()[0] # normalize audio for now audio = audio / np.abs(audio).max() print(audio.shape) write(os.path.join(output_dir, 'sid{}_sigma{}.wav'.format(speaker_id, sigma)), data_config['sampling_rate'], audio)
example
先下载pre-trained model
google drive上的flowtron_ljs.pt
文件
https://drive.google.com/file/d/1Cjd6dK_eFz6DE0PKXKgKxrzTUqzzUDW-/view
以及waveglow_256channels_v4.pt
模型文件
waveglow的github地址
https://github.com/NVIDIA/WaveGlow
最新的v5版本模型google drive
https://drive.google.com/file/d/1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF/view
python inference.py -c config.json -f models/flowtron_ljs.pt -w models/waveglow_256channels_v4.pt -t "It is well know that deep generative models have a rich latent space!" -i 0
改一下参数设置
python inference.py -c config.json -f models/flowtron_ljs.pt -w models/waveglow_256channels_universal_v5.pt -t "It is well know that deep generative models have a rich latent space!" -i 0
异常
ImportError: cannot import name 'betabinom' from 'scipy.stats'
scipy版本问题, 所以上面的报错有误导性 (blog不删除上面的报错是为了做比对
python -m pip install scipy --upgrade
接着报错
ModuleNotFoundError: No module named 'numba.decorators'
numba
版本问题, 卸载后安装0.48版本 (好像可以不卸载)
pip uninstall numba
pip install numba==0.48
接着报错
ValueError: 'bottom' is not a valid value for origin; supported values are 'upper', 'lower'
具体看代码, 改一下code
for k in range(len(attentions)):
attention = torch.cat(attentions[k]).cpu().numpy()
fig, axes = plt.subplots(1, 2, figsize=(16, 4))
axes[0].imshow(mels[0].cpu().numpy(), origin='lower', aspect='auto') # origin='bottom'
axes[1].imshow(attention[:, 0].transpose(), origin='lower', aspect='auto') # origin='bottom'
fig.savefig(os.path.join(output_dir, 'sid{}_sigma{}_attnlayer{}.png'.format(speaker_id, sigma, k)))
plt.close("all")
效果
results文件夹生成了语音文件, 我听是效果不错, 可以生成"It is well know that deep generative models have a rich latent space!"
这句话的语音, (CSDN可以放语音文件么, 不能直观演示
之后补充…
直接运行代码通常会报错误, 一般来说是依赖库版本问题, google / Stack Overflow一下就行
后记
一个二进制玩家逐渐转移到数据科学和AI安全领域…我也没想到哇
不过 AI 和 IoT 一样都是大势所趋, 所以只能顺势而为, 来都来了就老老实实干, 不想那些有的没的了 (有空再搞搞二进制
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。