当前位置:   article > 正文

flowtron 文本到语音生成模型使用_cannot import name 'betabinom' from 'scipy.stats

cannot import name 'betabinom' from 'scipy.stats

前言

最近需要用一个text2speech的网络模型做工程, 了解到2020的一篇论文的工作就是这个, 正好作者也开源了项目, 所以来研究研究

paper
Flowtron: an Autoregressive Flow-based Network for Text-to-Mel-spectrogram Synthesis

github
https://github.com/NVIDIA/flowtron

环境配置

Setup
注意需要安装pytorch (都是搞AI的人, 不会没装pytorch的叭

git clone https://github.com/NVIDIA/flowtron.git
cd flowtron
git submodule update --init; cd tacotron2; git submodule update --init
pip install -r requirements.txt
  • 1
  • 2
  • 3
  • 4

异常:
(1) 直接https下载然后解压, 会报git没配置的错误
(2) scipy版本太高

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
paddlepaddle-gpu 1.7.2.post107 requires scipy<=1.3.1; python_version >= "3.5", but you have scipy 1.5.2 which is incompatible.
  • 1
  • 2

先卸载scipy

D:\flowtron\flowtron>python -m pip uninstall scipy
Found existing installation: scipy 1.5.2
Uninstalling scipy-1.5.2:
  Would remove:
    g:\python\lib\site-packages\scipy-1.5.2.dist-info\*
    g:\python\lib\site-packages\scipy\*
Proceed (Y/n)? y
  Successfully uninstalled scipy-1.5.2
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

然后安装1.3.1版本scipy

D:\flowtron\flowtron>python -m pip install scipy==1.3.1
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Collecting scipy==1.3.1
  Downloading https://mirrors.aliyun.com/pypi/packages/50/eb/defa40367863304e1ef01c6572584c411446a5f29bdd9dc90f91509e9144/scipy-1.3.1-cp37-cp37m-win_amd64.whl (30.3 MB)
     |████████████████████████████████| 30.3 MB 57 kB/s
Requirement already satisfied: numpy>=1.13.3 in g:\python\lib\site-packages (from scipy==1.3.1) (1.19.2)
Installing collected packages: scipy
Successfully installed scipy-1.3.1
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
You should consider upgrading via the 'G:\Python\python.exe -m pip install --upgrade pip' command.
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

代码

看参数设置

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-c', '--config', type=str,
                        help='JSON file for configuration')
    parser.add_argument('-p', '--params', nargs='+', default=[])
    parser.add_argument('-f', '--flowtron_path',
                        help='Path to flowtron state dict', type=str)
    parser.add_argument('-w', '--waveglow_path',
                        help='Path to waveglow state dict', type=str)
    parser.add_argument('-t', '--text', help='Text to synthesize', type=str)
    parser.add_argument('-i', '--id', help='Speaker id', type=int)
    parser.add_argument('-n', '--n_frames', help='Number of frames',
                        default=400, type=int)
    parser.add_argument('-o', "--output_dir", default="results/")
    parser.add_argument("-s", "--sigma", default=0.5, type=float)
    parser.add_argument("-g", "--gate", default=0.5, type=float)
    parser.add_argument("--seed", default=1234, type=int)
    args = parser.parse_args()

    # Parse configs.  Globals nicer in this case
    with open(args.config) as f:
        data = f.read()

    global config
    config = json.loads(data)
    update_params(config, args.params)

    data_config = config["data_config"]
    global model_config
    model_config = config["model_config"]

    # Make directory if it doesn't exist
    if not os.path.isdir(args.output_dir):
        os.makedirs(args.output_dir)
        os.chmod(args.output_dir, 0o775)

    torch.backends.cudnn.enabled = True
    torch.backends.cudnn.benchmark = False
    infer(args.flowtron_path, args.waveglow_path, args.output_dir, args.text,
          args.id, args.n_frames, args.sigma, args.gate, args.seed)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40

看看infer源码

def infer(flowtron_path, waveglow_path, output_dir, text, speaker_id, n_frames,
          sigma, gate_threshold, seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)

    # load waveglow
    waveglow = torch.load(waveglow_path)['model'].cuda().eval()
    waveglow.cuda().half()
    for k in waveglow.convinv:
        k.float()
    waveglow.eval()

    # load flowtron
    model = Flowtron(**model_config).cuda()
    state_dict = torch.load(flowtron_path, map_location='cpu')['state_dict']
    model.load_state_dict(state_dict)
    model.eval()
    print("Loaded checkpoint '{}')" .format(flowtron_path))

    ignore_keys = ['training_files', 'validation_files']
    trainset = Data(
        data_config['training_files'],
        **dict((k, v) for k, v in data_config.items() if k not in ignore_keys))
    speaker_vecs = trainset.get_speaker_id(speaker_id).cuda()
    text = trainset.get_text(text).cuda()
    speaker_vecs = speaker_vecs[None]
    text = text[None]

    with torch.no_grad():
        residual = torch.cuda.FloatTensor(1, 80, n_frames).normal_() * sigma
        mels, attentions = model.infer(
            residual, speaker_vecs, text, gate_threshold=gate_threshold)

    for k in range(len(attentions)):
        attention = torch.cat(attentions[k]).cpu().numpy()
        fig, axes = plt.subplots(1, 2, figsize=(16, 4))
        axes[0].imshow(mels[0].cpu().numpy(), origin='bottom', aspect='auto')
        axes[1].imshow(attention[:, 0].transpose(), origin='bottom', aspect='auto')
        fig.savefig(os.path.join(output_dir, 'sid{}_sigma{}_attnlayer{}.png'.format(speaker_id, sigma, k)))
        plt.close("all")

    with torch.no_grad():
        audio = waveglow.infer(mels.half(), sigma=0.8).float()

    audio = audio.cpu().numpy()[0]
    # normalize audio for now
    audio = audio / np.abs(audio).max()
    print(audio.shape)

    write(os.path.join(output_dir, 'sid{}_sigma{}.wav'.format(speaker_id, sigma)),
          data_config['sampling_rate'], audio)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51

应用

example
先下载pre-trained model
google drive上的flowtron_ljs.pt文件
https://drive.google.com/file/d/1Cjd6dK_eFz6DE0PKXKgKxrzTUqzzUDW-/view

以及waveglow_256channels_v4.pt模型文件
waveglow的github地址
https://github.com/NVIDIA/WaveGlow
最新的v5版本模型google drive
https://drive.google.com/file/d/1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF/view

python inference.py -c config.json -f models/flowtron_ljs.pt -w models/waveglow_256channels_v4.pt -t "It is well know that deep generative models have a rich latent space!" -i 0
  • 1

改一下参数设置

python inference.py -c config.json -f models/flowtron_ljs.pt -w models/waveglow_256channels_universal_v5.pt -t "It is well know that deep generative models have a rich latent space!" -i 0
  • 1

异常

ImportError: cannot import name 'betabinom' from 'scipy.stats'
  • 1

scipy版本问题, 所以上面的报错有误导性 (blog不删除上面的报错是为了做比对

python -m pip install scipy --upgrade
  • 1

接着报错

ModuleNotFoundError: No module named 'numba.decorators'
  • 1

numba版本问题, 卸载后安装0.48版本 (好像可以不卸载)

pip uninstall numba
pip install numba==0.48
  • 1
  • 2

接着报错

ValueError: 'bottom' is not a valid value for origin; supported values are 'upper', 'lower'
  • 1

具体看代码, 改一下code

    for k in range(len(attentions)):
        attention = torch.cat(attentions[k]).cpu().numpy()
        fig, axes = plt.subplots(1, 2, figsize=(16, 4))
        axes[0].imshow(mels[0].cpu().numpy(), origin='lower', aspect='auto') # origin='bottom'
        axes[1].imshow(attention[:, 0].transpose(), origin='lower', aspect='auto') # origin='bottom'
        fig.savefig(os.path.join(output_dir, 'sid{}_sigma{}_attnlayer{}.png'.format(speaker_id, sigma, k)))
        plt.close("all")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

效果
results文件夹生成了语音文件, 我听是效果不错, 可以生成"It is well know that deep generative models have a rich latent space!"这句话的语音, (CSDN可以放语音文件么, 不能直观演示

在这里插入图片描述

原理浅析

之后补充…

总结

直接运行代码通常会报错误, 一般来说是依赖库版本问题, google / Stack Overflow一下就行

后记
一个二进制玩家逐渐转移到数据科学和AI安全领域…我也没想到哇
不过 AI 和 IoT 一样都是大势所趋, 所以只能顺势而为, 来都来了就老老实实干, 不想那些有的没的了 (有空再搞搞二进制

本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号