当前位置:   article > 正文

国内外最好的语料库汇总_nhss: a speech and singing parallel database

nhss: a speech and singing parallel database

语料在语言学科研究和深度学习中都至关重要,下面对常用的语料库资源进行总结:部分信息来源于其他博客,但是本文会保持持续更新

Open Speech and Language Resources
http://www.openslr.org/resources.php

更新(2020年6月10):

若干开源语音数据库: https://blog.ailemon.me/2018/11/21/free-open-source-chinese-speech-datasets/

更新2020/10/23

AISHELL-3 高保真中文语音数据库( 希尔贝壳中文普通话语音数据库AISHELL-3的语音时长为85小时88035句,可做为多说话人合成系统。录制过程在安静室内环境中, 使用高保真麦克风(44.1kHz,16bit)。218名来自中国不同口音区域的发言人参与录制。专业语音校对人员进行拼音和韵律标注,并通过严格质量检验,此数据库音字确率在98%以上。(支持学术研究,未经允许禁止商用。))
DiDiSpeech: A Large Scale Mandarin Speech Corpus It consists of about 800 hours of speech data at 48kHz sampling rate from 6000 speakers and the corresponding texts. All speech data in the corpus was recorded in quiet environment and is suitable for various speech processing tasks, such as voice conversion, multi-speaker text-to-speech and automatic speech recognition.

NHSS: A Speech and Singing Parallel
We present a database of parallel recordings of speech and singing, collected and released by the Human Language Technology (HLT) laboratory at the National University of Singapore (NUS), that is called NUS-HLT Speak-Sing (NHSS) database. This database consists of recordings of sung vocals of English pop songs, the spoken counterpart of lyrics of the songs read by the singers in their natural reading manner, and manually prepared utterance-level and word-level annotations. The audio recordings in the NHSS database correspond to a total of 100 songs sung and spoken by 10 singers, resulting in total of 7 hours audio data. There are 5 male and 5 female singers, singing and reading the lyrics of 10 songs each. We release this database to the public for research activities.

更新2020/12/25
http://www.openslr.org/82/
多场景说话人识别数据集CN-Celeb ,包含了来自3000 名中国明星在采访、歌舞、音乐、影视等各类场景中的语音片段。CN-Celeb2 的采集流程与 CN-Celeb1 相仿,语音片段全部由各个数据源经过自动化处理程序提取,并通过人工校验得到。整个 CN-Celeb 系列覆盖了噪音、信道、发音方式等各方面的复杂性,特别适用于研究复杂场景下的说话人识别技术。

更新2021/02/10
数据集名称:speechocean762

数据集下载链接为:http://www.openslr.org/101/ ,其对应的Kaldi recipe入口为:egs/gop_speechocean762(数据介绍:
小米语音联合海天瑞声开源了业界首个比较完善的英语发音评测公开数据集
数据集语言:中国人讲英语,样本均衡,内容完善,数据集包含5000个英文句子,内容涵盖日常生活多个方面;由250位英语非母语发音人录制,其母语均为普通话;发音人性别、年龄占比均衡,男女比例1:1,儿童及成年发音人比例1:1;发音人英语水平经过严格设计及筛选,好、中、差比例为2:1:1,可保证对不同程度英语发音学习者的反馈测试。

标贝开源:

https://www.data-baker.com/#/data/index/source
有效时长:约12小时
平均字数:16字
语言类型:标准普通话
发 音 人:女;20-30岁;声音积极知性
录音环境:声音采集环境为专业录音棚环境:1)录音棚符合专业音库录制标准;2)录音环境和设备自始至终保持不变;3)录音环境的信噪比不低于35dB。

cmudict

http://www.speech.cs.cmu.edu/cgi-bin/cmudict

粤语NLP:
https://github.com/CanCLID/awesome-cantonese-nlp

IPA:
https://en.wikipedia.org/wiki/Pinyin
https://github.com/untunt/PhonoCollection/blob/master/Standard%20Chinese.md

更新2021/06/22
开源 中英双语 多说话人 的情感 VC数据库 Emotional Voice Conversion: Theory, Databases and ESD( https://arxiv.org/abs/2105.14762 )

更新2021/09/01
RyanSpeech Corpus

RyanSpeech is a new speech corpus for research on automated text-to-speech (TTS) systems. Publicly available TTS corpora are often noisy, recorded with multiple speakers, or do not have quality male speech data. In order to meet the need for a high-quality

http://mohammadmahoor.com/ryanspeech/

EVC:
https://arxiv.org/pdf/2105.14762.pdf

更新2021/09/08
Aishell4
http://www.aishelltech.com/aishell_4
AISHELL-4是一个通过麦克风阵列实录的八通道中文普通话会议场景语音数据集。该数据集共包含211场会议,每场会议4至8人,数据集共120小时左右。该数据集旨在促进实际应用场景下多说话人处理的研究。AISHELL-4数据包括了实际会议场景下各种重要特性,例如停顿、重叠、说话人轮转、噪声等。同时数据集提供了准确的音字转写文本及时间戳信息,方便研究者进行诸如前端处理、语音识别、说话人分割等单独任务,并可以进行联合优化。

更新2021/10/14
WenetSpeech

A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition
https://wenet-e2e.github.io/WenetSpeech/

更新2022/02/15
更新几个英文语料库
LibriTTS corpus
http://openslr.magicdatatech.com/60/
Large-scale corpus of English speech derived from the original materials of the LibriSpeech corpus

common voice
https://commonvoice.mozilla.org/zh-CN/datasets

Hi-Fi Multi-Speaker English TTS Dataset (Hi-Fi TTS)
http://www.openslr.org/109/
About this resource:
Hi-Fi Multi-Speaker English TTS Dataset (Hi-Fi TTS) is a multi-speaker English dataset for training text-to-speech models. The dataset is based on public audiobooks from LibriVox and texts from Project Gutenberg.
The Hi-Fi TTS dataset contains about 291.6 hours of speech from 10 speakers with at least 17 hours per speaker sampled at 44.1 kHz.

Free ST American English Corpus
http://www.openslr.org/45/

VCTK
CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92)
https://datashare.ed.ac.uk/handle/10283/3443

RyanSpeech Corpus

RyanSpeech is a new speech corpus for research on automated text-to-speech (TTS) systems. Publicly available TTS corpora are often noisy, recorded with multiple speakers, or do not have quality male speech data. In order to meet the need for a high-quality

http://mohammadmahoor.com/ryanspeech/

M-AILABS

Most of the data is based on LibriVox and Project Gutenberg. The training data consist of nearly thousand hours of audio and the text-files in prepared format.

https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/

多个数据目录:
http://openslr.magicdatatech.com/resources.php

https://github.com/coqui-ai/open-speech-corpora

更新2023/08/06
StarRail Dataset 米哈游提供的多人游戏语音数据库

https://github.com/AI-Hobbyist/StarRail_Datasets/tree/main/Label%20%26%20Voice
librispeech clean:
http://www.openslr.org/141/

2023-08-24 更新
https://www.tedownload.com/ 经济学人数据下载
https://github.com/hehonghui/awesome-english-ebooks
https://www.douban.com/group/topic/283251376/?_i=2814330H4g_IOf

20240429
上海交通大学 StoryTTS 数据集:StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations


其他:

国外语料库 ❀❀❀

BNC——英国国家语料库(British National Corpus):http://www.natcorp.ox.ac.uk/

BOE——柯林斯英语语料库(the Bank of English):http://www.collinslanguage.com/wordbanks/

联合国文件数据库(提供80万份六种语言平行文档)http://documents.un.org/simple.asp

ANC——美国国家语料库(American National Corpus):http://www.anc.org/

兰开斯特汉语语料库 (LCMC) http://ota.oucs.ox.ac.uk/s/download.php?otaid=2474

OLAC语言开发典藏社群(Open Language Archives Community)http://search.language-archives.org/index.html

COCA———美国当代英语语料库(Corpus of Contemporary American English)

http://www.americancorpus.org/

COHA——美国近当代英语语料库(Corpus of Historical American English):http://corpus.byu.edu.coha/

SKETCHENGINE多语言语料库:

www.sketchengine.co.uk

BASE——英国学术口语语料库(British Academic Spoken English Corpus):http://www2.warwick.ac.uk/fac/soc/celte/research/base/

Leeds: http://corpus.leeds.ac.uk/internet.html

JustTheWord: http://193.133.140.102/JustTheWord/index.html

Lextutor: http://www.lextutor.ca/

Web Concordancer: www.edict.com.hk

国内语料库 ❀❀❀
BCC语料库:http://bcc.blcu.edu.cn/

语料库:http://yulk.org/

语料库在线:http://www.cncorpus.org/

北京大学中国语言学研究中心 :http://ccl.pku.edu.cn/corpus.asp

国家语委现代汉语语料库http://www.cncorpus.org/

北外语料库语言学:http://www.bfsu-corpus.org/

古代汉语语料库http://www.cncorpus.org/login.aspx

语料库语言学在线:http://ccl.pku.edu.cn/corpus.asp

《人民日报》标注语料库http://www.icl.pku.edu.cn/icl_res/

汉语国际教育技术研发中心:HSK动态作文语料库http://202.112.195.192:8060/hsk/login.asp

语言研究所:北京口语语料查询系统(B J K Y)http://www.blcu.edu.cn/yys/6_beijing/6_beijing_chaxun.asp

现代汉语平衡语料库http://www.sinica.edu.tw/SinicaCorpus/

古汉语语料库http://www.sinica.edu.tw/ftms-bin/ftmsw

近代汉语标记语料库http://www.sinica.edu.tw/Early_Mandarin/

树图数据库http://treebank.sinica.edu.tw/

中英双语知识本体词网http://bow.sinica.edu.tw/

搜文解字:http://words.sinica.edu.tw/

文国寻宝记:http://www.sinica.edu.tw/wen/

唐诗三百首http://cls.admin.yzu.edu.tw/300/

汉籍电子文献http://www.sinica.edu.tw/~tdbproj/handy1/

红楼梦网络教学研究数据中心http://cls.hs.yzu.edu.tw/HLM/home.htm

中国传媒大学文本语料库检索系统:http://ling.cuc.edu.cn/RawPub/

哈工大信息检索研究室对外共享语料库资源http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm

香港教育学院语言资讯科学中心及其语料库实验室http://www.livac.org/index.php?lang=sc

中文语言资源联盟http://www.chineseldc.org/

杨百翰大学语料库http://view.byu.edu/

声明:本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号