赞
踩
自然语言处理:也称为NLP (Natural Language Processing),是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。
分词后的预处理:
CYK算法(基于动态规划)
这些Project都是类似论文实现那样的demo级的,也不是传统的工程实现,用的方法一般比工业界的高端,非常适合练手用。
1.分词 Word Segmentation
chqiwang/convseg ,基于CNN做中文分词,提供数据和代码。
对应的论文Convolutional Neural Network with Word Embeddings for Chinese Word Segmentation IJCNLP2017.
2.词预测 Word Prediction
Kyubyong/word_prediction ,基于CNN做词预测,提供数据和代码。
3. 文本蕴涵 Textual Entailment
Steven-Hewitt/Entailment-with-Tensorflow,基于Tensorflow做文本蕴涵,提供数据和代码。
4. 语音识别 Automatic Speech Recognition
buriburisuri/speech-to-text-wavenet,基于DeepMind WaveNet和Tensorflow做句子级语音识别。
5. 自动摘要 Automatic Summarisation
PKULCWM/PKUSUMSUM,北大万小军老师团队的自动摘要方法汇总,包含了他们大量paper的实现,支持单文档摘要、多文档摘要、topic-focused多文档摘要。
6. 文本纠错 Text Correct
atpaino/deep-text-corrector,基于深度学习做文本纠错,提供数据和代码。
7.字音转换 Grapheme to Phoneme
cmusphinx/g2p-seq2seq,基于网红transformer做, 提供数据和代码。
8. 复述检测 Paraphrase Detection 和 问答 Question Answering
Paraphrase-Driven Learning for Open Question Answering, 基于复述驱动学习的开放域问答。
9. 音汉互译 Pinyin-To-Chinese
Kyubyong/neural_chinese_transliterator,基于CNN做音汉互译。
10. 情感分析 Sentiment Analysis
情感分析包括的内容太多了,目前没发现比较全的。推荐两个适合练手的吧:Deeply Moving: Deep Learning for Sentiment Analysis,http://sentic.net/about/。
11. 手语识别 Sign Language Recognition
Home - SignAll, 该项目在手语识别做的非常成熟。
12. 词性标注(POS)、 命名实体识别(NER)、 句法分析(parser)、 语义角色标注(SRL) 等。
HIT-SCIR/ltp, 包括代码、模型、数据,还有详细的文档,而且效果还很好。
13. 词干 Word Stemming
snowballstem/snowball, 实现的词干效果还不错。
14. 语言识别 Language Identification
https://github.com/saffsd/langid.py,语言识别比较好的开源工具。
15. 机器翻译 Machine Translation
OpenNMT/OpenNMT-py, 基于PyTorch的神经机器翻译,很适合练手。
16. 复述生成 Paraphrase Generation
vsuthichai/paraphraser,基于Tensorflow的句子级复述生成,适合练手。
17. 关系抽取 Relationship Extraction
ankitp94/relationship-extraction,基于核方法的关系抽取。
18. 句子边界消歧 Sentence Boundary Disambiguation
https://github.com/Orekhov/SentenceBreaking,很有意思。
19.事件抽取 Event Extraction
liuhuanyong/ComplexEventExtraction, 中文复合事件抽取,包括条件事件、因果事件、顺承事件、反转事件等事件抽取,并形成事理图谱。
20. 词义消歧 Word Sense Disambiguation
alvations/pywsd,代码不多,方法简单,适合练手。
21. 命名实体消歧 Named Entity Disambiguation
dice-group/AGDISTIS,实体消歧是很重要的,尤其对于实体融合(比如知识图谱中多源数据融合)、实体链接。
22. 幽默检测 Humor Detection
23. 讽刺检测 Sarcasm Detection
AniSkywalker/SarcasmDetection,基于神经网络的讽刺检测。
24. 实体链接 Entity Linking
hasibi/EntityLinkingRetrieval-ELR, 实体链接用途非常广,非常适合练手。
25. 指代消歧 Coreference Resolution
huggingface/neuralcoref,基于神经网络的指代消歧。
26. 关键词/短语抽取和社会标签推荐 Keyphrase Extraction and Social Tag Suggestion
thunlp/THUTag, 用多种方法 实现了多种关键词/短语抽取和社会标签推荐。
https://www.cnblogs.com/d0main/p/8176825.html
Example
INPUT:
Profits soared at Boeing Co., easily topping forecasts on Wall Street, as
their CEO Alan Mulally announced first quarter results.
OUTPUT:
Profits/N soared/V at/P Boeing/N Co./N ,/, easily/ADV topping/V forecasts/
N on/P Wall/N Street/N ,/, as/P their/POSS CEO/N Alan/N Mulally/
N announced/V first/ADJ quarter/N results/N ./.
KEY: N = Noun, V = Verb, P = Preposition, Adv = Adverb
Example
INPUT:
Profits soared at Boeing Co., easily topping forecasts on Wall Street, as
their CEO Alan Mulally announced first quarter results.
OUTPUT:
Profits/NA soared/NA at/NA Boeing/SC Co./CC ,/NA easily/NA topping/
NA forecasts/NA on/NA Wall/SL Street/CL ,/NA as/NA their/NA CEO/NA
Alan/SP Mulally/CP announced/NA first/NA quarter/NA results/NA ./NA
KEY: NA = No entity, SC = Start Company, CC = Continue Company, SL = Start Location, CL = Continue Location
Example:
Best roast chicken in San Francisco! – Positive
The waiter ignored us for 20 minutes. – Negative
Example: “Carter told Mubarak he shouldn’t run again.” To solve whether “he” is related to “Carter” or “Mubarak”.
Example :
I need new batteries for my mouse. - “mouse” is ambiguous here.
the basic problem of parsing sentences.
translating sentences from one language to another, best example would be Google translate.
to take a text as input and represent it in a structured form like a database entries.
to take input as text document(s) and try to condense them into a summary.
Example:
User - I need a flight from New York to London, arriving at 10 pm ?
System - What day are you leaving?
User - Tomorrow.
System detects the missing information in your sentences.
最近有时间我会从前往后阅读nlper这个博客,发现“Most Influential NLP Papers”这篇文章比较有参考价值,不过写于06年初,稍早一些,但是真金不怕火炼,就放在这里供大家参考了!
“I conducted a mini survey recently, asking people I knew what they thought were the most influential papers in NLP from the past two decades. Here are the wholly unscientific results, sorted from most votes and subsorted by author. Note that I only got responses from 7 people. I’ve not listed papers that got only one vote and have not included my personal votes.”
按照作者的说法,他是做了一个小型的调查,通过询问他所了解的自然语言处理的研究者“过去20年他们所认为的最有影响力的自然语言处理论文”得到这个调查结果的。事实上,作者仅仅得到七个人的回应,并且其中六个人是南加州大学(作者所工作的单位)和宾州大学的。以下是调查的最终结果,按照得票数进行排序,如果票数相同,则按论文作者的姓名进行排序,注意其中并不包括只得到一票的论文和作者自己的投票:
(7 votes): Brown et al., 1993; The Mathematics of Statistical Machine Translation(统计机器翻译)
(5 votes): Collins, 1997; Three Generative, Lexicalised Models for Statistical Parsing(统计句法分析)
(4 votes): Marcus, 1993 Building a large annotated corpus of English: the Penn Treebank(语料库)
(3 votes): Berger et al., 1996; A maximum entropy approach to natural language processing(最大熵)
(2 votes): Bikel et al., 1997; An Algorithm that Learns What’s in a Name
(2 votes): Collins, 2002; Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms
(2 votes): Lafferty et al., 2001; Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data(条件随机场)
(2 votes): Och, 2003; Minimum Error Rate Training for Statistical Machine Translation(统计机器翻译)
(2 votes): Papineni et al., 2001; Bleu: a method for automatic evaluation of machine translation(机器翻译自动评测)
(2 votes): Ratnaparkhi, 1999; Learning to Parse Natural Language with Maximum Entropy Models
(2 votes): Yarowsky, 1995; Unsupervised Word Sense Disambiguation Rivaling Supervised Methods(词义消歧)
括号中是我注释的所属领域,机器翻译之所以占了三个,估计与南加州大学的投票有关。
不知道这里是否也可以做个这样的调查?毕竟个人的能力有限,而大家的力量是无穷的,如果我们这些nlpers一起行动,也许会有一个不错的调查结果,对大家以及后来者多少都会有些参考。
初步的想法是:读者如果熟悉自然语言处理或者计算语言学某个领域,可以列出自己认可的比较有影响力的几篇自然语言处理论文,如果能得到足够的回复,我最后统一汇总一下这些结果,做个类似nlper的调查结论。
52nlp还远没有nlper那么大的影响力,我也不知道这个调查是否能最终成功,但是希望亲爱的nlper们能行动起来,无论是一篇还是两篇!
参考资料:
NLP算法岗一年半的工作总结–聊聊什么才是NLP算法工程师的核心竞争力
NLP常见任务
What are the major open problems in natural language understanding?
练手|常见30种NLP任务的练手项目
自然语言处理(NLP)知识结构总结
https://github.com/msgi/nlp-journey.git
链接:https://pan.baidu.com/s/1Rj_AoxZyrQItZg78iqFAAg 提取码:izej
最有影响力的自然语言处理论文
自然语言处理(NLP)论文资料
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。