当前位置:   article > 正文

NLP代码模板集合

NLP代码模板集合

文章目录

1 词基本操作

1.1 使用NLTK下载停用词

Difficulty Level : L1

是下载,不是用。先要下载,才能用。

# Downloading packages and importing

import nltk
nltk.download('punkt')
nltk.download('stop')
nltk.download('stopwords')

#> [nltk_data] Downloading package punkt to /root/nltk_data...
#> [nltk_data]   Unzipping tokenizers/punkt.zip.
#> [nltk_data] Error loading stop: Package 'stop' not found in index
#> [nltk_data] Downloading package stopwords to /root/nltk_data...
#> [nltk_data]   Unzipping corpora/stopwords.zip.
#> True
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

1.2 使用 spacy 加载语言模型

Difficulty Level : L1

# 下载
python -m spacy download en_core_web_sm

import spacy
nlp = spacy.load("en_core_web_sm")
nlp
# More models here: https://spacy.io/models
#> <spacy.lang.en.English at 0x7facaf6cd0f0>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

1.3 删除句子中的停用词

Difficulty Level : L1

1.3.1 Input

text="""the outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives. Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others. Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be) the role of the different mass media channels in our lives on individual, social and societal levels? Mass media have long been recognized as powerful forces shaping how we experience the world and ourselves. This recognition is accompanied by a growing volume of research, that closely follows the footsteps of technological transformations (e.g. radio, movies, television, the internet, mobiles) and the zeitgeist (e.g. cold war, 9/11, climate change) in an attempt to map mass media major impacts on how we perceive ourselves, both as individuals and citizens. Are media (broadcast and digital) still able to convey a sense of unity reaching large audiences, or are messages lost in the noisy crowd of mass self-communication? """
  • 1

1.3.2 Desired Output

'outbreak coronavirus disease 2019 ( COVID-19 ) created global health crisis deep impact way perceive world everyday lives . rate contagion patterns transmission threatens sense agency , safety measures place contain spread virus require social distancing refraining inherently human , find solace company . context physical threat , social physical distancing , public alarm , ( ) role different mass media channels lives individual , social societal levels ? Mass media long recognized powerful forces shaping experience world . recognition accompanied growing volume research , closely follows footsteps technological transformations ( e.g. radio , movies , television , internet , mobiles ) zeitgeist ( e.g. cold war , 9/11 , climate change ) attempt map mass media major impacts perceive , individuals citizens . media ( broadcast digital ) able convey sense unity reaching large audiences , messages lost noisy crowd mass self - communication ?'
  • 1

1.3.3 Solution

1.3.3.1 Method 1: Removing stopwords in nltk
# Method 1
# Removing stopwords in nltk

from nltk.corpus import stopwords
my_stopwords=set(stopwords.words('english'))
new_tokens=[]

# Tokenization using word_tokenize()
all_tokens=nltk.word_tokenize(text)

for token in all_tokens:
  if token not in my_stopwords:
    new_tokens.append(token)

" ".join(new_tokens)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
1.3.3.2 Method 2: Removing stopwords in spaCy
# Method 2
# Removing stopwords in spaCy
import spacy

nlp=spacy.load("en_core_web_sm")
doc=nlp(text)
new_tokens=[]

# Using is_stop attribute of each token to check if it's a stopword
for token in doc:
  if token.is_stop==False:
    new_tokens.append(token.text)

" ".join(new_tokens)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

1.4 基于 spaCy,添加自定义停用词

Difficulty Level : L1

Q. Add the custom stopwords “NIL” and “JUNK” in spaCy and remove the stopwords in below text

1.4.1 Input

text=" Jonas was a JUNK great guy NIL Adam was evil NIL Martha JUNK was more of a fool "
  • 1

1.4.2 Expected Output

  'Jonas great guy Adam evil Martha fool'
  • 1

1.4.3 Solution

import spacy

nlp=spacy.load("en_core_web_sm")
# list of custom stop words
customize_stop_words = ['NIL','JUNK']

# Adding these stop words
for w in customize_stop_words:
    nlp.vocab[w].is_stop = True
doc = nlp(text)
tokens = [token.text for token in doc if not token.is_stop]

" ".join(tokens)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

1.5 删除标点符号

Difficulty Level : L1

Q. Remove all the punctuations in the given text

1.5.1 Input

text="The match has concluded !!! India has won the match . Will we fin the finals too ? !"
  • 1

1.5.2 Desired Output

'The match has concluded India has won the match Will we fin the finals too'
  • 1

1.5.3 Solution

1.5.3.1 Method 1: Removing punctuations in spaCy
# Removing punctuations in spaCy
import spacy

nlp=spacy.load("en_core_web_sm")
doc=nlp(text)
new_tokens=[]
# Check if a token is a punctuation through is_punct attribute
for token in doc:
  if token.is_punct==False:
    new_tokens.append(token.text)

" ".join(new_tokens)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
1.5.3.2 Method 2: Removing punctuation in nltk with RegexpTokenizer
# Method 2
# Removing punctuation in nltk with RegexpTokenizer

tokenizer=nltk.RegexpTokenizer(r"\w+")

tokens=tokenizer.tokenize(text)
" ".join(tokens)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

1.6 使用 bigram 将词汇合并成短语(非常重要)

Difficulty Level : L3

传统的使用场景都是分词,但是很少有分短语,除非使用较为复杂的遗存句法解析工具。本题的目的是:将常见的两个词汇合并成短语。

核心是使用:Gensim’s Phraser。

1.6.1 Input

documents = ["the mayor of new york was there", "new york mayor was present"]
  • 1

1.6.2 Desired Output

['the', 'mayor', 'of', 'new york', 'was', 'there']
['new york', 'mayor', 'was', 'present']
  • 1
  • 2

1.6.3 Solution

# Import Phraser from gensim
from gensim.models import Phrases
from gensim.models.phrases import Phraser

sentence_stream = [doc.split(" ") for doc in documents]

# Creating bigram phraser
bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)

for sent in sentence_stream:
    tokens_ = bigram_phraser[sent]
    print(tokens_)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

1.7 统计 bigram, trigram(非常重要)

Difficulty Level : L3

1.7.1 Input

text="Machine learning is a neccessary field in today's world. Data science can do wonders . Natural Language Processing is how machines understand text "
  • 1

1.7.2 Desired Output

Bigrams are [('machine', 'learning'), ('learning', 'is'), ('is', 'a'), ('a', 'neccessary'), ('neccessary', 'field'), ('field', 'in'), ('in', "today's"), ("today's", 'world.'), ('world.', 'data'), ('data', 'science'), ('science', 'can'), ('can', 'do'), ('do', 'wonders'), ('wonders', '.'), ('.', 'natural'), ('natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'how'), ('how', 'machines'), ('machines', 'understand'), ('understand', 'text')]
 Trigrams are [('machine', 'learning', 'is'), ('learning', 'is', 'a'), ('is', 'a', 'neccessary'), ('a', 'neccessary', 'field'), ('neccessary', 'field', 'in'), ('field', 'in', "today's"), ('in', "today's", 'world.'), ("today's", 'world.', 'data'), ('world.', 'data', 'science'), ('data', 'science', 'can'), ('science', 'can', 'do'), ('can', 'do', 'wonders'), ('do', 'wonders', '.'), ('wonders', '.', 'natural'), ('.', 'natural', 'language'), ('natural', 'language', 'processing'), ('language', 'processing', 'is'), ('processing', 'is', 'how'), ('is', 'how', 'machines'), ('how', 'machines', 'understand'), ('machines', 'understand', 'text')]
  • 1
  • 2

1.7.3 Solution

# 方法1
from nltk import ngrams
bigram = list(ngrams(text.lower().split(), 2))
trigram = list(ngrams(text.lower().split(), 3))

print(" Bigrams are",bigram)
print(" Trigrams are", trigram)


# 方法2
def ngram(text, n):
    # 将输入文本按空格分割为单词列表
    words = text.split()
    # 构建 ngram 列表
    ngram_list = []
    for i in range(len(words) - n + 1):
        ngram_list.append(' '.join(words[i:i+n]))
    return ngram_list
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18

2 Tokenizer(分词)

2.1 使用 NLTK 或 spaCy 进行 tokenize

Difficulty Level : L1

2.1.1 Input

text="Last week, the University of Cambridge shared its own research that shows if everyone wears a mask outside home,dreaded ‘second wave’ of the pandemic can be avoided."
  • 1

2.1.2 Desired Output

Last
week
,
the
University
of
Cambridge
shared
...(truncated)...
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

2.1.3 Solution

# 方法一:Tokeniation with nltk
tokens = nltk.word_tokenize(text)
for token in tokens:
  print(token)
  

# 方法二:Tokenization with spaCy
lm = spacy.load("en_core_web_sm")
tokens = lm(text)
for token in tokens:
  print(token.text)  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

2.2 使用 transformers 进行 tokenize?(非常重要)

Difficulty Level : L1

2.2.1 Input

text="I love spring season. I go hiking with my friends"
  • 1

2.2.2 Desired Output

[101, 1045, 2293, 3500, 2161, 1012, 1045, 2175, 13039, 2007, 2026, 2814, 102]

[CLS] i love spring season. i go hiking with my friends [SEP]
  • 1
  • 2
  • 3

2.2.3 Solution

from transformers import AutoTokenizer

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Encoding with the tokenizer
inputs = tokenizer.encode(text)
print(inputs)
# 还可以这样用
print(tokenizer(text))

# 解码
print(tokenizer.decode(inputs))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

2.3 使用停用词进行tokenize

Difficulty Level : L2

Q. Tokenize the given text with stop words (“is”,”the”,”was”) as delimiters. Tokenizing this way identifies meaningful phrases. Sometimes, useful for topic modeling

2.3.1 Input

text = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know.""
  • 1

2.3.2 Expected Output

['Walter',
 'feeling anxious',
 'He',
 'diagnosed today',
 'He probably',
 'best person I know']
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

2.3.3 Solution

text = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know."

stop_words_and_delims = ['was', 'is', 'the', '.', ',', '-', '!', '?']
for r in stop_words_and_delims:
    text = text.replace(r, 'DELIM')

words = [t.strip() for t in text.split('DELIM')]
words_filtered = list(filter(lambda a: a not in [''], words))
print(words_filtered)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

2.4 如何对 Tweeter等网文进行tokenizer?

Difficulty Level : L2

2.4.1 Input

text=" Having lots of fun #goa #vaction #summervacation. Fancy dinner @Beachbay restro :) "
  • 1

2.4.2 Desired Output

['Having',
 'lots',
 'of',
 'fun',
 'goa',
 'vaction',
 'summervacation',
 'Fancy',
 'dinner',
 'Beachbay',
 'restro']
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

2.4.3 Solution

import re
# Cleaning the tweets
text=re.sub(r'[^\w]', ' ', text)

# Using nltk's TweetTokenizer
from nltk.tokenize import TweetTokenizer
tokenizer=TweetTokenizer()
print(tokenizer.tokenize(text))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

3 句基本操作

3.1 如何将文档拆分成句子?

Difficulty Level : L1

Q. Print the sentences of the given text document

3.1.1 Input

text="""The outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives. Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others. Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be) the role of the different mass media channels in our lives on individual, social and societal levels? Mass media have long been recognized as powerful forces shaping how we experience the world and ourselves. This recognition is accompanied by a growing volume of research, that closely follows the footsteps of technological transformations (e.g. radio, movies, television, the internet, mobiles) and the zeitgeist (e.g. cold war, 9/11, climate change) in an attempt to map mass media major impacts on how we perceive ourselves, both as individuals and citizens. Are media (broadcast and digital) still able to convey a sense of unity reaching large audiences, or are messages lost in the noisy crowd of mass self-communication? """
  • 1

3.1.2 Desired Output

The outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives.
Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others.
Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be)
...(truncated)...
  • 1
  • 2
  • 3
  • 4

3.1.3 Solution

# 方法一:使用 spaCy
import spacy
lm = spacy.load('en_core_web_sm')
doc = lm(text)
for sentence in doc.sents:
  print(sentence)

# 方法二:使用 NLTK
print(nltk.sent_tokenize(text))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

3.2 如何获得句子对应的遗存句法树?

Difficulty Level : L3

3.2.1 Input

text1="Netflix has released a new series"
text2="It was shot in London"
text3="It is called Dark and the main character is Jonas"
text4="Adam is the evil character"
  • 1
  • 2
  • 3
  • 4

3.2.2 Desired Output

{'id': 0,
 'paragraphs': [{'cats': [],
   'raw': 'Netflix has released a new series',
   'sentences': [{'brackets': [],
     'tokens': [{'dep': 'nsubj',
       'head': 2,
       'id': 0,
       'ner': 'U-ORG',
       'orth': 'Netflix',
       'tag': 'NNP'},
      {'dep': 'aux',
       'head': 1,
       'id': 1,
       'ner': 'O',
       'orth': 'has',
       'tag': 'VBZ'},
      {'dep': 'ROOT',
       'head': 0,
       'id': 2,
       'ner': 'O',
       'orth': 'released',
       'tag': 'VBN'},
      {'dep': 'det', 'head': 2, 'id': 3, 'ner': 'O', 'orth': 'a', 'tag': 'DT'},
      {'dep': 'amod',
       'head': 1,
       'id': 4,
       'ner': 'O',
       'orth': 'new',
       'tag': 'JJ'},
      {'dep': 'dobj',
       'head': -3,
       'id': 5,
       'ner': 'O',
       'orth': 'series',
       'tag': 'NN'}]}]},
    ...(truncated)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36

3.2.3 Solution

# Covert into spacy documents
doc1=nlp(text1)
doc2=nlp(text2)
doc3=nlp(text3)
doc4=nlp(text4)

# Import docs_to_json 
from spacy.gold import docs_to_json

# Converting into json format
json_data = docs_to_json([doc1,doc2,doc3,doc4])
print(json_data)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

3.3 词干提取(stemming)

Difficulty Level : L2

3.3.1 Input

text= "Dancing is an art. Students should be taught dance as a subject in schools . I danced in many of my school function. Some people are always hesitating to dance."
  • 1

3.3.2 Desired Output

text= 'danc is an art . student should be taught danc as a subject in school . I danc in mani of my school function . some peopl are alway hesit to danc .'
  • 1

3.3.3 Solution

from nltk.stem import PorterStemmer
stemmer=PorterStemmer
  • 1
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/AllinToyou/article/detail/455547
推荐阅读
相关标签
  

闽ICP备14008679号