赞
踩
Difficulty Level : L1
是下载,不是用。先要下载,才能用。
# Downloading packages and importing
import nltk
nltk.download('punkt')
nltk.download('stop')
nltk.download('stopwords')
#> [nltk_data] Downloading package punkt to /root/nltk_data...
#> [nltk_data] Unzipping tokenizers/punkt.zip.
#> [nltk_data] Error loading stop: Package 'stop' not found in index
#> [nltk_data] Downloading package stopwords to /root/nltk_data...
#> [nltk_data] Unzipping corpora/stopwords.zip.
#> True
Difficulty Level : L1
# 下载
python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm")
nlp
# More models here: https://spacy.io/models
#> <spacy.lang.en.English at 0x7facaf6cd0f0>
Difficulty Level : L1
text="""the outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives. Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others. Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be) the role of the different mass media channels in our lives on individual, social and societal levels? Mass media have long been recognized as powerful forces shaping how we experience the world and ourselves. This recognition is accompanied by a growing volume of research, that closely follows the footsteps of technological transformations (e.g. radio, movies, television, the internet, mobiles) and the zeitgeist (e.g. cold war, 9/11, climate change) in an attempt to map mass media major impacts on how we perceive ourselves, both as individuals and citizens. Are media (broadcast and digital) still able to convey a sense of unity reaching large audiences, or are messages lost in the noisy crowd of mass self-communication? """
'outbreak coronavirus disease 2019 ( COVID-19 ) created global health crisis deep impact way perceive world everyday lives . rate contagion patterns transmission threatens sense agency , safety measures place contain spread virus require social distancing refraining inherently human , find solace company . context physical threat , social physical distancing , public alarm , ( ) role different mass media channels lives individual , social societal levels ? Mass media long recognized powerful forces shaping experience world . recognition accompanied growing volume research , closely follows footsteps technological transformations ( e.g. radio , movies , television , internet , mobiles ) zeitgeist ( e.g. cold war , 9/11 , climate change ) attempt map mass media major impacts perceive , individuals citizens . media ( broadcast digital ) able convey sense unity reaching large audiences , messages lost noisy crowd mass self - communication ?'
# Method 1
# Removing stopwords in nltk
from nltk.corpus import stopwords
my_stopwords=set(stopwords.words('english'))
new_tokens=[]
# Tokenization using word_tokenize()
all_tokens=nltk.word_tokenize(text)
for token in all_tokens:
if token not in my_stopwords:
new_tokens.append(token)
" ".join(new_tokens)
# Method 2
# Removing stopwords in spaCy
import spacy
nlp=spacy.load("en_core_web_sm")
doc=nlp(text)
new_tokens=[]
# Using is_stop attribute of each token to check if it's a stopword
for token in doc:
if token.is_stop==False:
new_tokens.append(token.text)
" ".join(new_tokens)
Difficulty Level : L1
Q. Add the custom stopwords “NIL” and “JUNK” in spaCy and remove the stopwords in below text
text=" Jonas was a JUNK great guy NIL Adam was evil NIL Martha JUNK was more of a fool "
'Jonas great guy Adam evil Martha fool'
import spacy
nlp=spacy.load("en_core_web_sm")
# list of custom stop words
customize_stop_words = ['NIL','JUNK']
# Adding these stop words
for w in customize_stop_words:
nlp.vocab[w].is_stop = True
doc = nlp(text)
tokens = [token.text for token in doc if not token.is_stop]
" ".join(tokens)
Difficulty Level : L1
Q. Remove all the punctuations in the given text
text="The match has concluded !!! India has won the match . Will we fin the finals too ? !"
'The match has concluded India has won the match Will we fin the finals too'
# Removing punctuations in spaCy
import spacy
nlp=spacy.load("en_core_web_sm")
doc=nlp(text)
new_tokens=[]
# Check if a token is a punctuation through is_punct attribute
for token in doc:
if token.is_punct==False:
new_tokens.append(token.text)
" ".join(new_tokens)
# Method 2
# Removing punctuation in nltk with RegexpTokenizer
tokenizer=nltk.RegexpTokenizer(r"\w+")
tokens=tokenizer.tokenize(text)
" ".join(tokens)
Difficulty Level : L3
传统的使用场景都是分词,但是很少有分短语,除非使用较为复杂的遗存句法解析工具。本题的目的是:将常见的两个词汇合并成短语。
核心是使用:Gensim’s Phraser。
documents = ["the mayor of new york was there", "new york mayor was present"]
['the', 'mayor', 'of', 'new york', 'was', 'there']
['new york', 'mayor', 'was', 'present']
# Import Phraser from gensim
from gensim.models import Phrases
from gensim.models.phrases import Phraser
sentence_stream = [doc.split(" ") for doc in documents]
# Creating bigram phraser
bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)
for sent in sentence_stream:
tokens_ = bigram_phraser[sent]
print(tokens_)
Difficulty Level : L3
text="Machine learning is a neccessary field in today's world. Data science can do wonders . Natural Language Processing is how machines understand text "
Bigrams are [('machine', 'learning'), ('learning', 'is'), ('is', 'a'), ('a', 'neccessary'), ('neccessary', 'field'), ('field', 'in'), ('in', "today's"), ("today's", 'world.'), ('world.', 'data'), ('data', 'science'), ('science', 'can'), ('can', 'do'), ('do', 'wonders'), ('wonders', '.'), ('.', 'natural'), ('natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'how'), ('how', 'machines'), ('machines', 'understand'), ('understand', 'text')]
Trigrams are [('machine', 'learning', 'is'), ('learning', 'is', 'a'), ('is', 'a', 'neccessary'), ('a', 'neccessary', 'field'), ('neccessary', 'field', 'in'), ('field', 'in', "today's"), ('in', "today's", 'world.'), ("today's", 'world.', 'data'), ('world.', 'data', 'science'), ('data', 'science', 'can'), ('science', 'can', 'do'), ('can', 'do', 'wonders'), ('do', 'wonders', '.'), ('wonders', '.', 'natural'), ('.', 'natural', 'language'), ('natural', 'language', 'processing'), ('language', 'processing', 'is'), ('processing', 'is', 'how'), ('is', 'how', 'machines'), ('how', 'machines', 'understand'), ('machines', 'understand', 'text')]
# 方法1 from nltk import ngrams bigram = list(ngrams(text.lower().split(), 2)) trigram = list(ngrams(text.lower().split(), 3)) print(" Bigrams are",bigram) print(" Trigrams are", trigram) # 方法2 def ngram(text, n): # 将输入文本按空格分割为单词列表 words = text.split() # 构建 ngram 列表 ngram_list = [] for i in range(len(words) - n + 1): ngram_list.append(' '.join(words[i:i+n])) return ngram_list
Difficulty Level : L1
text="Last week, the University of Cambridge shared its own research that shows if everyone wears a mask outside home,dreaded ‘second wave’ of the pandemic can be avoided."
Last
week
,
the
University
of
Cambridge
shared
...(truncated)...
# 方法一:Tokeniation with nltk
tokens = nltk.word_tokenize(text)
for token in tokens:
print(token)
# 方法二:Tokenization with spaCy
lm = spacy.load("en_core_web_sm")
tokens = lm(text)
for token in tokens:
print(token.text)
Difficulty Level : L1
text="I love spring season. I go hiking with my friends"
[101, 1045, 2293, 3500, 2161, 1012, 1045, 2175, 13039, 2007, 2026, 2814, 102]
[CLS] i love spring season. i go hiking with my friends [SEP]
from transformers import AutoTokenizer
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# Encoding with the tokenizer
inputs = tokenizer.encode(text)
print(inputs)
# 还可以这样用
print(tokenizer(text))
# 解码
print(tokenizer.decode(inputs))
Difficulty Level : L2
Q. Tokenize the given text with stop words (“is”,”the”,”was”) as delimiters. Tokenizing this way identifies meaningful phrases. Sometimes, useful for topic modeling
text = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know.""
['Walter',
'feeling anxious',
'He',
'diagnosed today',
'He probably',
'best person I know']
text = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know."
stop_words_and_delims = ['was', 'is', 'the', '.', ',', '-', '!', '?']
for r in stop_words_and_delims:
text = text.replace(r, 'DELIM')
words = [t.strip() for t in text.split('DELIM')]
words_filtered = list(filter(lambda a: a not in [''], words))
print(words_filtered)
Difficulty Level : L2
text=" Having lots of fun #goa #vaction #summervacation. Fancy dinner @Beachbay restro :) "
['Having',
'lots',
'of',
'fun',
'goa',
'vaction',
'summervacation',
'Fancy',
'dinner',
'Beachbay',
'restro']
import re
# Cleaning the tweets
text=re.sub(r'[^\w]', ' ', text)
# Using nltk's TweetTokenizer
from nltk.tokenize import TweetTokenizer
tokenizer=TweetTokenizer()
print(tokenizer.tokenize(text))
Difficulty Level : L1
Q. Print the sentences of the given text document
text="""The outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives. Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others. Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be) the role of the different mass media channels in our lives on individual, social and societal levels? Mass media have long been recognized as powerful forces shaping how we experience the world and ourselves. This recognition is accompanied by a growing volume of research, that closely follows the footsteps of technological transformations (e.g. radio, movies, television, the internet, mobiles) and the zeitgeist (e.g. cold war, 9/11, climate change) in an attempt to map mass media major impacts on how we perceive ourselves, both as individuals and citizens. Are media (broadcast and digital) still able to convey a sense of unity reaching large audiences, or are messages lost in the noisy crowd of mass self-communication? """
The outbreak of coronavirus disease 2019 (COVID-19) has created a global health crisis that has had a deep impact on the way we perceive our world and our everyday lives.
Not only the rate of contagion and patterns of transmission threatens our sense of agency, but the safety measures put in place to contain the spread of the virus also require social distancing by refraining from doing what is inherently human, which is to find solace in the company of others.
Within this context of physical threat, social and physical distancing, as well as public alarm, what has been (and can be)
...(truncated)...
# 方法一:使用 spaCy
import spacy
lm = spacy.load('en_core_web_sm')
doc = lm(text)
for sentence in doc.sents:
print(sentence)
# 方法二:使用 NLTK
print(nltk.sent_tokenize(text))
Difficulty Level : L3
text1="Netflix has released a new series"
text2="It was shot in London"
text3="It is called Dark and the main character is Jonas"
text4="Adam is the evil character"
{'id': 0, 'paragraphs': [{'cats': [], 'raw': 'Netflix has released a new series', 'sentences': [{'brackets': [], 'tokens': [{'dep': 'nsubj', 'head': 2, 'id': 0, 'ner': 'U-ORG', 'orth': 'Netflix', 'tag': 'NNP'}, {'dep': 'aux', 'head': 1, 'id': 1, 'ner': 'O', 'orth': 'has', 'tag': 'VBZ'}, {'dep': 'ROOT', 'head': 0, 'id': 2, 'ner': 'O', 'orth': 'released', 'tag': 'VBN'}, {'dep': 'det', 'head': 2, 'id': 3, 'ner': 'O', 'orth': 'a', 'tag': 'DT'}, {'dep': 'amod', 'head': 1, 'id': 4, 'ner': 'O', 'orth': 'new', 'tag': 'JJ'}, {'dep': 'dobj', 'head': -3, 'id': 5, 'ner': 'O', 'orth': 'series', 'tag': 'NN'}]}]}, ...(truncated)
# Covert into spacy documents
doc1=nlp(text1)
doc2=nlp(text2)
doc3=nlp(text3)
doc4=nlp(text4)
# Import docs_to_json
from spacy.gold import docs_to_json
# Converting into json format
json_data = docs_to_json([doc1,doc2,doc3,doc4])
print(json_data)
Difficulty Level : L2
text= "Dancing is an art. Students should be taught dance as a subject in schools . I danced in many of my school function. Some people are always hesitating to dance."
text= 'danc is an art . student should be taught danc as a subject in school . I danc in mani of my school function . some peopl are alway hesit to danc .'
from nltk.stem import PorterStemmer
stemmer=PorterStemmer
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。