当前位置:   article > 正文

Python NLP

Python NLP
  • . :代替任何单个字符
  • ^a :匹配所有以字母a开头的字符串
  • a& :匹配所有以字母a结尾的字符串
  • r"\\" :匹配反斜杠
  • [0-9]:任选一个
  • [0-9]{3}:[0-9]重复三次
  1. #return true or false
  2. re.search(regex, string);
  3. #return [] exist in string;
  4. re.selectall(regex, string);

将空值替换为“0”的操作:

  1. import numpy as np
  2. matrix = np.genfromtxt("....csv", dtype = 'U75', skip_header = 1, delimiter = ',')
  3. for i in range(np.shape[0]):
  4. column = (matrix[:, i] == '')
  5. matrix[column, i] = '0'
  1. #数据转换
  2. vector = vector.astype(float)

NLTK

  1. import nltk
  2. nltk.download('gutenberg')

Linux平台

Ubuntu 自带python2和python3 设置默认Python版本和切换:

  1. 直接执行这两个命令即可:
  2. sudo update-alternatives --install /usr/bin/python python /usr/bin/python2 100
  3. sudo update-alternatives --install /usr/bin/python python /usr/bin/python3 150
  4. 如果要切换到Python2,执行:
  5. sudo update-alternatives --config python
  6. 按照提示输入选择数字回车即可。

由于Ubuntu自带的pip最高版本达不到下载nltk的要求:(强制重装pip)

  1. curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
  2. sudo python get-pip.py --force-reinstall

然后再

pip install nltk

配置Java环境

gedit出错:gedit是用户图形界面,在服务器上打不开...所以无法进入gedit编辑器

此时直接修改的配置文件:

  1. JAVA_HOME=/usr/lib/jvm/java1.8
  2. JRE_HOME=/usr/lib/jvm/java1.8/jre
  3. PATH=$JAVA_HOME/bin:$PATH
  4. CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
  5. export JAVA_HOME
  6. export JRE_HOME
  7. export PATH

添加相同的代码就可以,最后java -version(一个-),就成功了。

linux nltk包的位置:/usr/local/lib/python3.6/dist-packages

NLTK Stanford NLP

Python NLTK结合Stanford NLP工具包进行分词、词性标注、句法分析

在NLTK中使用stanford NLP工具包

一些不清楚的地方

比如说文中说的的stanfordNLTK目录在哪?

作者的资源链接应该标错了,估摸着最近应该得不到回复,我一个一个下载,整理一下:

NLP

一、文本处理

1.1 分句

  1. import nltk
  2. from nltk.corpus import gutenberg
  3. from pprint import pprint
  4. import numpy as np
  1. alice = gutenberg.raw(fileids='carroll-alice.txt')
  2. default_st = nltk.sent_tokenize
  3. print('\nTotal sentences in alice:', len(alice_sentences))
  4. print('First 5 sentences in alice:-')
  5. print(np.array(alice_sentences[0:5]))
  1. punkt_st = nltk.tokenize.PunktSentenceTokenizer()
  2. sample_sentences = punkt_st.tokenize(sample_text)
  3. print(np.array(sample_sentences))

正则表达式:

  1. SENTENCE_TOKENS_PATTERN = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'
  2. regex_st = nltk.tokenize.RegexpTokenizer(
  3. pattern=SENTENCE_TOKENS_PATTERN,
  4. gaps=True)
  5. sample_sentences = regex_st.tokenize(sample_text)
  6. print(np.array(sample_sentences))

1.2 分词

默认:

  1. default_wt = nltk.word_tokenize
  2. words = default_wt(sample_text)
  3. np.array(words)

Treebank

  1. treebank_wt = nltk.TreebankWordTokenizer()
  2. words = treebank_wt.tokenize(sample_text)
  3. np.array(words)

正则表达式:r'\w+'

  1. GAP_PATTERN = r'\s+'
  2. regex_wt = nltk.RegexpTokenizer(pattern=GAP_PATTERN,
  3. gaps=True)
  4. words = regex_wt.tokenize(sample_text)
  5. np.array(words)
  6. word_indices = list(regex_wt.span_tokenize(sample_text))
  7. print(word_indices)#输出分割后字符在原始的位置
  8. print(np.array([sample_text[start:end] for start, end in word_indices]))
  9. #输出每个字符

 

  1. def tokenize_text(text):
  2. sentences = nltk.sent_tokenize(text)
  3. word_tokens = [nltk.word_tokenize(sentence) for sentence in sentences]
  4. return word_tokens
  5. sents = tokenize_text(sample_text)
  6. np.array(sents)
  7. words = [word for sentence in sents for word in sentence]
  8. np.array(words)

更快的分词分句:

  1. import spacy
  2. nlp = spacy.load('en_core', parse = True, tag=True, entity=True)
  3. text_spacy = nlp(sample_text)
  4. sents = np.array(list(text_spacy.sents))
  5. sent_words = [[word.text for word in sent] for sent in sents]
  6. np.array(sent_words)
  7. words = [word.text for word in text_spacy]
  8. np.array(words)

去除重音字符:

  1. import unicodedata
  2. def remove_accented_chars(text):
  3. text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
  4. return text
  5. remove_accented_chars('Sómě Áccěntěd těxt')

删除特殊字符:

  1. def remove_special_characters(text, remove_digits=False):
  2. pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
  3. text = re.sub(pattern, '', text)
  4. return text
  5. remove_special_characters("Well this was fun! What do you think? 123#@!",
  6. remove_digits=True)

拓展略缩词:

  1. from contractions import CONTRACTION_MAP
  2. import re
  3. def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
  4. contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())),
  5. flags=re.IGNORECASE|re.DOTALL)
  6. def expand_match(contraction):
  7. match = contraction.group(0)
  8. first_char = match[0]
  9. expanded_contraction = contraction_mapping.get(match)\
  10. if contraction_mapping.get(match)\
  11. else contraction_mapping.get(match.lower())
  12. expanded_contraction = first_char+expanded_contraction[1:]
  13. return expanded_contraction
  14. expanded_text = contractions_pattern.sub(expand_match, text)
  15. expanded_text = re.sub("'", "", expanded_text)
  16. return expanded_text

缩略词列表:链接:https://pan.baidu.com/s/1qu44acyb6pwMuUtfBqimig  提取码:5rnf

 

 

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/人工智能uu/article/detail/1011820
推荐阅读
相关标签
  

闽ICP备14008679号