赞
踩
目录
1.停用词
- from nltk.corpus import stopwords
-
- print(stopwords.words('english'))
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
2.常用语料库
(1)未标注语料库
- from nltk.corpus import gutenberg
-
- print(gutenberg.raw('austen-emma.txt'))
(原文太长,只截取一部分)
(2)人工标注语料库。人工标注的关于某项任务的结果。
- from nltk.corpus import sentence_polarity
-
- # print(sentence_polarity.categories())
- # print(sentence_polarity.words())
- # print(sentence_polarity.sents())
-
- a = [(sentence, category) for category in sentence_polarity.categories() for sentence in sentence_polarity.sents(categories=category)]
- print(a)
3.常用词典
(1)WordNet。
- from nltk.corpus import wordnet
-
- syns = wordnet.synsets('bank')
- print(syns[0].name())
- print(syns[0].definition())
- print(syns[1].definition())
- print(syns[0].examples())
- print(syns[0].hypernyms())
-
- dog = wordnet.synset('dog.n.01')
- cat = wordnet.synset('cat.n.01')
- print(dog.wup_similarity(cat))
- bank.n.01
- sloping land (especially the slope beside a body of water)
- a financial institution that accepts deposits and channels the money into lending activities
- ['they pulled the canoe up on the bank', 'he sat on the bank of the river and watched the currents']
- [Synset('slope.n.01')]
- 0.8571428571428571
(2)SentiWordNet
- from nltk.corpus import sentiwordnet
-
- print(sentiwordnet.senti_synset('good.a.01'))
<good.a.01: PosScore=0.75 NegScore=0.0>
1.分句
通常一个句子能够表达完整的语义信息,因此在进行更深入的自然语言处理之前,往往需要将较长的文档切分为若干句子,这一过程被称为分句。
- from nltk.tokenize import sent_tokenize
- from nltk.corpus import gutenberg
-
- text = gutenberg.raw('austen-emma.txt')
- sentences = sent_tokenize(text)
- print(sentences[100])
- Mr. Knightley loves to find fault with me, you know--
- in a joke--it is all a joke.
2.标记解析
一个句子是由若干标记(Token)按顺序构成的,其中标记既可以是一个词,也可以是标点符号等,这些标记是自然语言处理最基本的输入单元。将句子分割为标记的过程叫作标记解析(Tokenizaition)。
- from nltk.tokenize import word_tokenize
- from nltk.tokenize import sent_tokenize
- from nltk.corpus import gutenberg
-
- text = gutenberg.raw('austen-emma.txt')
- sentences = sent_tokenize(text)
- print(word_tokenize(sentences[100]))
['Mr.', 'Knightley', 'loves', 'to', 'find', 'fault', 'with', 'me', ',', 'you', 'know', '--', 'in', 'a', 'joke', '--', 'it', 'is', 'all', 'a', 'joke', '.']
3.词性标注
词性是词语所承担的语法功能类别,如名词、动词和形容词等,因此词性也被称为词类。词性标注就是根据词语所处的上下文,确定其具体的词性。
- from nltk import pos_tag
- from nltk.tokenize import word_tokenize
-
- results = pos_tag(word_tokenize('They sat by the fire.'))
- print(results)
[('They', 'PRP'), ('sat', 'VBP'), ('by', 'IN'), ('the', 'DT'), ('fire', 'NN'), ('.', '.')]
- results = pos_tag(word_tokenize('They fire a gun.'))
- print(results)
[('They', 'PRP'), ('fire', 'VBP'), ('a', 'DT'), ('gun', 'NN'), ('.', '.')]
- import nltk.help
-
- nltk.help.upenn_tagset('NN')
- nltk.help.upenn_tagset('VBP')
- nltk.help.upenn_tagset()
- NN: noun, common, singular or mass
- common-carrier cabbage knuckle-duster Casino afghan shed thermostat
- investment slide humour falloff slick wind hyena override subhumanity
- machinist ...
- VBP: verb, present tense, not 3rd person singular
- predominate wrap resort sue twist spill cure lengthen brush terminate
- appear tend stray glisten obtain comprise detest tease attract
- emphasize mold postpone sever return wag ...
4.其他工具
NLTK还提供了其他丰富的自然语言处理工具,包括命名实体识别、组块分析和句法分析等。
CoreNLP
spaCy
……
以中文为代表的汉藏语系与以英语为代表的印欧语系不同,一个显著的区别在于词语之间不存在明显的分隔符,句子一般是由遗传连续的字符构成的,因此在处理中文时,需要使用更有针对性的分析工具。
pip install ltp
- from ltp import LTP
-
- ltp = LTP() # 默认加载Small模型,首次使用时会自动下载并加载模型
- segment, hidden = ltp.seg(['南京市长江大桥。']) # 对句子进行分词,结果使用segment访问,hidden用于访问每个词的隐含层向量,用于后续分析步骤
- print(segment) # LTP能够获得正确的分词结果,而不会错误地分为[['南京', '市长', '江大桥', '.']
[['南京市', '长江大桥', '。']]
除了分词功能,LTP还提供了分句、词性标注、命名实体识别、依存句法分析和语义角色标注等功能。
- from ltp import LTP
-
- ltp = LTP() # 默认加载Small模型,首次使用时会自动下载并加载模型
- # segment, hidden = ltp.seg(['南京市长江大桥。']) # 对句子进行分词,结果使用segment访问,hidden用于访问每个词的隐含层向量,用于后续分析步骤
- # print(segment) # LTP能够获得正确的分词结果,而不会错误地分为[['南京', '市长', '江大桥', '.']
-
- sentences = ltp.sent_split(['南京市长江大桥。', '汤姆生病了。他去了医院。']) # 分句
- print(sentences)
-
- segment, hidden = ltp.seg(sentences)
- print(segment)
-
- pos_tags = ltp.pos(hidden) # 词性标注
- print(pos_tags) # 词性标注的结果为每个词对应的词性,LTP使用的词性标记集与NLTK不尽相同,但基本大同小异
- ['南京市长江大桥。', '汤姆生病了。', '他去了医院。']
- [['南京市', '长江大桥', '。'], ['汤姆', '生病', '了', '。'], ['他', '去', '了', '医院', '。']]
- [['ns', 'ns', 'wp'], ['nh', 'v', 'u', 'wp'], ['r', 'v', 'u', 'n', 'wp']]
PyTorch是一个基于张量的数学运算工具包,提供了两个高级功能:1)具有强大的GPU(图形处理单元,也叫显卡)加速的张量计算功能;2)能够自动进行微分计算,从而可以使用基于梯度的方法对模型参数进行优化。
所谓张量,就是多维数组。2维张量被称为矩阵,1维张量被称为向量,0维张量被称为标量。
- import torch
-
- print(torch.empty(2, 3)) # 创建一个形状为(2, 3)的空张量(未初始化)
- tensor([[0., 0., 0.],
- [0., 0., 0.]])
print(torch.rand(2, 3)) # 创建一个形状为(2, 3)的随机张量,每个值从[0, 1)之间的均匀分布中采用
- tensor([[0.5289, 0.8055, 0.9490],
- [0.7827, 0.0692, 0.5653]])
print(torch.randn(2, 3)) # 创建一个形状为(2, 3)的随机张量,每个值从标准正态分布(均值为0,方差为1)中采用
- tensor([[ 0.4637, 0.1505, -0.0608],
- [-0.6243, 0.2489, -1.2854]])
print(torch.zeros(2, 3, dtype=torch.long)) # 创建一个形状为(2, 3)的0张量,其中dtype设置张量的数据类型,此处为整数
- tensor([[0, 0, 0],
- [0, 0, 0]])
print(torch.zeros(2, 3, dtype=torch.float)) # 创建一个形状为(2, 3)的0张量,类型为双精度浮点数
- tensor([[0., 0., 0.],
- [0., 0., 0.]])
print(torch.tensor([[1.0, 3.8, 2.1], [8.6, 4.0, 2.4]])) # 通过Python列表创建张量
- tensor([[1.0000, 3.8000, 2.1000],
- [8.6000, 4.0000, 2.4000]])
print(torch.arange(10)) # 生成包含0至9,共10个数字的张量
tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
以上张量都存储在内存中,并使用CPU进行运算。若要在GPU中创建和计算张量,则需要显式地将其存入GPU中。
- import torch
-
- print(torch.empty(2, 3).cuda()) # 创建一个形状为(2, 3)的空张量(未初始化)
-
- print(torch.rand(2, 3).to('cuda')) # 创建一个形状为(2, 3)的随机张量,每个值从[0, 1)之间的均匀分布中采用
-
- print(torch.randn(2, 3, device='cuda')) # 创建一个形状为(2, 3)的随机张量,每个值从标准正态分布(均值为0,方差为1)中采用
- tensor([[-1.6376e+35, nan, 2.3727e-35],
- [ nan, -6.6785e-21, nan]], device='cuda:0')
- tensor([[0.9834, 0.7833, 0.0906],
- [0.8226, 0.6121, 0.7161]], device='cuda:0')
- tensor([[-1.1911, -1.8766, -0.5631],
- [ 0.1184, -0.5241, 1.5375]], device='cuda:0')
- import torch
-
- x = torch.tensor([1, 2, 3], dtype=torch.double)
- y = torch.tensor([4, 5, 6], dtype=torch.double)
-
- print(x + y)
tensor([5., 7., 9.], dtype=torch.float64)
print(x - y)
tensor([-3., -3., -3.], dtype=torch.float64)
print(x * y)
tensor([ 4., 10., 18.], dtype=torch.float64)
print(x / y)
tensor([0.2500, 0.4000, 0.5000], dtype=torch.float64)
print(x.dot(y)) # 向量x和y的点积
tensor(32., dtype=torch.float64)
print(x.sin()) # 对x按元素求正弦值
tensor([0.8415, 0.9093, 0.1411], dtype=torch.float64)
print(x.exp()) # 对x按元素求e^x
tensor([ 2.7183, 7.3891, 20.0855], dtype=torch.float64)
print(torch.cat((x, y), dim=0))
tensor([1., 2., 3., 4., 5., 6.], dtype=torch.float64)
- import torch
-
- x = torch.tensor([[1, 2, 3], [1, 2, 3]], dtype=torch.double)
- print(x.mean())
- print(x.mean(axis=0))
- print(x.mean(axis=1))
- tensor(2., dtype=torch.float64)
- tensor([1., 2., 3.], dtype=torch.float64)
- tensor([2., 2.], dtype=torch.float64)
- import torch
-
- M = torch.rand(1000, 1000)
- print(timeit -n 500 M.mm(M).mm(M))
-
- N = torch.rand(1000, 1000).cuda()
- print(timeit -n 500 N.mm(N).mm(N))
这里的timeit的用法是照着书上敲的,跟我学的用法有点不太一样,求大神解答。
使用PyTorch计算梯度非常容易,仅需要执行tensor.backward()函数,就可以通过反向传播算法自动完成。
为了计算一个函数关于某一变量的导数,PyTorch要求显式地设置该变量(张量)是可求导的,否则默认不能对该变量求导。具体设置方法是在张量生成时,设置requires_grad=True。
- import torch
-
- x = torch.tensor([2.], requires_grad=True)
- y = torch.tensor([3.], requires_grad=True)
- z = (x + y) * (y - 2)
- print(z)
-
- z.backward()
- print(x.grad, y.grad)
- tensor([5.], grad_fn=<MulBackward0>)
- tensor([1.]) tensor([6.])
- import torch
-
- x = torch.tensor([1, 2, 3, 4, 5, 6])
- print(x, x.shape)
-
- x = x.view(2, 3) # 将x的形状调整为(2, 3)
- print(x)
-
- x = x.reshape(2, 3)
- print(x)
-
- x = x.view(3, 2) # 将x的形状调整为(3, 2)
- print(x)
-
- x = x.reshape(3, 2)
- print(x)
-
- x = x.view(-1, 3) # -1位置的大小可以通过其他维的大小推断出,此处为2
- print(x)
-
- x = x.reshape(-1, 3)
- print(x)
- tensor([1, 2, 3, 4, 5, 6]) torch.Size([6])
- tensor([[1, 2, 3],
- [4, 5, 6]])
- tensor([[1, 2, 3],
- [4, 5, 6]])
- tensor([[1, 2],
- [3, 4],
- [5, 6]])
- tensor([[1, 2],
- [3, 4],
- [5, 6]])
- tensor([[1, 2, 3],
- [4, 5, 6]])
- tensor([[1, 2, 3],
- [4, 5, 6]])
- import torch
-
- x = torch.tensor([[1, 2, 3], [4, 5, 6]])
- print(x)
-
- x = x.transpose(0, 1)
- print(x)
- tensor([[1, 2, 3],
- [4, 5, 6]])
- tensor([[1, 4],
- [2, 5],
- [3, 6]])
- import torch
-
- x = torch.tensor([[[1, 2, 3], [4, 5, 6]]])
- print(x, x.shape)
-
- x = x.permute(2, 0, 1)
- print(x, x.shape)
- tensor([[[1, 2, 3],
- [4, 5, 6]]]) torch.Size([1, 2, 3])
- tensor([[[1, 4]],
-
- [[2, 5]],
-
- [[3, 6]]]) torch.Size([3, 1, 2])
- import torch
-
- x = torch.arange(1, 4).view(3, 1)
- y = torch.arange(4, 6).view(1, 2)
-
- print(x)
- print(y)
-
- print(x + y)
- tensor([[1],
- [2],
- [3]])
- tensor([[4, 5]])
- tensor([[5, 6],
- [6, 7],
- [7, 8]])
PyTorch可以对张量的任意一个维度进行索引或切片。
- import torch
-
- x = torch.arange(12).view(3, 4)
- print(x)
-
- print(x[1, 3]) # 第2行,第4列的元素(7)
-
- print(x[1]) # 第2行全部元素
-
- print(x[1:3]) # 第2、3两行元素
-
- print(x[:, 2]) # 第3列全部元素
-
- print(x[:, 2:4]) # 第3、4两列元素
-
- x[:, 2:4] = 100 # 第3、4两列元素全部赋值为100
- print(x)
- tensor([[ 0, 1, 2, 3],
- [ 4, 5, 6, 7],
- [ 8, 9, 10, 11]])
- tensor(7)
- tensor([4, 5, 6, 7])
- tensor([[ 4, 5, 6, 7],
- [ 8, 9, 10, 11]])
- tensor([ 2, 6, 10])
- tensor([[ 2, 3],
- [ 6, 7],
- [10, 11]])
- tensor([[ 0, 1, 100, 100],
- [ 4, 5, 100, 100],
- [ 8, 9, 100, 100]])
具体来讲,所谓升维,就是通过调用torch.unsqueeze(input, dim, out=None)函数,对输入张量的dim位置插入维度1,并返回一个新的张量。与索引相同,dim的值也可以为负数。
降维恰好相反,使用torch.squeeze(input, dim=None, out=None)函数,在不指定dim时,张量中形状为1的所有维都将被除去。
- import torch
-
- a = torch.tensor([1, 2, 3, 4])
- print(a.shape)
-
- b = torch.unsqueeze(a, dim=0) # 将a的第1维升高
- print(b, b.shape) # 打印b以及b的形状
-
- b = a.unsqueeze(dim=0) # unsqueeze函数的另一种等价调用方式
- print(b, b.shape)
-
- c = b.squeeze() # 对b进行降维,去掉所有形状中为1的维
- print(c, c.shape)
- torch.Size([4])
- tensor([[1, 2, 3, 4]]) torch.Size([1, 4])
- tensor([[1, 2, 3, 4]]) torch.Size([1, 4])
- tensor([1, 2, 3, 4]) torch.Size([4])
预训练语言模型需要通过海量文本学习语义信息,随着语料规模的增大,得到的统计信息将更加降准,更利于文本表示的学习。为了训练效果更好的预训练模型,高质量、大规模的预训练数据是必不可少的。
略
略
1.纯文本语料抽取
pip install wikiextractor
python -m wikiextractor.WikiExtractor 维基百科快照文件
上面这一步我在window下执行报错了,查了一下之后换成了到Ubuntu下执行,顺利完成。
2.中文繁简体转换
pip install opencc
python convert_t2s.py input_file > output_file
- import sys
- import opencc
-
- converter = opencc.OpenCC('t2s.json')
- f_in = open(sys.argv[0], 'r')
- for line in f_in.readlines():
- line = line.strip()
- line_t2s = converter.convert(line)
- print(line_t2s)
3.数据清洗
数据清洗将最大限度地保留自然文本的统计特征,对于其中的“对”与“错”,则交由模型来进行学习,而非通过人工进行过多干预。
python wikidata_cleaning.py input_file > output_file
- import sys
- import re
-
-
- def remove_empty_paired_punc(in_str):
- return in_str.replace('()', '').replace('《》', '').replace('【】', '').replace('[]', '')
-
-
- def remove_html_tags(in_str):
- html_pattern = re.compile(r'<[^>]+>', re.S)
- return html_pattern.sub('', in_str)
-
-
- def remove_contro_chars(in_str):
- control_chars = ''.join(map(unichr, range(0, 32) + range(127, 160)))
- control_chars = re.compile('[%s]' % re.escape(control_chars))
- return control_chars.sub('', in_str)
-
-
- f_in = open(sys.argv[0], 'r')
- for line in f_in.readlines():
- line = line.strip()
- if re.search(r'^(<doc id>)|(</doc>)', line):
- print(line)
- continue
-
- line = remove_empty_paired_punc(line)
- line = remove_html_tags(line)
- line = remove_contro_chars(line)
- print(line)
略
- from pprint import pprint
- from datasets import list_datasets, load_dataset
-
- datasets_list = list_datasets()
- print(len(datasets_list))
-
- dataset = load_dataset('sst', split='train')
- print(len(dataset))
-
- print(pprint((dataset[0])))
- from datasets import list_metrics, load_metric
-
- metrics_list = list_metrics()
- print(len(metrics_list))
- print(','.join(metrics_list))
-
- accuracy_metric = load_metric('accuracy')
- results = accuracy_metric.compute(reference=[0, 1, 0], predictions=[1, 1, 0])
- print(results)
1.使用NLTK工具下载简·奥斯汀所著的Emma小说原文,并去掉其中的停用词。
- import nltk
-
- emma = nltk.corpus.gutenberg.words('austen-emma.txt')
- stopwords = nltk.corpus.stopwords.words('english')
- print(stopwords)
-
- emma = [w.lower() for w in emma]
- emma_without_stopwords = [w for w in emma if w not in stopwords]
- print(emma_without_stopwords)
2.使用NLTK提供的WordNet计算两个词(不是词义)的相似度,计算方法为两词各种词义之间的最大相似度。
- from nltk.corpus import wordnet
-
- word1 = 'dog'
- word2 = 'cat'
- word1_synsets = wordnet.synsets(word1)
- word2_synsets = wordnet.synsets(word2)
-
- result = max([w1.path_similarity(w2) for w1 in word1_synsets for w2 in word2_synsets])
- print(result)
0.2
3.使用NLTK提供的SentiWordNet工具计算一个句子的情感倾向性,计算方法为每个词所处词性下的每个词义情感倾向性之和。
- import nltk
-
- sentence = ['welcome', 'to', 'harbin', 'institute', 'of', 'technology']
-
- sentence_tag = nltk.pos_tag(sentence)
- tag_map = {'NN': 'n', 'NNP': 'n', 'NNS': 'n', 'UH': 'n', 'VB': 'v', 'VBD': 'v', 'VBG': 'v', 'VBN': 'v', 'VBZ': 'v', 'JJ': 'a', 'JJR': 'a', 'JJS': 'a', 'RB': 'r', 'RBR': 'r', 'RBS': 'r', 'RP': 'r', 'WRB': 'r'}
- sentence_tag = [(t[0], tag_map[t[1]]) if t[1] in tag_map else (t[0], '') for t in sentence_tag]
- sentiment_synsets = [list(nltk.corpus.sentiwordnet.senti_synsets(t[0], t[1])) for t in sentence_tag]
- score = sum(sum([x.pos_score() - x.neg_score() for x in s]) / len(s) for s in sentiment_synsets if len(s) !=0)
- print(score)
0.3125
4.使用真实文本对比LTP与正向最大匹配分词的结果,并人工分析哪些结果LTP正确,正向最大匹配错误;哪些结果LTP错误,正向最大匹配正确;以及哪些结果两个结果都错误。
- import ltp
-
- sentence = ['南京市长江大桥', '行行行', '结婚的和尚未结婚的确实在干扰分词啊']
- ltp_model = ltp.LTP()
- segment, _ = ltp_model.seg(sentence)
- print(segment)
[['南京市', '长江大桥'], ['行行行'], ['结婚', '的和尚未', '结婚', '的确实在', '干扰分词', '啊']]
5.分析view、reshape、transpose和permute四种调整张量形状方法各自擅长处理的问题?
- import torch
-
- a = torch.randn(2, 3, 4)
- print(a)
- print(a.shape)
-
- a = a.permute(1, 0, 2)
- print(a.shape)
-
- # print(a.view(2, 3, 4))
- print(a.view(2, 3, 4))
- RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
- print(a.reshape(2, 3, 4).shape)
-
- a = torch.rand(3, 4, 5)
- b = a.transpose(0, 1).transpose(1, 2)
- print(b.shape)
-
- c = a.permute(1, 2, 0)
- print(c.shape)
- torch.Size([2, 3, 4])
- torch.Size([4, 5, 3])
- torch.Size([4, 5, 3])
6.安装PyTorch并实际对比使用和不使用GPU时,三个大张量相乘时的效率?
- import timeit
-
- it1 = timeit.Timer('M.mm(M).mm(M)', 'import torch\nM = torch.rand(1000, 1000)')
- t1 = it1.timeit(500)/500
- print('{:.4f}ms'.format(t1*1000))
-
- it2 = timeit.Timer('M.mm(M)', 'import torch\nM = torch.rand(1000, 1000).cuda()')
- t2 = it2.timeit(500)/500
- print('{:.4f}ms'.format(t2*1000))
- 29.2605ms
- 0.9476ms
7.下载最新的Common Crawl数据,并实现抽取中文、去重、简繁转换、数据清洗等功能。
抽取中文:
- import re
-
-
- def translate(str):
- pattern = re.compile('[^\u4e00-\u9fa50-9\s]')
- # 中文的编码范围是:\u4e00到\u9fa5
- str = re.sub(pattern, '', str)
- return str
-
- print(translate('你好hello哈哈哈'))
去重:
- import sys
-
- f_in = open(sys.argv[1], 'r') # 输入文件
- lines_dic = {}
- for line in f_in.readlines():
- line = line.strip()
- hashcode = hash(line)
- if hashcode not in lines_dic.keys():
- lines_dic[hashcode] = [line]
- print(line)
- elif line not in lines_dic[hashcode]:
- lines_dic[hashcode].append(line)
- print(line)
-
- f_in.close()
繁简转换:
- import sys
- import opencc
-
- converter = opencc.OpenCC('t2s.json') # 载入繁简转换配置文件
- f_in = open(sys.argv[1], 'r') # 输入文件
- for line in f_in.readlines():
- line = line.strip()
- line_t2s = converter.convert(line)
- print(line_t2s)
数据清洗:
- import re
-
-
- def remove_empty_paired_punc(in_str):
- return in_str.replace('()', '').replace('《》', '').replace('【】', '').replace('[]', '')
-
-
- def remove_html_tag(in_str):
- html_pattern = re.compile(r'<[^>]+>', re.S)
- return html_pattern.sub('', in_str)
-
-
- def remove_control_chars(in_str):
- control_chars = ''.join(map(chr, list(range(0, 32)) + list(range(127, 160))))
- control_chars = re.compile('[%s]' % re.escape(control_chars))
- return control_chars.sub('', in_str)
最终数据预处理脚本如下:
- import re
- import sys
- import opencc
-
-
- def translate(str):
- pattern = re.compile('[^\u4e00-\u9fa50-9\s]')
- # 中文的编码范围是:\u4e00到\u9fa5
- str = re.sub(pattern, '', str)
- return str
-
-
- def remove_empty_paired_punc(in_str):
- return in_str.replace('()', '').replace('《》', '').replace('【】', '').replace('[]', '')
-
-
- def remove_html_tag(in_str):
- html_pattern = re.compile(r'<[^>]+>', re.S)
- return html_pattern.sub('', in_str)
-
-
- def remove_control_chars(in_str):
- control_chars = ''.join(map(chr, list(range(0, 32)) + list(range(127, 160))))
- control_chars = re.compile('[%s]' % re.escape(control_chars))
- return control_chars.sub('', in_str)
-
-
- converter = opencc.OpenCC('t2s.json') # 载入繁简转换配置文件
- f_in = open(sys.argv[1], 'r') # 输入文件
- lines_dic = {}
- for line in f_in.readlines():
- line = line.strip()
- # 数据清洗
- line = remove_empty_paired_punc(line)
- line = remove_html_tag(line)
- line = remove_control_chars(line)
- # 中文抽取
- line = translate(line)
- # 简繁转换
- line = converter.convert(line)
- # 去重
- hashcode = hash(line)
- if hashcode not in lines_dic.keys():
- lines_dic[hashcode] = [line]
- print(line)
- elif line not in lines_dic[hashcode]:
- lines_dic[hashcode].append(line)
- print(line)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。