赞
踩
理解并实践语言模型。
1.用python编程实践语言模型(uni-gram和bi-gram),加入平滑技术。
2.计算test.txt中句子的PPL,对比uuni-gram和bi-gram语言模型效果。
问题1: 进行数据预处理需要的操作。
解决1: 上网查找资料,进行简单的数据预处理可以将英文全部转换为小写,去除除“’”外的标点符号,根据要求用nltk.tokenize.word_tokenize来进行分词。
问题2: 标点作为文本的一部分,需要去除吗?
解决2: 查资料以及结合思考得,标点所包含的语义信息很少,去除标点还有提高效率减少噪音等好处,所以需要去除标点。
问题3: 加一平滑具体是怎么做的:是直接从训练集上直接生成词表,表中单词的次数加一;还是先将测试集的未登录词先加入词表,然后词表单词次数加一。
解决3: 通过查询资料以及结合思考,使用训练集直接生成词表,表中单词次数加一并且计算出每个单词出现的概率。至于测试集中出现的未登录词,使用一个小概率来代替未登录词的概率。此概率一般是
P
=
1
N
+
V
P = \frac{1}{N+V}
P=N+V1,其中
N
N
N为训练集所有单词的数量,
V
V
V为词表的大小。
问题4: 计算句子概率时连乘导致浮点数下溢。
解决4: 将乘法转换为对数的加法。困惑度计算使用的公式为
P
P
=
2
−
1
N
∑
i
=
1
N
l
o
g
2
P
(
w
i
)
PP=2^{-\frac{1}{N}\sum_{i=1}^{N}log_{2}P(w_{i})}
PP=2−N1∑i=1Nlog2P(wi) ,
P
(
w
i
)
P(w_{i})
P(wi)是一个词的概率,
w
i
w_{i}
wi是句子的第
i
i
i个词,
N
N
N是句子的长度。
问题5: 如果一个文本中有若干句子,怎样评估困惑度。
解决5: 如果一个文本中有若干句子,衡量整个文本的困惑度通常是指计算文本的平均困惑度,即将每个句子的困惑度进行平均。
问题6: bi_gram中加一平滑应该怎么做?
解决6: 通过上网查询得,平滑后的bigram概率为
P
(
w
i
∣
w
i
−
1
)
=
C
(
w
i
−
1
,
w
i
)
+
1
C
(
w
i
)
+
V
P(w_{i}|w_{i-1})=\frac{C(w_{i-1},w_{i})+1}{C(w_{i})+V}
P(wi∣wi−1)=C(wi)+VC(wi−1,wi)+1,
C
(
)
C()
C()是这个/对词在训练数据中出现的次数,
V
V
V是词汇表中的词的数量。如果
w
1
w_{1}
w1不在训练集中,则使用1/(不同前词的个数)。
一、uni_gram:
1,数据预处理。去除标点符号(除了“’”,就是英语的上单引号,因为它很有可能是单词的一部分),将句子全部化为小写字母,将句子分割为单词。输入为训练集文本,输出为单词列表。
# 数据预处理,将文本处理成单词列表
def preprocess_text(text):
sentences = text.split("__eou__") # 分句
sentences.pop()
words = []
for sentence in sentences:
sentence = re.sub(r"[^\w\s']", "", sentence).lower() # 去除标点,改为小写
words += word_tokenize(sentence) # 分词
return words
2,构建词汇表:使用单词列表统计每个唯一单词的出现次数。输出为字典,每个项的格式为{word:count}。
# 构建词汇表,vocab为字典,每个项的格式为{word:count}
def build_vocab(words):
vocab = Counter(words)
return vocab
3,计算概率,并使用加一平滑:对于词汇表中的每个单词,计算其在语料库中出现的概率。使用加一平滑,每个单词出现次数加一。所以概率为(单词的出现次数+1)/(语料库中所有单词的总数+词汇表中单词个数)。
# 计算unigram概率(加一平滑)
def calculate_unigram_probs(vocab, total_words):
unigram_probs = {}
for word, count in vocab.items():
unigram_probs[word] = (count + 1) / (total_words + len(vocab))
return unigram_probs
4,测试文本处理。去除标点符号(除了“’”),将句子全部化为小写字母,将句子分割为单词。输入为训练集文本,输出为二维列表,每个句子的单词存为列表,再将句子存在文本列表中。
# 数据预处理,将文本处理成二维列表,text列表存储sentence列表,sentence存这个句子的单词
def preprocess_text2(text):
sentences = text.split("__eou__")
sentences.pop()
text = []
for sentence in sentences:
sentence = re.sub(r"[^\w\s']", "", sentence).lower()
text.append(word_tokenize(sentence))
return text
5,计算困惑度。困惑度的公式见解决4,未登录词的公式见解决3。
# 计算句子困惑度
def sentence_perplexity(text, unigram_probs, vocab, total_words):
perplexity = []
for sentence in text:
prob = 0
for word in sentence:
if word in unigram_probs:
prob += log2(unigram_probs[word])
else:
prob += log2(1 / (len(vocab) + total_words)) # 未知单词的概率
perplexity.append(pow(2, -(prob / len(sentence))))
return perplexity
6,评估文本的困惑度。方法见解决5。
# 评估文本困惑度
def text_perplexity(perplexity):
return sum(perplexity) / len(perplexity)
二、bi_gram:
1,数据预处理并构建词汇表,将每个句子前面加上“<beg>”,后面加上“</end>”。
2,将训练集处理成二维列表,计算bigram词频。输出为二维字典,一级索引为前词,二级索引为后词。
# 计算bigram词频
def calculate_bigram(text):
bigram_counts = defaultdict(dict)
for sentence in text:
for i in range(len(sentence) - 1):
if sentence[i + 1] not in bigram_counts[sentence[i]]:
bigram_counts[sentence[i]][sentence[i + 1]] = 1
else:
bigram_counts[sentence[i]][sentence[i + 1]] += 1
return bigram_counts
3,计算概率,使用加一平滑。每对词的概率计算方法见解决6。
# 计算bigram概率
def calculate_bigram_probs(bigram_counts,vocab):
bigram_probs = defaultdict(dict)
for prev_word, list in bigram_counts.items():
for back_word, count in list.items():
bigram_probs[prev_word][back_word] = (count + 1) /(
vocab[prev_word] + len(vocab))
return bigram_probs
4,处理测试文本。
5,计算句子困惑度。未登录词的处理方式见解决6。
# 计算句子困惑度 def sentence_perplexity(text, bigram_probs, vocab, bigram_counts): perplexity = [] for sentence in text: prob = 0 for i in range(len(sentence) - 1): if sentence[i] not in vocab: # w1是未登录词 prob += len(vocab) elif sentence[i + 1] not in bigram_probs[sentence[i]]: # w1不是未登录词而w2是 prob += log2(1 / (vocab[sentence[i]] + len(vocab))) else: # 都不是未登录词 prob += log2(bigram_probs[sentence[i]][sentence[i + 1]]) perplexity.append(pow(2, -(prob / (len(sentence) - 1)))) return perplexity
6,评估文本的困惑度。
1,使用unigram模型时的困惑度。
2,使用bigram模型时的困惑度。
3,bigram的效果比较好。
from nltk.tokenize import word_tokenize from collections import Counter from math import log2 import re # 数据预处理,将文本处理成单词列表 def preprocess_text(text): sentences = text.split("__eou__") # 分句 sentences.pop() words = [] for sentence in sentences: sentence = re.sub(r"[^\w\s']", "", sentence).lower() # 去除标点,改为小写 words += word_tokenize(sentence) # 分词 return words # 数据预处理,将文本处理成二维列表,text列表存储sentence列表,sentence存这个句子的单词 def preprocess_text2(text): sentences = text.split("__eou__") sentences.pop() text = [] for sentence in sentences: sentence = re.sub(r"[^\w\s']", "", sentence).lower() text.append(word_tokenize(sentence)) return text # 构建词汇表,vocab为字典,每个项的格式为{word:count} def build_vocab(words): vocab = Counter(words) return vocab # 计算unigram概率(加一平滑) def calculate_unigram_probs(vocab, total_words): unigram_probs = {} for word, count in vocab.items(): unigram_probs[word] = (count + 1) / (total_words + len(vocab)) return unigram_probs # 计算句子困惑度 def sentence_perplexity(text, unigram_probs, vocab, total_words): perplexity = [] for sentence in text: prob = 0 for word in sentence: if word in unigram_probs: prob += log2(unigram_probs[word]) else: prob += log2(1 / (len(vocab) + total_words)) # 未知单词的概率 perplexity.append(pow(2, -(prob / len(sentence)))) return perplexity # 评估文本困惑度 def text_perplexity(perplexity): return sum(perplexity) / len(perplexity) # 加载数据 with open("train_LM.txt", "r", encoding="utf-8") as file: train_text = file.read() with open("test_LM.txt", "r", encoding="utf-8") as file: test_text = file.read() words = preprocess_text(train_text) # 单词列表 vocab = build_vocab(words) # 词汇表 unigram_probs = calculate_unigram_probs(vocab, len(words)) # unigram概率 test_text = preprocess_text2(test_text) # text二维列表 perplexity = sentence_perplexity(test_text, unigram_probs, vocab, len(words)) # 句子困惑度列表 test_perplexity = text_perplexity(perplexity) # 文本困惑度 print(test_perplexity)
from nltk.tokenize import word_tokenize from collections import Counter, defaultdict from math import log2 import re # 预处理文本 def preprocess_text(text): sentences = text.split("__eou__") sentences.pop() words = [] for sentence in sentences: sentence = re.sub(r"[^\w\s']", "", sentence).lower() words += word_tokenize(sentence) words.append("<beg>") return words def preprocess_text2(text): sentences = text.split("__eou__") sentences.pop() words = [] for sentence in sentences: sentence = re.sub(r"[^\w\s']", "", sentence).lower() words.append(["<beg>"] + word_tokenize(sentence) + ["</end>"]) return words # 构建词汇表 def build_vocab(words): vocab = Counter(words) return vocab # 计算bigram词频 def calculate_bigram(text): bigram_counts = defaultdict(dict) for sentence in text: for i in range(len(sentence) - 1): if sentence[i + 1] not in bigram_counts[sentence[i]]: bigram_counts[sentence[i]][sentence[i + 1]] = 1 else: bigram_counts[sentence[i]][sentence[i + 1]] += 1 return bigram_counts # 计算bigram概率 def calculate_bigram_probs(bigram_counts,vocab): bigram_probs = defaultdict(dict) for prev_word, list in bigram_counts.items(): for back_word, count in list.items(): bigram_probs[prev_word][back_word] = (count + 1) /( vocab[prev_word] + len(vocab)) return bigram_probs # 计算句子困惑度 def sentence_perplexity(text, bigram_probs, vocab, bigram_counts): perplexity = [] for sentence in text: prob = 0 for i in range(len(sentence) - 1): if sentence[i] not in vocab: # w1是未登录词 prob += len(vocab) elif sentence[i + 1] not in bigram_probs[sentence[i]]: # w1不是未登录词而w2是 prob += log2(1 / (vocab[sentence[i]] + len(vocab))) else: # 都不是未登录词 prob += log2(bigram_probs[sentence[i]][sentence[i + 1]]) perplexity.append(pow(2, -(prob / (len(sentence) - 1)))) return perplexity def text_perplexity(perplexity): return sum(perplexity) / len(perplexity) # 示例文本 with open("train_LM.txt", "r", encoding="utf-8") as file: text = file.read() with open("test_LM.txt", "r", encoding="utf-8") as file: test_text = file.read() words = preprocess_text(text) vocab = build_vocab(words) train_text = preprocess_text2(text) bigram_counts = calculate_bigram(train_text) bigram_probs = calculate_bigram_probs(bigram_counts,vocab) test_text = preprocess_text2(test_text) perplexity = sentence_perplexity(test_text, bigram_probs, vocab, bigram_counts) test_perplexity = text_perplexity(perplexity) print(test_perplexity)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。