赞
踩
现在,通过你训练的词向量,我们将进行一个简单的情感分析。对于斯坦福情感树库数据集中的每个句子,我们将使用该句子中所有词向量的平均值作为其特征,并尝试预测所述句子的情感等级。这些短语的情感等级在原始数据集中以实际值表示,这里我们只使用五个类:
“very negative (−−)”, “negative (−)”, “neutral”, “positive (+)”, “very positive (++)”
在代码中分别用0到4表示。在本部分中,将学习训练softmax分类器,并在训练集和验证集上改进其泛化性。
(a)实现句子特征化。表示句子的一种简单方法是取句子中单词向量的平均值。
- def getSentenceFeatures(tokens, wordVectors, sentence):
- """
- Obtain the sentence feature for sentiment analysis by averaging its
- word vectors
- """
-
- # Implement computation for the sentence features given a sentence.
-
- # Inputs:
- # tokens -- a dictionary that maps words to their indices in
- # the word vector list
- # wordVectors -- word vectors (each row) for all tokens
- # sentence -- a list of words in the sentence of interest
-
- # Output:
- # - sentVector: feature vector for the sentence
-
- sentVector = np.zeros((wordVectors.shape[1],))
-
- # YOUR CODE HERE
- for word in sentence:
- sentVector += wordVectors[tokens[word]]
- sentVector /= len(sentence)
- # END YOUR CODE
-
- assert sentVector.shape == (wordVectors.shape[1],)
- return sentVector
(b)为什么我们要在分类时引入正则化(大多数机器学习任务都使用了)?
避免过拟合以及对未见样本的不佳泛化。
(c)在q4_sentiment.py中填写超参数选择代码,搜索“最优”正则化参数。填写ChooseBestModel的代码。应该能够使用(d)中的预训练向量在验证集和测试集上获得至少36.5%的精度。
- def getRegularizationValues():
- """Try different regularizations
- Return a sorted list of values to try.
- """
- values = None # Assign a list of floats in the block below
- # YOUR CODE HERE
- values = np.logspace(-4, 2, num=100, base=10) # 从10^-4到10^2的100个数寻找最佳的正则化系数
- # END YOUR CODE
- return sorted(values)
-
-
- def chooseBestModel(results):
- """Choose the best model based on dev set performance.
- Arguments:
- results -- A list of python dictionaries of the following format:
- {
- "reg": regularization,
- "clf": classifier,
- "train": trainAccuracy,
- "dev": devAccuracy,
- "test": testAccuracy
- }
- Each dictionary represents the performance of one model.
- Returns:
- Your chosen result dictionary.
- """
- bestResult = None
-
- # YOUR CODE HERE
- bestResult = max(results, key=lambda x: x['dev'])
- # END YOUR CODE
-
- return bestResult
(d)使用q3中的词向量,运行python q4_sentiment.py --yourvectors来训练模型。现在,使用预训练的GloVe向量(在维基百科数据上)运行python q4_sentiment.py --pretrained训练模型。比较并报告最佳的训练、验证和测试的精度。为什么你认为预训练向量更好?
- def main(args):
- """ Train a model to do sentiment analyis"""
-
- # Load the dataset
- dataset = StanfordSentiment()
- tokens = dataset.tokens()
- nWords = len(tokens)
-
- if args.yourvectors:
- _, wordVectors, _ = load_saved_params()
- wordVectors = np.concatenate(
- (wordVectors[:nWords,:], wordVectors[nWords:,:]),
- axis=1)
- elif args.pretrained:
- wordVectors = glove.loadWordVectors(tokens)
- dimVectors = wordVectors.shape[1]
-
- # Load the train set
- trainset = dataset.getTrainSentences()
- nTrain = len(trainset)
- trainFeatures = np.zeros((nTrain, dimVectors))
- trainLabels = np.zeros((nTrain,), dtype=np.int32)
- for i in range(nTrain):
- words, trainLabels[i] = trainset[i]
- trainFeatures[i, :] = getSentenceFeatures(tokens, wordVectors, words)
-
- # Prepare dev set features
- devset = dataset.getDevSentences()
- nDev = len(devset)
- devFeatures = np.zeros((nDev, dimVectors))
- devLabels = np.zeros((nDev,), dtype=np.int32)
- for i in range(nDev):
- words, devLabels[i] = devset[i]
- devFeatures[i, :] = getSentenceFeatures(tokens, wordVectors, words)
-
- # Prepare test set features
- testset = dataset.getTestSentences()
- nTest = len(testset)
- testFeatures = np.zeros((nTest, dimVectors))
- testLabels = np.zeros((nTest,), dtype=np.int32)
- for i in range(nTest):
- words, testLabels[i] = testset[i]
- testFeatures[i, :] = getSentenceFeatures(tokens, wordVectors, words)
-
- # We will save our results from each run
- results = []
- regValues = getRegularizationValues()
- for reg in regValues:
- print("Training for reg=%f" % reg)
- # Note: add a very small number to regularization to please the library
- clf = LogisticRegression(C=1.0/(reg + 1e-12), solver='liblinear', multi_class='ovr')
- clf.fit(trainFeatures, trainLabels)
-
- # Test on train set
- pred = clf.predict(trainFeatures)
- trainAccuracy = accuracy(trainLabels, pred)
- print("Train accuracy (%%): %f" % trainAccuracy)
-
- # Test on dev set
- pred = clf.predict(devFeatures)
- devAccuracy = accuracy(devLabels, pred)
- print("Dev accuracy (%%): %f" % devAccuracy)
-
- # Test on test set
- # Note: always running on test is poor style. Typically, you should
- # do this only after validation.
- pred = clf.predict(testFeatures)
- testAccuracy = accuracy(testLabels, pred)
- print("Test accuracy (%%): %f" % testAccuracy)
-
- results.append({
- "reg": reg,
- "clf": clf,
- "train": trainAccuracy,
- "dev": devAccuracy,
- "test": testAccuracy})
-
- # Print the accuracies
- print("")
- print("=== Recap ===")
- print("Reg\t\tTrain\tDev\tTest")
- for result in results:
- print("%.2E\t%.3f\t%.3f\t%.3f" % (
- result["reg"],
- result["train"],
- result["dev"],
- result["test"]))
- print("")
-
- bestResult = chooseBestModel(results)
- print("Best regularization value: %0.2E" % bestResult["reg"])
- print("Test accuracy (%%): %f" % bestResult["test"])
-
- # do some error analysis
- if args.pretrained:
- plotRegVsAccuracy(regValues, results, "q4_reg_v_acc.png")
- outputConfusionMatrix(devFeatures, devLabels, bestResult["clf"],
- "q4_dev_conf.png")
- outputPredictions(devset, devFeatures, devLabels, bestResult["clf"],
- "q4_dev_pred.txt")
-
-
- if __name__ == "__main__":
- main(getArguments())
Best regularization value: 8.11E-02
Test accuracy (%): 36.742081
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。