当前位置:   article > 正文

CS224N刷题——Assignment1.4_情感分析_cs224n 情感分析

cs224n 情感分析

Assignment #1

4.Sentiment Analysis

现在,通过你训练的词向量,我们将进行一个简单的情感分析。对于斯坦福情感树库数据集中的每个句子,我们将使用该句子中所有词向量的平均值作为其特征,并尝试预测所述句子的情感等级。这些短语的情感等级在原始数据集中以实际值表示,这里我们只使用五个类:

“very negative (−−)”, “negative (−)”, “neutral”, “positive (+)”, “very positive (++)”

在代码中分别用0到4表示。在本部分中,将学习训练softmax分类器,并在训练集和验证集上改进其泛化性。

(a)实现句子特征化。表示句子的一种简单方法是取句子中单词向量的平均值。

  1. def getSentenceFeatures(tokens, wordVectors, sentence):
  2. """
  3. Obtain the sentence feature for sentiment analysis by averaging its
  4. word vectors
  5. """
  6. # Implement computation for the sentence features given a sentence.
  7. # Inputs:
  8. # tokens -- a dictionary that maps words to their indices in
  9. # the word vector list
  10. # wordVectors -- word vectors (each row) for all tokens
  11. # sentence -- a list of words in the sentence of interest
  12. # Output:
  13. # - sentVector: feature vector for the sentence
  14. sentVector = np.zeros((wordVectors.shape[1],))
  15. # YOUR CODE HERE
  16. for word in sentence:
  17. sentVector += wordVectors[tokens[word]]
  18. sentVector /= len(sentence)
  19. # END YOUR CODE
  20. assert sentVector.shape == (wordVectors.shape[1],)
  21. return sentVector

(b)为什么我们要在分类时引入正则化(大多数机器学习任务都使用了)?

避免过拟合以及对未见样本的不佳泛化。

(c)在q4_sentiment.py中填写超参数选择代码,搜索“最优”正则化参数。填写ChooseBestModel的代码。应该能够使用(d)中的预训练向量在验证集和测试集上获得至少36.5%的精度。

  1. def getRegularizationValues():
  2. """Try different regularizations
  3. Return a sorted list of values to try.
  4. """
  5. values = None # Assign a list of floats in the block below
  6. # YOUR CODE HERE
  7. values = np.logspace(-4, 2, num=100, base=10) # 从10^-4到10^2的100个数寻找最佳的正则化系数
  8. # END YOUR CODE
  9. return sorted(values)
  10. def chooseBestModel(results):
  11. """Choose the best model based on dev set performance.
  12. Arguments:
  13. results -- A list of python dictionaries of the following format:
  14. {
  15. "reg": regularization,
  16. "clf": classifier,
  17. "train": trainAccuracy,
  18. "dev": devAccuracy,
  19. "test": testAccuracy
  20. }
  21. Each dictionary represents the performance of one model.
  22. Returns:
  23. Your chosen result dictionary.
  24. """
  25. bestResult = None
  26. # YOUR CODE HERE
  27. bestResult = max(results, key=lambda x: x['dev'])
  28. # END YOUR CODE
  29. return bestResult

(d)使用q3中的词向量,运行python q4_sentiment.py --yourvectors来训练模型。现在,使用预训练的GloVe向量(在维基百科数据上)运行python q4_sentiment.py --pretrained训练模型。比较并报告最佳的训练、验证和测试的精度。为什么你认为预训练向量更好?

  1. def main(args):
  2. """ Train a model to do sentiment analyis"""
  3. # Load the dataset
  4. dataset = StanfordSentiment()
  5. tokens = dataset.tokens()
  6. nWords = len(tokens)
  7. if args.yourvectors:
  8. _, wordVectors, _ = load_saved_params()
  9. wordVectors = np.concatenate(
  10. (wordVectors[:nWords,:], wordVectors[nWords:,:]),
  11. axis=1)
  12. elif args.pretrained:
  13. wordVectors = glove.loadWordVectors(tokens)
  14. dimVectors = wordVectors.shape[1]
  15. # Load the train set
  16. trainset = dataset.getTrainSentences()
  17. nTrain = len(trainset)
  18. trainFeatures = np.zeros((nTrain, dimVectors))
  19. trainLabels = np.zeros((nTrain,), dtype=np.int32)
  20. for i in range(nTrain):
  21. words, trainLabels[i] = trainset[i]
  22. trainFeatures[i, :] = getSentenceFeatures(tokens, wordVectors, words)
  23. # Prepare dev set features
  24. devset = dataset.getDevSentences()
  25. nDev = len(devset)
  26. devFeatures = np.zeros((nDev, dimVectors))
  27. devLabels = np.zeros((nDev,), dtype=np.int32)
  28. for i in range(nDev):
  29. words, devLabels[i] = devset[i]
  30. devFeatures[i, :] = getSentenceFeatures(tokens, wordVectors, words)
  31. # Prepare test set features
  32. testset = dataset.getTestSentences()
  33. nTest = len(testset)
  34. testFeatures = np.zeros((nTest, dimVectors))
  35. testLabels = np.zeros((nTest,), dtype=np.int32)
  36. for i in range(nTest):
  37. words, testLabels[i] = testset[i]
  38. testFeatures[i, :] = getSentenceFeatures(tokens, wordVectors, words)
  39. # We will save our results from each run
  40. results = []
  41. regValues = getRegularizationValues()
  42. for reg in regValues:
  43. print("Training for reg=%f" % reg)
  44. # Note: add a very small number to regularization to please the library
  45. clf = LogisticRegression(C=1.0/(reg + 1e-12), solver='liblinear', multi_class='ovr')
  46. clf.fit(trainFeatures, trainLabels)
  47. # Test on train set
  48. pred = clf.predict(trainFeatures)
  49. trainAccuracy = accuracy(trainLabels, pred)
  50. print("Train accuracy (%%): %f" % trainAccuracy)
  51. # Test on dev set
  52. pred = clf.predict(devFeatures)
  53. devAccuracy = accuracy(devLabels, pred)
  54. print("Dev accuracy (%%): %f" % devAccuracy)
  55. # Test on test set
  56. # Note: always running on test is poor style. Typically, you should
  57. # do this only after validation.
  58. pred = clf.predict(testFeatures)
  59. testAccuracy = accuracy(testLabels, pred)
  60. print("Test accuracy (%%): %f" % testAccuracy)
  61. results.append({
  62. "reg": reg,
  63. "clf": clf,
  64. "train": trainAccuracy,
  65. "dev": devAccuracy,
  66. "test": testAccuracy})
  67. # Print the accuracies
  68. print("")
  69. print("=== Recap ===")
  70. print("Reg\t\tTrain\tDev\tTest")
  71. for result in results:
  72. print("%.2E\t%.3f\t%.3f\t%.3f" % (
  73. result["reg"],
  74. result["train"],
  75. result["dev"],
  76. result["test"]))
  77. print("")
  78. bestResult = chooseBestModel(results)
  79. print("Best regularization value: %0.2E" % bestResult["reg"])
  80. print("Test accuracy (%%): %f" % bestResult["test"])
  81. # do some error analysis
  82. if args.pretrained:
  83. plotRegVsAccuracy(regValues, results, "q4_reg_v_acc.png")
  84. outputConfusionMatrix(devFeatures, devLabels, bestResult["clf"],
  85. "q4_dev_conf.png")
  86. outputPredictions(devset, devFeatures, devLabels, bestResult["clf"],
  87. "q4_dev_pred.txt")
  88. if __name__ == "__main__":
  89. main(getArguments())

Best regularization value: 8.11E-02
Test accuracy (%): 36.742081

本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号