当前位置:   article > 正文

NLP 利器 Gensim 中 word2vec 模型的内存需求,和模型评估方式_怎么评估word2vec模型

怎么评估word2vec模型

Gensim 中 word2vec 模型的内存需求,和模型评估方式

本文为系列文章之一,前面的几篇请点击链接:
NLP 利器 gensim 库基本特性介绍和安装方式
NLP 利器 Gensim 库的使用之 Word2Vec 模型案例演示
NLP 利器 Gensim 来训练自己的 word2vec 词向量模型
NLP 利器 Gensim 来训练 word2vec 词向量模型的参数设置

一、内存需求

word2vec 模型的参数是以 Numpy array 的形式存储。

shape 是:(词表长度,词向量维度)

  • 词表长度由 min_count 控制。
  • 词向量维度由 size 控制。

所以参数个数是 len(vocab) * size

每个参数都是单精度浮点数,即 32 位,在内存中占 4 个字节 bytes。

而这样的矩阵会有 3 个同时存储在内存 RAM 中。

所以假设我们词表长度为 100,000,词向量维度 200,那我们所需的内存大小为:

100,000 * 200 * 4 * 3 = 229MB 左右

当然需要额外的一些内存存储词表内容,但是这个基本可以忽略。

二、模型评估

Word2Vec 模型的训练,是一个非监督学习过程,其实没有客观的标准去衡量精确度。

评估需要依赖于最终的应用。

Google 开放了一个 20,000 个样本的测试集合(句法和语义),来测试 “A 之于于 B 就好比 C 之于 D” 这样的任务。

例如一个比较类型的句法类比:

bad : worse ; good : ?

数据集中有 9 种句法对比,包括名词的复数,相反意义的名词等。

语义问题包括了 5 种语义类比,比如:

首都城市(Paris : France ; Tokyo : ?)

家庭成员(brother : sister ; dad : ?)

Gensim 支持同样的评估集合,同时格式也一样。

model.wv.accuracy('./datasets/questions-words.txt')
  • 1

测试结果:

[{'section': 'capital-common-countries',
  'correct': [],
  'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),
   ('CANBERRA', 'AUSTRALIA', 'PARIS', 'FRANCE'),
   ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'),
   ('KABUL', 'AFGHANISTAN', 'CANBERRA', 'AUSTRALIA'),
   ('PARIS', 'FRANCE', 'CANBERRA', 'AUSTRALIA'),
   ('PARIS', 'FRANCE', 'KABUL', 'AFGHANISTAN')]},
 {'section': 'capital-world',
  'correct': [],
  'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),
   ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE')]},
 {'section': 'currency', 'correct': [], 'incorrect': []},
 {'section': 'city-in-state', 'correct': [], 'incorrect': []},
 {'section': 'family',
  'correct': [],
  'incorrect': [('HE', 'SHE', 'HIS', 'HER'),
   ('HE', 'SHE', 'MAN', 'WOMAN'),
   ('HIS', 'HER', 'MAN', 'WOMAN'),
   ('HIS', 'HER', 'HE', 'SHE'),
   ('MAN', 'WOMAN', 'HE', 'SHE'),
   ('MAN', 'WOMAN', 'HIS', 'HER')]},
 {'section': 'gram1-adjective-to-adverb', 'correct': [], 'incorrect': []},
 {'section': 'gram2-opposite', 'correct': [], 'incorrect': []},
 {'section': 'gram3-comparative',
  'correct': [],
  'incorrect': [('GOOD', 'BETTER', 'GREAT', 'GREATER'),
   ('GOOD', 'BETTER', 'LONG', 'LONGER'),
   ('GOOD', 'BETTER', 'LOW', 'LOWER'),
   ('GOOD', 'BETTER', 'SMALL', 'SMALLER'),
   ('GREAT', 'GREATER', 'LONG', 'LONGER'),
   ('GREAT', 'GREATER', 'LOW', 'LOWER'),
   ('GREAT', 'GREATER', 'SMALL', 'SMALLER'),
   ('GREAT', 'GREATER', 'GOOD', 'BETTER'),
   ('LONG', 'LONGER', 'LOW', 'LOWER'),
   ('LONG', 'LONGER', 'SMALL', 'SMALLER'),
   ('LONG', 'LONGER', 'GOOD', 'BETTER'),
   ('LONG', 'LONGER', 'GREAT', 'GREATER'),
   ('LOW', 'LOWER', 'SMALL', 'SMALLER'),
   ('LOW', 'LOWER', 'GOOD', 'BETTER'),
   ('LOW', 'LOWER', 'GREAT', 'GREATER'),
   ('LOW', 'LOWER', 'LONG', 'LONGER'),
   ('SMALL', 'SMALLER', 'GOOD', 'BETTER'),
   ('SMALL', 'SMALLER', 'GREAT', 'GREATER'),
   ('SMALL', 'SMALLER', 'LONG', 'LONGER'),
   ('SMALL', 'SMALLER', 'LOW', 'LOWER')]},
 {'section': 'gram4-superlative',
  'correct': [],
  'incorrect': [('BIG', 'BIGGEST', 'GOOD', 'BEST'),
   ('BIG', 'BIGGEST', 'GREAT', 'GREATEST'),
   ('BIG', 'BIGGEST', 'LARGE', 'LARGEST'),
   ('GOOD', 'BEST', 'GREAT', 'GREATEST'),
   ('GOOD', 'BEST', 'LARGE', 'LARGEST'),
   ('GOOD', 'BEST', 'BIG', 'BIGGEST'),
   ('GREAT', 'GREATEST', 'LARGE', 'LARGEST'),
   ('GREAT', 'GREATEST', 'BIG', 'BIGGEST'),
   ('GREAT', 'GREATEST', 'GOOD', 'BEST'),
   ('LARGE', 'LARGEST', 'BIG', 'BIGGEST'),
   ('LARGE', 'LARGEST', 'GOOD', 'BEST'),
   ('LARGE', 'LARGEST', 'GREAT', 'GREATEST')]},
 {'section': 'gram5-present-participle',
  'correct': [],
  'incorrect': [('GO', 'GOING', 'LOOK', 'LOOKING'),
   ('GO', 'GOING', 'PLAY', 'PLAYING'),
   ('GO', 'GOING', 'RUN', 'RUNNING'),
   ('GO', 'GOING', 'SAY', 'SAYING'),
   ('LOOK', 'LOOKING', 'PLAY', 'PLAYING'),
   ('LOOK', 'LOOKING', 'RUN', 'RUNNING'),
   ('LOOK', 'LOOKING', 'SAY', 'SAYING'),
   ('LOOK', 'LOOKING', 'GO', 'GOING'),
   ('PLAY', 'PLAYING', 'RUN', 'RUNNING'),
   ('PLAY', 'PLAYING', 'SAY', 'SAYING'),
   ('PLAY', 'PLAYING', 'GO', 'GOING'),
   ('PLAY', 'PLAYING', 'LOOK', 'LOOKING'),
   ('RUN', 'RUNNING', 'SAY', 'SAYING'),
   ('RUN', 'RUNNING', 'GO', 'GOING'),
   ('RUN', 'RUNNING', 'LOOK', 'LOOKING'),
   ('RUN', 'RUNNING', 'PLAY', 'PLAYING'),
   ('SAY', 'SAYING', 'GO', 'GOING'),
   ('SAY', 'SAYING', 'LOOK', 'LOOKING'),
   ('SAY', 'SAYING', 'PLAY', 'PLAYING'),
   ('SAY', 'SAYING', 'RUN', 'RUNNING')]},
 {'section': 'gram6-nationality-adjective',
  'correct': [('INDIA', 'INDIAN', 'AUSTRALIA', 'AUSTRALIAN')],
  'incorrect': [('AUSTRALIA', 'AUSTRALIAN', 'FRANCE', 'FRENCH'),
   ('AUSTRALIA', 'AUSTRALIAN', 'INDIA', 'INDIAN'),
   ('AUSTRALIA', 'AUSTRALIAN', 'ISRAEL', 'ISRAELI'),
   ('AUSTRALIA', 'AUSTRALIAN', 'JAPAN', 'JAPANESE'),
   ('AUSTRALIA', 'AUSTRALIAN', 'SWITZERLAND', 'SWISS'),
   ('FRANCE', 'FRENCH', 'INDIA', 'INDIAN'),
   ('FRANCE', 'FRENCH', 'ISRAEL', 'ISRAELI'),
   ('FRANCE', 'FRENCH', 'JAPAN', 'JAPANESE'),
   ('FRANCE', 'FRENCH', 'SWITZERLAND', 'SWISS'),
   ('FRANCE', 'FRENCH', 'AUSTRALIA', 'AUSTRALIAN'),
   ('INDIA', 'INDIAN', 'ISRAEL', 'ISRAELI'),
   ('INDIA', 'INDIAN', 'JAPAN', 'JAPANESE'),
   ('INDIA', 'INDIAN', 'SWITZERLAND', 'SWISS'),
   ('INDIA', 'INDIAN', 'FRANCE', 'FRENCH'),
   ('ISRAEL', 'ISRAELI', 'JAPAN', 'JAPANESE'),
   ('ISRAEL', 'ISRAELI', 'SWITZERLAND', 'SWISS'),
   ('ISRAEL', 'ISRAELI', 'AUSTRALIA', 'AUSTRALIAN'),
   ('ISRAEL', 'ISRAELI', 'FRANCE', 'FRENCH'),
   ('ISRAEL', 'ISRAELI', 'INDIA', 'INDIAN'),
   ('JAPAN', 'JAPANESE', 'SWITZERLAND', 'SWISS'),
   ('JAPAN', 'JAPANESE', 'AUSTRALIA', 'AUSTRALIAN'),
   ('JAPAN', 'JAPANESE', 'FRANCE', 'FRENCH'),
   ('JAPAN', 'JAPANESE', 'INDIA', 'INDIAN'),
   ('JAPAN', 'JAPANESE', 'ISRAEL', 'ISRAELI'),
   ('SWITZERLAND', 'SWISS', 'AUSTRALIA', 'AUSTRALIAN'),
   ('SWITZERLAND', 'SWISS', 'FRANCE', 'FRENCH'),
   ('SWITZERLAND', 'SWISS', 'INDIA', 'INDIAN'),
   ('SWITZERLAND', 'SWISS', 'ISRAEL', 'ISRAELI'),
   ('SWITZERLAND', 'SWISS', 'JAPAN', 'JAPANESE')]},
 {'section': 'gram7-past-tense',
  'correct': [],
  'incorrect': [('GOING', 'WENT', 'PAYING', 'PAID'),
   ('GOING', 'WENT', 'PLAYING', 'PLAYED'),
   ('GOING', 'WENT', 'SAYING', 'SAID'),
   ('GOING', 'WENT', 'TAKING', 'TOOK'),
   ('PAYING', 'PAID', 'PLAYING', 'PLAYED'),
   ('PAYING', 'PAID', 'SAYING', 'SAID'),
   ('PAYING', 'PAID', 'TAKING', 'TOOK'),
   ('PAYING', 'PAID', 'GOING', 'WENT'),
   ('PLAYING', 'PLAYED', 'SAYING', 'SAID'),
   ('PLAYING', 'PLAYED', 'TAKING', 'TOOK'),
   ('PLAYING', 'PLAYED', 'GOING', 'WENT'),
   ('PLAYING', 'PLAYED', 'PAYING', 'PAID'),
   ('SAYING', 'SAID', 'TAKING', 'TOOK'),
   ('SAYING', 'SAID', 'GOING', 'WENT'),
   ('SAYING', 'SAID', 'PAYING', 'PAID'),
   ('SAYING', 'SAID', 'PLAYING', 'PLAYED'),
   ('TAKING', 'TOOK', 'GOING', 'WENT'),
   ('TAKING', 'TOOK', 'PAYING', 'PAID'),
   ('TAKING', 'TOOK', 'PLAYING', 'PLAYED'),
   ('TAKING', 'TOOK', 'SAYING', 'SAID')]},
 {'section': 'gram8-plural',
  'correct': [],
  'incorrect': [('BUILDING', 'BUILDINGS', 'CAR', 'CARS'),
   ('BUILDING', 'BUILDINGS', 'CHILD', 'CHILDREN'),
   ('BUILDING', 'BUILDINGS', 'MAN', 'MEN'),
   ('BUILDING', 'BUILDINGS', 'ROAD', 'ROADS'),
   ('BUILDING', 'BUILDINGS', 'WOMAN', 'WOMEN'),
   ('CAR', 'CARS', 'CHILD', 'CHILDREN'),
   ('CAR', 'CARS', 'MAN', 'MEN'),
   ('CAR', 'CARS', 'ROAD', 'ROADS'),
   ('CAR', 'CARS', 'WOMAN', 'WOMEN'),
   ('CAR', 'CARS', 'BUILDING', 'BUILDINGS'),
   ('CHILD', 'CHILDREN', 'MAN', 'MEN'),
   ('CHILD', 'CHILDREN', 'ROAD', 'ROADS'),
   ('CHILD', 'CHILDREN', 'WOMAN', 'WOMEN'),
   ('CHILD', 'CHILDREN', 'BUILDING', 'BUILDINGS'),
   ('CHILD', 'CHILDREN', 'CAR', 'CARS'),
   ('MAN', 'MEN', 'ROAD', 'ROADS'),
   ('MAN', 'MEN', 'WOMAN', 'WOMEN'),
   ('MAN', 'MEN', 'BUILDING', 'BUILDINGS'),
   ('MAN', 'MEN', 'CAR', 'CARS'),
   ('MAN', 'MEN', 'CHILD', 'CHILDREN'),
   ('ROAD', 'ROADS', 'WOMAN', 'WOMEN'),
   ('ROAD', 'ROADS', 'BUILDING', 'BUILDINGS'),
   ('ROAD', 'ROADS', 'CAR', 'CARS'),
   ('ROAD', 'ROADS', 'CHILD', 'CHILDREN'),
   ('ROAD', 'ROADS', 'MAN', 'MEN'),
   ('WOMAN', 'WOMEN', 'BUILDING', 'BUILDINGS'),
   ('WOMAN', 'WOMEN', 'CAR', 'CARS'),
   ('WOMAN', 'WOMEN', 'CHILD', 'CHILDREN'),
   ('WOMAN', 'WOMEN', 'MAN', 'MEN'),
   ('WOMAN', 'WOMEN', 'ROAD', 'ROADS')]},
 {'section': 'gram9-plural-verbs', 'correct': [], 'incorrect': []},
 {'section': 'total',
  'correct': [('INDIA', 'INDIAN', 'AUSTRALIA', 'AUSTRALIAN')],
  'incorrect': [('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),
   ('CANBERRA', 'AUSTRALIA', 'PARIS', 'FRANCE'),
   ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'),
   ('KABUL', 'AFGHANISTAN', 'CANBERRA', 'AUSTRALIA'),
   ('PARIS', 'FRANCE', 'CANBERRA', 'AUSTRALIA'),
   ('PARIS', 'FRANCE', 'KABUL', 'AFGHANISTAN'),
   ('CANBERRA', 'AUSTRALIA', 'KABUL', 'AFGHANISTAN'),
   ('KABUL', 'AFGHANISTAN', 'PARIS', 'FRANCE'),
   ('HE', 'SHE', 'HIS', 'HER'),
   ('HE', 'SHE', 'MAN', 'WOMAN'),
   ('HIS', 'HER', 'MAN', 'WOMAN'),
   ('HIS', 'HER', 'HE', 'SHE'),
   ('MAN', 'WOMAN', 'HE', 'SHE'),
   ('MAN', 'WOMAN', 'HIS', 'HER'),
   ('GOOD', 'BETTER', 'GREAT', 'GREATER'),
   ('GOOD', 'BETTER', 'LONG', 'LONGER'),
   ('GOOD', 'BETTER', 'LOW', 'LOWER'),
   ('GOOD', 'BETTER', 'SMALL', 'SMALLER'),
   ('GREAT', 'GREATER', 'LONG', 'LONGER'),
   ('GREAT', 'GREATER', 'LOW', 'LOWER'),
   ('GREAT', 'GREATER', 'SMALL', 'SMALLER'),
   ('GREAT', 'GREATER', 'GOOD', 'BETTER'),
   ('LONG', 'LONGER', 'LOW', 'LOWER'),
   ('LONG', 'LONGER', 'SMALL', 'SMALLER'),
   ('LONG', 'LONGER', 'GOOD', 'BETTER'),
   ('LONG', 'LONGER', 'GREAT', 'GREATER'),
   ('LOW', 'LOWER', 'SMALL', 'SMALLER'),
   ('LOW', 'LOWER', 'GOOD', 'BETTER'),
   ('LOW', 'LOWER', 'GREAT', 'GREATER'),
   ('LOW', 'LOWER', 'LONG', 'LONGER'),
   ('SMALL', 'SMALLER', 'GOOD', 'BETTER'),
   ('SMALL', 'SMALLER', 'GREAT', 'GREATER'),
   ('SMALL', 'SMALLER', 'LONG', 'LONGER'),
   ('SMALL', 'SMALLER', 'LOW', 'LOWER'),
   ('BIG', 'BIGGEST', 'GOOD', 'BEST'),
   ('BIG', 'BIGGEST', 'GREAT', 'GREATEST'),
   ('BIG', 'BIGGEST', 'LARGE', 'LARGEST'),
   ('GOOD', 'BEST', 'GREAT', 'GREATEST'),
   ('GOOD', 'BEST', 'LARGE', 'LARGEST'),
   ('GOOD', 'BEST', 'BIG', 'BIGGEST'),
   ('GREAT', 'GREATEST', 'LARGE', 'LARGEST'),
   ('GREAT', 'GREATEST', 'BIG', 'BIGGEST'),
   ('GREAT', 'GREATEST', 'GOOD', 'BEST'),
   ('LARGE', 'LARGEST', 'BIG', 'BIGGEST'),
   ('LARGE', 'LARGEST', 'GOOD', 'BEST'),
   ('LARGE', 'LARGEST', 'GREAT', 'GREATEST'),
   ('GO', 'GOING', 'LOOK', 'LOOKING'),
   ('GO', 'GOING', 'PLAY', 'PLAYING'),
   ('GO', 'GOING', 'RUN', 'RUNNING'),
   ('GO', 'GOING', 'SAY', 'SAYING'),
   ('LOOK', 'LOOKING', 'PLAY', 'PLAYING'),
   ('LOOK', 'LOOKING', 'RUN', 'RUNNING'),
   ('LOOK', 'LOOKING', 'SAY', 'SAYING'),
   ('LOOK', 'LOOKING', 'GO', 'GOING'),
   ('PLAY', 'PLAYING', 'RUN', 'RUNNING'),
   ('PLAY', 'PLAYING', 'SAY', 'SAYING'),
   ('PLAY', 'PLAYING', 'GO', 'GOING'),
   ('PLAY', 'PLAYING', 'LOOK', 'LOOKING'),
   ('RUN', 'RUNNING', 'SAY', 'SAYING'),
   ('RUN', 'RUNNING', 'GO', 'GOING'),
   ('RUN', 'RUNNING', 'LOOK', 'LOOKING'),
   ('RUN', 'RUNNING', 'PLAY', 'PLAYING'),
   ('SAY', 'SAYING', 'GO', 'GOING'),
   ('SAY', 'SAYING', 'LOOK', 'LOOKING'),
   ('SAY', 'SAYING', 'PLAY', 'PLAYING'),
   ('SAY', 'SAYING', 'RUN', 'RUNNING'),
   ('AUSTRALIA', 'AUSTRALIAN', 'FRANCE', 'FRENCH'),
   ('AUSTRALIA', 'AUSTRALIAN', 'INDIA', 'INDIAN'),
   ('AUSTRALIA', 'AUSTRALIAN', 'ISRAEL', 'ISRAELI'),
   ('AUSTRALIA', 'AUSTRALIAN', 'JAPAN', 'JAPANESE'),
   ('AUSTRALIA', 'AUSTRALIAN', 'SWITZERLAND', 'SWISS'),
   ('FRANCE', 'FRENCH', 'INDIA', 'INDIAN'),
   ('FRANCE', 'FRENCH', 'ISRAEL', 'ISRAELI'),
   ('FRANCE', 'FRENCH', 'JAPAN', 'JAPANESE'),
   ('FRANCE', 'FRENCH', 'SWITZERLAND', 'SWISS'),
   ('FRANCE', 'FRENCH', 'AUSTRALIA', 'AUSTRALIAN'),
   ('INDIA', 'INDIAN', 'ISRAEL', 'ISRAELI'),
   ('INDIA', 'INDIAN', 'JAPAN', 'JAPANESE'),
   ('INDIA', 'INDIAN', 'SWITZERLAND', 'SWISS'),
   ('INDIA', 'INDIAN', 'FRANCE', 'FRENCH'),
   ('ISRAEL', 'ISRAELI', 'JAPAN', 'JAPANESE'),
   ('ISRAEL', 'ISRAELI', 'SWITZERLAND', 'SWISS'),
   ('ISRAEL', 'ISRAELI', 'AUSTRALIA', 'AUSTRALIAN'),
   ('ISRAEL', 'ISRAELI', 'FRANCE', 'FRENCH'),
   ('ISRAEL', 'ISRAELI', 'INDIA', 'INDIAN'),
   ('JAPAN', 'JAPANESE', 'SWITZERLAND', 'SWISS'),
   ('JAPAN', 'JAPANESE', 'AUSTRALIA', 'AUSTRALIAN'),
   ('JAPAN', 'JAPANESE', 'FRANCE', 'FRENCH'),
   ('JAPAN', 'JAPANESE', 'INDIA', 'INDIAN'),
   ('JAPAN', 'JAPANESE', 'ISRAEL', 'ISRAELI'),
   ('SWITZERLAND', 'SWISS', 'AUSTRALIA', 'AUSTRALIAN'),
   ('SWITZERLAND', 'SWISS', 'FRANCE', 'FRENCH'),
   ('SWITZERLAND', 'SWISS', 'INDIA', 'INDIAN'),
   ('SWITZERLAND', 'SWISS', 'ISRAEL', 'ISRAELI'),
   ('SWITZERLAND', 'SWISS', 'JAPAN', 'JAPANESE'),
   ('GOING', 'WENT', 'PAYING', 'PAID'),
   ('GOING', 'WENT', 'PLAYING', 'PLAYED'),
   ('GOING', 'WENT', 'SAYING', 'SAID'),
   ('GOING', 'WENT', 'TAKING', 'TOOK'),
   ('PAYING', 'PAID', 'PLAYING', 'PLAYED'),
   ('PAYING', 'PAID', 'SAYING', 'SAID'),
   ('PAYING', 'PAID', 'TAKING', 'TOOK'),
   ('PAYING', 'PAID', 'GOING', 'WENT'),
   ('PLAYING', 'PLAYED', 'SAYING', 'SAID'),
   ('PLAYING', 'PLAYED', 'TAKING', 'TOOK'),
   ('PLAYING', 'PLAYED', 'GOING', 'WENT'),
   ('PLAYING', 'PLAYED', 'PAYING', 'PAID'),
   ('SAYING', 'SAID', 'TAKING', 'TOOK'),
   ('SAYING', 'SAID', 'GOING', 'WENT'),
   ('SAYING', 'SAID', 'PAYING', 'PAID'),
   ('SAYING', 'SAID', 'PLAYING', 'PLAYED'),
   ('TAKING', 'TOOK', 'GOING', 'WENT'),
   ('TAKING', 'TOOK', 'PAYING', 'PAID'),
   ('TAKING', 'TOOK', 'PLAYING', 'PLAYED'),
   ('TAKING', 'TOOK', 'SAYING', 'SAID'),
   ('BUILDING', 'BUILDINGS', 'CAR', 'CARS'),
   ('BUILDING', 'BUILDINGS', 'CHILD', 'CHILDREN'),
   ('BUILDING', 'BUILDINGS', 'MAN', 'MEN'),
   ('BUILDING', 'BUILDINGS', 'ROAD', 'ROADS'),
   ('BUILDING', 'BUILDINGS', 'WOMAN', 'WOMEN'),
   ('CAR', 'CARS', 'CHILD', 'CHILDREN'),
   ('CAR', 'CARS', 'MAN', 'MEN'),
   ('CAR', 'CARS', 'ROAD', 'ROADS'),
   ('CAR', 'CARS', 'WOMAN', 'WOMEN'),
   ('CAR', 'CARS', 'BUILDING', 'BUILDINGS'),
   ('CHILD', 'CHILDREN', 'MAN', 'MEN'),
   ('CHILD', 'CHILDREN', 'ROAD', 'ROADS'),
   ('CHILD', 'CHILDREN', 'WOMAN', 'WOMEN'),
   ('CHILD', 'CHILDREN', 'BUILDING', 'BUILDINGS'),
   ('CHILD', 'CHILDREN', 'CAR', 'CARS'),
   ('MAN', 'MEN', 'ROAD', 'ROADS'),
   ('MAN', 'MEN', 'WOMAN', 'WOMEN'),
   ('MAN', 'MEN', 'BUILDING', 'BUILDINGS'),
   ('MAN', 'MEN', 'CAR', 'CARS'),
   ('MAN', 'MEN', 'CHILD', 'CHILDREN'),
   ('ROAD', 'ROADS', 'WOMAN', 'WOMEN'),
   ('ROAD', 'ROADS', 'BUILDING', 'BUILDINGS'),
   ('ROAD', 'ROADS', 'CAR', 'CARS'),
   ('ROAD', 'ROADS', 'CHILD', 'CHILDREN'),
   ('ROAD', 'ROADS', 'MAN', 'MEN'),
   ('WOMAN', 'WOMEN', 'BUILDING', 'BUILDINGS'),
   ('WOMAN', 'WOMEN', 'CAR', 'CARS'),
   ('WOMAN', 'WOMEN', 'CHILD', 'CHILDREN'),
   ('WOMAN', 'WOMEN', 'MAN', 'MEN'),
   ('WOMAN', 'WOMEN', 'ROAD', 'ROADS')]}]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170
  • 171
  • 172
  • 173
  • 174
  • 175
  • 176
  • 177
  • 178
  • 179
  • 180
  • 181
  • 182
  • 183
  • 184
  • 185
  • 186
  • 187
  • 188
  • 189
  • 190
  • 191
  • 192
  • 193
  • 194
  • 195
  • 196
  • 197
  • 198
  • 199
  • 200
  • 201
  • 202
  • 203
  • 204
  • 205
  • 206
  • 207
  • 208
  • 209
  • 210
  • 211
  • 212
  • 213
  • 214
  • 215
  • 216
  • 217
  • 218
  • 219
  • 220
  • 221
  • 222
  • 223
  • 224
  • 225
  • 226
  • 227
  • 228
  • 229
  • 230
  • 231
  • 232
  • 233
  • 234
  • 235
  • 236
  • 237
  • 238
  • 239
  • 240
  • 241
  • 242
  • 243
  • 244
  • 245
  • 246
  • 247
  • 248
  • 249
  • 250
  • 251
  • 252
  • 253
  • 254
  • 255
  • 256
  • 257
  • 258
  • 259
  • 260
  • 261
  • 262
  • 263
  • 264
  • 265
  • 266
  • 267
  • 268
  • 269
  • 270
  • 271
  • 272
  • 273
  • 274
  • 275
  • 276
  • 277
  • 278
  • 279
  • 280
  • 281
  • 282
  • 283
  • 284
  • 285
  • 286
  • 287
  • 288
  • 289
  • 290
  • 291
  • 292
  • 293
  • 294
  • 295
  • 296
  • 297
  • 298
  • 299
  • 300
  • 301
  • 302
  • 303
  • 304
  • 305
  • 306
  • 307
  • 308
  • 309
  • 310
  • 311
  • 312
  • 313
  • 314
  • 315

可以看到测试的结果并不理想,应该是因为我们前面使用的训练语料比较小的原因。

这种精确度的衡量方式有个可选参数 restrict_vocab,用于限制哪些测试样本会被考虑到。

在 2016 年的版本中,Gensim 增加了一个更好的方式来评估语义相似度。

默认使用的是学术数据集:WS-353。但是个人也可以基于这个数据集创造一个专注于特别领域的数据集。

这个数据集包含词语对,及人工标注的相似度评估,用于衡量这两个词的相关性,或同时出现的概率。

例如 coast(海岸) 和 shore(岸)非常相似,这两个词经常出现在同一段文字中。

同时,clothes(衣服) 和 closet(衣橱) 的相似度就要低一些,虽然这两个词是有关系的,但是无法互换。

model.wv.evaluate_word_pairs(datapath('wordsim353.tsv'))
  • 1

测试结果:

((0.1952515342533469, 0.13490728041580877),
 SpearmanrResult(correlation=0.19127414318530173, pvalue=0.14319638687965558),
 83.0028328611898)
  • 1
  • 2
  • 3

返回值:

  • pearson (tuple of (float, float)) – Pearson correlation coefficient with 2-tailed p-value.
    • 皮尔森相关系数(2 个双尾 p 值)
  • spearman (tuple of (float, float)) – Spearman rank-order correlation coefficient between the similarities from the dataset and the similarities produced by the model itself, with 2-tailed p-value.
    • 斯皮尔曼等级相关系数,针对数据集的相关性和模型产生的相关性,2 个 双尾 p 值。
  • oov_ratio (float) – The ratio of pairs with unknown words.
    • 配对中有未知单词的比例。

所以上面的结果显示,我们测试的成绩并不好呀,应该是训练语料较小的原因吧!

!!! 注意:

  • 在 Google 测试集和 WS-353 上取得好成绩并不意味着在应用中也会表现很好~
  • 反之亦然~
  • 最好直接在所需的任务中进行测试!比如我们要做一个分类任务,那直接看分类的效果就好了!
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/寸_铁/article/detail/881874
推荐阅读
相关标签
  

闽ICP备14008679号