赞
踩
https://radimrehurek.com/gensim/models/deprecated/word2vec.html
用gensim函数库训练Word2Vec模型有很多配置参数,以下是对gensim文档的Word2Vec函数的参数说明
class gensim.models.deprecated.word2vec.
Word2Vec
(sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=<built-in function hash>, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False)
Bases: gensim.models.deprecated.old_saveload.SaveLoad
Class for training, using and evaluating neural networks described in https://code.google.com/p/word2vec/
If you’re finished training a model (=no more updates, only querying) then switch to the gensim.models.KeyedVectors
instance in wv
The model can be stored/loaded via its save() and load() methods, or stored/loaded in a format compatible with the original word2vec implementation via wv.save_word2vec_format() and KeyedVectors.load_word2vec_format().
Initialize the model from an iterable of sentences. Each sentence is a list of words (unicode strings) that will be used for training.
The sentences iterable can be simply a list, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus
, Text8Corpus
or LineSentence
in this module for such examples.
※ sentences:可以是一个list,对于大语料集,建议使用BrownCorpus,Text8Corpus或LineSentence构建。
If you don’t supply sentences, the model is left uninitialized – use if you plan to initialize it in some other way.
sg defines the training algorithm. By default (sg=0), CBOW is used. Otherwise (sg=1), skip-gram is employed.
※sg: 用于设置训练算法,默认值为0,对应CBOW算法;sg=1则采用skip-gram算法。
size is the dimensionality of the feature vectors.
※size:是指特征向量的维度,默认值为100。大的size需要更多的训练数据,但是效果会更好,推荐值为几十到几百。
window is the maximum distance between the current and predicted word within a sentence.
※window:表示当前词与预测词在一个句子中的最大距离是多少。
alpha is the initial learning rate (will linearly drop to min_alpha as training progresses).
※alpha: 是学习速率
seed = for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread, to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization.)
※seed:用于随机数发生器。与初始化词向量有关。
min_count = ignore all words with total frequency lower than this.
※min_count: 可以对字典做截断。词频少于min_count次数的单词会被丢弃掉, 默认值为5。
max_vocab_size = limit RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit (default).
※max_vocab_size: 设置词向量构建期间的RAM限制。如果所有独立单词个数超过这个,则就消除掉其中最不频繁的一个。每一千万个单词需要大约1GB的RAM。设置成None则没有限制。
sample = threshold for configuring which higher-frequency words are randomly downsampled;
default is 1e-3, useful range is (0, 1e-5).
※sample: 高频词汇的随机降采样的配置阈值,默认为1e-3,范围是(0,1e-5)。
workers = use this many worker threads to train the model (=faster training with multicore machines).
※workers参数控制训练的并行数。
hs = if 1, hierarchical softmax will be used for model training. If set to 0 (default), and negative is non-zero, negative sampling will be used.
※hs: 如果为1则会采用hierarchica·softmax技巧。如果设置为0(默认),则negative sampling会被使用。
negative = if > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). Default is 5. If set to 0, no negative samping is used.
※negative: 如果>0,则会采用negativesamp·ing,用于设置多少个noise words
cbow_mean = if 0, use the sum of the context word vectors. If 1 (default), use the mean. Only applies when cbow is used.
※cbow_mean: 如果为0,则采用上下文词向量的和,如果为1(默认)则采用均值。只有使用CBOW的时候才起作用。
hashfxn = hash function to use to randomly initialize weights, for increased training reproducibility. Default is Python’s rudimentary built in hash function.
※hashfxn: hash函数来初始化权重。默认使用python的hash函数
iter = number of iterations (epochs) over the corpus. Default is 5.
※iter: 迭代次数,默认为5。
trim_rule = vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used), or a callable that accepts parameters (word, count, min_count) and returns either utils.RULE_DISCARD, utils.RULE_KEEP or utils.RULE_DEFAULT. Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.
※trim_rule: 用于设置词汇表的整理规则,指定那些单词要留下,哪些要被删除。可以设置为None(min_count会被使用)或者一个接受()并返回RU·E_DISCARD,uti·s.RU·E_KEEP或者uti·s.RU·E_DEFAU·T的函数。
sorted_vocab = if 1 (default), sort the vocabulary by descending frequency before assigning word indexes.
※sorted_vocab: 如果为1(默认),则在分配word index 的时候会先对单词基于频率降序排序。
batch_words = target size (in words) for batches of examples passed to worker threads (and thus cython routines). Default is 10000. (Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)
※batch_words:每一批的传递给线程的单词的数量,默认为10000。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。