赞
踩
目录
向量空间模型(又称“词向量模型”):将文本文档转为数字向量。权重为文档中单词的频率、平均出现的频率或TF-IDF权重。
主流:谷歌的word2vec算法,它是一个基于神经网络的实现,使用CBOW(Continuous Bags of Words)和skip-gram两种结构学习单词的分布式向量表示。也可基于Gensim库实现。
- # Create Dictionary
- id2word = corpora.Dictionary(data_lemmatized)
- # Create Corpus
- texts = data_lemmatized
- # Term Document Frequency
- corpus = [id2word.doc2bow(text) for text in texts]
- print()
-
- #构建主题模型
- #依然基于gensim
- lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
- id2word=id2word,
- num_topics=2,
- random_state=100,
- update_every=1,
- chunksize=100,
- passes=10,
- alpha='auto',
- per_word_topics=True)
-
- #查看LDA模型中的主题
- # Print the Keyword in the 10 topics
- pprint(lda_model.print_topics())
- doc_lda = lda_model[corpus]

- import gensim.downloader as api
- from gensim.models import TfidfModel
- from gensim.corpora import Dictionary
- dct = Dictionary(data_lemmatized) # fit dictionary
- corpus = [dct.doc2bow(line) for line in data_lemmatized] # convert corpus to BoW format
- model = TfidfModel(corpus) # fit model
- vector = model[corpus] # apply model to the first corpus document
-
-
- #构建主题模型
- #依然基于gensim
- lda_model = gensim.models.ldamodel.LdaModel(corpus=vector,
- id2word=dct,
- num_topics=2,
- random_state=100,
- update_every=1,
- chunksize=100,
- passes=10,
- alpha='auto',
- per_word_topics=True)
-
- #查看LDA模型中的主题
- # Print the Keyword in the 10 topics
- pprint(lda_model.print_topics())
- doc_lda = lda_model[corpus]

- data=("I LOVE apples# & 3241","he likes PIG3s","she do not like anything,except apples.\.")
- 主题数=2
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。