数据以未经清洗的格式收集,使用Twitter API进行收集。数据已被过滤,仅保留了英文内容。它针对的是用户在推文级别的心理健康分类。
换句话说,我们希望构建一个模型,可以根据其内容将数据分类为“抑郁”和“无抑郁”。我可以想出许多原因,为什么这种事情如此酷。我提出的原因之一是可以根据用户的情绪来判断他们的心理状况,并推广相关产品。例如,患有抑郁症的人可以推广药物。药物是治疗抑郁症最有效的方法之一,因此,将推文分类为“抑郁”和“无抑郁”对此非常有用。该项目的步骤如下:1. 对文本进行初步清理。2. 数据可视化。3. 使用“tm”包将其转换为格式。4. 将数据分成“训练”和“测试”集。5. 使用“一次性热编码”方法定义分类模型的“特征”。6. 在“e1071”包中应用朴素贝叶斯算法于“训练”数据。7. 使用该模型从“测试”数据中进行预测。
## no depression depression
## 10000 10000
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 20000
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 5
## [1] just year sinc diagnos anxieti depress today im take moment reflect far ive come sinc
## [2] sunday need break im plan spend littl time possibl
## [3] awak tire need sleep brain idea
## [4] rt sewhq retro bear make perfect gift great beginn get stitch octob sew sale now yay httptco
## [5] hard say whether pack list make life easier just reinforc much still need movinghous anxieti
## [1] 20000 28725
## [1] 20000 1109 ## <<DocumentTermMatrix (documents: 10, terms: 15)>> ## Non-/sparse entries: 20/130 ## Sparsity : 87% ## Maximal term length: 7 ## Weighting : term frequency (tf) ## Sample : ## Terms ## Docs anxieti come depress diagnos far ive just need sinc take ## 1 1 1 1 1 1 1 1 0 2 1 ## 10 0 0 0 0 0 0 0 0 0 0 ## 2 0 0 0 0 0 0 0 1 0 0 ## 3 0 0 0 0 0 0 0 1 0 0 ## 4 0 0 0 0 0 0 0 0 0 0 ## 5 1 0 0 0 0 0 1 1 0 0 ## 6 0 0 0 0 0 0 0 0 0 0 ## 7 0 0 0 0 0 0 0 0 0 0 ## 8 0 0 0 0 0 0 0 0 0 0 ## 9 0 0 0 0 0 0 0 0 0 1
regular inspect()方法的信息量不是很大。但是,如果我们想知道一些单词或短语在语料库中出现的频率,我们可以做的一件事是将这些术语作为字典传递给DocumentTermMatrix()方法。
## [1] "Its just over 2 years since I was diagnosed with anxiety and depression Today Im taking a moment to reflect on how far Ive come since"
## word freq
## like like 1035
## depress depress 953
## just just 943
## dont dont 832
## get get 794
## one one 746
## user system elapsed
## 0.72 0.01 0.74
Model Evaluation for Naive Bayes
## Confusion Matrix and Statistics ## ## Reference ## Prediction no depression depression ## no depression 1006 0 ## depression 0 970 ## ## Accuracy : 1 ## 95% CI : (0.9981, 1) ## No Information Rate : 0.5091 ## P-Value [Acc > NIR] : < 2.2e-16 ## ## Kappa : 1 ## ## Mcnemar's Test P-Value : NA ## ## Sensitivity : 1.0000 ## Specificity : 1.0000 ## Pos Pred Value : 1.0000 ## Neg Pred Value : 1.0000 ## Prevalence : 0.5091 ## Detection Rate : 0.5091 ## Detection Prevalence : 0.5091 ## Balanced Accuracy : 1.0000 ## ## 'Positive' Class : no depression
与分别为100%和100%的支持向量机和随机森林模型相比,朴素贝叶斯模型以100%的准确率表现最好。Naive Bayes的工作原理是假设数据集的特征彼此独立 — 因此被称为Naive。
library(tidyverse) library(ggthemes) library(e1071) # has the naiveBayes algorithm library(caret) # good ML package, I like the confusionMatrix() function library(tm) # for text mining # Load package library(wordcloud) Sys.setenv(LANG="en_US.UTF-8") ### https://www.kaggle.com/datasets/infamouscoder/mental-health-social-media data <- read_csv("Mental-Health-Twitter.csv") data$label <- factor(data$label,levels = c(0,1), labels = c("no depression","depression")) table(data$label) data$post_text <- gsub("[^[:alnum:][:blank:]?&/\\-]", "", data$post_text) corpus <- Corpus(VectorSource(data$post_text)) corpus clean.corpus <- corpus %>% tm_map(tolower) %>% tm_map(removeNumbers) %>% tm_map(removeWords, stopwords()) %>% tm_map(removePunctuation) %>% tm_map(stripWhitespace)%>% tm_map(stemDocument) inspect(clean.corpus[1:5]) no_depression <- subset(data,label=="no depression") wordcloud(no_depression$post_text, max.words = 100, scale = c(3,0.5)) depression <- subset(data,label=="depression") wordcloud(depression$post_text, max.words = 100, scale = c(3,0.5)) df <- data%>% group_by(user_id,label)%>% count() ggplot(df, aes(x = label)) + geom_bar() + geom_text(aes(label = ..count..), stat = "count", vjust = 2, colour = "white") + ylab("people") # Create the Document Term Matrix dtm <- DocumentTermMatrix(clean.corpus) dim(dtm) dtm = removeSparseTerms(dtm, 0.999) dim(dtm) #Inspecting the the first 10 tweets and the first 15 words in the dataset inspect(dtm[0:10, 1:15]) data$post_text[1] freq<- sort(colSums(as.matrix(dtm)), decreasing=TRUE) wf<- data.frame(word=names(freq), freq=freq) head(wf) ggplot(head(wf,10),aes(x = fct_reorder(word,freq),y = freq)) + geom_col() + xlab("word") + ggtitle("Top 10 words in Tweets") convert_count <- function(x) { y <- ifelse(x > 0, 1,0) y <- factor(y, levels=c(0,1), labels=c("No", "Yes")) y } # Apply the convert_count function to get final training and testing DTMs datasetNB <- apply(dtm, 2, convert_count)
