为了解决上述挑战,本研究旨在运用简单的循环神经网络(RNN)、长短时记忆网络(LSTM)和双向长短时记忆网络(Bidirectional LSTM)模型进行文本检测,以实现对人工撰写文本和LLM生成文本的有效区分。
朴素贝叶斯算法 | 支持向量机(SVM) | 决策树和随机森林 | |
原理 | 基于特征之间的独立性假设,建模文本和类别之间的关系。 | 找到一个最优的超平面来划分不同类别的文本。 | 决策树根据特征逐步划分数据来进行分类,随机森林则整合多个决策树的结果。 |
优势 | 简单、高效,对于大规模文本分类任务具有较好的性能,计算开销低,具有较好的性能。 | 在高维空间中表现强大,适用于处理复杂的文本特征 | 于非线性关系的建模较为灵活,具有较好的泛化性能 |
局限 | 对于上下文信息的建模较弱,难以处理长距离依赖性。 | 计算开销较大,对于非线性关系建模相对困难,不适用于大规模数据集。 | 在处理高维稀疏数据时可能过拟合,对噪声敏感,需要谨慎调参。 |
该研究里的数据集来自Kaggle里一个正在进行的比赛LLM Detect AI Generated Text的附加数据集。该数据集里面包括44868 行数据,每行数据都是由中小学生根据命题撰写的文本或者LLM根据指令生成的文本。其中使用LLM模型主要包括persuade_corpus、chat_gpt_moth、llama2_chat等。每行数据都包括具体文本,生成标签(0,1),生成指令,源数据集等内容,其中27371是人工撰写的文本,17497是LLM生成的文本。
Text | Label | Prompt_name | Source | RDizzl3_seven |
essay text | 1 for AI generated, 0 for human | original persuade promp | source dataset | True or Flase |
最后,我们用Tokenizer 将文本数据转换为序列(整数序列),Tokenizer 会基于训练集文本构建一个词汇表,并将每个单词映射到一个唯一的整数。在神经网络中,嵌入层(Embedding Layer)通常用于将整数序列映射为密集的词嵌入表示。Tokenizer的输出(整数序列)可以直接作为嵌入层的输入。
LSTM由以下几个部分组成:输入门(Input Gate):决定哪些信息将被存储到细胞状态中。通过Sigmoid激活函数来决定哪些信息需要更新;遗忘门(Forget Gate):决定要从细胞状态中删除哪些信息。同样通过Sigmoid激活函数来决定删除的程度;细胞状态(Cell State):负责存储短期和长期信息。通过输入门和遗忘门的控制来更新和维护;输出门(Output Gate):基于更新后的细胞状态,决定最终的LSTM输出。
4.3Bidirectional LSTM模型
双向循环神经网络由两个方向的循环神经网络组成,分别处理输入序列的正向和反向信息。这允许模型在处理序列时更好地捕获上下文信息。而双向Bi-LSTM 将双向RNN 和 LSTM 结合起来,通过双向RNN捕获上下文信息,而LSTM则负责更好地处理序列中的长距离依赖关系。这种组合有助于提高模型的性能,特别是在文本分类和生成任务中。
在超参数选择时,BiLSTM 不仅需要选择与传统 RNN 相似的超参数,如学习率、迭代次数等,同时还需要调整 BiRNN 和 LSTM 的BiRNN 的隐藏层大小、LSTM 的细胞状态维度等独有参数。
(这里 TP 表示真正例,TN 表示真负例,FP 表示假正例,FN 表示假负例)
综上,SimpleRNN模型的准确率表现较差,说明模型未能有效地区分中小学生文本和LLM生成文本;精确度和召回率都相对较低,说明模型在两个类别上的表现都存在一定问题;F1 分数作为综合评价指标也相对较低,提示模型在平衡精确度和召回率时存在挑战。
综上,LSTM模型的准确率相对较低,但召回率非常高,说明模型在LLM生成文本的识别上表现良好,几乎没有遗漏;精确度较低可能表示模型在中小学生文本和LLM生成文本的区分上存在问题,有一定的误判;F1 分数相对较高,显示了在精确度和召回率之间取得了一定的平衡。
Bi-LSTM在五次训练中的准确率较高,为98.57%,损失下降较多。精确度、召回率和FI分数分别为0.991、0.979、0.985。可以看出Bi-LSTM 模型在中小学生文本和LLM生成文本的检测任务上表现非常出色,准确率高、精确度高、召回率高,F1 分数也很高。模型的高精确度表明它在区分中小学生文本和LLM生成文本方面非常可靠,几乎没有误判。虽然召回率略低于精确度,但在实际任务中,相对较高的召回率仍然是非常可接受的。
SimpleRNN 相对较差,表现欠佳,可能无法很好地捕捉到序列中的复杂关系,梯度消失问题可能影响了性能。LSTM 在召回率上表现很好,但精确度较低,可能过于保守。对于正类别的识别较好,但有一定的误判。Bi-LSTM 表现最为优越,准确率极高,且在精确度、召回率和 F1 分数上都取得了很好的平衡。双向循环和长短时记忆的结合使其能够更全面地捕获序列信息。
总之,Bi-LSTM 模型在中小学生文本和LLM生成文本的检测任务上展现了卓越的性能,是一个强大的模型。相比之下,SimpleRNN 和 LSTM 在这个任务上的表现相对较差,可能受制于梯度消失等问题。上面有说过LSTM跟RNN相比有长期依赖性,但是在遇到过长的序列时,RNN和LSTM性能没有太大差别。
True Label: 1, Predicted Label: 0
Text: imagination powerful tool ability shape life countless way knowledge important not always enough lead innovation creative thinking fact imagination often important knowledge come thing one obvious example importance imagination field invention many successful invention history result someone using imagination envision better way something example brother invention airplane made possible ability imagine would like fly without imagination would still stuck ground another way imagination important world movie entertainment use imagination create new exciting story audience without imagination movie would nothing series image screen instead able transport different world let experience thing may never opportunity see real life addition example imagination also used bring joy life whether vacation new hobby imagination help find happiness fulfillment also source comfort difficult time feeling use imagination escape happier place even little course imagination not always easy come take practice effort develop imagination reward well worth using imagination come new innovative idea create exciting work art find joy even aspect life conclusion imagination powerful tool important knowledge many way ability lead innovation creative thinking bring joy life knowledge important not enough must use imagination truly unlock full potential live best life
模型可能误判原因: SimpleRNN 对于强调抽象概念和创造性思维的文本可能不具备足够理解能力。
总体来说,SimpleRNN 在处理包含较长、语义结构较复杂的文本时表现较差,出现了误判。其主要误判原因可能包括:
难以捕捉长期依赖关系: SimpleRNN 在处理长序列时可能遇到梯度消失或梯度爆炸的问题,导致难以捕捉文本中的长期依赖关系。这使得模型在理解包含多个句子、复杂逻辑结构的文本时性能下降。
处理复杂语义结构的挑战: SimpleRNN 对于复杂的语义结构、抽象概念的理解能力相对有限。在处理需要深层次理解和推理的文本时,模型可能无法准确地捕捉关键信息。
对抽象概念的理解有限: SimpleRNN 在处理强调抽象概念、创造性思维的文本时可能表现不佳。这使得模型在区分包含抽象思维和创意元素的文本时容易出现误差。
综上,SimpleRNN 在处理一些复杂、抽象的语境和语义结构时存在局限性,对于更长、更具挑战性的文本任务可能需要更复杂的循环神经网络结构来提高性能。
2.LSTM 的误判
True Label: 1, Predicted Label: 0
Text: le car usage answer yes could stop alot pollution going atmosphere could help green house gas going ozone layer harming earth one reason cut back car use green house gas going air europe passenger car green house gas even responsible area cut back would good thing cut back pollution bet use bike instead car would cut back green house gas example take look program helped mobile world think another reason lot people really like fact not car great factor earth take look germany town vauban resident town went driving soccer mom people feel ease willingly gave car lot people not like family vauban not car sold car move mean gave automobile stay town cut back lot pollution europe passage woman say car always tense much happier way conclusion belive giving car cut back pollution like europe great idea quite think would whole better situation would mean lot green house gas ozone would not get added onto bike perfect way even walking good explained everything
综合而言,LSTM 模型在处理复杂文本时可能面临对上下文理解的挑战,以及在处理多样性观点和主题转换时的困难。改进的方向包括增加多样性数据、调整模型参数、平衡数据集等,以提高模型对复杂文本的泛化能力。
3.Bi-LSTM 的误判
SimpleRNN和LSTM的问题: SimpleRNN和LSTM在处理长序列文本时可能遇到梯度消失的问题,导致模型难以学习长距离依赖性。这可能是它们在任务上表现较差的原因之一。
- #!/usr/bin/env python
- # coding: utf-8
- # In[1]:
- # Import Libraries
- import matplotlib.pyplot as plt
- import numpy as np
- import pandas as pd
- import re
- import seaborn as sns
- from nltk.tokenize import word_tokenize
- from nltk.stem import WordNetLemmatizer
- from unidecode import unidecode
- from string import punctuation
- from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
- from sklearn.model_selection import train_test_split
- from wordcloud import WordCloud
- from tqdm import tqdm
- tqdm.pandas()
- import warnings
- warnings.filterwarnings("ignore")
- # In[2]:
- import tensorflow as tf
- from tensorflow import keras
- from keras.models import Sequential
- from keras.layers import Dense, Input, SimpleRNN, LSTM , Bidirectional, Embedding
- from keras.layers import Dropout
- from tensorflow.keras.preprocessing.text import Tokenizer
- from tensorflow.keras.preprocessing.sequence import pad_sequences
- from keras.utils import to_categorical
- # In[3]:
- import os
- train_gpu = [0,]
- os.environ["CUDA_VISIBLE_DEVICES"] = ','.join(str(x) for x in train_gpu)
- ngpus_per_node = len(train_gpu)
- gpus = tf.config.experimental.list_physical_devices(device_type='GPU')
- for gpu in gpus:
- tf.config.experimental.set_memory_growth(gpu, True)
- if ngpus_per_node > 1:
- strategy = tf.distribute.MirroredStrategy()
- else:
- strategy = None
- print('Number of devices: {}'.format(ngpus_per_node))
- # In[4]:
- #载入数据
- train_essays = pd.read_csv('C:\\Users\\huaix\\PycharmProjects\\pythonProject4\\kaggle\\input\\ai\\train_v2_drcat_02.csv')
- train_essays = train_essays.rename(columns={'label': 'generated', 'prompt_name': 'prompt_id'}, errors='ignore')
- # In[5]:
- train_essays
- # In[6]:
- #定义新数据集
- df1 = train_essays[['text','generated']]
- df1.generated.value_counts()
- # In[7]:
- #文本处理函数
- def remove_blank(data):
- formated_text = data.replace("\\n"," ").replace("\t"," ")
- return formated_text
- def handle_accented(data):
- fixed_text = unidecode(data)
- return fixed_text
- def clean_text(data):
- tokens = word_tokenize(data)
- text=[i.lower() for i in tokens if (i.lower() not in punctuation) and (i.lower() not in stopwords_list) and (len(i)>2) and (i.isalpha())]
- return text
- def lemmatization(data):
- lemma = WordNetLemmatizer()
- final_text = []
- for i in data:
- lemma_word = lemma.lemmatize(i)
- final_text.append(lemma_word)
- return " ".join(final_text)
- stopwords_text = """"i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't"
- """
- stopwords_list = stopwords_text.split()
- # In[8]:
- # 处理缺失数据——无缺失数据
- num_missing = df1.isna().sum()
- num_missing
- # In[9]:
- # 处理重复数据——无重复行
- duplicates = df1.duplicated(subset=['text']).sum()
- df1 = df1.drop_duplicates()
- print("重复行的数量:", duplicates)
- # In[10]:
- #平衡抽样
- sampled_data1 = df1[df1['generated'] == 1]
- sampled_data0 = df1[df1['generated'] == 0].sample(len(sampled_data1)) # samples a number of rows equal to the length of df_neg
- sampled_data0.generated.value_counts()
- # In[11]:
- sampled_data1.generated.value_counts()
- # In[12]:
- #新建数据集
- df2 = pd.concat([sampled_data1, sampled_data0], axis = 0)
- # In[13]:
- df2
- # In[14]:
- x_train,x_test,y_train,y_test=train_test_split(df2.text, df2.generated,test_size=0.3,shuffle=True)
- clean_train = x_train.apply(remove_blank)
- clean_test = x_test.apply(remove_blank)
- clean_train = clean_train.apply(handle_accented)
- clean_test = clean_test.apply(handle_accented)
- clean_train = clean_train.apply(clean_text)
- clean_test = clean_test.apply(clean_text)
- clean_train = clean_train.apply(lemmatization)
- clean_test = clean_test.apply(lemmatization)
- # In[16]:
- # 训练集词云图
- str_text=clean_train.str.cat(sep=" ") # we have to convert all review into single string
- wordcloud=WordCloud(width=500,height=300).generate(str_text)
- plt.figure(figsize=(10,6))
- plt.imshow(wordcloud)
- plt.axis("off")
- plt.show()
- # In[15]:
- #测试集词云图
- str_text=clean_test.str.cat(sep=" ") # we have to convert all review into single string
- wordcloud=WordCloud(width=500,height=300).generate(str_text)
- plt.figure(figsize=(10,6))
- plt.imshow(wordcloud)
- plt.axis("off")
- plt.show()
- # In[39]:
- #单词嵌入
- x_train_list = clean_train.to_list()
- x_test_list = clean_test.to_list()
- max_words = 5000
- tk = Tokenizer(num_words=max_words, oov_token="##oov##")
- tk.fit_on_texts(x_train_list)
- x_train_seq = tk.texts_to_sequences(x_train_list)
- x_test_seq = tk.texts_to_sequences(x_test_list)
- # In[40]:
- lst = []
- for i in range(0, len(x_train_seq)):
- a = len(x_train_seq[i])
- lst.append(a)
- print("Max len of sent:",max(lst))
- print("Min len of sent:",min(lst))
- # In[41]:
- sns.histplot(lst, color="red", label='All Essays', kde=True)
- plt.title('Distribution of Essay Lengths')
- plt.xlabel('Essay Length (Number of Characters)')
- plt.ylabel('Frequency')
- plt.legend()
- plt.show()
- # In[42]:
- #文本序列填充或截断到指定的最大长度
- max_len_per_sent = 700
- x_train_seq1 = pad_sequences(x_train_seq, maxlen=max_len_per_sent, padding="post", truncating="post")
- x_test_seq1 = pad_sequences(x_test_seq, maxlen=max_len_per_sent, padding="post", truncating="post")
- # In[43]:
- #简单RNN网络
- max_len_per_sent = 700
- vocab_size = 137759
- model1 = Sequential()
- model1.add(Embedding(input_dim = 5001 , output_dim = 500 , input_length = max_len_per_sent))
- model1.add(SimpleRNN(units=100 , return_sequences=False))
- model1.add(Dense(units=100 , activation='relu'))
- model1.add(Dropout(0.2))
- model1.add(Dense(units=1, activation='sigmoid'))
- model1.compile(optimizer="adam",
- loss="binary_crossentropy",
- metrics=["accuracy"])
- model1.summary()
- # In[44]:
- history = model1.fit(x_train_seq1 , y_train , epochs=5 ,
- batch_size=64, validation_data=(x_test_seq1 , y_test))
- # In[45]:
- acc=history.history['accuracy']
- val_acc=history.history['val_accuracy']
- loss=history.history['loss']
- val_loss=history.history['val_loss']
- epochs=range(1,len(acc)+1)
- # In[46]:
- plt.plot(epochs,acc,'bo',label='Training acc',color='red')
- plt.plot(epochs,val_acc,'b',label='validation acc',color='red')
- plt.title('Training and validation accracy in SimpleRNN')
- plt.legend()
- plt.figure()
- plt.plot(epochs,loss,'bo',label='Training loss',color='red')
- plt.plot(epochs,val_loss,'b',label='validation loss',color='red')
- plt.title('Training and validation loss in SimpleRNN')
- plt.legend()
- plt.show()
- # In[99]:
- # 创建一个包含两个子图的大图形,每个子图一行两列
- plt.figure(figsize=(14, 6))
- # 在第一个子图中绘制训练准确度和验证准确度
- plt.subplot(1, 2, 1)
- plt.plot(epochs, acc, 'b-', label='Training acc', color='red')
- plt.plot(epochs, val_acc, 'b--', label='Validation acc', color='blue')
- plt.title('Training and Validation Accuracy in SimpleRNN')
- plt.legend()
- # 在每个点上展示具体数字
- for i, txt in enumerate(val_acc):
- plt.annotate(f'{txt:.2f}', (epochs[i], val_acc[i]), textcoords="offset points", xytext=(0,5), ha='center')
- # 在第二个子图中绘制训练损失和验证损失
- plt.subplot(1, 2, 2)
- plt.plot(epochs, loss, 'b-', label='Training loss', color='red')
- plt.plot(epochs, val_loss, 'b--', label='Validation loss', color='blue')
- plt.title('Training and Validation Loss in SimpleRNN')
- plt.legend()
- # 在每个点上展示具体数字
- for i, txt in enumerate(val_loss):
- plt.annotate(f'{txt:.3f}', (epochs[i], val_loss[i]), textcoords="offset points", xytext=(0,5), ha='center')
- # 调整子图之间的水平间距
- plt.subplots_adjust(wspace=0.3)
- # 展示图形
- plt.show()
- # In[47]:
- # vocab_size = len(tk.word_index) + 1 # 17231
- #双向RNN-LSTM模型
- model2 = Sequential()
- model2.add(Embedding(input_dim = 5001 , output_dim = 500 , input_length = max_len_per_sent))
- model2.add(Bidirectional(LSTM(units=64 , return_sequences=False)))
- model2.add(Dense(units=100 , activation='relu'))
- model2.add(Dropout(0.2))
- model2.add(Dense(units=100 , activation='relu'))
- model2.add(Dropout(0.2))
- model2.add(Dense(units=1, activation='sigmoid'))
- model2.compile(optimizer="adam",
- loss="binary_crossentropy",
- metrics=["accuracy"])
- model2.summary()
- # In[48]:
- history2 = model2.fit(x_train_seq1 , y_train , epochs=5 ,
- batch_size=64, validation_data=(x_test_seq1 , y_test))
- # In[49]:
- acc2=history2.history['accuracy']
- val_acc2=history2.history['val_accuracy']
- loss2=history2.history['loss']
- val_loss2=history2.history['val_loss']
- epochs=range(1,len(acc)+1)
- # In[50]:
- plt.plot(epochs,acc2,'bo',label='Training acc',color="green")
- plt.plot(epochs,val_acc2,'b',label='validation acc',color="green")
- plt.title('Training and validation accracy in BiRNN-LSTM')
- plt.legend()
- plt.figure()
- plt.plot(epochs,loss2,'bo',label='Training loss',color="green")
- plt.plot(epochs,val_loss2,'b',label='validation loss',color="green")
- plt.title('Training and validation loss in BiRNN-LSTM')
- plt.legend()
- plt.show()
- # In[105]:
- # 创建一个包含两个子图的大图形,每个子图一行两列
- plt.figure(figsize=(14, 6))
- # 在第一个子图中绘制训练准确度和验证准确度
- plt.subplot(1, 2, 1)
- plt.plot(epochs, acc2, 'b-', label='Training acc', color='green')
- plt.plot(epochs, val_acc2, 'b--', label='Validation acc', color='blue')
- plt.title('Training and Validation Accuracy in BiRNN-LSTM')
- plt.legend()
- # 在每个点上展示具体数字
- for i, txt in enumerate(val_acc2):
- plt.annotate(f'{txt:.2f}', (epochs[i], val_acc2[i]), textcoords="offset points", xytext=(0,5), ha='center')
- # 在第二个子图中绘制训练损失和验证损失
- plt.subplot(1, 2, 2)
- plt.plot(epochs, loss2, 'b-', label='Training loss', color='green')
- plt.plot(epochs, val_loss2, 'b--', label='Validation loss', color='blue')
- plt.title('Training and Validation Loss in BiRNN-LSTM')
- plt.legend()
- # 在每个点上展示具体数字
- for i, txt in enumerate(val_loss2):
- plt.annotate(f'{txt:.3f}', (epochs[i], val_loss2[i]), textcoords="offset points", xytext=(0,5), ha='center')
- # 调整子图之间的水平间距
- plt.subplots_adjust(wspace=0.3)
- # 展示图形
- plt.show()
- # In[51]:
- #LSTM模型
- model3 = Sequential()
- model3.add(Embedding(input_dim = 5001 , output_dim = 500 , input_length = max_len_per_sent))
- model3.add(LSTM(units=64 , return_sequences=False))
- model3.add(Dense(units=100 , activation='relu'))
- model3.add(Dropout(0.2))
- model3.add(Dense(units=100 , activation='relu'))
- model3.add(Dropout(0.2))
- model3.add(Dense(units=1, activation='sigmoid'))
- model3.compile(optimizer="adam",
- loss="binary_crossentropy",
- metrics=["accuracy"])
- model3.summary()
- # In[52]:
- history3 = model3.fit(x_train_seq1 , y_train , epochs=5 ,
- batch_size=64, validation_data=(x_test_seq1 , y_test))
- # In[53]:
- acc3=history3.history['accuracy']
- val_acc3=history3.history['val_accuracy']
- loss3=history3.history['loss']
- val_loss3=history3.history['val_loss']
- epochs=range(1,len(acc)+1)
- # In[54]:
- plt.plot(epochs,acc3,'bo',label='Training acc',color="skyblue")
- plt.plot(epochs,val_acc3,'b',label='validation acc',color="skyblue")
- plt.title('Training and validation accracy in LSTM')
- plt.legend()
- plt.figure()
- plt.plot(epochs,loss3,'bo',label='Training loss',color="skyblue")
- plt.plot(epochs,val_loss3,'b',label='validation loss',color="skyblue")
- plt.title('Training and validation loss in LSTM')
- plt.legend()
- plt.show()
- # In[111]:
- # 创建一个包含两个子图的大图形,每个子图一行两列
- plt.figure(figsize=(14, 6))
- # 在第一个子图中绘制训练准确度和验证准确度
- plt.subplot(1, 2, 1)
- plt.plot(epochs, acc3, 'b-', label='Training acc', color='skyblue')
- plt.plot(epochs, val_acc3, 'b--', label='Validation acc', color='blue')
- plt.title('Training and Validation Accuracy in LSTM')
- plt.legend()
- # 在每个点上展示具体数字
- for i, txt in enumerate(val_acc3):
- plt.annotate(f'{txt:.2f}', (epochs[i], val_acc3[i]), textcoords="offset points", xytext=(0,5), ha='center')
- # 在第二个子图中绘制训练损失和验证损失
- plt.subplot(1, 2, 2)
- plt.plot(epochs, loss3, 'b-', label='Training loss', color='skyblue')
- plt.plot(epochs, val_loss3, 'b--', label='Validation loss', color='blue')
- plt.title('Training and Validation Loss in LSTM')
- plt.legend()
- # 在每个点上展示具体数字
- for i, txt in enumerate(val_loss3):
- plt.annotate(f'{txt:.3f}', (epochs[i], val_loss3[i]), textcoords="offset points", xytext=(0,5), ha='center')
- # 调整子图之间的水平间距
- plt.subplots_adjust(wspace=0.3)
- # 展示图形
- plt.show()
- # In[55]:
- plt.plot(epochs,val_acc,'b',label='RNN',color="red")
- plt.plot(epochs,val_acc3,'b',label='LSTM',color="skyblue")
- plt.plot(epochs,val_acc2,'b',label='biRNN-LSTM',color="green")
- plt.title('validation accracy')
- plt.legend()
- plt.figure()
- plt.plot(epochs,val_loss,'b',label='RNN',color="red")
- plt.plot(epochs,val_loss3,'b',label='LSTM',color="skyblue")
- plt.plot(epochs,val_loss2,'b',label='biRNN-LSTM',color="green")
- plt.title('validation loss')
- plt.legend()
- plt.show()
- # In[110]:
- # 创建一个包含两个子图的大图形,每个子图一行两列
- plt.figure(figsize=(14, 6))
- # 在第一个子图中绘制验证准确度
- plt.subplot(1, 2, 1)
- plt.plot(epochs, val_acc, 'b-', label='RNN', color='red')
- plt.plot(epochs, val_acc3, 'b--', label='LSTM', color='skyblue')
- plt.plot(epochs, val_acc2, 'b-.', label='biRNN-LSTM', color='green')
- plt.title('Validation Accuracy')
- plt.legend()
- # 在每个点上展示具体数字
- for i, txt in enumerate(val_acc):
- plt.annotate(f'{txt:.2f}', (epochs[i], val_acc[i]), textcoords="offset points", xytext=(0,10), ha='center')
- for i, txt in enumerate(val_acc3):
- plt.annotate(f'{txt:.2f}', (epochs[i], val_acc3[i]), textcoords="offset points", xytext=(0,10), ha='center')
- for i, txt in enumerate(val_acc2):
- plt.annotate(f'{txt:.2f}', (epochs[i], val_acc2[i]), textcoords="offset points", xytext=(0,10), ha='center')
- # 在第二个子图中绘制验证损失
- plt.subplot(1, 2, 2)
- plt.plot(epochs, val_loss, 'b-', label='RNN', color='red')
- plt.plot(epochs, val_loss3, 'b--', label='LSTM', color='skyblue')
- plt.plot(epochs, val_loss2, 'b-.', label='biRNN-LSTM', color='green')
- plt.title('Validation Loss')
- plt.legend()
- # 在每个点上展示具体数字
- for i, txt in enumerate(val_loss):
- plt.annotate(f'{txt:.3f}', (epochs[i], val_loss[i]), textcoords="offset points", xytext=(0,-10), ha='center')
- for i, txt in enumerate(val_loss3):
- plt.annotate(f'{txt:.3f}', (epochs[i], val_loss3[i]), textcoords="offset points", xytext=(0,10), ha='center')
- for i, txt in enumerate(val_loss2):
- plt.annotate(f'{txt:.3f}', (epochs[i], val_loss2[i]), textcoords="offset points", xytext=(0,10), ha='center')
- # 调整子图之间的水平间距
- plt.subplots_adjust(wspace=0.3)
- # 展示图形
- plt.show()
- # In[56]:
- sns.histplot(lst, color="red", label='All Essays', kde=True)
- plt.title('Distribution of Essay Lengths')
- plt.xlabel('Essay Length (Number of Characters)')
- plt.ylabel('Frequency')
- plt.legend()
- plt.show()
- # In[57]:
- # 使用模型进行预测
- y_pred_proba1 = model1.predict(x_test_seq1)
- y_pred_proba2 = model2.predict(x_test_seq1)
- y_pred_proba3 = model3.predict(x_test_seq1)
- # 将概率转换为二进制标签
- y_pred1 = (y_pred_proba1 > 0.5).astype(int)
- y_pred2 = (y_pred_proba2 > 0.5).astype(int)
- y_pred3 = (y_pred_proba3 > 0.5).astype(int)
- # In[58]:
- #计算准确率额
- from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
- precision_simple_rnn = precision_score(y_test, y_pred1)
- precision_lstm = precision_score(y_test, y_pred3)
- precision_birnn_lstm = precision_score(y_test, y_pred2)
- precision_simple_rnn
- # In[81]:
- #LSTM模型
- model3 = Sequential()
- model3.add(Embedding(input_dim = 5000 , output_dim = 500 , input_length = max_len_per_sent))
- model3.add(LSTM(units=32 , return_sequences=False))
- model3.add(Dense(units=100 , activation='relu'))
- model3.add(Dropout(0.2))
- model3.add(Dense(units=1, activation='sigmoid'))
- model3.compile(optimizer="rmsprop",
- loss="binary_crossentropy",
- metrics=["accuracy"])
- model3.summary()
- # In[82]:
- history3 = model3.fit(x_train_seq1 , y_train , epochs=2 ,
- batch_size=64, validation_data=(x_test_seq1 , y_test))
- # In[83]:
- y_pred_proba3 = model3.predict(x_test_seq1)
- y_pred3 = (y_pred_proba3 > 0.5).astype(int)
- precision_lstm = precision_score(y_test, y_pred3)
- precision_lstm
- # In[84]:
- precision_birnn_lstm
- # In[85]:
- #计算召回旅率
- from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
- recall_simple_rnn = recall_score(y_test, y_pred1)
- recall_lstm = recall_score(y_test, y_pred3)
- recall_birnn_lstm = recall_score(y_test, y_pred2)
- recall_simple_rnn
- # In[87]:
- recall_lstm
- # In[88]:
- recall_birnn_lstm
- # In[89]:
- #计算F1
- from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
- f1_simple_rnn = f1_score(y_test, y_pred1)
- f1_lstm = f1_score(y_test, y_pred3)
- f1_birnn_lstm = f1_score(y_test, y_pred2)
- f1_simple_rnn
- # In[90]:
- f1_lstm
- # In[91]:
- f1_birnn_lstm
- # In[92]:
- import matplotlib.pyplot as plt
- import numpy as np
- # 假设你已经得到了三个模型的性能指标
- models = ['Simple RNN', 'LSTM', 'BiRNN-LSTM']
- precision_scores = [precision_simple_rnn, precision_lstm, precision_birnn_lstm]
- recall_scores = [recall_simple_rnn, recall_lstm, recall_birnn_lstm]
- f1_scores = [f1_simple_rnn, f1_lstm, f1_birnn_lstm]
- # 设置图表布局
- fig, axs = plt.subplots(3, 1, figsize=(10, 12))
- # 绘制精确度
- axs[0].bar(models, precision_scores, color='blue', alpha=0.7)
- axs[0].set_ylabel('Precision')
- axs[0].set_title('Model Precision Comparison')
- # 绘制召回率
- axs[1].bar(models, recall_scores, color='green', alpha=0.7)
- axs[1].set_ylabel('Recall')
- axs[1].set_title('Model Recall Comparison')
- # 绘制F1分数
- axs[2].bar(models, f1_scores, color='orange', alpha=0.7)
- axs[2].set_ylabel('F1 Score')
- axs[2].set_title('Model F1 Score Comparison')
- # 显示图表
- plt.tight_layout()
- plt.show()
- # In[112]:
- import matplotlib.pyplot as plt
- import numpy as np
- # 假设你已经得到了三个模型的性能指标
- models = ['Simple RNN', 'LSTM', 'BiRNN-LSTM']
- precision_scores = [precision_simple_rnn, precision_lstm, precision_birnn_lstm]
- recall_scores = [recall_simple_rnn, recall_lstm, recall_birnn_lstm]
- f1_scores = [f1_simple_rnn, f1_lstm, f1_birnn_lstm]
- # 设置图表布局
- fig, axs = plt.subplots(3, 1, figsize=(10, 12))
- # 绘制精确度
- bars = axs[0].bar(models, precision_scores, color='blue', alpha=0.7)
- axs[0].set_ylabel('Precision')
- axs[0].set_title('Model Precision Comparison')
- # 在每个柱状图上方添加具体数值标签
- for bar, value in zip(bars, precision_scores):
- axs[0].text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.01, f'{value:.3f}', ha='center', va='bottom')
- # 绘制召回率
- bars = axs[1].bar(models, recall_scores, color='green', alpha=0.7)
- axs[1].set_ylabel('Recall')
- axs[1].set_title('Model Recall Comparison')
- # 在每个柱状图上方添加具体数值标签
- for bar, value in zip(bars, recall_scores):
- axs[1].text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.01, f'{value:.3f}', ha='center', va='bottom')
- # 绘制F1分数
- bars = axs[2].bar(models, f1_scores, color='orange', alpha=0.7)
- axs[2].set_ylabel('F1 Score')
- axs[2].set_title('Model F1 Score Comparison')
- # 在每个柱状图上方添加具体数值标签
- for bar, value in zip(bars, f1_scores):
- axs[2].text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.01, f'{value:.3f}', ha='center', va='bottom')
- # 显示图表
- plt.tight_layout()
- plt.show()
- # In[113]:
- #计算平均验证集准确率
- average_val_accuracy1 = sum(val_acc) / len(val_acc)
- average_val_accuracy3 = sum(val_acc3) / len(val_acc3)
- average_val_accuracy2 = sum(val_acc2) / len(val_acc2)
- print(f'Average Validation Accuracy in RNN: {average_val_accuracy1}')
- print(f'Average Validation Accuracy in LSTM: {average_val_accuracy3}')
- print(f'Average Validation Accuracy in BiLSTM: {average_val_accuracy2}')
- # In[121]:
- #错误分析
- #RNN
- # 检查维度
- print(y_pred1.shape)
- print(y_test.shape)
- # 将多维数组转换为一维数组
- y_pred1 = y_pred1.ravel()
- y_test = y_test.ravel()
- # 确认转换后的维度
- print(y_pred1.shape)
- print(y_test.shape)
- # In[123]:
- # 执行误判样本的查找
- misclassified_samples = np.where(y_pred1 != y_test)[0]
- # 取前十个误判案例
- for sample_index in misclassified_samples[:10]:
- print(f"Index: {sample_index}")
- print(f"True Label: {y_test[sample_index]}, Predicted Label: {y_pred1[sample_index]}")
- print(f"Text: {tk.sequences_to_texts(x_test_seq)[sample_index]}")
- print("\n")
- # In[124]:
- #lstm
- # 检查维度
- print(y_pred3.shape)
- print(y_test.shape)
- # 将多维数组转换为一维数组
- y_pred3= y_pred3.ravel()
- y_test = y_test.ravel()
- # 确认转换后的维度
- print(y_pred3.shape)
- print(y_test.shape)
- # In[127]:
- # 执行误判样本的查找
- misclassified_samples3 = np.where(y_pred3 != y_test)[0]
- # 取前十个误判案例
- for sample_index in misclassified_samples3[:10]:
- print(f"Index: {sample_index}")
- print(f"True Label: {y_test[sample_index]}, Predicted Label: {y_pred3[sample_index]}")
- print(f"Text: {tk.sequences_to_texts(x_test_seq)[sample_index]}")
- print("\n")
- # In[128]:
- #BIlstm
- # 检查维度
- print(y_pred2.shape)
- print(y_test.shape)
- # 将多维数组转换为一维数组
- y_pred2= y_pred2.ravel()
- y_test = y_test.ravel()
- # 确认转换后的维度
- print(y_pred2.shape)
- print(y_test.shape)
- # In[130]:
- # 执行误判样本的查找
- misclassified_samples2 = np.where(y_pred2 != y_test)[0]
- # 取前十个误判案例
- for sample_index in misclassified_samples3[100:110]:
- print(f"Index: {sample_index}")
- print(f"True Label: {y_test[sample_index]}, Predicted Label: {y_pred2[sample_index]}")
- print(f"Text: {tk.sequences_to_texts(x_test_seq)[sample_index]}")
- print("\n")
- # In[ ]:

data 来自:https://www.kaggle.com/code/harshithvarma007/llm-text-detection-99-47-accuracy/input
