赞
踩
def load_data(filepath, input_shape=20): df = pd.read_csv(filepath) # 标签及词汇表 labels, vocabulary = list(df['label'].unique()), list(df['evaluation'].unique()) # 构造字符级别的特征 string = '' for word in vocabulary: string += word vocabulary = set(string) # 字典列表 word_dictionary = {word: i+1 for i, word in enumerate(vocabulary)} with open('word_dict.pk', 'wb') as f: pickle.dump(word_dictionary, f) inverse_word_dictionary = {i+1: word for i, word in enumerate(vocabulary)} label_dictionary = {label: i for i, label in enumerate(labels)} with open('label_dict.pk', 'wb') as f: pickle.dump(label_dictionary, f) output_dictionary = {i: labels for i, labels in enumerate(labels)} vocab_size = len(word_dictionary.keys()) # 词汇表大小 label_size = len(label_dictionary.keys()) # 标签类别数量 # 序列填充,按input_shape填充,长度不足的按0补充 x = [[word_dictionary[word] for word in sent] for sent in df['evaluation']] x = pad_sequences(maxlen=input_shape, sequences=x, padding='post', value=0) y = [[label_dictionary[sent]] for sent in df['label']] y = [np_utils.to_categorical(label, num_classes=label_size) for label in y] y = np.array([list(_[0]) for _ in y]) return x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary
labels, vocabulary = list(df['label'].unique()), list(df['evaluation'].unique())
效果示例:
['正面', '负面']
作用:取出数据集中的数据
string = ''
for word in vocabulary:
string += word
vocabulary = set(string)
作用:便于构建字典列表
word_dictionary = {word: i + 1 for i, word in enumerate(vocabulary)}
with open('word_dict.pk', 'wb') as f:
pickle.dump(word_dictionary, f)
inverse_word_dictionary = {i + 1: word for i, word in enumerate(vocabulary)}
label_dictionary = {label: i for i, label in enumerate(labels)}
with open('label_dict.pk', 'wb') as f:
pickle.dump(label_dictionary, f)
output_dictionary = {i: labels for i, labels in enumerate(labels)}
构建字典列表,即可以认为是一个hashtable,将数据中的字给编号,便于将句子转化成整数的矩阵。
例如:将“我爱你”,“我喜欢你”和“我不喜欢你”转化成
1 2 3 0 0
1 4 5 3 0
1 6 4 5 3
便于后面训练模型使用。
pickle.dump(obj, file, [,protocol])
注释:序列化对象,将对象obj保存到文件file中去。参数protocol是序列化模式,默认是0,以文本形式进行序列化。
enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标。
enumerate用法介绍
x = [[word_dictionary[word] for word in sent] for sent in df['evaluation']]
x = pad_sequences(maxlen=180, sequences=x, padding='post', value=0)
y = [[label_dictionary[sent]] for sent in df['label']]
y = [np_utils.to_categorical(label, num_classes=label_size) for label in y]
y = np.array([list(_[0]) for _ in y])
在倒数第二行将y转换成onehot时,此时y输出为:
[array([[1., 0.]], dtype=float32), array([[1., 0.]], dtype=float32)]
因此,需要用np.array将y转换成onehot表示。
keras.preprocessing.sequence.pad_sequences(sequences,
maxlen=None,
dtype='int32',
padding='pre',
truncating='pre',
value=0.)
sequences:浮点数或整数构成的两层嵌套列表
maxlen:None或整数,为序列的最大长度。大于此长度的序列将被截短,小于此长度的序列将在后部填0.
dtype:返回的numpy array的数据类型
padding:‘pre’或‘post’,确定当需要补0时,在序列的起始还是结尾补`
truncating:‘pre’或‘post’,确定当需要截断序列时,从起始还是结尾截断
value:浮点数,此值将在填充时代替默认的填充值0
to_categorical(y, num_classes=None, dtype=‘float32’)
将整型的类别标签转为onehot编码。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。