当前位置:   article > 正文

keras\preprocessing目录文件详解5.2(sequence.py)-keras学习笔记五_keras.preprocessing.timeseries_dataset_from_array

keras.preprocessing.timeseries_dataset_from_array

功能:用于预处理序列(例如一篇文章,句子)数据的实用工具。

keras-master\keras\preprocessing\sequence.py

建立词向量嵌入层,把输入文本转为可以进一步处理的数据格式(例如,矩阵)

Keras开发包文件目录

Keras实例文件目录

代码注释

  1. # -*- coding: utf-8 -*-
  2. """Utilities for preprocessing sequence data.
  3. 用于预处理序列数据的实用工具。
  4. """
  5. from __future__ import absolute_import
  6. from __future__ import division
  7. from __future__ import print_function
  8. import numpy as np
  9. import random
  10. from six.moves import range
  11. def pad_sequences(sequences, maxlen=None, dtype='int32',
  12. padding='pre', truncating='pre', value=0.):
  13. """Pads each sequence to the same length (length of the longest sequence).
  14. 填充使得每个序列都具有相同的长度(最长序列的长度)。
  15. If maxlen is provided, any sequence longer
  16. than maxlen is truncated to maxlen.
  17. 如果提供了maxlen(最大长度),则任何比如果提供了maxlen长的序列都被截断到maxlen(长度)。
  18. Truncation happens off either the beginning (default) or
  19. the end of the sequence.
  20. 截断发生在开始(默认)或序列结束时。
  21. Supports post-padding and pre-padding (default).
  22. 支持后置填充和预填充(默认)。
  23. # Arguments
  24. 参数
  25. sequences: list of lists where each element is a sequence
  26. sequences: 每个元素是序列的列表(列表中的每个元素是一个列表)。
  27. maxlen: int, maximum length
  28. maxlen: 整型,最大长度
  29. dtype: type to cast the resulting sequence.
  30. dtype: 生成结果序列的类型。
  31. padding: 'pre' or 'post', pad either before or after each sequence.
  32. padding: 前或后,在每个序列的前或后填充。
  33. truncating: 'pre' or 'post', remove values from sequences larger than
  34. maxlen either in the beginning or in the end of the sequence
  35. truncating: 前或后,在序列开始或结束时从大于maxlen的序列中移除值
  36. value: float, value to pad the sequences to the desired value.
  37. value: 浮点型,值将序列填充到期望值。
  38. # Returns
  39. 返回
  40. x: numpy array with dimensions (number_of_sequences, maxlen)
  41. x: numpy数组,维度为 (number_of_sequences, maxlen) ,其中number_of_sequences为序列数量,maxlen序列最大长度
  42. # Raises
  43. 补充
  44. ValueError: in case of invalid values for `truncating` or `padding`,
  45. or in case of invalid shape for a `sequences` entry.
  46. ValueError: 在“truncating”或“padding”的无效值的情况下,或者对于“sequences”条目无效的形状。
  47. """
  48. if not hasattr(sequences, '__len__'):
  49. raise ValueError('`sequences` must be iterable.')
  50. lengths = []
  51. for x in sequences:
  52. if not hasattr(x, '__len__'):
  53. raise ValueError('`sequences` must be a list of iterables. '
  54. 'Found non-iterable: ' + str(x))
  55. lengths.append(len(x))
  56. num_samples = len(sequences)
  57. if maxlen is None:
  58. maxlen = np.max(lengths)
  59. # take the sample shape from the first non empty sequence
  60. # checking for consistency in the main loop below.
  61. # 从第一个非空序列检查中获取样本形状,以便在下面的主循环中获得一致性。
  62. sample_shape = tuple()
  63. for s in sequences:
  64. if len(s) > 0:
  65. sample_shape = np.asarray(s).shape[1:]
  66. break
  67. x = (np.ones((num_samples, maxlen) + sample_shape) * value).astype(dtype)
  68. for idx, s in enumerate(sequences):
  69. if not len(s):
  70. continue # empty list/array was found
  71. if truncating == 'pre':
  72. trunc = s[-maxlen:]
  73. elif truncating == 'post':
  74. trunc = s[:maxlen]
  75. else:
  76. raise ValueError('Truncating type "%s" not understood' % truncating)
  77. # check `trunc` has expected shape
  78. # 检查“trunc”是否具有预期形状
  79. trunc = np.asarray(trunc, dtype=dtype)
  80. if trunc.shape[1:] != sample_shape:
  81. raise ValueError('Shape of sample %s of sequence at position %s is different from expected shape %s' %
  82. (trunc.shape[1:], idx, sample_shape))
  83. if padding == 'post':
  84. x[idx, :len(trunc)] = trunc
  85. elif padding == 'pre':
  86. x[idx, -len(trunc):] = trunc
  87. else:
  88. raise ValueError('Padding type "%s" not understood' % padding)
  89. return x
  90. def make_sampling_table(size, sampling_factor=1e-5):
  91. """Generates a word rank-based probabilistic sampling table.
  92. 生成基于词秩的概率抽样表。
  93. This generates an array where the ith element
  94. is the probability that a word of rank i would be sampled,
  95. according to the sampling distribution used in word2vec.
  96. 这就产生了一个数组,其中第i个元素是根据word2vec中使用的采样分布来对秩i进行采样的概率。
  97. The word2vec formula is:
  98. word2vec公式为:
  99. p(word) = min(1, sqrt(word.frequency/sampling_factor) / (word.frequency/sampling_factor))
  100. We assume that the word frequencies follow Zipf's law (s=1) to derive
  101. 我们假设词频遵循Zipf定律(s=1)来推导。
  102. a numerical approximation of frequency(rank):
  103. 频率(秩)的数值逼近:
  104. frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))
  105. where gamma is the Euler-Mascheroni constant.
  106. 其中Gamma是Euler-Mascheroni常数。
  107. Zipf's law(齐夫定律):https://en.wikipedia.org/wiki/Zipf%27s_law
  108. https://www.cnblogs.com/sddai/p/6081447.html
  109. # Arguments
  110. 参数
  111. size: int, number of possible words to sample.
  112. size: 整型,可能的采样单词数。
  113. sampling_factor: the sampling factor in the word2vec formula.
  114. sampling_factor: word2vec公式中的采样因子。
  115. # Returns
  116. 返回
  117. A 1D Numpy array of length `size` where the ith entry
  118. is the probability that a word of rank i should be sampled.
  119. 长度为“size”的一维Numpy数组,其中第i个条目是应该对等级I进行采样的概率。
  120. """
  121. gamma = 0.577
  122. rank = np.arange(size)
  123. rank[0] = 1
  124. inv_fq = rank * (np.log(rank) + gamma) + 0.5 - 1. / (12. * rank)
  125. f = sampling_factor * inv_fq
  126. return np.minimum(1., f / np.sqrt(f))
  127. def skipgrams(sequence, vocabulary_size,
  128. window_size=4, negative_samples=1., shuffle=True,
  129. categorical=False, sampling_table=None, seed=None):
  130. """Generates skipgram word pairs.
  131. 生成skipgram单词对。
  132. skipgram:https://blog.csdn.net/u010665216/article/details/78721354?locationNum=7&fps=1
  133. Takes a sequence (list of indexes of words),
  134. returns couples of [word_index, other_word index] and labels (1s or 0s),
  135. where label = 1 if 'other_word' belongs to the context of 'word',
  136. and label=0 if 'other_word' is randomly sampled
  137. 取一个序列(单词索引的列表),返回[word_index, other_word index]和标签(1s或0)的对,其中标签label = 1如
  138. 果 'other_word' 属于'word'的上下文,同时标签label=0,如果'other_word'是随机抽样的。
  139. # Arguments
  140. 参数
  141. sequence: a word sequence (sentence), encoded as a list
  142. of word indices (integers). If using a `sampling_table`,
  143. word indices are expected to match the rank
  144. of the words in a reference dataset (e.g. 10 would encode
  145. the 10-th most frequently occurring token).
  146. Note that index 0 is expected to be a non-word and will be skipped.
  147. sequence:一个单词序列(句子),被编码为单词索引(整数)的列表。如果使用“sampling_table”,则期
  148. 望单词索引与参考数据集中的单词的等级相匹配(例如,10将编码第10个最频繁出现的分词)。
  149. 注意,索引0预期为非单词,将被跳过。
  150. vocabulary_size: int. maximum possible word index + 1
  151. vocabulary_size: 整型。最大(值)可能是 word index + 1 (第一个词索引是0)
  152. window_size: int. actually half-window.
  153. The window of a word wi will be [i-window_size, i+window_size+1]
  154. window_size:整型。实际上是半窗口。
  155. 一个单词Wi的窗口将是 [i-window_size, i+window_size+1]。
  156. negative_samples: float >= 0. 0 for no negative (=random) samples.
  157. 1 for same number as positive samples. etc.
  158. negative_samples: 浮点数 >= 0。 0表示没有负(随机)样本。1表示和正样本相同数量。
  159. shuffle: whether to shuffle the word couples before returning them.
  160. shuffle: 在返回之前,是否重新整理(排序)词对。
  161. categorical: bool. if False, labels will be
  162. integers (eg. [0, 1, 1 .. ]),
  163. if True labels will be categorical eg. [[1,0],[0,1],[0,1] .. ]
  164. sampling_table: 1D array of size `vocabulary_size` where the entry i
  165. encodes the probability to sample a word of rank i.
  166. sampling_table: `vocabulary_size` 大小的一维数组,其中条目i编码i等级词的采样概率。
  167. seed: random seed.
  168. seed: 随机种子
  169. # Returns
  170. 返回
  171. couples, labels: where `couples` are int pairs and
  172. `labels` are either 0 or 1.
  173. couples, labels:`couples`是整数对,`labels`是 0 或者 1。
  174. # Note
  175. 注意
  176. By convention, index 0 in the vocabulary is
  177. a non-word and will be skipped.
  178. 按照惯例,词汇表中的索引0是非单词,将被跳过。
  179. """
  180. couples = []
  181. labels = []
  182. for i, wi in enumerate(sequence):
  183. if not wi:
  184. continue
  185. if sampling_table is not None:
  186. if sampling_table[wi] < random.random():
  187. continue
  188. window_start = max(0, i - window_size)
  189. window_end = min(len(sequence), i + window_size + 1)
  190. for j in range(window_start, window_end):
  191. if j != i:
  192. wj = sequence[j]
  193. if not wj:
  194. continue
  195. couples.append([wi, wj])
  196. if categorical:
  197. labels.append([0, 1])
  198. else:
  199. labels.append(1)
  200. if negative_samples > 0:
  201. num_negative_samples = int(len(labels) * negative_samples)
  202. words = [c[0] for c in couples]
  203. random.shuffle(words)
  204. couples += [[words[i % len(words)],
  205. random.randint(1, vocabulary_size - 1)] for i in range(num_negative_samples)]
  206. if categorical:
  207. labels += [[1, 0]] * num_negative_samples
  208. else:
  209. labels += [0] * num_negative_samples
  210. if shuffle:
  211. if seed is None:
  212. seed = random.randint(0, 10e6)
  213. random.seed(seed)
  214. random.shuffle(couples)
  215. random.seed(seed)
  216. random.shuffle(labels)
  217. return couples, labels
  218. def _remove_long_seq(maxlen, seq, label):
  219. """Removes sequences that exceed the maximum length.
  220. 移除超过最大长度的序列。
  221. # Arguments
  222. 参数
  223. maxlen: int, maximum length
  224. maxlen: 整型,最大的长度
  225. seq: list of lists where each sublist is a sequence
  226. seq: 每个子列表是序列的序列列表
  227. label: list where each element is an integer
  228. label: 每个元素是整数的列表
  229. # Returns
  230. 返回
  231. new_seq, new_label: shortened lists for `seq` and `label`.
  232. new_seq, new_label: `seq` 和 `label`.的缩短列表。
  233. """
  234. new_seq, new_label = [], []
  235. for x, y in zip(seq, label):
  236. if len(x) < maxlen:
  237. new_seq.append(x)
  238. new_label.append(y)
  239. return new_seq, new_label

代码执行

Keras详细介绍

英文:https://keras.io/

中文:http://keras-cn.readthedocs.io/en/latest/

实例下载

https://github.com/keras-team/keras

https://github.com/keras-team/keras/tree/master/examples

完整项目下载

方便没积分童鞋,请加企鹅452205574,共享文件夹。

包括:代码、数据集合(图片)、已生成model、安装库文件等。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/AllinToyou/article/detail/293109
推荐阅读
相关标签
  

闽ICP备14008679号