赞
踩
功能:用于预处理序列(例如一篇文章,句子)数据的实用工具。
keras-master\keras\preprocessing\sequence.py
建立词向量嵌入层,把输入文本转为可以进一步处理的数据格式(例如,矩阵)
代码注释
- # -*- coding: utf-8 -*-
- """Utilities for preprocessing sequence data.
- 用于预处理序列数据的实用工具。
- """
- from __future__ import absolute_import
- from __future__ import division
- from __future__ import print_function
-
- import numpy as np
- import random
- from six.moves import range
-
-
- def pad_sequences(sequences, maxlen=None, dtype='int32',
- padding='pre', truncating='pre', value=0.):
- """Pads each sequence to the same length (length of the longest sequence).
- 填充使得每个序列都具有相同的长度(最长序列的长度)。
- If maxlen is provided, any sequence longer
- than maxlen is truncated to maxlen.
- 如果提供了maxlen(最大长度),则任何比如果提供了maxlen长的序列都被截断到maxlen(长度)。
- Truncation happens off either the beginning (default) or
- the end of the sequence.
- 截断发生在开始(默认)或序列结束时。
- Supports post-padding and pre-padding (default).
- 支持后置填充和预填充(默认)。
- # Arguments
- 参数
- sequences: list of lists where each element is a sequence
- sequences: 每个元素是序列的列表(列表中的每个元素是一个列表)。
- maxlen: int, maximum length
- maxlen: 整型,最大长度
- dtype: type to cast the resulting sequence.
- dtype: 生成结果序列的类型。
- padding: 'pre' or 'post', pad either before or after each sequence.
- padding: 前或后,在每个序列的前或后填充。
- truncating: 'pre' or 'post', remove values from sequences larger than
- maxlen either in the beginning or in the end of the sequence
- truncating: 前或后,在序列开始或结束时从大于maxlen的序列中移除值
- value: float, value to pad the sequences to the desired value.
- value: 浮点型,值将序列填充到期望值。
- # Returns
- 返回
- x: numpy array with dimensions (number_of_sequences, maxlen)
- x: numpy数组,维度为 (number_of_sequences, maxlen) ,其中number_of_sequences为序列数量,maxlen序列最大长度
- # Raises
- 补充
- ValueError: in case of invalid values for `truncating` or `padding`,
- or in case of invalid shape for a `sequences` entry.
- ValueError: 在“truncating”或“padding”的无效值的情况下,或者对于“sequences”条目无效的形状。
- """
- if not hasattr(sequences, '__len__'):
- raise ValueError('`sequences` must be iterable.')
- lengths = []
- for x in sequences:
- if not hasattr(x, '__len__'):
- raise ValueError('`sequences` must be a list of iterables. '
- 'Found non-iterable: ' + str(x))
- lengths.append(len(x))
-
- num_samples = len(sequences)
- if maxlen is None:
- maxlen = np.max(lengths)
-
- # take the sample shape from the first non empty sequence
- # checking for consistency in the main loop below.
- # 从第一个非空序列检查中获取样本形状,以便在下面的主循环中获得一致性。
- sample_shape = tuple()
- for s in sequences:
- if len(s) > 0:
- sample_shape = np.asarray(s).shape[1:]
- break
-
- x = (np.ones((num_samples, maxlen) + sample_shape) * value).astype(dtype)
- for idx, s in enumerate(sequences):
- if not len(s):
- continue # empty list/array was found
- if truncating == 'pre':
- trunc = s[-maxlen:]
- elif truncating == 'post':
- trunc = s[:maxlen]
- else:
- raise ValueError('Truncating type "%s" not understood' % truncating)
-
- # check `trunc` has expected shape
- # 检查“trunc”是否具有预期形状
- trunc = np.asarray(trunc, dtype=dtype)
- if trunc.shape[1:] != sample_shape:
- raise ValueError('Shape of sample %s of sequence at position %s is different from expected shape %s' %
- (trunc.shape[1:], idx, sample_shape))
-
- if padding == 'post':
- x[idx, :len(trunc)] = trunc
- elif padding == 'pre':
- x[idx, -len(trunc):] = trunc
- else:
- raise ValueError('Padding type "%s" not understood' % padding)
- return x
-
-
- def make_sampling_table(size, sampling_factor=1e-5):
- """Generates a word rank-based probabilistic sampling table.
- 生成基于词秩的概率抽样表。
- This generates an array where the ith element
- is the probability that a word of rank i would be sampled,
- according to the sampling distribution used in word2vec.
- 这就产生了一个数组,其中第i个元素是根据word2vec中使用的采样分布来对秩i进行采样的概率。
- The word2vec formula is:
- word2vec公式为:
- p(word) = min(1, sqrt(word.frequency/sampling_factor) / (word.frequency/sampling_factor))
- We assume that the word frequencies follow Zipf's law (s=1) to derive
- 我们假设词频遵循Zipf定律(s=1)来推导。
- a numerical approximation of frequency(rank):
- 频率(秩)的数值逼近:
- frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))
- where gamma is the Euler-Mascheroni constant.
- 其中Gamma是Euler-Mascheroni常数。
- Zipf's law(齐夫定律):https://en.wikipedia.org/wiki/Zipf%27s_law
- https://www.cnblogs.com/sddai/p/6081447.html
- # Arguments
- 参数
- size: int, number of possible words to sample.
- size: 整型,可能的采样单词数。
- sampling_factor: the sampling factor in the word2vec formula.
- sampling_factor: word2vec公式中的采样因子。
- # Returns
- 返回
- A 1D Numpy array of length `size` where the ith entry
- is the probability that a word of rank i should be sampled.
- 长度为“size”的一维Numpy数组,其中第i个条目是应该对等级I进行采样的概率。
- """
- gamma = 0.577
- rank = np.arange(size)
- rank[0] = 1
- inv_fq = rank * (np.log(rank) + gamma) + 0.5 - 1. / (12. * rank)
- f = sampling_factor * inv_fq
-
- return np.minimum(1., f / np.sqrt(f))
-
-
- def skipgrams(sequence, vocabulary_size,
- window_size=4, negative_samples=1., shuffle=True,
- categorical=False, sampling_table=None, seed=None):
- """Generates skipgram word pairs.
- 生成skipgram单词对。
- skipgram:https://blog.csdn.net/u010665216/article/details/78721354?locationNum=7&fps=1
- Takes a sequence (list of indexes of words),
- returns couples of [word_index, other_word index] and labels (1s or 0s),
- where label = 1 if 'other_word' belongs to the context of 'word',
- and label=0 if 'other_word' is randomly sampled
- 取一个序列(单词索引的列表),返回[word_index, other_word index]和标签(1s或0)的对,其中标签label = 1如
- 果 'other_word' 属于'word'的上下文,同时标签label=0,如果'other_word'是随机抽样的。
- # Arguments
- 参数
- sequence: a word sequence (sentence), encoded as a list
- of word indices (integers). If using a `sampling_table`,
- word indices are expected to match the rank
- of the words in a reference dataset (e.g. 10 would encode
- the 10-th most frequently occurring token).
- Note that index 0 is expected to be a non-word and will be skipped.
- sequence:一个单词序列(句子),被编码为单词索引(整数)的列表。如果使用“sampling_table”,则期
- 望单词索引与参考数据集中的单词的等级相匹配(例如,10将编码第10个最频繁出现的分词)。
- 注意,索引0预期为非单词,将被跳过。
- vocabulary_size: int. maximum possible word index + 1
- vocabulary_size: 整型。最大(值)可能是 word index + 1 (第一个词索引是0)
- window_size: int. actually half-window.
- The window of a word wi will be [i-window_size, i+window_size+1]
- window_size:整型。实际上是半窗口。
- 一个单词Wi的窗口将是 [i-window_size, i+window_size+1]。
- negative_samples: float >= 0. 0 for no negative (=random) samples.
- 1 for same number as positive samples. etc.
- negative_samples: 浮点数 >= 0。 0表示没有负(随机)样本。1表示和正样本相同数量。
- shuffle: whether to shuffle the word couples before returning them.
- shuffle: 在返回之前,是否重新整理(排序)词对。
- categorical: bool. if False, labels will be
- integers (eg. [0, 1, 1 .. ]),
- if True labels will be categorical eg. [[1,0],[0,1],[0,1] .. ]
- sampling_table: 1D array of size `vocabulary_size` where the entry i
- encodes the probability to sample a word of rank i.
- sampling_table: `vocabulary_size` 大小的一维数组,其中条目i编码i等级词的采样概率。
- seed: random seed.
- seed: 随机种子
- # Returns
- 返回
- couples, labels: where `couples` are int pairs and
- `labels` are either 0 or 1.
- couples, labels:`couples`是整数对,`labels`是 0 或者 1。
- # Note
- 注意
- By convention, index 0 in the vocabulary is
- a non-word and will be skipped.
- 按照惯例,词汇表中的索引0是非单词,将被跳过。
- """
- couples = []
- labels = []
- for i, wi in enumerate(sequence):
- if not wi:
- continue
- if sampling_table is not None:
- if sampling_table[wi] < random.random():
- continue
-
- window_start = max(0, i - window_size)
- window_end = min(len(sequence), i + window_size + 1)
- for j in range(window_start, window_end):
- if j != i:
- wj = sequence[j]
- if not wj:
- continue
- couples.append([wi, wj])
- if categorical:
- labels.append([0, 1])
- else:
- labels.append(1)
-
- if negative_samples > 0:
- num_negative_samples = int(len(labels) * negative_samples)
- words = [c[0] for c in couples]
- random.shuffle(words)
-
- couples += [[words[i % len(words)],
- random.randint(1, vocabulary_size - 1)] for i in range(num_negative_samples)]
- if categorical:
- labels += [[1, 0]] * num_negative_samples
- else:
- labels += [0] * num_negative_samples
-
- if shuffle:
- if seed is None:
- seed = random.randint(0, 10e6)
- random.seed(seed)
- random.shuffle(couples)
- random.seed(seed)
- random.shuffle(labels)
-
- return couples, labels
-
-
- def _remove_long_seq(maxlen, seq, label):
- """Removes sequences that exceed the maximum length.
- 移除超过最大长度的序列。
- # Arguments
- 参数
- maxlen: int, maximum length
- maxlen: 整型,最大的长度
- seq: list of lists where each sublist is a sequence
- seq: 每个子列表是序列的序列列表
- label: list where each element is an integer
- label: 每个元素是整数的列表
- # Returns
- 返回
- new_seq, new_label: shortened lists for `seq` and `label`.
- new_seq, new_label: `seq` 和 `label`.的缩短列表。
- """
- new_seq, new_label = [], []
- for x, y in zip(seq, label):
- if len(x) < maxlen:
- new_seq.append(x)
- new_label.append(y)
- return new_seq, new_label
代码执行
Keras详细介绍
中文:http://keras-cn.readthedocs.io/en/latest/
实例下载
https://github.com/keras-team/keras
https://github.com/keras-team/keras/tree/master/examples
完整项目下载
方便没积分童鞋,请加企鹅452205574,共享文件夹。
包括:代码、数据集合(图片)、已生成model、安装库文件等。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。