当前位置:   article > 正文

keras: Data Load & Data Preprocessing_keras data dataloader

keras data dataloader


Data Load

Format Transformation

original format:

  • Images
  • Text files
  • CSV data

you need to make your data available as one of 3 formats:

  • NumPy arrays
    适合不大的数据
  • tf.data.Dataset objects :
    ①有着GPU优化,比其他类型能更好地利用GPU。
    ②能从磁盘上读取大到内存放不下的数据。
  • Python generators

转化为tf.data.Dataset

读取

  • Images: tf.keras.preprocessing.image_dataset_from_directory(...)
# image files sorted into class-specific folders
main_directory/
...class_a/
......a_image_1.jpg
......a_image_2.jpg
...class_b/
......b_image_1.jpg
......b_image_2.jpg
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
dataset = keras.preprocessing.image_dataset_from_directory(
  'path/to/main_directory', batch_size=64, image_size=(200, 200))

# For demonstration, iterate over the batches yielded by the dataset.
for data, labels in dataset:
   print(data.shape)  # (64, 200, 200, 3) 每批64张、200*200像素、3个RGB通道
   print(data.dtype)  # float32
   print(labels.shape)  # (64,)	每批标签64个
   print(labels.dtype)  # int32
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • Text files: keras.preprocessing.text_dataset_from_directory(...)
    同样,在不同文件夹中按类分类的文档。
dataset = keras.preprocessing.text_dataset_from_directory(
  'path/to/main_directory', batch_size=64)

# For demonstration, iterate over the batches yielded by the dataset.
for data, labels in dataset:
   print(data.shape)  # (64,)
   print(data.dtype)  # string
   print(labels.shape)  # (64,)
   print(labels.dtype)  # int32
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

操作

  • 查看方式1:迭代.as_numpy_iterator()
print(list(dataset.as_numpy_iterator()))
# [(array([1, 3], dtype=int32), array([b'A'], dtype=object)), 
#  (array([2, 1], dtype=int32), array([b'B'], dtype=object)), 
#  (array([3, 3], dtype=int32), array([b'A'], dtype=object))]
  • 1
  • 2
  • 3
  • 4
  • 查看方式2:for
for element in dataset.as_numpy_iterator():
	print(element)
# (array([1, 3], dtype=int32), array([b'A'], dtype=object))
# (array([2, 1], dtype=int32), array([b'B'], dtype=object))
# (array([3, 3], dtype=int32), array([b'A'], dtype=object))
  • 1
  • 2
  • 3
  • 4
  • 5
  • .take(count):取出几批的样本。
for inputs, targets in dataset.take(1):
	print(inputs)			# tf.Tensor([1 3], shape=(2,), dtype=int32)
	print(targets)			# tf.Tensor([b'A'], shape=(1,), dtype=string)
  • 1
  • 2
  • 3
  • .batch():指定batch_size。必须指定,不然fit()时会报错
# 指定一批32个
dataset = dataset.batch(32)
  • 1
  • 2

Data Preprocessing

vectorized & standardized

简单来说:

  • vectorized 向量化:非数字特征映射到数字,比如[狗, 猫]→[0, 1]
  • standardized 标准化:修改范围到[0.0, 1.0]、符合概率学(均值0和方差1)

详细:

  • Text files
    ①need to be read into string tensors,
    ②then split into words.
    ③Finally, the words need to be indexed & turned into integer tensors.
  • Images
    ①need to be read and decoded into integer tensors,
    ②then converted to floating point and normalized to small values (usually between 0 and 1).
  • CSV data
    ①needs to be parsed, with numerical features converted to floating point tensors and categorical features indexed and converted to integer tensors.
    ②Then each feature typically needs to be normalized to zero-mean and unit-variance.

Text

基本

tensorflow.keras.layers.experimental.preprocessing.TextVectorization:holds a mapping between string tokens and integer indices.

  • 词汇表必须是字符串。
  • 索引0表示缺省值(即单词长度不够时的空单词""),索引1表示词汇表外的值(词汇表由adapt()指定)。
from tensorflow.keras.layers.experimental import preprocessing

vocabulary = ["aa bb cc"]
data = ["aa bb cc"]
layer = preprocessing.TextVectorization()
layer.adapt(vocabulary)						# 以哪个为词汇表
normalized_data = layer(data)				# 根据之前adapt()的vocabulary翻译data
print(normalized_data)
# tf.Tensor([[4 3 2 2 1 1]], shape=(1, 6), dtype=int64)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 重复的单词cc,可以看到都是2
  • 词汇表外的值ddee,都是1
  • 词汇表vocabulary映射adapt()时,标点符号和空格不算,只看单词。重复的单词只留一个。
  • 词汇表vocabulary可以是一维数组["aa bb cc"](句子)、["aa bb", "bb cc"](句子)、["aa", "bb", "cc"](单词),不能是字符串"aa bb cc",不能是多列二维数组[["aa", "bb"], ["aa", "cc"]],但可以是单列的二维数组[["aa bb"], ["aa cc"]](句子)、[["aa"], ["bb"], ["cc"]](单词)。
  • 处理data同样也是同样的格式要求,结果的形状必定是二维。注意,认为每行是一个"..."
data = ["aa bb cc", "cc dd"]
'''
tf.Tensor(
[[2 4 3]
 [3 1 0]], shape=(2, 3), dtype=int64)
'''
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 单词长度不够,指的是data中的两句话,选取最长单词数作为结果的列维度,其他不足长度的句子少的单词就对应0

热编码 one-hot encoded

# Example: one-hot encoded bigrams
from tensorflow.keras.layers.experimental import preprocessing

vocabulary = ["aa bb cc"]
data = ["aa", "bb", "cc", "dd", ""]

layer = preprocessing.TextVectorization(output_mode="binary", ngrams=2)
layer.adapt(vocabulary)

integer_data = layer(data)
print(integer_data)
'''
tf.Tensor(
[[0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]], shape=(5, 6), dtype=float32)
'''
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19

每行都只有一个位是1,其他都是0.

Image & CSV: normalizing features

  • 均值0和方差1:tensorflow.keras.layers.experimental.preprocessing.Normalization
    adapt()接收三类输入类型:a batched Dataset, a Tensor, or a Numpy array。不能直接用pd.DataFrame.
from tensorflow.keras.layers.experimental import preprocessing

data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32)

normalizer = preprocessing.Normalization()
normalizer.adapt(data)

normalized_data = normalizer(data)
print(normalized_data)
'''
tf.Tensor(
[[-1.2247448 -1.2247448 -1.2247448]
 [ 0.         0.         0.       ]
 [ 1.2247448  1.2247448  1.2247448]], shape=(3, 3), dtype=float32)
'''
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
# tf.keras.utils.normalize(): numpy array
normalized_data = tf.keras.utils.normalize(data)
print(normalized_data)
  • 1
  • 2
  • 3
  • 调整范围:tensorflow.keras.layers.experimental.preprocessing.Rescaling
import numpy as np
from tensorflow.keras.layers.experimental import preprocessing

# Example image data, with values in the [0, 255] range
training_data = np.random.randint(0, 256, size=(64, 200, 200, 3)).astype("float32")

# 限定范围:从[0, 255]到[0.0, 1.0]
output_data = preprocessing.Rescaling(scale=1.0 / 255)(training_data)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

如果是numpy,那么可以直接

a = np.array([25.5,255])
a = a/255
  • 1
  • 2

labels分类

  • num_classes(这里是3)必须大于等于labels的最大值+1.
  • y表示的类别应该是[0, MAX],这样恰好符合num_classes。如果从1开始的话,虽然可以,但是创出来就是有一个从没有用到的0列。
y = np.array([0, 2, 1, 2, 1]);		# 三类:0 1 2
y = keras.utils.to_categorical(y, 3)
print(y)
'''
[[1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]
'''
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

更多

为了处理

Categorical data preprocessing layers

  • CategoryEncoding layer
  • Hashing layer
  • Discretization layer
  • StringLookup layer
  • IntegerLookup layer
  • CategoryCrossing layer

Image preprocessing & augmentation layers

  • Resizing layer
  • Rescaling layer
  • CenterCrop layer
  • RandomCrop layer
  • RandomFlip layer
  • RandomTranslation layer
  • RandomRotation layer
  • RandomZoom layer
  • RandomHeight layer
  • RandomWidth layer

Core preprocessing layers

  • TextVectorization layer
  • Normalization layer

为了生成Dataset

Dataset preprocessing

  • Image data preprocessing
    • image_dataset_from_directory function
    • load_img function
    • img_to_array function
    • ImageDataGenerator class
    • flow method
    • flow_from_dataframe method
    • flow_from_directory method
  • Timeseries data preprocessing
    • timeseries_dataset_from_array function
    • pad_sequences function
    • TimeseriesGenerator class
  • Text data preprocessing
    • text_dataset_from_directory function
    • Tokenizer class

数据处理层可以写入到Model中

normalizer = preprocessing.Normalization()
normalizer.adapt(x_train)

inputs = keras.Input(shape=input_shape)
x = preprocessing_layer(inputs)
outputs = rest_of_the_model(x)
model = keras.Model(inputs, outputs)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/笔触狂放9/article/detail/293106?site
推荐阅读
相关标签
  

闽ICP备14008679号