赞
踩
我们这里处理的是 CSV 数据
There are two main parts to this:
This tutorial focuses on the loading, and gives some quick examples of preprocessing. For a tutorial that focuses on the preprocessing aspect see the preprocessing layers guide and tutorial.
For any small CSV dataset the simplest way to train a TensorFlow model on it is to load it into memory as a pandas Dataframe or a NumPy array.
The nominal task for this dataset is to predict the age from the other measurements, so separate the features and labels for training:
abalone_features = abalone_train.copy()
abalone_labels = abalone_features.pop('Age')
You have just seen the most basic way to train a model using CSV data. Next, you will learn how to apply preprocessing to normalize numeric columns.
import pandas as pd import numpy as np # Make numpy values easier to read. np.set_printoptions(precision=3, suppress=True) import tensorflow as tf from tensorflow.keras import layers abalone_train = pd.read_csv( "https://storage.googleapis.com/download.tensorflow.org/data/abalone_train.csv", names=["Length", "Diameter", "Height", "Whole weight", "Shucked weight", "Viscera weight", "Shell weight", "Age"]) print(abalone_train.head()) abalone_features = abalone_train.copy() abalone_labels = abalone_features.pop('Age') abalone_features = np.array(abalone_features) print(abalone_features) abalone_model = tf.keras.Sequential([ layers.Dense(64), layers.Dense(1) ]) abalone_model.compile(loss=tf.losses.MeanSquaredError(), optimizer=tf.optimizers.Adam()) abalone_model.fit(abalone_features, abalone_labels, epochs=10)
上图展示了两种预测方式,结果是一样的
这里我们对数值类型的特征进行标准化处理
It’s good practice to normalize the inputs to your model. The Keras preprocessing layers provide a convenient way to build this normalization into your model.
The layer will precompute the mean and variance of each column, and use these to normalize the data.
Note: Only use your training data to .adapt() preprocessing layers. Do not use your validation or test data.
normalize = layers.Normalization()
normalize.adapt(abalone_features)
norm_abalone_model = tf.keras.Sequential([
normalize,
layers.Dense(64),
layers.Dense(1)
])
norm_abalone_model.compile(loss=tf.losses.MeanSquaredError(),
optimizer=tf.optimizers.Adam())
norm_abalone_model.fit(abalone_features, abalone_labels, epochs=10)
print(norm_abalone_model.predict(abalone_features[:2]))
print(abalone_labels[:2])
泰坦尼克的数据预处理,原始数据存在 csv 中,一部分数据是 float,float 这部分的数据需要经过标准化处理,一部分数据是 string,string 这部分的数据需要经过预处理转化成数值类型
Because of the different data types and ranges you can’t simply stack the features into NumPy array and pass it to a keras.Sequential model. Each column needs to be handled individually.
As one option, you could preprocess your data offline (using any tool you like) to convert categorical columns to numeric columns, then pass the processed output to your TensorFlow model. The disadvantage to that approach is that if you save and export your model the preprocessing is not saved with it. The Keras preprocessing layers avoid this problem because they’re part of the model.
The functional API operates on “symbolic” tensors. Normal “eager” tensors have a value. In contrast these “symbolic” tensors do not. Instead they keep track of which operations are run on them, and build representation of the calculation, that you can run later
For the string inputs use the tf.keras.layers.StringLookup function to map from strings to integer indices in a vocabulary. Next, use tf.keras.layers.CategoryEncoding to convert the indexes into float32 data appropriate for the model.
The default settings for the tf.keras.layers.CategoryEncoding layer create a one-hot vector for each input. A layers.Embedding would also work. See the preprocessing layers guide and tutorial for more on this topic.
for name, input in inputs.items():
if input.dtype == tf.float32:
continue
lookup = layers.StringLookup(vocabulary=np.unique(titanic_features[name]))
one_hot = layers.CategoryEncoding(max_tokens=lookup.vocab_size())
x = lookup(input)
x = one_hot(x)
preprocessed_inputs.append(x)
With the collection of inputs and processed_inputs, you can concatenate all the preprocessed inputs together, and build a model that handles the preprocessing:
preprocessed_inputs_cat = layers.Concatenate()(preprocessed_inputs)
titanic_preprocessing = tf.keras.Model(inputs, preprocessed_inputs_cat)
tf.keras.utils.plot_model(model = titanic_preprocessing , rankdir="LR", dpi=72, show_shapes=True)
This model just contains the input preprocessing. You can run it to see what it does to your data. Keras models don’t automatically convert Pandas DataFrames because it’s not clear if it should be converted to one tensor or to a dictionary of tensors. So convert it to a dictionary of tensors:
titanic_features_dict = {name: np.array(value)
for name, value in titanic_features.items()}
Slice out the first training example and pass it to this preprocessing model, you see the numeric features and string one-hots all concatenated together:
features_dict = {name:values[:1] for name, values in titanic_features_dict.items()}
titanic_preprocessing(features_dict)
放上完整的代码:
import pandas as pd import numpy as np # Make numpy values easier to read. np.set_printoptions(precision=3, suppress=True) import tensorflow as tf from tensorflow.keras import layers titanic = pd.read_csv("https://storage.googleapis.com/tf-datasets/titanic/train.csv") print(titanic.head()) titanic_features = titanic.copy() titanic_labels = titanic_features.pop('survived') # Create a symbolic input input = tf.keras.Input(shape=(), dtype=tf.float32) # Do a calculation using is result = 2 * input + 1 # the result doesn't have a value print(result) calc = tf.keras.Model(inputs=input, outputs=result) print(calc(1).numpy()) print(calc(2).numpy()) inputs = {} for name, column in titanic_features.items(): dtype = column.dtype if dtype == object: dtype = tf.string else: dtype = tf.float32 inputs[name] = tf.keras.Input(shape=(1,), name=name, dtype=dtype) print(inputs) numeric_inputs = {name: input for name, input in inputs.items() if input.dtype == tf.float32} x = layers.Concatenate()(list(numeric_inputs.values())) norm = layers.Normalization() norm.adapt(np.array(titanic[numeric_inputs.keys()])) all_numeric_inputs = norm(x) print(all_numeric_inputs) preprocessed_inputs = [all_numeric_inputs] for name, input in inputs.items(): if input.dtype == tf.float32: continue lookup = layers.StringLookup(vocabulary=np.unique(titanic_features[name])) one_hot = layers.CategoryEncoding(max_tokens=lookup.vocab_size()) x = lookup(input) x = one_hot(x) preprocessed_inputs.append(x) preprocessed_inputs_cat = layers.Concatenate()(preprocessed_inputs) titanic_preprocessing = tf.keras.Model(inputs, preprocessed_inputs_cat) titanic_features_dict = {name: np.array(value) for name, value in titanic_features.items()} features_dict = {name: values[:1] for name, values in titanic_features_dict.items()} print(titanic_preprocessing(features_dict)) def titanic_model(preprocessing_head, inputs): body = tf.keras.Sequential([ layers.Dense(64), layers.Dense(1) ]) preprocessed_inputs = preprocessing_head(inputs) result = body(preprocessed_inputs) model = tf.keras.Model(inputs, result) model.compile(loss=tf.losses.BinaryCrossentropy(from_logits=True), optimizer=tf.optimizers.Adam()) return model titanic_model = titanic_model(titanic_preprocessing, inputs) titanic_model.fit(x=titanic_features_dict, y=titanic_labels, epochs=10) titanic_model.save('test') reloaded = tf.keras.models.load_model('test') features_dict = {name: values[:1] for name, values in titanic_features_dict.items()} before = titanic_model(features_dict) after = reloaded(features_dict) assert (before - after) < 1e-3 print(before) print(after)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。