当前位置:   article > 正文

wide & deep 特征区别_wide deep

wide deep
谷歌论文链接

https://arxiv.org/pdf/1606.07792.pdf

wide & deep 主要思想

wide & deep 模型的核心思想是结合线性模型的记忆能力(memorization)和 DNN 模型的泛化能力(generalization),在训练过程中同时优化两个模型的参数,从而达到整体模型的预测能力最优。
在这里插入图片描述

wide 特征

wide 主要用作学习样本中特征的共现性,产生的结果是和用户有过直接行为的 item,通过少量的交叉特征转换来补充 deep 的弱点

wide侧的输入可以包含了一些交叉特征、离散特征及部分连续特征

其中基φk(x)是0/1布尔变量。它捕获了二元特征之间的相互作用,并给广义线性模型增加了非线性。

(e.g.,“AND(gender=female,language=en)”) is 1 if and only if theconstituent features (“gender=female” and “language=en”)are all 1, and 0 otherwise

特征交叉

deep 特征

deep 部分是一个前馈神经网络。通常情况下这部分特征使用的是用户行为特征,用于学习历史数据中不存在的特征组合。

deep 层的输入主要是一些稠密的连续特征和一些离散特征的embedding

对于分类特征,原始输入是特征字符串(例如 “language=en”)。这些稀疏的、高维的分类特征首先被转换成一个低维的、密集的实值向量,通常被称为嵌入向量。嵌入的维数通常是10~100

这些向量一般使用随机的方法进行初始化,随机可以均匀随机也可以随机成正态分布,随机初始化的目的是将向量初始化到一个数量级,在训练过程中通过最小化损失函数来优化模型

这些低维密集的嵌入向量和连续值特征concat之后被输入到神经网络的隐藏层,执行以下计算
deep 特征

联合训练

对于联合训练来说,wide 只需要通过少量的交叉特征转换来补充 deep 的弱点,而不需要全部的离散值特征。联合训练是通过将梯度从输出同时反向传播到模型的 wide & deep 部分来完成的。

其中 Y 是0/1 lable,σ(·)是 sigmoid 函数,φ(x)是原始特征 x的交叉特征,b是偏置项.Wwide是 wide 层的权重, Wdeep是最终激活层的权重

训练过程

训练过程

最终训练模型结构图如下,在训练期间,输入层接受训练数据(标签和离散特征和连续特征)。在模型的 deep 部分,每个分类特征学习了一个32维的嵌入向量。将所有的嵌入特征与密集特征连接在一起,得到一个大约1200维的密集向量。然后馈入3 ReLU层,最后是 logistics 输出层。
在这里插入图片描述

反馈过程

wide 模型使用Ftrl优化器,deep 模型使用Adagrad优化器;
二分类损失函数:_binary_logistic_head_with_sigmoid_cross_entropy_loss
多分类损失函数:_multi_class_head_with_softmax_cross_entropy_loss
loss 更新方式logits = dnn_logits + linear_logits

一个例子(https://zhuanlan.zhihu.com/p/43328492)
import tempfile
import tensorflow as tf

from six.moves import urllib

import pandas as pd

flags = tf.app.flags
FLAGS = flags.FLAGS

flags.DEFINE_string("model_dir", "", "Base directory for output models.")
flags.DEFINE_string("model_type", "wide_n_deep", "valid model types:{'wide','deep', 'wide_n_deep'")
flags.DEFINE_integer("train_steps", 200, "Number of training steps.")
flags.DEFINE_string("train_data", "", "Path to the training data.")
flags.DEFINE_string("test_data", "", "path to the test data")

COLUMNS = ["age", "workclass", "fnlwgt", "education", "education_num",
           "marital_status", "occupation", "relationship", "race", "gender",
           "capital_gain", "capital_loss", "hours_per_week", "native_country",
           "income_bracket"]

LABEL_COLUMN = "label"

CATEGORICAL_COLUMNS = ["workclass", "education", "marital_status", "occupation",
                       "relationship", "race", "gender", "native_country"]

CONTINUOUS_COLUMNS = ["age", "education_num", "capital_gain", "capital_loss",
                      "hours_per_week"]


# download test and train data
def maybe_download():
    if FLAGS.train_data:
        train_data_file = FLAGS.train_data
    else:
        train_file = tempfile.NamedTemporaryFile(delete=False)
        urllib.request.urlretrieve("http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.data", train_file.name)
        train_file_name = train_file.name
        train_file.close()
        print("Training data is downloaded to %s" % train_file_name)

    if FLAGS.test_data:
        test_file_name = FLAGS.test_data
    else:
        test_file = tempfile.NamedTemporaryFile(delete=False)
        urllib.request.urlretrieve("http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.test",
                                   test_file.name)  # pylint: disable=line-too-long
        test_file_name = test_file.name
        test_file.close()
        print("Test data is downloaded to %s" % test_file_name)

    return train_file_name, test_file_name


# build the estimator
def build_estimator(model_dir):
    # 离散分类别的
    gender = tf.contrib.layers.sparse_column_with_keys(column_name="gender", keys=["female", "male"])
    education = tf.contrib.layers.sparse_column_with_hash_bucket("education", hash_bucket_size=1000)
    relationship = tf.contrib.layers.sparse_column_with_hash_bucket("relationship", hash_bucket_size=100)
    workclass = tf.contrib.layers.sparse_column_with_hash_bucket("workclass", hash_bucket_size=100)
    occupation = tf.contrib.layers.sparse_column_with_hash_bucket("occupation", hash_bucket_size=1000)
    native_country = tf.contrib.layers.sparse_column_with_hash_bucket("native_country", hash_bucket_size=1000)

    # Continuous base columns.
    age = tf.contrib.layers.real_valued_column("age")
    education_num = tf.contrib.layers.real_valued_column("education_num")
    capital_gain = tf.contrib.layers.real_valued_column("capital_gain")
    capital_loss = tf.contrib.layers.real_valued_column("capital_loss")
    hours_per_week = tf.contrib.layers.real_valued_column("hours_per_week")
    # 类别转换
    age_buckets = tf.contrib.layers.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
    
    # wide 层是枚举值特征,以及枚举值交叉特征
    wide_columns = [gender, native_country, education, occupation, workclass, relationship, age_buckets,
                    tf.contrib.layers.crossed_column([education, occupation], hash_bucket_size=int(1e4)),
                    tf.contrib.layers.crossed_column([age_buckets, education, occupation], hash_bucket_size=int(1e6)),
                    tf.contrib.layers.crossed_column([native_country, occupation], hash_bucket_size=int(1e4))]

    # embedding_column用来表示类别型的变量
    # deep 层是 embedding 特征 和 连续性特征
    deep_columns = [tf.contrib.layers.embedding_column(workclass, dimension=8),
                    tf.contrib.layers.embedding_column(education, dimension=8),
                    tf.contrib.layers.embedding_column(gender, dimension=8),
                    tf.contrib.layers.embedding_column(relationship, dimension=8),
                    tf.contrib.layers.embedding_column(native_country, dimension=8),
                    tf.contrib.layers.embedding_column(occupation, dimension=8),
                    age, education_num, capital_gain, capital_loss, hours_per_week, ]

    if FLAGS.model_type == "wide":
        m = tf.contrib.learn.LinearClassifier(model_dir=model_dir, feature_columns=wide_columns)
    elif FLAGS.model_type == "deep":
        m = tf.contrib.learn.DNNClassifier(model_dir=model_dir, feature_columns=deep_columns, hidden_units=[100, 50])
    else:
        m = tf.contrib.learn.DNNLinearCombinedClassifier(model_dir=model_dir, linear_feature_columns=wide_columns, dnn_feature_columns=deep_columns, dnn_hidden_units=[100, 50])

    return m


def input_fn(df):
    continuous_cols = {k: tf.constant(df[k].values) for k in CONTINUOUS_COLUMNS}
    categorical_cols = {k: tf.SparseTensor(indices=[[i, 0] for i in range(df[k].size)], values=df[k].values, shape=[df[k].size, 1]) for k in CATEGORICAL_COLUMNS}  # 原文例子为dense_shape
    feature_cols = dict(continuous_cols)
    feature_cols.update(categorical_cols)
    label = tf.constant(df[LABEL_COLUMN].values)

    return feature_cols, label


def train_and_eval():
    train_file_name, test_file_name = maybe_download()
    df_train = pd.read_csv(
        tf.gfile.Open(train_file_name),
        names=COLUMNS,
        skipinitialspace=True,
        engine="python"
    )
    df_test = pd.read_csv(
        tf.gfile.Open(test_file_name),
        names=COLUMNS,
        skipinitialspace=True,
        skiprows=1,
        engine="python"
    )

    # drop Not a number elements
    df_train = df_train.dropna(how='any', axis=0)
    df_test = df_test.dropna(how='any', axis=0)

    # convert >50 to 1
    df_train[LABEL_COLUMN] = (
        df_train["income_bracket"].apply(lambda x: ">50" in x).astype(int)
    )
    df_test[LABEL_COLUMN] = (
        df_test["income_bracket"].apply(lambda x: ">50K" in x)).astype(int)

    model_dir = tempfile.mkdtemp() if not FLAGS.model_dir else FLAGS.model_dir
    print("model dir = %s" % model_dir)

    m = build_estimator(model_dir)
    print(FLAGS.train_steps)
    m.fit(input_fn=lambda: input_fn(df_train),
          steps=FLAGS.train_steps)
    results = m.evaluate(input_fn=lambda: input_fn(df_test), steps=1)

    for key in sorted(results):
        print("%s: %s" % (key, results[key]))


def main(_):
    train_and_eval()


if __name__ == "__main__":
    tf.app.run()

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/你好赵伟/article/detail/494640
推荐阅读
相关标签
  

闽ICP备14008679号