- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

【Pytorch基础教程26】wide&deep推荐算法（tf2.0和torch版）

野猪佩奇996 发表于 2022/03/27 22:32:10 2022/03/27

【摘要】学习总结文章目录学习总结一、tensorflow2.0的安装二、数据集的加载三、模型的搭建四、模型训练和测试五、模型搭建的比较六、经典wide&deep模型（tf2.0版）6.0 模...

学习总结

一、tensorflow2.0的安装

学术界主流是pytorch，但在工业界中为了模型部署便捷，也会使用tensorflow2，TensorFlow 2.0 提供比1.0更简化的 API、注重 Keras、结合了 Eager execution。

下载过程参考官网：https://www.tensorflow.org/install/pip

张量的操作：

Tensorflow 和 PyTorch 张量初始化可以直接分别调用：tf.constent 方法，调用 torch.tensor 方法，填入张量数值即可。
也可以：tf.Variable 创建的是Variable对象，不是Tensor对象，前者可以跟踪求梯度，后者torch.Tensor不能直接求梯度。
torch.tensor和 torch.Tensor 均创建的是 Tensor 对象，但前者输入具体数值，后者输入 Tensor shape(size)，数值不可控，不推荐。

二、数据集的加载

tensorflow用tf.keras.datasets.mnist.load_data()加载数据，numpy.ndarray格式；pytorch使用torchvison.datasets.MNIST加载的数据集，数据格式为image（无法直接使用，需要设置transform = transforms.ToTensor()转换成tensor张量数据），transform.Compose()还能通过list传参进行图片转换、正则化等操作。
tensorflow通过tf.data.Dataset.from_tensor_slices()构建数据集对象，通过.map自定义preprocess函数对数据预处理；而pytorch使用torch.utils.data.DataLoader构建数据集对象。处理后Tensorflow 中 image shape: [b, 28, 28], label shape: [b]。
PyTorch 的 DataLoader 可以设置训练数据的 Train = False 避免在测试数据库中对数据进行训练，而 Tensorflow 就只能在搭建网络的时候才能声明。
如果tensorflow加载本地的数据集：train_dataset = get_dataset(路径)，如果是从一个URL下载文件，可以用如下的tf.keras.utils.get_file。默认情况下，URL origin处的文件被下载到缓存目录 〜/.keras 中，放在缓存子目录 datasets中，并命名为 fname。文件 example.txt 的最终位置为 ~/.keras/datasets/example.txt。

tf.keras.utils.get_file(fname, origin, untar=False, md5_hash=None, file_hash=None, cache_subdir='datasets', hash_algorithm='auto', extract=False, archive_format='auto', cache_dir=None)

  
 
  1

三、模型的搭建

Tensorflow 继承 tf.keras.Model对象，PyTorch 继承 torch.nn.Module对象．
Tensorflow 模型对象中，前向传播调用 call() 函数，PyTorch 调用 forward() 函数．

class CNN_model(keras.Model):
    def __init__(self):
        super().__init__()
    
        self.model = keras.Sequential(
            [layers.Conv2D(filters=3, kernel_size=(3,3), strides=(1,1),padding="same"),
            layers.MaxPool2D(pool_size=(2,2)),
            layers.ReLU(),
            layers.Conv2D(6,(3,3),(2,2),"same"),
            layers.ReLU(),
            layers.Flatten(),
            layers.Dense(10)]
            )
    
    def call(self,x):
        x = self.model(x)
        
        return x
    
model = CNN_model()
model.build(input_shape = (None,28,28,1))
model.summary()
optimizer = tf.optimizers.Adam(learning_rate)

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23

pytorch需要创建 device = torch.device('cuda:0')并将网络和参数搬到这个 device 上进行计算；但是tensorflow如果使用的是tensorflow-gpu版，则直接使用的GPU计算。

四、模型训练和测试

accuracy的计算过程：

将所有验证数据带入训练好的模型中，给出预测值。
将预测值与实际值进行比较。
累加预测正确的样本数和总样本数。
用【正确率 = 预测正确的样本数 / 所有样本数】算出正确率

自动求导的接口：tf.GradientTape() 是一个自动求导的记录器，在其中的变量和计算步骤都会被自动记录。变量 x 和计算步骤 y = tf.square(x) 被自动记录，因此可以通过 y_grad = tape.gradient(y, x) 求张量y 对参数x的导数。

for epoch in range(epochs):
    
    for step, (x, y) in enumerate(ds_train):
        x = tf.reshape(x, [-1, 28,28,1])
        with tf.GradientTape() as tape:            
            logits = model(x)
            
            losses = tf.losses.sparse_categorical_crossentropy(y,logits,from_logits=True)
            loss = tf.reduce_mean(losses)
            
        grads = tape.gradient(loss, model.variables)
        
        optimizer.apply_gradients(zip(grads, model.variables))
        
        if(step%100==0):
            print("epoch:{}, step:{} loss:{}".
                  format(epoch, step, loss.numpy()))
            
            
#             test accuracy: 
            total_correct = 0
            total_num = 0
            
            for x_test, y_test in ds_test:
                x_test = tf.reshape(x_test, [-1, 28,28,1])
                y_pred = tf.argmax(model(x_test),axis=1)
                y_pred = tf.cast(y_pred, tf.int32)
                correct = tf.cast((y_pred == y_test), tf.int32)
                correct = tf.reduce_sum(correct)
                
                total_correct += int(correct)
                total_num += x_test.shape[0]
        
            
            accuracy = total_correct/total_num
            print('accuracy: ', accuracy)

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36

为了与 PyTorch 中torch.nn.CrossEntropyLoss()求交叉熵的方法一致，Tensorflow 中并未对label 进行 One-Hot 编码，所以使用了tf.losses.sparse_categorical_crossentropy() 方法计算交叉熵。结果为：

Model: "cnn_model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
sequential_2 (Sequential)    multiple                  3148      
=================================================================
Total params: 3,148
Trainable params: 3,148
Non-trainable params: 0
_________________________________________________________________
epoch:0, step:0 loss:2.328885078430176
accuracy:  0.1409
epoch:0, step:100 loss:0.14413821697235107
accuracy:  0.8648
.......

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15

五、模型搭建的比较

小结：为了方便比较和训练，我们用最简单的MINST数据集，比较两个框架的模型训练过程。可以看到TensorFlow 2.0 提供比1.0更简化的 API、注重 Keras、结合了 Eager execution。下面进一步使用tensorflow复现wide&deep模型。

六、经典wide&deep模型（tf2.0版）

6.0 模型提出的背景

Wide&Deep 模型是由 Google 的应用商店团队 Google Play 提出的，在 Google Play 为用户推荐 APP 这样的应用场景下。Wide&Deep 模型的推荐目标：尽量推荐那些用户可能喜欢，愿意安装的应用。

问题：具体到 Wide&Deep 模型中，Google Play 团队是如何为 Wide 部分和 Deep 部分挑选特征的。

图： Google Play Wide&Deep模型的细节（出自Wide & Deep Learning for Recommender Systems ）

上图补充google play团队这个模型的细节：从左边的wide的特征开始看起，只利用两个特征（“已安装应用”特征和“曝光应用”特征）的交叉，就是说wide想学的东西是希望记住“因为A所以B”的规则——如果安装了应用A，是否会安装B这样的规则。

6.1 Retrieval和Ranking

当一个用户访问app商店时，此时会产生一个请求，请求到达推荐系统后，推荐系统为该用户返回推荐的apps列表。
在实际的推荐系统中，通常将推荐的过程分为两个部分，即上图中的Retrieval和Ranking，Retrieval负责从数据库中检索出与用户相关的一些apps，Ranking负责对这些检索出的apps打分，最终，按照分数的高低返回相应的列表给用户。

其中在ranking中使用更细致的用户特征，如：

User features（年龄、性别、语言、民族等）
Contextual features(上下文特征：设备，时间等)
Impression features（展示特征：app age、app的历史统计信息等）

与一般的推荐系统不同的是，Google Pay是通过检索来实现推荐的召回，将大规模的App应用缩小到小规模（例如100）的相关应用。然后在通过用户特征、上下文特征、用户行为特征等来建立推荐模型，估计用户点击每个App的概率分数，按照分数进行排序，推荐Top K个App。

wide&deep模型最后是将wide部分和deep部分的输出进行加权求和（使用一个逻辑回归），通过sigmoid后输出概率值：
$\mid \mathbf{x})=\sigma\left(\mathbf{w}_{\text {wide }}^{T}[\mathbf{x}, \phi(\mathbf{x})]+\mathbf{w}_{\text {deep }}^{T} a^{\left(l_{f}\right)}+b\right)$

其中 $Y$ 是二值分类标签； $\sigma(\cdot)$ 为 $\operatorname{sigmoid}$ 函数, $\mathbf{w}_{\text {wide }}, \mathbf{w}_{\text {dee }}$ 分别是Wide部分和Deep部分的权重。
wide和deep使用的梯度下降方式不同，前者使用L1正则（有特征选择的作用），后者用普通的梯度下降方式（L2正则）。
wide&deep只是一种架构，可根据具体业务改，如某些特征不适合wide和deep，而是使用FM时，则将经过FM的特征和wide deep段的output进行拼接。
wide和deep模型的联合训练是通过使用小批量随机优化同时将输出的梯度反向传播到模型的wide和deep部分来完成的。在实验中，我们使用带L1正则的FTRL算法作为wide部分的优化器，AdaGrad作为deep部分的优化器。

6.2 训练的方法：

Wide模型：FTRL（Follow-the-regularized-leader）
Deep模型：AdaGrad

6.3 区别联合训练和集成学习的差别：

集成学习是多模型分别独立训练，最后再将结果进行融合；
联合训练会将wide和deep模型组合在一起，在训练时同时优化所有参数，并且进行加权求和，根据最终的loss计算出gradient，反向传播到Wide和Deep两部分中，分别训练自己的参数。也就是说，wide & deep 模型的权重更新会受到 wide 侧和 deep 侧对模型训练误差的共同影响。在论文中，wide部分是使用L1正则化的Follow-the-regularized-leader(FTRL)算法进行优化，deep部分使用的是AdaGrad完成优化。

6.4 代码部分

（1）导入数据集

通过get_dataset导入movielen数据集。

import tensorflow as tf

# load sample as tf dataset
def get_dataset(file_path):
    dataset = tf.data.experimental.make_csv_dataset(
        file_path,
        batch_size=12,
        label_name='label',
        na_value="0",
        num_epochs=1,
        ignore_errors=True)
    return dataset


# split as test dataset and training dataset
train_dataset = get_dataset('D:/wide&deep/trainingSamples.csv')
test_dataset = get_dataset('D:/wide&deep/testSamples.csv')

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17

（2）特征处理

广义上来讲，所有的特征都可以分为两大类：

1）类别型特征：利用 One-hot 编码处理

第一类是类别、ID 型特征（以下简称类别型特征）。
拿电影推荐来说，电影的风格、ID、标签、导演演员等信息，用户看过的电影 ID、用户的性别、地理位置信息、当前的季节、时间（上午，下午，晚上）、天气等等，这些无法用数字表示的信息全都可以被看作是类别、ID 类特征。——利用one hot编码。

# genre features vocabulary
genre_vocab = ['Film-Noir', 'Action', 'Adventure', 'Horror', 'Romance', 'War', 'Comedy', 'Western', 'Documentary',
               'Sci-Fi', 'Drama', 'Thriller',
               'Crime', 'Fantasy', 'Animation', 'IMAX', 'Mystery', 'Children', 'Musical']

GENRE_FEATURES = {
    'userGenre1': genre_vocab,
    'userGenre2': genre_vocab,
    'userGenre3': genre_vocab,
    'userGenre4': genre_vocab,
    'userGenre5': genre_vocab,
    'movieGenre1': genre_vocab,
    'movieGenre2': genre_vocab,
    'movieGenre3': genre_vocab
}

# all categorical features

# genre类别型特征转为one-hot特征，此处用到词表
categorical_columns = []
for feature, vocab in GENRE_FEATURES.items():
    cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
        key=feature, vocabulary_list=vocab)
    emb_col = tf.feature_column.embedding_column(cat_col, 10)
    categorical_columns.append(emb_col)
    
# 用户id和电影id转为one-hot特征，此处不用词表
# movie id embedding feature
movie_col = tf.feature_column.categorical_column_with_identity(key='movieId', num_buckets=1001)
movie_emb_col = tf.feature_column.embedding_column(movie_col, 10)
categorical_columns.append(movie_emb_col)

# user id embedding feature
user_col = tf.feature_column.categorical_column_with_identity(key='userId', num_buckets=30001)
# 为了将得到的one-hot转为稠密向量，所以要加一层embedding
user_emb_col = tf.feature_column.embedding_column(user_col, 10)
categorical_columns.append(user_emb_col)

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37

2）数值型特征：归一化和分桶

第二类是数值型特征，能用数字直接表示的特征就是数值型特征。如：用户的年龄、收入、电影的播放时长、点击量、点击率等。——为了解决特征的尺度相差过大（使用归一化）；解决特征分布不均匀问题（使用分桶策略）
在经典的 YouTube 深度推荐模型中，可以看到一些很有意思的处理方法。比如，在处理观看时间间隔（time since last watch）和视频曝光量（previous impressions）这两个特征时，YouTube 模型对它们进行归一化后，又将它们各自处理成了三个特征（图 6 中红框内的部分），分别是原特征值 x，特征值的平方x^2，以及特征值的开方。

分桶（Bucketing）：将样本按照某特征的值从高到低排序，然后按照桶的数量找到分位数，将样本分到各自的桶中，再用桶 ID 作为特征值。

无论是平方还是开方操作，改变的还是这个特征值的分布，这些操作与分桶操作一样，都是希望通过改变特征的分布，让模型能够更好地学习到特征内包含的有价值信息。但由于我们没法通过人工的经验判断哪种特征处理方式更好，所以索性把它们都输入模型，让模型来做选择。

# all numerical features
numerical_columns = [tf.feature_column.numeric_column('releaseYear'),
                     tf.feature_column.numeric_column('movieRatingCount'),
                     tf.feature_column.numeric_column('movieAvgRating'),
                     tf.feature_column.numeric_column('movieRatingStddev'),
                     tf.feature_column.numeric_column('userRatingCount'),
                     tf.feature_column.numeric_column('userAvgRating'),
                     tf.feature_column.numeric_column('userRatingStddev')]

# cross feature between current movie and user historical movie
rated_movie = tf.feature_column.categorical_column_with_identity(key='userRatedMovie1', num_buckets=1001)
crossed_feature = tf.feature_column.indicator_column(tf.feature_column.crossed_column([movie_col, rated_movie], 10000))

# define input for keras model
inputs = {
    'movieAvgRating': tf.keras.layers.Input(name='movieAvgRating', shape=(), dtype='float32'),
    'movieRatingStddev': tf.keras.layers.Input(name='movieRatingStddev', shape=(), dtype='float32'),
    'movieRatingCount': tf.keras.layers.Input(name='movieRatingCount', shape=(), dtype='int32'),
    'userAvgRating': tf.keras.layers.Input(name='userAvgRating', shape=(), dtype='float32'),
    'userRatingStddev': tf.keras.layers.Input(name='userRatingStddev', shape=(), dtype='float32'),
    'userRatingCount': tf.keras.layers.Input(name='userRatingCount', shape=(), dtype='int32'),
    'releaseYear': tf.keras.layers.Input(name='releaseYear', shape=(), dtype='int32'),

    'movieId': tf.keras.layers.Input(name='movieId', shape=(), dtype='int32'),
    'userId': tf.keras.layers.Input(name='userId', shape=(), dtype='int32'),
    'userRatedMovie1': tf.keras.layers.Input(name='userRatedMovie1', shape=(), dtype='int32'),

    'userGenre1': tf.keras.layers.Input(name='userGenre1', shape=(), dtype='string'),
    'userGenre2': tf.keras.layers.Input(name='userGenre2', shape=(), dtype='string'),
    'userGenre3': tf.keras.layers.Input(name='userGenre3', shape=(), dtype='string'),
    'userGenre4': tf.keras.layers.Input(name='userGenre4', shape=(), dtype='string'),
    'userGenre5': tf.keras.layers.Input(name='userGenre5', shape=(), dtype='string'),
    'movieGenre1': tf.keras.layers.Input(name='movieGenre1', shape=(), dtype='string'),
    'movieGenre2': tf.keras.layers.Input(name='movieGenre2', shape=(), dtype='string'),
    'movieGenre3': tf.keras.layers.Input(name='movieGenre3', shape=(), dtype='string'),
}

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36

（3）模型部分

deep模块：embedding+MLP，让模型具有较强泛化能力。和上个task的特征是一样的，输入层加两层 128 维隐层的结构，它的输入是类别型 Embedding 向量和数值型特征。
wide模块：将输入层直接连接到输出层（中间没有做任何处理）
——让模型具有较强的记忆力。把输入特征连接到输出层，注意Wide 部分所用的特征 crossed_feature。如下面我们用movielen电影数据集，生成了一个由【用户已好评电影】和【当前评价电影】组成的一个交叉特征crossed_feature。其交叉的代码：

# movie id 转为 one-hot特征，movie id embedding feature
movie_col = tf.feature_column.categorical_column_with_identity(key='movieId', num_buckets=1001)
rated_movie = tf.feature_column.categorical_column_with_identity(key='userRatedMovie1', 
                                                                 num_buckets=1001)
# cross feature between current movie and user historical movie
crossed_feature = tf.feature_column.indicator_column(tf.feature_column.crossed_column([movie_col, rated_movie], 
                                                                                      10000))

  
 
  1
  2
  3
  4
  5
  6
  7

# wide and deep model architecture
# deep part for all input features
deep = tf.keras.layers.DenseFeatures(numerical_columns + categorical_columns)(inputs)
deep = tf.keras.layers.Dense(128, activation='relu')(deep)
deep = tf.keras.layers.Dense(128, activation='relu')(deep)
# wide part for cross feature
wide = tf.keras.layers.DenseFeatures(crossed_feature)(inputs)
both = tf.keras.layers.concatenate([deep, wide])
output_layer = tf.keras.layers.Dense(1, activation='sigmoid')(both)
model = tf.keras.Model(inputs, output_layer)

# compile the model, set loss function, optimizer and evaluation metrics
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy', tf.keras.metrics.AUC(curve='ROC'), tf.keras.metrics.AUC(curve='PR')])

# train the model
model.fit(train_dataset, epochs=5)

# evaluate the model
test_loss, test_accuracy, test_roc_auc, test_pr_auc = model.evaluate(test_dataset)
print('\n\nTest Loss {}, Test Accuracy {}, Test ROC AUC {}, Test PR AUC {}'.format(test_loss, test_accuracy,
                                                                                   test_roc_auc, test_pr_auc))

# print some predict results
predictions = model.predict(test_dataset)
for prediction, goodRating in zip(predictions[:12], list(test_dataset)[0][1][:12]):
    print("Predicted good rating: {:.2%}".format(prediction[0]),
          " | Actual rating label: ",
          ("Good Rating" if bool(goodRating) else "Bad Rating"))

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31

（4）训练结果

2022-03-26 20:32:48.370872: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Epoch 1/5
D:\anaconda1\envs\tensorflow\lib\site-packages\tensorflow\python\keras\engine\functional.py:540: UserWarning: Input dict contained keys ['rating', 'timestamp', 'userRatedMovie2', 'userRatedMovie3', 'userRatedMovie4', 'userRatedMovie5', 'userAvgReleaseYear', 'userReleaseYearStddev'] which did not match any model input. They will be ignored by the model.
  warnings.warn(
7403/7403 [==============================] - 48s 6ms/step - loss: 0.7353 - accuracy: 0.6156 - auc: 0.6373 - auc_1: 0.6713
Epoch 2/5
7403/7403 [==============================] - 48s 6ms/step - loss: 0.5968 - accuracy: 0.6854 - auc: 0.7384 - auc_1: 0.7648
Epoch 3/5
7403/7403 [==============================] - 45s 6ms/step - loss: 0.5417 - accuracy: 0.7283 - auc: 0.7948 - auc_1: 0.8157
Epoch 4/5
7403/7403 [==============================] - 40s 5ms/step - loss: 0.5024 - accuracy: 0.7552 - auc: 0.8288 - auc_1: 0.8492
Epoch 5/5
7403/7403 [==============================] - 40s 5ms/step - loss: 0.4774 - accuracy: 0.7708 - auc: 0.8482 - auc_1: 0.8704
1870/1870 [==============================] - 4s 2ms/step - loss: 0.6218 - accuracy: 0.6893 - auc: 0.7536 - auc_1: 0.7819


Test Loss 0.6217975616455078, Test Accuracy 0.6893048286437988, Test ROC AUC 0.7535525560379028, Test PR AUC 0.7818983793258667
Predicted good rating: 82.82%  | Actual rating label:  Good Rating
Predicted good rating: 49.24%  | Actual rating label:  Good Rating
Predicted good rating: 77.08%  | Actual rating label:  Bad Rating
Predicted good rating: 74.62%  | Actual rating label:  Good Rating
Predicted good rating: 39.94%  | Actual rating label:  Bad Rating
Predicted good rating: 71.07%  | Actual rating label:  Good Rating
Predicted good rating: 65.24%  | Actual rating label:  Good Rating
Predicted good rating: 4.88%  | Actual rating label:  Good Rating
Predicted good rating: 94.61%  | Actual rating label:  Bad Rating
Predicted good rating: 39.97%  | Actual rating label:  Bad Rating
Predicted good rating: 75.85%  | Actual rating label:  Bad Rating
Predicted good rating: 30.80%  | Actual rating label:  Good Rating

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30

Reference

[1] tensorflow官网：https://www.tensorflow.org/
[2] Tensorflow2 和 Pytorch的基本操作对比(二)
[3] PyTorch与TensorFlow 2.x各有什么优势？
[4] TensorFlow 2.x —— tf.keras.util.get_file

文章来源: andyguo.blog.csdn.net，作者：山顶夕景，版权归原作者所有，如需转载，请联系作者。

原文链接：andyguo.blog.csdn.net/article/details/123744978

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

【Pytorch基础教程26】wide&amp;deep推荐算法（tf2.0和torch版）