当前位置:   article > 正文

【赛事基线】“深水云脑”居民小区二次供水需求预测Baseline之DL_深圳 水务 污水 竞赛 预测 datafun

深圳 水务 污水 竞赛 预测 datafun
v1.1 说明
  1. 运行此notebook需要比赛数据,请到比赛官网注册报名后下载,并放置到对应的目录('./work/data/')下面!

  2. 更新一下当时的排名
    在这里插入图片描述

  3. 如果要达到上面的这个分数,可以结合另一个基线的方法。

之前写了个【赛事基线】“深水云脑”水质净化厂工艺控制-曝气量预测Baseline之DL,同时,“深水云脑”系列的比赛还有一个《居民小区二次供水需求预测》,同样也是时间序列问题,那就趁热打铁把这个比赛的Baseline也做了~

本文大体分为:

  • 赛题分析
  • 基线代码
  • 结果分析

黑喂狗 ~~~

赛题分析

1. 赛题与数据

摘一下赛题任务:

本次赛题主要通过居民小区智能水表总表读数和二次供水泵后流量计历史数据,结合气象、疫情数据等互联网相关数据进行回归、时序建模,以建立该区域居民小区需水预测模型。利用举办方提供的多个居民小区历史用水数据和感知数据,预测特定周期内不同小区每小时需水量,以指导实际供水运行工作。
在这里插入图片描述

直接看数据说话 ~

注意: 这里不提供数据集,运行的时候需要先去报名比赛,然后把相应数据放到对应的目录里面!!!

!pip install --user -q -r requirements.txt
  • 1
import os
import numpy as np
import pandas as pd
import time
import functools

from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_log_error as msle
from sklearn.model_selection import StratifiedKFold, KFold
import matplotlib.pyplot as plt
%matplotlib inline
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
DATA_PATH = './work/data/'
  • 1
df_daily = pd.read_csv(DATA_PATH + 'daily_dataset.csv')
df_min = pd.read_csv(DATA_PATH + 'per5min_dataset.csv')
df_hour = pd.read_csv(DATA_PATH + 'hourly_dataset.csv')
df_test = pd.read_csv(DATA_PATH + 'test_public.csv')
df_sub = pd.read_csv(DATA_PATH + 'sample_submission.csv')
df_weather = pd.read_csv(DATA_PATH + 'weather.csv')
df_epidemic = pd.read_csv(DATA_PATH + 'epidemic.csv')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
df_hour.head()
  • 1
timeflow_1flow_2flow_3flow_4flow_5flow_6flow_7flow_8flow_9...flow_12flow_13flow_14flow_15flow_16flow_17flow_18flow_19flow_20train or test
02022-01-01 01:00:0029.714.654.740.13.049.710.91.15.0...2.9141.73.21.33.56.8NaN1.8061.4train
12022-01-01 02:00:0021.99.038.027.72.430.26.40.42.6...1.1081.32.20.82.34.5NaN3.8470.8train
22022-01-01 03:00:0016.94.528.922.91.319.73.80.51.4...0.7720.61.50.61.12.4NaNNaN0.5train
32022-01-01 04:00:0014.33.225.520.01.515.42.70.41.2...0.4140.21.20.70.81.8NaNNaN0.2train
42022-01-01 05:00:0014.93.526.420.61.217.52.20.51.2...0.2790.81.10.40.91.9NaNNaN0.3train

5 rows × 22 columns

df_hour.tail()
  • 1
timeflow_1flow_2flow_3flow_4flow_5flow_6flow_7flow_8flow_9...flow_12flow_13flow_14flow_15flow_16flow_17flow_18flow_19flow_20train or test
57312022-08-27 20:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4
57322022-08-27 21:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4
57332022-08-27 22:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4
57342022-08-27 23:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4
57352022-08-28 00:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4

5 rows × 22 columns

df_hour.describe()
  • 1
flow_1flow_2flow_3flow_4flow_5flow_6flow_7flow_8flow_9flow_10flow_11flow_12flow_13flow_14flow_15flow_16flow_17flow_18flow_19flow_20
count4980.0000005056.000005039.0000005039.000004979.0000005039.0000004982.0000004959.0000005061.0000004084.0000004924.0000004096.0000004973.0000004866.0000004957.0000004965.0000004954.0000004272.0000004088.0000005061.000000
mean44.41594420.2633774.64133847.058925.57849085.77846815.4344242.3173026.9976293.4761075.8815194.2010933.1490653.1501642.1632445.2894069.2852047.9152183.7365782.548390
std57.28930912.2184641.72287429.220407.38078755.07244712.8207622.3450894.4769015.0868246.9478218.2083453.0514863.3944002.2925575.7295349.8255809.9283155.3174411.552966
min1.0000001.800000.000000-0.100000.0000000.1000001.200000-32.3000000.5000000.056000-61.500000-0.013000-32.000000-30.300000-27.500000-68.800000-121.400000-0.013000-0.0860000.000000
25%28.40000011.6000055.40000025.800003.00000054.4000007.3000001.1000003.5000001.3460002.8000001.7760001.8000001.4000001.1000002.6000004.6000003.6480001.5975001.500000
50%43.50000019.6000074.10000047.600005.20000086.30000014.0500002.1000006.5000002.6840005.4000003.1880002.9000002.7000001.9000004.8000008.6000006.4400002.8050002.500000
75%55.60000026.9000092.40000063.200007.600000113.15000021.2000003.2000009.7000004.1692507.6000005.0080004.1000004.1000002.9000007.10000012.6000009.5120004.4800003.400000
max3797.400000160.600002048.6000001308.20000475.3000002458.500000414.30000062.60000026.000000172.339000183.000000376.93800091.60000088.90000061.900000152.700000265.700000364.062000172.74800012.200000
figure=plt.figure(figsize=(16,3))

ax1=plt.subplot(141)
plt.plot(df_hour['flow_1'])
ax2=plt.subplot(142)
plt.plot(df_hour['flow_2'])
ax3=plt.subplot(143)
plt.plot(df_hour['flow_3'])
ax4=plt.subplot(144)
plt.plot(df_hour['flow_4'])

plt.show()

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

在这里插入图片描述

df_test
  • 1
timeflow_1flow_2flow_3flow_4flow_5flow_6flow_7flow_8flow_9...flow_12flow_13flow_14flow_15flow_16flow_17flow_18flow_19flow_20train or test
02022-05-01 01:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest1
12022-05-01 02:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest1
22022-05-01 03:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest1
32022-05-01 04:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest1
42022-05-01 05:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest1
..................................................................
6672022-08-27 20:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4
6682022-08-27 21:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4
6692022-08-27 22:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4
6702022-08-27 23:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4
6712022-08-28 00:00:00NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNtest4

672 rows × 22 columns

df_test.groupby('train or test')['time'].count()
  • 1
train or test
test1    168
test2    168
test3    168
test4    168
Name: time, dtype: int64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
SEQ_LEN = 168
  • 1
# 参考开源项目https://github.com/lhrgo/Competition-code/blob/main/baseline.ipynb
test_list1 = df_test.groupby('train or test')['time'].first().reset_index()
test_list1 = test_list1['time'].values.tolist()
test_list2 = df_test.groupby('train or test')['time'].last().reset_index()
test_list2 = test_list2['time'].values.tolist()
test_list1.extend(test_list2)
test_list1.sort()
test_list1
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
['2022-05-01 01:00:00',
 '2022-05-08 00:00:00',
 '2022-06-01 01:00:00',
 '2022-06-08 00:00:00',
 '2022-07-21 01:00:00',
 '2022-07-28 00:00:00',
 '2022-08-21 01:00:00',
 '2022-08-28 00:00:00']
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

从测试数据(df_test)能够看到,这次的预测任务是以 小时 为单位的,作为基线,为了简化分析,这次训练数据同样只采用 小时 数据(df_hour)。

如果对比【赛事基线】“深水云脑”水质净化厂工艺控制-曝气量预测Baseline之DL这个任务,会发现两者最大的区别,一个是预测时间序列的伴生控制量(“水质净化厂工艺控制-曝气量”的 column),而一个是预测时间序列本身(“居民小区二次供水需求预测”的 row)。

此外,赛题更特殊的一点是,整个时间序列被分为四个大段,而每个预测时间段只能使用此段时间之前的数据进行预测。

引用赛题官网的解释:

在这里插入图片描述

规则举例:

  • 可使用训练集1预测测试集1。
  • 可使用训练集1、2和3来预测测试集3。
  • 可以通过半监督学习,用训练集1、测试集1、训练集2来预测测试集2。
  • 禁止使用训练集4来预测测试集1、2、3。

具体分析数据:

  1. 共有 5736条记录(以小时计)。
  2. 其中 672条测试集,共分为四段,每一段为7天168条数据:
    • 2022-05-01 01:00:00 ~ 2022-05-08 00:00:00
    • 2022-06-01 01:00:00 ~ 2022-06-08 00:00:00
    • 2022-07-21 01:00:00 ~ 2022-07-28 00:00:00
    • 2022-08-21 01:00:00 ~ 2022-08-28 00:00:00
  3. 数据特征为 mermaid flowchat_1 ~ flow_20,共20个,同时也是需要预测的字段。
  4. 记录中 train or test用来区分训练集与测试集。
  5. 这是一个时间序列的数值回归问题。
  6. 数据具有 NaN值、小于零、异常大,等异常数值。
  7. 另外比赛还提供了以天计、以分钟计、天气、疫情等数据,后续可以用来构建特征。

2. 建模分析

目前已有的baseline基本都是用 lightgbm,这里尝试用 LSTM、Transformer来解决这个时间序列问题。

建模之前先要构造数据,这里上一个图,看看如何构造适合 LSTM、Transformer的数据结构:

在这里插入图片描述

这里以测试集的长度 T = 168 T=168 T=168 为每条构造数据的时间跨度,则,以每 T T T 长度的数据为 X X X,以紧邻的长度为 T T T 的数据为 Y Y Y,以步长 s = 1 小 时 s=1小时 s=1 滚动生成。

其中 X X X 的每个时间点 t t t 包括 m m m 个特征值,如 mermaid flowchat_1、flow_2...,以及构造的 day、hour...等特征。

其中 Y Y Y 的每个时间点 t t t 包括 20 20 20 个预测值,分别对应 mermaid flowchat_1 ~ flow_20

这样构造完数据之后,再根据4个测试集的时间点,从序列中摘出对应的训练数据即可。

而最终的测试数据,其实只有 4 条,也就是对应的4个测试时间点前 T T T 的那一条数据。

构造好数据了,对应的模型结构如下图所示:

在这里插入图片描述

从上面的数据构造与模型结构可以看到,对比 lightgbm如果要解决此问题,需要构造模型数量为:

4 ∗ 20 ∗ k = 80 ∗ k 4 * 20 * k = 80 * k 420k=80k

这里的 k k kk fold的数量,也就是说至少需要 80 个模型!(以目前公布的baseline来举例。如果将 mermaid flowchat_n也做为特征的话,则可以大大减少模型数量。)

需要预测多少次数值呢?

4 ∗ 168 ∗ 20 ∗ k = 13440 ∗ k 4 * 168 * 20 * k = 13440 * k 416820k=13440k

而我们这里使用的 LSTM、Transformer来做,共需要模型:

4 ∗ k 4 * k 4k

每个测试时间段只需要一个模型,需要做测试集预测 4次,只有 4个 X X X 数据需要预测!

OK,这里并不是说哪种方法更好,方法的好坏还是要用最终的成绩来说,这里只是提供一个更简洁有意思的方案而已 ~~~

数据构造与模型结构都介绍完了,具体的实现过程看下面的代码吧~

基线代码

COLUMNS_Y = ['flow_{}'.format(i) for i in range(1, 21)]
COLUMNS_X = COLUMNS_Y + ['day', 'hour', 'dayofweek']
COLUMNS_X, COLUMNS_Y
  • 1
  • 2
  • 3
(['flow_1',
  'flow_2',
  'flow_3',
  'flow_4',
  'flow_5',
  'flow_6',
  'flow_7',
  'flow_8',
  'flow_9',
  'flow_10',
  'flow_11',
  'flow_12',
  'flow_13',
  'flow_14',
  'flow_15',
  'flow_16',
  'flow_17',
  'flow_18',
  'flow_19',
  'flow_20',
  'day',
  'hour',
  'dayofweek'],
 ['flow_1',
  'flow_2',
  'flow_3',
  'flow_4',
  'flow_5',
  'flow_6',
  'flow_7',
  'flow_8',
  'flow_9',
  'flow_10',
  'flow_11',
  'flow_12',
  'flow_13',
  'flow_14',
  'flow_15',
  'flow_16',
  'flow_17',
  'flow_18',
  'flow_19',
  'flow_20'])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
def add_time_feat(data):
    data['time'] = pd.to_datetime(data['time'])
    data['day'] = data['time'].dt.day
    data['hour'] = data['time'].dt.hour
    data['minute'] = data['time'].dt.minute
    data['dayofweek'] = data['time'].dt.dayofweek
    return data.sort_values('time').reset_index(drop=True)

def add_other_feat(data, columns):
    data['flow_sum'] = data[columns].sum()
    data['flow_median'] = data[columns].median()
    data['flow_mean'] = data[columns].mean()
    return data

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
df_hour = add_time_feat(df_hour)
  • 1
df_hour.head()
  • 1
timeflow_1flow_2flow_3flow_4flow_5flow_6flow_7flow_8flow_9...flow_16flow_17flow_18flow_19flow_20train or testdayhourminutedayofweek
02022-01-01 01:00:0029.714.654.740.13.049.710.91.15.0...3.56.8NaN1.8061.4train1105
12022-01-01 02:00:0021.99.038.027.72.430.26.40.42.6...2.34.5NaN3.8470.8train1205
22022-01-01 03:00:0016.94.528.922.91.319.73.80.51.4...1.12.4NaNNaN0.5train1305
32022-01-01 04:00:0014.33.225.520.01.515.42.70.41.2...0.81.8NaNNaN0.2train1405
42022-01-01 05:00:0014.93.526.420.61.217.52.20.51.2...0.91.9NaNNaN0.3train1505

5 rows × 26 columns

class Trans:
    def __init__(self, data, name):
        self.min = max(0, np.percentile(data, 1))
        self.max = np.percentile(data, 99)
        self.base = self.max-self.min

    def transform(self, data, scale=True):
        _data = np.clip(data, self.min, self.max)
        if not scale:
            return _data
        return (_data-self.min)/self.base

class TransUtil:
    def __init__(self, data, exclude_cols=None):
        self.columns = data.columns
        self.exclude_cols = exclude_cols
        self.trans = {}
        for c in self.columns:
            if data[c].dtype not in [int, float]:
                print('column "{}" not init trans...'.format(c))
                continue

            if exclude_cols is None or (exclude_cols is not None and c not in exclude_cols):
                print('init trans column...', c)
                self.trans[c] = Trans(data[c].fillna(method='backfill').fillna(method='ffill'), c)

    def transform(self, data, col_name, scale=True):
        if self.exclude_cols is not None and col_name in self.exclude_cols:
            return data

        for t in self.trans:
            if t.startswith(col_name):
                return self.trans[t].transform(data, scale=scale)
        
        return data
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
trans_util = TransUtil(df_hour, exclude_cols=None) # 数据标准化
  • 1
column "time" not init trans...
init trans column... flow_1
init trans column... flow_2
init trans column... flow_3
init trans column... flow_4
init trans column... flow_5
init trans column... flow_6
init trans column... flow_7
init trans column... flow_8
init trans column... flow_9
init trans column... flow_10
init trans column... flow_11
init trans column... flow_12
init trans column... flow_13
init trans column... flow_14
init trans column... flow_15
init trans column... flow_16
init trans column... flow_17
init trans column... flow_18
init trans column... flow_19
init trans column... flow_20
column "train or test" not init trans...
init trans column... day
init trans column... hour
init trans column... minute
init trans column... dayofweek
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
def generate_xy_pair(data, seq_len, trans_util, columns_x, columns_y):
    data_x = pd.DataFrame()
    for c in columns_x:
        data_x[c] = trans_util.transform(data[c].fillna(data[c].median()), c)

    data_y = pd.DataFrame()
    for c in columns_y:
        data_y[c] = trans_util.transform(data[c].fillna(data[c].median()), c, scale=False)

    data_x = data_x.values
    data_y = data_y.values
    
    print(data_x.shape, data_y.shape)

    d_x = []
    d_y = []
    for i in range(len(data_x)-seq_len*2+1):
        _x = data_x[i:i+seq_len]
        _y = data_y[i+seq_len:i+seq_len+seq_len]

        assert len(_x) == len(_y) == seq_len, (_x, _y, _x.shape, _y.shape, i, len(data_x))

        d_x.append(_x.T)
        d_y.append(_y.T)

    return np.asarray(d_x).transpose((0, 2, 1)), np.asarray(d_y).transpose((0, 2, 1))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
data_x, data_y = generate_xy_pair(df_hour, seq_len=SEQ_LEN, trans_util=trans_util, columns_x=COLUMNS_X, columns_y=COLUMNS_Y)
  • 1
(5736, 23) (5736, 20)
  • 1
data_x.shape, data_y.shape
  • 1
((5401, 168, 23), (5401, 168, 20))
  • 1
data_x[0], data_y[0]
  • 1
(array([[0.19510716, 0.2526096 , 0.26320132, ..., 0.        , 0.04347826,
         0.83333333],
        [0.11625556, 0.13569937, 0.12541254, ..., 0.        , 0.08695652,
         0.83333333],
        [0.06570966, 0.04175365, 0.05033003, ..., 0.        , 0.13043478,
         0.83333333],
        ...,
        [0.63687829, 0.98538622, 0.92739274, ..., 0.2       , 0.95652174,
         0.66666667],
        [0.92094622, 0.6993737 , 0.67986799, ..., 0.2       , 1.        ,
         0.66666667],
        [0.26991508, 0.44050104, 0.38118812, ..., 0.23333333, 0.        ,
         0.83333333]]),
 array([[ 23.6  ,  12.2  ,  40.6  , ...,   3.932,   1.15 ,   1.4  ],
        [ 15.6  ,   5.   ,  32.6  , ...,   1.575,   0.509,   0.3  ],
        [ 12.4  ,   3.9  ,  25.1  , ...,   1.042,   0.394,   0.3  ],
        ...,
        [ 71.3  ,  46.3  , 133.3  , ...,  14.968,   6.192,   4.8  ],
        [ 60.7  ,  37.   , 105.5  , ...,  12.944,   5.072,   4.   ],
        [ 35.   ,  19.8  ,  67.5  , ...,   8.908,   2.912,   2.4  ]]))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
# 根据每段测试集将对应的训练数据/测试数据的idx提取出来
_train_idx_1 = df_hour[df_hour['time']<test_list1[0]].index.values.tolist()
_train_idx_2 = df_hour[(df_hour['time']>test_list1[1])&(df_hour['time']<test_list1[2])].index.values.tolist()
_train_idx_3 = df_hour[(df_hour['time']>test_list1[3])&(df_hour['time']<test_list1[4])].index.values.tolist()
_train_idx_4 = df_hour[(df_hour['time']>test_list1[5])&(df_hour['time']<test_list1[6])].index.values.tolist()

# 每一段数据包括上一段时间
train_idx_1 = _train_idx_1[:-SEQ_LEN*2]
train_idx_2 = train_idx_1 + _train_idx_2[:-SEQ_LEN*2]
train_idx_3 = train_idx_2 + _train_idx_3[:-SEQ_LEN*2]
train_idx_4 = train_idx_3 + _train_idx_4[:-SEQ_LEN*2]

test_idx_1 = _train_idx_1[-SEQ_LEN]
test_idx_2 = _train_idx_2[-SEQ_LEN]
test_idx_3 = _train_idx_3[-SEQ_LEN]
test_idx_4 = _train_idx_4[-SEQ_LEN]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
len(_train_idx_1), len(_train_idx_2), len(_train_idx_3), len(_train_idx_4)
  • 1
(2880, 576, 1032, 576)
  • 1
len(train_idx_1), len(train_idx_2), len(train_idx_3), len(train_idx_4)
  • 1
(2544, 2784, 3480, 3720)
  • 1
test_idx_1, test_idx_2, test_idx_3, test_idx_4
  • 1
(2712, 3456, 4656, 5400)
  • 1
train_x_1 = data_x[train_idx_1]
train_y_1 = data_y[train_idx_1]
train_x_2 = data_x[train_idx_2]
train_y_2 = data_y[train_idx_2]
train_x_3 = data_x[train_idx_3]
train_y_3 = data_y[train_idx_3]
train_x_4 = data_x[train_idx_4]
train_y_4 = data_y[train_idx_4]

test_x_1 = data_x[test_idx_1]
test_x_2 = data_x[test_idx_2]
test_x_3 = data_x[test_idx_3]
test_x_4 = data_x[test_idx_4]

FEATURE_SIZE = train_x_1.shape[-1]
OUTPUT_SIZE = train_y_1.shape[-1]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
train_x_1.shape, train_y_1.shape, test_x_1.shape
  • 1
((2544, 168, 23), (2544, 168, 20), (168, 23))
  • 1
import paddle
import paddle.nn as nn
import paddle.nn.functional as F

class Tt(nn.Layer):
    def __init__(self,
                 seq_len,
                 feature_size,
                 output_size,
                 use_model='lstm',
                 hidden_size=576,
                 num_hidden_layers=6,
                 num_attention_heads=6,
                 intermediate_size=3072,
                 hidden_act="gelu",
                 hidden_dropout_prob=0.1,
                 attention_probs_dropout_prob=0.1,
                 max_position_embeddings=512,
                 max_hour=25,
                 max_min=61,
                 max_dow=8,
                 max_ts=1441):
        super(Tt, self).__init__()

        self.use_model = use_model
        self.feature_size = feature_size

        # 如果有相应的时间embedding则可以使用
        self.th_embeddings = nn.Embedding(max_hour, hidden_size)
        self.tm_embeddings = nn.Embedding(max_min, hidden_size)
        self.td_embeddings = nn.Embedding(max_dow, hidden_size)
        self.tt_embeddings = nn.Embedding(max_ts, hidden_size)

        # 位置编码
        self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size)
        self.layer_norm = nn.LayerNorm(hidden_size)
        self.fc_inputs = nn.Linear(feature_size, hidden_size)

        encoder_layer = nn.TransformerEncoderLayer(
            hidden_size,
            num_attention_heads,
            intermediate_size,
            dropout=hidden_dropout_prob,
            activation=hidden_act,
            attn_dropout=attention_probs_dropout_prob,
            act_dropout=0)
        self.encoder = nn.TransformerEncoder(encoder_layer, num_hidden_layers)

        self.lstm = paddle.nn.LSTM(input_size=hidden_size, hidden_size=hidden_size, num_layers=2)

        self.fc_output_1 = nn.Linear(hidden_size, hidden_size)
        self.fc_output_2 = nn.Linear(hidden_size, hidden_size)
        self.fc_output_3 = nn.Linear(hidden_size, output_size)

    def forward(self,
                inputs,
                inputs_th=None,
                inputs_tm=None,
                inputs_td=None,
                inputs_tt=None,
                position_ids=None,
                attention_mask=None):

        if position_ids is None:
            ones = paddle.ones(inputs.shape[:2], dtype="int64")
            seq_length = paddle.cumsum(ones, axis=1)
            position_ids = seq_length - ones
            position_ids.stop_gradient = True

        position_embeddings = self.position_embeddings(position_ids)

        inputs = self.fc_inputs(inputs)
        inputs = nn.Tanh()(inputs)

        inputs = inputs + position_embeddings

        # 如果有相应的时间embedding则可以使用
        if inputs_th is not None:
            inputs += self.th_embeddings(inputs_th)
        
        if inputs_tm is not None:
            inputs += self.tm_embeddings(inputs_tm)

        if inputs_td is not None:
            inputs += self.td_embeddings(inputs_td)

        if inputs_tt is not None:
            inputs += self.tt_embeddings(inputs_tt)

        inputs = self.layer_norm(inputs)

        # 选择使用LSTM或者Transformer
        if self.use_model == 'lstm':
            encoder_outputs, (h, c) = self.lstm(inputs)
        elif self.use_model == 'transformer':
            if attention_mask is None:
                attention_mask = paddle.unsqueeze(
                    (paddle.zeros(inputs.shape[:2])).astype(
                        self.fc_inputs.weight.dtype) * -1e4,
                    axis=[1, 2])

            encoder_outputs = self.encoder(
                inputs,
                src_mask=attention_mask)

        output = self.fc_output_1(encoder_outputs)
        output = nn.ReLU()(output)
        output = self.fc_output_2(output)
        output = self.fc_output_3(output)

        return output

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
import paddle
import paddle.nn.functional as F
from paddle.metric import Accuracy
from paddle.io import DataLoader, BatchSampler
from paddlenlp.datasets import MapDataset
from paddlenlp.data import DataCollatorWithPadding
from paddlenlp.data import Dict, Stack, Pad
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
def calc_score(y_true, y_pred):
    return 1/(1+msle(np.clip(np.reshape(y_true, -1), 0, None), np.clip(np.reshape(y_pred, -1), 0, None)))

def eval_model(model, data_loader):
    model.eval()

    y_pred = []
    y_true = []
    for step, batch in enumerate(data_loader, start=1):
        data = batch['data'].astype('float32')
        label = batch['label'].astype('float32')

        # 计算模型输出
        output = model(inputs=data)
        y_pred.extend(output.numpy())
        y_true.extend(label.numpy())
    
    score = calc_score(y_true, y_pred)
    model.train()
    return score

def make_data_loader(data_x, idx, batch_size, data_y=None, shuffle=False):

    data = [{
        'data': data_x[i], 
        'label': 0 if data_y is None else data_y[i]} 
        for i in idx]
    ds = MapDataset(data)
    batch_sampler = BatchSampler(ds, batch_size=batch_size, shuffle=shuffle)
    return DataLoader(dataset=ds, batch_sampler=batch_sampler)

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
EPOCHS = 30
BATCH_SIZE = 256
CKPT_DIR = 'work/output'
K_FOLD = 5
epoch_base = 0
step_eval = 5
step_log = 100

def do_train(train_x, train_y, prefix):
    print('-'*20)
    print('training ...', prefix)
    print('train x:', np.shape(train_x), 'train y:', np.shape(train_y))

    paddle.seed(2022)

    for kfold, tv_idx in enumerate(KFold(n_splits=K_FOLD, shuffle=True, random_state=2022).split(train_x)):
        print('training fold...', kfold)

        train_idx, valid_idx = tv_idx

        model = Tt(seq_len=SEQ_LEN, feature_size=FEATURE_SIZE, output_size=OUTPUT_SIZE)

        train_data_loader = make_data_loader(
            train_x, train_idx, BATCH_SIZE, data_y=train_y, shuffle=True)
        valid_data_loader = make_data_loader(
            train_x, valid_idx, BATCH_SIZE, data_y=train_y, shuffle=False)

        optimizer = paddle.optimizer.AdamW(learning_rate=1e-4, parameters=model.parameters())
        criterion = paddle.nn.MSELoss()

        epochs = EPOCHS # 训练轮次
        save_dir = CKPT_DIR #训练过程中保存模型参数的文件夹
        if not os.path.exists(save_dir):
            os.makedirs(save_dir)

        global_step = 0 #迭代次数
        tic_train = time.time()

        model.train()

        best_score = 0
        for epoch in range(1+epoch_base, epochs+epoch_base+1):
            for step, batch in enumerate(train_data_loader, start=1):
                data = batch['data'].astype('float32')
                label = batch['label'].astype('float32')

                # 计算模型输出
                output = model(inputs=data)
                loss = criterion(output, label)
                # print(loss)

                # 打印损失函数值、准确率、计算速度
                global_step += 1
                if global_step % step_eval == 0:
                    score = eval_model(model, valid_data_loader)            
                    if score > best_score:
                        # print('saving best model...', score)
                        _save_dir = os.path.join(save_dir, '{}_kfold_{}_best_model.pdparams'.format(prefix, kfold))
                        paddle.save(
                            model.state_dict(),
                            _save_dir)
                        best_score = score
                    if global_step % step_log == 0:
                        print(
                            'global step %d, epoch: %d, batch: %d, loss: %.5f, valid score: %.5f, speed: %.2f step/s'
                            % (global_step, epoch, step, loss, score,
                                10 / (time.time() - tic_train)))
                        tic_train = time.time()

                # 反向梯度回传,更新参数
                loss.backward()
                optimizer.step()
                optimizer.clear_grad()

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
def do_pred(test_x, prefix):
    print('-'*20)
    print('predict ...', prefix)
    print('predict x:', np.shape(test_x))

    # 预测
    test_data_loader = make_data_loader(
            [test_x], [0], BATCH_SIZE, data_y=None, shuffle=False)

    sub_df = []
    save_dir = CKPT_DIR

    for kfold in range(K_FOLD):
        print('predict kfold...', kfold)
        model = Tt(seq_len=SEQ_LEN, feature_size=FEATURE_SIZE, output_size=OUTPUT_SIZE)
        model.set_dict(paddle.load(os.path.join(save_dir, '{}_kfold_{}_best_model.pdparams'.format(prefix, kfold))))
        model.eval()

        y_pred = []
        for step, batch in enumerate(test_data_loader, start=1):
            data = batch['data'].astype('float32')
            label = batch['label'].astype('float32')

            # 计算模型输出
            output = model(inputs=data)
            y_pred.extend(output.numpy())

        sub_df.append(np.clip(y_pred, 0, None))
    
    return sub_df
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
# 依次训练每个测试集对应的模型
do_train(train_x_1, train_y_1, 'm1')
do_train(train_x_2, train_y_2, 'm2')
do_train(train_x_3, train_y_3, 'm3')
do_train(train_x_4, train_y_4, 'm4')
  • 1
  • 2
  • 3
  • 4
  • 5
--------------------
training ... m1
train x: (2544, 168, 23) train y: (2544, 168, 20)
training fold... 0


W0928 21:34:13.226250   365 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0928 21:34:13.229223   365 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.


global step 100, epoch: 13, batch: 4, loss: 189.34042, valid score: 0.74267, speed: 0.67 step/s
global step 200, epoch: 25, batch: 8, loss: 26.75570, valid score: 0.94225, speed: 0.75 step/s
training fold... 1
global step 100, epoch: 13, batch: 4, loss: 179.81596, valid score: 0.75175, speed: 0.88 step/s
global step 200, epoch: 25, batch: 8, loss: 27.06740, valid score: 0.94496, speed: 0.75 step/s
training fold... 2
global step 100, epoch: 13, batch: 4, loss: 192.32230, valid score: 0.74129, speed: 0.91 step/s
global step 200, epoch: 25, batch: 8, loss: 27.35677, valid score: 0.94298, speed: 0.75 step/s
training fold... 3
global step 100, epoch: 13, batch: 4, loss: 176.71466, valid score: 0.75317, speed: 0.87 step/s
global step 200, epoch: 25, batch: 8, loss: 24.32207, valid score: 0.94430, speed: 0.75 step/s
training fold... 4
global step 100, epoch: 13, batch: 4, loss: 196.51141, valid score: 0.73796, speed: 0.88 step/s
global step 200, epoch: 25, batch: 8, loss: 27.48337, valid score: 0.94143, speed: 0.74 step/s
--------------------
training ... m2
train x: (2784, 168, 23) train y: (2784, 168, 20)
training fold... 0
global step 100, epoch: 12, batch: 1, loss: 192.12552, valid score: 0.74218, speed: 0.83 step/s
global step 200, epoch: 23, batch: 2, loss: 26.67301, valid score: 0.94218, speed: 0.73 step/s
training fold... 1
global step 100, epoch: 12, batch: 1, loss: 181.16043, valid score: 0.75225, speed: 0.85 step/s
global step 200, epoch: 23, batch: 2, loss: 26.28015, valid score: 0.94389, speed: 0.73 step/s
training fold... 2
global step 100, epoch: 12, batch: 1, loss: 194.71078, valid score: 0.74261, speed: 0.87 step/s
global step 200, epoch: 23, batch: 2, loss: 28.19350, valid score: 0.93948, speed: 0.72 step/s
training fold... 3
global step 100, epoch: 12, batch: 1, loss: 181.40471, valid score: 0.75267, speed: 0.86 step/s
global step 200, epoch: 23, batch: 2, loss: 27.63694, valid score: 0.94298, speed: 0.72 step/s
training fold... 4
global step 100, epoch: 12, batch: 1, loss: 194.80693, valid score: 0.73768, speed: 0.85 step/s
global step 200, epoch: 23, batch: 2, loss: 27.04206, valid score: 0.93785, speed: 0.73 step/s
--------------------
training ... m3
train x: (3480, 168, 23) train y: (3480, 168, 20)
training fold... 0
global step 100, epoch: 10, batch: 1, loss: 195.62051, valid score: 0.74132, speed: 0.80 step/s
global step 200, epoch: 19, batch: 2, loss: 29.17942, valid score: 0.93782, speed: 0.70 step/s
global step 300, epoch: 28, batch: 3, loss: 22.93004, valid score: 0.94732, speed: 0.90 step/s
training fold... 1
global step 100, epoch: 10, batch: 1, loss: 191.73341, valid score: 0.74899, speed: 0.85 step/s
global step 200, epoch: 19, batch: 2, loss: 28.48909, valid score: 0.94111, speed: 0.70 step/s
global step 300, epoch: 28, batch: 3, loss: 24.10351, valid score: 0.94549, speed: 0.83 step/s
training fold... 2
global step 100, epoch: 10, batch: 1, loss: 200.53751, valid score: 0.74166, speed: 0.84 step/s
global step 200, epoch: 19, batch: 2, loss: 32.34964, valid score: 0.93378, speed: 0.70 step/s
global step 300, epoch: 28, batch: 3, loss: 22.18238, valid score: 0.94529, speed: 0.86 step/s
training fold... 3
global step 100, epoch: 10, batch: 1, loss: 190.54114, valid score: 0.74929, speed: 0.83 step/s
global step 200, epoch: 19, batch: 2, loss: 29.43060, valid score: 0.93647, speed: 0.70 step/s
global step 300, epoch: 28, batch: 3, loss: 22.63792, valid score: 0.94633, speed: 0.84 step/s
training fold... 4
global step 100, epoch: 10, batch: 1, loss: 199.86848, valid score: 0.73911, speed: 0.82 step/s
global step 200, epoch: 19, batch: 2, loss: 30.84038, valid score: 0.93401, speed: 0.71 step/s
global step 300, epoch: 28, batch: 3, loss: 25.37951, valid score: 0.94664, speed: 0.82 step/s
--------------------
training ... m4
train x: (3720, 168, 23) train y: (3720, 168, 20)
training fold... 0
global step 100, epoch: 9, batch: 4, loss: 196.55203, valid score: 0.74267, speed: 0.81 step/s
global step 200, epoch: 17, batch: 8, loss: 31.35485, valid score: 0.93497, speed: 0.70 step/s
global step 300, epoch: 25, batch: 12, loss: 24.27215, valid score: 0.94545, speed: 0.80 step/s
training fold... 1
global step 100, epoch: 9, batch: 4, loss: 191.64560, valid score: 0.74758, speed: 0.83 step/s
global step 200, epoch: 17, batch: 8, loss: 30.92274, valid score: 0.93813, speed: 0.69 step/s
global step 300, epoch: 25, batch: 12, loss: 24.90816, valid score: 0.94470, speed: 0.90 step/s
training fold... 2
global step 100, epoch: 9, batch: 4, loss: 197.55722, valid score: 0.74337, speed: 0.84 step/s
global step 200, epoch: 17, batch: 8, loss: 31.99613, valid score: 0.93345, speed: 0.70 step/s
global step 300, epoch: 25, batch: 12, loss: 24.23726, valid score: 0.94481, speed: 0.77 step/s
training fold... 3
global step 100, epoch: 9, batch: 4, loss: 186.58867, valid score: 0.74806, speed: 0.79 step/s
global step 200, epoch: 17, batch: 8, loss: 29.82816, valid score: 0.93393, speed: 0.71 step/s
global step 300, epoch: 25, batch: 12, loss: 25.93081, valid score: 0.94440, speed: 0.84 step/s
training fold... 4
global step 100, epoch: 9, batch: 4, loss: 198.73732, valid score: 0.74012, speed: 0.81 step/s
global step 200, epoch: 17, batch: 8, loss: 31.71860, valid score: 0.92987, speed: 0.70 step/s
global step 300, epoch: 25, batch: 12, loss: 24.98176, valid score: 0.94471, speed: 0.83 step/s
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
# 以此预测数据
pred_1 = do_pred(test_x_1, 'm1')
pred_2 = do_pred(test_x_2, 'm2')
pred_3 = do_pred(test_x_3, 'm3')
pred_4 = do_pred(test_x_4, 'm4')
  • 1
  • 2
  • 3
  • 4
  • 5
--------------------
predict ... m1
predict x: (168, 23)
predict kfold... 0
predict kfold... 1
predict kfold... 2
predict kfold... 3
predict kfold... 4
--------------------
predict ... m2
predict x: (168, 23)
predict kfold... 0
predict kfold... 1
predict kfold... 2
predict kfold... 3
predict kfold... 4
--------------------
predict ... m3
predict x: (168, 23)
predict kfold... 0
predict kfold... 1
predict kfold... 2
predict kfold... 3
predict kfold... 4
--------------------
predict ... m4
predict x: (168, 23)
predict kfold... 0
predict kfold... 1
predict kfold... 2
predict kfold... 3
predict kfold... 4
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
np.shape(pred_1), np.shape(pred_2), np.shape(pred_3), np.shape(pred_4)
  • 1
((5, 1, 168, 20), (5, 1, 168, 20), (5, 1, 168, 20), (5, 1, 168, 20))
  • 1
result = np.vstack((
    np.mean(pred_1, axis=0).squeeze(),
    np.mean(pred_2, axis=0).squeeze(),
    np.mean(pred_3, axis=0).squeeze(),
    np.mean(pred_4, axis=0).squeeze()))

result[result<0] = 0
result = pd.concat([df_sub['time'], pd.DataFrame(result)], axis=1)
result.columns = df_sub.columns
result.to_csv('work/result/result_0929_1.csv', index=False, encoding='utf-8')
result
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
timeflow_1flow_2flow_3flow_4flow_5flow_6flow_7flow_8flow_9...flow_11flow_12flow_13flow_14flow_15flow_16flow_17flow_18flow_19flow_20
02022-05-01 01:00:0016.3481399.17540326.26302718.7216742.37345934.8678367.2469551.0993443.313838...2.6702771.5937281.1537371.5145850.9689332.3035464.1337823.0258451.4022290.971339
12022-05-01 02:00:0014.0533805.96327322.57479314.6668611.69818126.5843874.5103870.7704312.162849...1.6513681.1847660.7589200.8778740.6572361.4386562.7126462.2770401.0389840.728588
22022-05-01 03:00:0013.6252624.78311621.75878313.3714541.48984023.9877093.5044700.6915821.788223...1.2440851.0795580.6461730.6398250.5593871.1231572.2160792.0841010.9399940.660505
32022-05-01 04:00:0014.6280294.95519023.24457914.1717381.58637125.4657713.6122960.7450961.890892...1.2511721.1659310.6891060.6490210.5906141.1483392.3080542.2251371.0030490.707679
42022-05-01 05:00:0017.2733906.25858527.38666717.0346641.97020330.9086864.6320160.9187582.382508...1.5988031.4206460.8582390.8467290.7351281.4652892.9028002.6697681.2213390.861373
..................................................................
6672022-08-27 20:00:0066.46830735.385902107.97544971.7524349.243578138.51367227.5109564.06935812.509884...10.1018126.1737424.8634655.8718433.6523279.14809615.80555911.8804415.4713683.962896
6682022-08-27 21:00:0076.71962042.754463123.60736182.20562711.031697161.87847933.8687974.92960815.284616...12.4548287.4299775.7230717.4726424.45206111.24803919.29324314.1265646.5710224.579916
6692022-08-27 22:00:0077.41856443.689445125.96052685.88475011.060221161.80069034.7039724.94804315.392654...12.9253827.6476265.8941227.7308724.58621011.64974219.84078414.6051586.6428104.627126
6702022-08-27 23:00:0060.88291233.174492103.90517476.5588767.978610120.09068325.8252833.45690210.831713...10.0265385.8873894.8016265.6553293.5967079.05543215.10651911.8375844.8732753.687579
6712022-08-28 00:00:0039.20066519.23662471.47702852.8190084.41845370.67337014.3439431.8405225.580392...5.9882783.5868763.1491783.0898762.1444675.4055998.6907537.5629302.8563582.337228

672 rows × 21 columns

结果分析

由于paddle的结果会有一点波动,这里仅做简单对比分析:

modelepochscore
LSTM300.441
LSTM500.442

在这里插入图片描述

关于模型的一些分析与说明,在另一篇文章【赛事基线】“深水云脑”水质净化厂工艺控制-曝气量预测Baseline之DL中已经聊了聊,这里不再赘述。

这里简单补充几点:

  1. 如果使用 Transformer结构,可以再加一个 Decoder的步骤,类似NLP中的生成模型,可以使模型更灵活,这里只写了 Encoder部分。
  2. 如果希望提升成绩,可以尝试构造更多的数据特征,比如时间差分、非线性变换等。
  3. 深度学习模型的最大优势在于结构灵活,从上面的模型就可以看出来,这里可以一次性输出 168 ∗ 20 = 3360 168 * 20 = 3360 16820=3360 个数值,并一次性反向传播完成。在NLP领域,多任务学习被证实具有很好的性能,在传统回归问题中也可以进行尝试。

最后,复杂的模型不一定就有更好的成绩,有同学上传的baseline中用简单的均值策略就可以获得远好于此次模型的成绩,值得深思 ~~~ 哈哈哈 [捂脸]

OK,希望这篇文章能对大家有所帮助,有问题互相探讨~

附录:

其他开源项目:

【赛事基线】“深水云脑”水质净化厂工艺控制-曝气量预测Baseline之DL

【实验分享】“字”还是“词”?这是个问题!

【比赛分享】讯飞-基于论文摘要的文本分类与查询性问答第4名(并列第3)的思考

我正在参加AI Studio 4周年活动,登录平台完成探索任务就有机会获得Mac、Iphone、网盘会员、GPU算力等奖品,点击链接为我助力,你也可以获奖哦
链接:https://aistudio.baidu.com/aistudio/4th?invitation=1&sharedUserId=942478&sharedUserName=er_zhong0


  • 1

此文章为搬运
原项目链接

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/weixin_40725706/article/detail/596438
推荐阅读
相关标签
  

闽ICP备14008679号