当前位置:   article > 正文

pytorch 建立LSTM模型实现股价预测_lstm股票预测pytorch

lstm股票预测pytorch

本文参考了这篇知乎文章:https://zhuanlan.zhihu.com/p/128927771,并对其中部分代码进行修改,使其更有可读性。
原版代码和数据见此链接:https://link.zhihu.com/?target=https%3A//github.com/yhannahwang/stock_prediction

本文通过jupyter notebook转化成markdown文件,再放到这里,代码和文字可能会有部分有背景色
阅读本文之前,需要知道LSTM的原理,还有pytorch中LSTM的接口定义,否则读起来很吃力。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
  • 1
  • 2
  • 3

导入数据,并进行处理

dates = pd.date_range('2010-10-11','2017-10-11',freq = 'B')   # 生成时间序列,频率为工作日
  • 1
# 生成一个只含索引的DataFrame
df_main   = pd.DataFrame(index = dates)                           
df_main
  • 1
  • 2
  • 3
2010-10-11
2010-10-12
2010-10-13
2010-10-14
2010-10-15
...
2017-10-05
2017-10-06
2017-10-09
2017-10-10
2017-10-11

1828 rows × 0 columns

df_aaxj   = pd.read_csv("data/ETFs/aaxj.us.txt", index_col=0)
df_aaxj
  • 1
  • 2
OpenHighLowCloseVolumeOpenInt
Date
2008-08-1544.88644.88644.88644.8861120
2008-08-1844.56444.56443.87543.875284970
2008-08-1943.28343.28343.28343.2831120
2008-08-2043.91843.91843.89243.89244680
2008-08-2244.09744.09744.01744.07140060
.....................
2017-11-0675.90076.53075.89076.53013137300
2017-11-0776.49076.58076.09076.18516272770
2017-11-0876.37076.59076.29076.5706811280
2017-11-0976.04076.20075.58076.11012615670
2017-11-1076.11076.15075.87076.0806196870

2325 rows × 6 columns

# 数据拼接
df_main   = df_main.join(df_aaxj)
df_main
  • 1
  • 2
  • 3
OpenHighLowCloseVolumeOpenInt
2010-10-1155.97156.05255.86356.052268544.00.0
2010-10-1255.67655.79255.36255.667817951.00.0
2010-10-1356.47256.86756.40156.569999413.00.0
2010-10-1456.73356.74256.29356.579661897.00.0
2010-10-1556.89356.89356.19456.552245001.00.0
.....................
2017-10-0573.50074.03073.50073.9702134323.00.0
2017-10-0673.47073.65073.22073.5792092100.00.0
2017-10-0973.50073.79573.48073.770879600.00.0
2017-10-1074.15074.49074.15074.4801878845.00.0
2017-10-1174.29074.64574.21074.6101168511.00.0

1828 rows × 6 columns

# 绘制收盘价格走势图
df_main[['Close']].plot()
plt.ylabel("stock_price")
plt.title("aaxj ETFs")
plt.show()
  • 1
  • 2
  • 3
  • 4
  • 5

在这里插入图片描述

# 筛选四个变量,作为数据的输入特征
sel_col = ['Open', 'High', 'Low', 'Close']   
df_main = df_main[sel_col]
df_main.head()
  • 1
  • 2
  • 3
  • 4
OpenHighLowClose
2010-10-1155.97156.05255.86356.052
2010-10-1255.67655.79255.36255.667
2010-10-1356.47256.86756.40156.569
2010-10-1456.73356.74256.29356.579
2010-10-1556.89356.89356.19456.552
# 查看是否有缺失值
np.sum(df_main.isnull())
  • 1
  • 2
Open     65
High     65
Low      65
Close    65
dtype: int64
  • 1
  • 2
  • 3
  • 4
  • 5
# 缺失值填充
df_main = df_main.fillna(method='ffill')      # 缺失值填充,使用上一个有效值
np.sum(df_main.isnull())
  • 1
  • 2
  • 3
Open     0
High     0
Low      0
Close    0
dtype: int64
  • 1
  • 2
  • 3
  • 4
  • 5
# 数据缩放
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(-1, 1))  
for col in sel_col:                           # 这里不能进行统一进行缩放,因为fit_transform返回值是numpy类型
    df_main[col] = scaler.fit_transform(df_main[col].values.reshape(-1,1))
  • 1
  • 2
  • 3
  • 4
  • 5
# 将下一日的收盘价作为本日的标签
df_main['target'] = df_main['Close'].shift(-1)   
df_main.head()
  • 1
  • 2
  • 3
OpenHighLowClosetarget
2010-10-11-0.089800-0.135104-0.074936-0.106322-0.129274
2010-10-12-0.107350-0.150977-0.104289-0.129274-0.075502
2010-10-13-0.059996-0.085348-0.043415-0.075502-0.074905
2010-10-14-0.044469-0.092979-0.049742-0.074905-0.076515
2010-10-15-0.034950-0.083761-0.055543-0.076515-0.068407
df_main.dropna()                      # 使用了shift函数,在最后必然是有缺失值的,这里去掉缺失值所在行
df_main = df_main.astype(np.float32)  # 修改数据类型
  • 1
  • 2

建立LSTM模型

import torch.nn as nn

input_dim = 4      # 数据的特征数
hidden_dim = 32    # 隐藏层的神经元个数
num_layers = 2     # LSTM的层数
output_dim = 1     # 预测值的特征数
                   #(这是预测股票价格,所以这里特征数是1,如果预测一个单词,那么这里是one-hot向量的编码长度)
class LSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers, output_dim):
        super(LSTM, self).__init__()
        # Hidden dimensions
        self.hidden_dim = hidden_dim

        # Number of hidden layers
        self.num_layers = num_layers

        # Building your LSTM
        # batch_first=True causes input/output tensors to be of shape (batch_dim, seq_dim, feature_dim)
        self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers, batch_first=True)

        # Readout layer 在LSTM后再加一个全连接层,因为是回归问题,所以不能在线性层后加激活函数
        self.fc = nn.Linear(hidden_dim, output_dim) 

    def forward(self, x):
        # Initialize hidden state with zeros   
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).requires_grad_() 
        # 这里x.size(0)就是batch_size

        # Initialize cell state
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).requires_grad_()

        # One time step
        # We need to detach as we are doing truncated backpropagation through time (BPTT)
        # If we don't, we'll backprop all the way to the start even after going through another batch
        out, (hn, cn) = self.lstm(x, (h0.detach(), c0.detach()))

        out = self.fc(out) 

        return out
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39

按照模型的接口组织数据

# 创建两个列表,用来存储数据的特征和标签
data_feat, data_target = [],[]

# 设每条数据序列有20组数据
seq = 20

for index in range(len(df_main) - seq):
    # 构建特征集
    data_feat.append(df_main[['Open', 'High', 'Low', 'Close']][index: index + seq].values)
    # 构建target集
    data_target.append(df_main['target'][index:index + seq])

# 将特征集和标签集整理成numpy数组
data_feat = np.array(data_feat)
data_target = np.array(data_target)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15

训练集与测试集的划分

# 这里按照8:2的比例划分训练集和测试集
test_set_size = int(np.round(0.2*df_main.shape[0]))  # np.round(1)是四舍五入,
train_size = data_feat.shape[0] - (test_set_size) 
print(test_set_size)  # 输出测试集大小
print(train_size)     # 输出训练集大小
  • 1
  • 2
  • 3
  • 4
  • 5
366
1442
  • 1
  • 2
trainX = torch.from_numpy(data_feat[:train_size].reshape(-1,seq,4)).type(torch.Tensor)   
# 这里第一个维度自动确定,我们认为其为batch_size,因为在LSTM类的定义中,设置了batch_first=True
testX  = torch.from_numpy(data_feat[train_size:].reshape(-1,seq,4)).type(torch.Tensor)
trainY = torch.from_numpy(data_target[:train_size].reshape(-1,seq,1)).type(torch.Tensor)
testY  = torch.from_numpy(data_target[train_size:].reshape(-1,seq,1)).type(torch.Tensor)
  • 1
  • 2
  • 3
  • 4
  • 5
print('x_train.shape = ',trainX.shape)
print('y_train.shape = ',trainY.shape)
print('x_test.shape = ',testX.shape)
print('y_test.shape = ',testY.shape)
  • 1
  • 2
  • 3
  • 4
x_train.shape =  torch.Size([1442, 20, 4])
y_train.shape =  torch.Size([1442, 20, 1])
x_test.shape =  torch.Size([366, 20, 4])
y_test.shape =  torch.Size([366, 20, 1])
  • 1
  • 2
  • 3
  • 4

建立数据导入器(略)

# 因为数据量不大,所以这里就不再划分batch,即认为batch_size=1442,
# 这里只是演示一下数据导入器,我们并不使用
  • 1
  • 2
batch_size=1442
train = torch.utils.data.TensorDataset(trainX,trainY)
test = torch.utils.data.TensorDataset(testX,testY)
train_loader = torch.utils.data.DataLoader(dataset=train, 
                                           batch_size=batch_size, 
                                           shuffle=False)

test_loader = torch.utils.data.DataLoader(dataset=test, 
                                          batch_size=batch_size, 
                                          shuffle=False)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

训练模型

# 实例化模型
model = LSTM(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim, num_layers=num_layers)

# 定义优化器和损失函数
optimiser = torch.optim.Adam(model.parameters(), lr=0.01) # 使用Adam优化算法
loss_fn = torch.nn.MSELoss(size_average=True)             # 使用均方差作为损失函数

# 设定数据遍历次数
num_epochs = 100

# 打印模型结构
print(model)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
LSTM(
  (lstm): LSTM(4, 32, num_layers=2, batch_first=True)
  (fc): Linear(in_features=32, out_features=1, bias=True)
)
  • 1
  • 2
  • 3
  • 4
# 打印模型各层的参数尺寸
for i in range(len(list(model.parameters()))):
    print(list(model.parameters())[i].size())
  • 1
  • 2
  • 3
torch.Size([128, 4])
torch.Size([128, 32])
torch.Size([128])
torch.Size([128])
torch.Size([128, 32])
torch.Size([128, 32])
torch.Size([128])
torch.Size([128])
torch.Size([1, 32])
torch.Size([1])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
# train model
hist = np.zeros(num_epochs)
for t in range(num_epochs):
    # Initialise hidden state
    # Don't do this if you want your LSTM to be stateful
    # model.hidden = model.init_hidden()
    
    # Forward pass
    y_train_pred = model(trainX)

    loss = loss_fn(y_train_pred, trainY)
    if t % 10 == 0 and t !=0:                  # 每训练十次,打印一次均方差
        print("Epoch ", t, "MSE: ", loss.item())
    hist[t] = loss.item()

    # Zero out gradient, else they will accumulate between epochs 将梯度归零
    optimiser.zero_grad()

    # Backward pass
    loss.backward()
    
    # Update parameters
    optimiser.step()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
Epoch  10 MSE:  0.01842750422656536
Epoch  20 MSE:  0.008485360071063042
Epoch  30 MSE:  0.004656758159399033
Epoch  40 MSE:  0.0032537723891437054
Epoch  50 MSE:  0.002434148220345378
Epoch  60 MSE:  0.0020096886437386274
Epoch  70 MSE:  0.0018414082005620003
Epoch  80 MSE:  0.0017679394222795963
Epoch  90 MSE:  0.0017151187639683485
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
# 计算训练得到的模型在训练集上的均方差
y_train_pred = model(trainX)
loss_fn(y_train_pred, trainY).item()
  • 1
  • 2
  • 3
0.0016758530400693417
  • 1

测试模型

# make predictions
y_test_pred = model(testX)
loss_fn(y_test_pred, testY).item()
  • 1
  • 2
  • 3
0.004057767800986767
  • 1
# 从结果来看,有些过拟合
  • 1

绘制效果图

"训练集效果图"
# 无论是真实值,还是模型的输出值,它们的维度均为(batch_size, seq, 1),seq=20
# 我们的目的是用前20天的数据预测今天的股价,所以我们只需要每个数据序列中第20天的标签即可
# 因为前面用了使用DataFrame中shift方法,所以第20天的标签,实际上就是第21天的股价
pred_value = y_train_pred.detach().numpy()[:,-1,0]       
true_value = trainY.detach().numpy()[:,-1,0] 

plt.plot(pred_value, label="Preds")    # 预测值
plt.plot(true_value, label="Data")    # 真实值
plt.legend()
plt.show()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

在这里插入图片描述

# 纵坐标还有负的,因为前面进行缩放,现在让数据还原成原来的大小
# invert predictions
pred_value = scaler.inverse_transform(pred_value.reshape(-1, 1))
true_value = scaler.inverse_transform(true_value.reshape(-1, 1))

plt.plot(pred_value, label="Preds")    # 预测值
plt.plot(true_value, label="Data")    # 真实值
plt.legend()
plt.show()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

在这里插入图片描述

"测试集效果图"
pred_value = y_test_pred.detach().numpy()[:,-1,0]    
true_value = testY.detach().numpy()[:,-1,0]

pred_value = scaler.inverse_transform(pred_value.reshape(-1, 1))
true_value = scaler.inverse_transform(true_value.reshape(-1, 1))

plt.plot(pred_value, label="Preds")    # 预测值
plt.plot(true_value, label="Data")    # 真实值
plt.legend()
plt.show()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-AoPkppZF-1642839935861)(output_35_0.png)]

前面还拟合的比较好,但是到了后面的时期就不太准确了,可能与前面模型出现过拟合有关系

声明:本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号