当前位置:   article > 正文

PTAN实战二 || 智能体agent类_自定义dqnagent

自定义dqnagent

PTAN实战二 || 智能体agent类

ptan.agent.DQNAgentptan.agent.PolicyAgent是ptan库中封装好的agent函数,需要传入NN和动作选择器,然后输出动作的索引,十分便捷

代码:

import ptan
import torch
import torch.nn as nn

# 自定义DQN的NN模型
class DQNNet(nn.Module):
    def __init__(self, actions: int):
        super(DQNNet, self).__init__()
        self.actions = actions

    def forward(self, x):
        # we always produce diagonal tensor of shape (batch_size, actions)
        return torch.eye(x.size()[0], self.actions)

# 自定义策略网络,策略网络的输出应该是动作对应的概率分布
class PolicyNet(nn.Module):
    def __init__(self, actions: int):
        super(PolicyNet, self).__init__()
        self.actions = actions

    def forward(self, x):
        # Now we produce the tensor with first two actions
        # having the same logit scores
        shape = (x.size()[0], self.actions)
        res = torch.zeros(shape, dtype=torch.float32)
        res[:, 0] = 1
        res[:, 1] = 1
        return res

net = DQNNet(actions=3)             # 初始化DQN的NN网络
net_out = net(torch.zeros(6, 10))
print("DQNNet:")
print(net_out)

selector = ptan.actions.ArgmaxActionSelector()  # 定义动作选择器
agent = ptan.agent.DQNAgent(dqn_model=net, action_selector=selector)    # 初始化agent,需要NN网络,动作选择器
ag_out = agent(torch.zeros(2, 5))       # 给agent传入的是一组tensor状态,2是batch,5是观察空间维度
print("Argmax:", ag_out)                # 返回值第一位是动作索引a,第二位是智能体内部状态的列表

# 定义greedy动作选择器,[0,epsilon]的概率随机选动作,[epsilon,1]的概率是NN输出
selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=1.0)
agent = ptan.agent.DQNAgent(dqn_model=net, action_selector=selector)
ag_out = agent(torch.zeros(10, 5))[0]
print("eps=1.0:", ag_out)

selector.epsilon = 0.5
ag_out = agent(torch.zeros(10, 5))[0]
print("eps=0.5:", ag_out)

selector.epsilon = 0.1
ag_out = agent(torch.zeros(10, 5))[0]
print("eps=0.1:", ag_out)

net = PolicyNet(actions=5)          # 初始化策略网络
net_out = net(torch.zeros(6, 10))
print("policy_net:")
print(net_out)

selector = ptan.actions.ProbabilityActionSelector()     # 定义动作选择器,以概率分布选择动作
# 此处apply_softmax=True 则说明不需要在PolicyNet网络中定义softmax,但是必须有一处有softmax
# 因为NN输出可能有负值,所以需要softmax来保证得到的输出值为正的归一化概率分布
agent = ptan.agent.PolicyAgent(model=net, action_selector=selector, apply_softmax=True)
ag_out = agent(torch.zeros(6, 5))[0]
print(ag_out)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64

输出:

DQNNet:
tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])

Argmax: (array([0, 1], dtype=int64), [None, None])
    
eps=1.0: [1 2 0 0 2 2 2 2 0 0]
eps=0.5: [0 0 0 0 0 1 0 0 1 0]
eps=0.1: [0 1 2 0 0 0 0 1 0 0]
    
policy_net:
tensor([[1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.]])

[1 0 4 3 1 0]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小蓝xlanll/article/detail/650724
推荐阅读
相关标签
  

闽ICP备14008679号