当前位置:   article > 正文

[代码解读]基于多代理RL的车联网频谱分享_Python实现_:] - env_marl.v2v_channels_abs[:, env_marl.vehicle

:] - env_marl.v2v_channels_abs[:, env_marl.vehicles[i].destinations[j]] + 10

论文原文:Spectrum Sharing in Vehicular Networks Based on Multi-Agent Reinforcement Learning

论文翻译 & 解读:[论文笔记]Spectrum Sharing in Vehicular Networks Based on Multi-Agent Reinforcement Learning

代码地址:https://github.com/le-liang/MARLspectrumSharingV2X

博客中用到的VISIO流程图(由博主个人绘制,有错误欢迎交流指教):https://download.csdn.net/download/m0_37495408/12353933


使用方法:

(来自原作者github的readme)

  • 要训​​练多主体RL模型:main_marl_train.py + Environment_marl.py + replay_memory.py
  • 要训​​练基准单一代理RL模型:main_sarl_train.py + Environment_marl.py + replay_memory.py
  • 要在同一环境中测试所有模型:main_test.py + Environment_marl_test.py + replay_memory.py +'/ model'。
    • 可以从运行“ main_test.py”直接复制本文中的图3和图4。通过“ Environment_marl_test.py”中的“ self.demand_size”更改V2V有效负载大小。
    • 图5只能从培训期间的记录回报中获得。
    • 图6-7显示了任意情节的表现(但随机基线失败且MARL传输成功)。实际上,大多数此类情节都表现出一些有趣的现象,表明了多主体合作。解释取决于读者。
    • 不建议在“ main_marl_train.py”中使用“测试”模式。

基本类定义

Environment_marl.py文件中,定义了架构的四个基本CLASS,分别是:V2Vchannels,V2Ichannels,Vehicle,Environ。其中Environ的方法(即函数)最多,Vehicle没有函数只有几个属性,其余两者各有两个方法(分别是计算路损和阴影衰落)。

Vehicle

初始化时需要传入三个参数:起始位置、起始方向、速度。函数内部将自己定义两个list:neighbors、destinations,分别存放邻居和V2V的通信端(这里两者在数值上相同,因为设定V2V的对象即为邻居)

  1. class Vehicle:
  2. # Vehicle simulator: include all the information for a vehicle
  3. def __init__(self, start_position, start_direction, velocity):
  4. self.position = start_position
  5. self.direction = start_direction
  6. self.velocity = velocity
  7. self.neighbors = []
  8. self.destinations = []

从下方的代码可见destionations的含义 

  1. def renew_neighbor(self): # 这个来自CLASS Env
  2. """ Determine the neighbors of each vehicles """
  3. for i in range(len(self.vehicles)):
  4. self.vehicles[i].neighbors = []
  5. self.vehicles[i].actions = []
  6. z = np.array([[complex(c.position[0], c.position[1]) for c in self.vehicles]])
  7. Distance = abs(z.T - z)
  8. for i in range(len(self.vehicles)):
  9. sort_idx = np.argsort(Distance[:, i])
  10. for j in range(self.n_neighbor):
  11. self.vehicles[i].neighbors.append(sort_idx[j + 1])
  12. destination = self.vehicles[i].neighbors
  13. self.vehicles[i].destinations = destination

V2Vchannels

内部参数z:这里将bs和ms的高度设置为1.5m,阴影的std为3,都是来自TR36 885-A.1.4-1;载波频率为2,单位为GHz;

  1. class V2Vchannels:
  2. # Simulator of the V2V Channels
  3. def __init__(self):
  4. self.t = 0
  5. self.h_bs = 1.5
  6. self.h_ms = 1.5
  7. self.fc = 2
  8. self.decorrelation_distance = 10
  9. self.shadow_std = 3

包含两个方法:

计算路损

  1. def get_path_loss(self, position_A, position_B):
  2. d1 = abs(position_A[0] - position_B[0])
  3. d2 = abs(position_A[1] - position_B[1])
  4. d = math.hypot(d1, d2) + 0.001 # sqrt(x*x + y*y)
  5. # 下一行定义有效BP距离
  6. d_bp = 4 * (self.h_bs - 1) * (self.h_ms - 1) * self.fc * (10 ** 9) / (3 * 10 ** 8)
  7. def PL_Los(d):
  8. if d <= 3:
  9. return 22.7 * np.log10(3) + 41 + 20 * np.log10(self.fc / 5)
  10. else:
  11. if d < d_bp:
  12. return 22.7 * np.log10(d) + 41 + 20 * np.log10(self.fc / 5)
  13. else:
  14. return 40.0 * np.log10(d) + 9.45 - 17.3 * np.log10(self.h_bs) - 17.3 * np.log10(self.h_ms) + 2.7 * np.log10(self.fc / 5)
  15. def PL_NLos(d_a, d_b):
  16. n_j = max(2.8 - 0.0024 * d_b, 1.84)
  17. return PL_Los(d_a) + 20 - 12.5 * n_j + 10 * n_j * np.log10(d_b) + 3 * np.log10(self.fc / 5)
  18. if min(d1, d2) < 7:
  19. PL = PL_Los(d)
  20. else:
  21. PL = min(PL_NLos(d1, d2), PL_NLos(d2, d1))
  22. return PL # + self.shadow_std * np.random.normal()

说明:上述代码使用随机过程模型(见[2]-p328)。

路损使用曼哈顿网格布局LOS模型,即:

PL_{LOS}(d)_{|dB} = 10n_1 lg(d/1m)+28.0+20log(fc/1GHz)+PL_{I|dB}, for 10m <d<d_{BP}

以及:PL_{LOS}(d)_{|dB} = 10n_2 lg(d/d'_{BP})+PL_{LOS}(d'_{BP})_{|dB}, for d'_{BP}<d<500m

上面的n_1=2.2、n_2=4.0,分别表示在BP之前和之后的率衰落常数,d'表示有效BP距离,代码中用d_bp表示。

曼哈顿网格布局NLOS模型:PL_{NLOS} = PL_{LOS}(d_1)_{|dB} + 17.9-12.5n_j+10n_jlg(d_2/1m)+3lg(f_c/1GHz)+PL_{2|dB}

代码后半出现的min函数,在[2]的p344页有描述,这是假设接收机位于垂直街道是对PL的估计方法。

代码中的公式出自IST-4-027756 WINNER II D1.1.2 V1.2 WINNER II 

其中有如下表格,与代码中的参数完全符合:

 更新阴影衰落

  1. def get_shadowing(self, delta_distance, shadowing):
  2. return np.exp(-1 * (delta_distance / self.decorrelation_distance)) * shadowing \
  3. + math.sqrt(1 - np.exp(-2 * (delta_distance / self.decorrelation_distance))) * np.random.normal(0, 3) # standard dev is 3 db

这个更新公式是出自文献[1]-A-1.4 Channel model表格后的部分,如下:

V2Ichannels

包含的两个方法和V2V相同,但是计算路损的时候不再区分Los了

  1. def get_path_loss(self, position_A):
  2. d1 = abs(position_A[0] - self.BS_position[0])
  3. d2 = abs(position_A[1] - self.BS_position[1])
  4. distance = math.hypot(d1, d2)
  5. return 128.1 + 37.6 * np.log10(math.sqrt(distance ** 2 + (self.h_bs - self.h_ms) ** 2) / 1000) # + self.shadow_std * np.random.normal()
  6. def get_shadowing(self, delta_distance, shadowing):
  7. nVeh = len(shadowing)
  8. self.R = np.sqrt(0.5 * np.ones([nVeh, nVeh]) + 0.5 * np.identity(nVeh))
  9. return np.multiply(np.exp(-1 * (delta_distance / self.Decorrelation_distance)), shadowing) \
  10. + np.sqrt(1 - np.exp(-2 * (delta_distance / self.Decorrelation_distance))) * np.random.normal(0, 8, nVeh)

上面的两个方法均是文献[1]-Table A.1.4-2的内容和其后的说明,如下:

 Environ

env 数据流图

初始化需要传入4个list(为上下左右路口的位置数据):down_lane, up_lane, left_lane, right_lane;地图的宽和高;车辆数和邻居数。除以上所提外,内部含有好多参数,如下:

  1. class Environ:
  2. def __init__(self, down_lane, up_lane, left_lane, right_lane, width, height, n_veh, n_neighbor):
  3. self.V2Vchannels = V2Vchannels()
  4. self.V2Ichannels = V2Ichannels()
  5. self.vehicles = []
  6. self.demand = []
  7. self.V2V_Shadowing = []
  8. self.V2I_Shadowing = []
  9. self.delta_distance = []
  10. self.V2V_channels_abs = []
  11. self.V2I_channels_abs = []
  12. self.V2I_power_dB = 23 # dBm
  13. self.V2V_power_dB_List = [23, 15, 5, -100] # the power levels
  14. self.V2I_power = 10 ** (self.V2I_power_dB)
  15. self.sig2_dB = -114
  16. self.bsAntGain = 8
  17. self.bsNoiseFigure = 5
  18. self.vehAntGain = 3
  19. self.vehNoiseFigure = 9
  20. self.sig2 = 10 ** (self.sig2_dB / 10)
  21. self.n_RB = n_veh
  22. self.n_Veh = n_veh
  23. self.n_neighbor = n_neighbor
  24. self.time_fast = 0.001
  25. self.time_slow = 0.1 # update slow fading/vehicle position every 100 ms
  26. self.bandwidth = int(1e6) # bandwidth per RB, 1 MHz
  27. # self.bandwidth = 1500
  28. self.demand_size = int((4 * 190 + 300) * 8 * 2) # V2V payload: 1060 Bytes every 100 ms
  29. # self.demand_size = 20
  30. self.V2V_Interference_all = np.zeros((self.n_Veh, self.n_neighbor, self.n_RB)) + self.sig2

添加车:有两个方法:add_new_vehivles(需要传输起始坐标、方向、速度),add_new_vehicles_by_number(n)。后者比较有意思,只需要一个参数,n,但是并不是添加n辆车,而是4n辆车,上下左右方向各一台,位置是随机的。

更新车辆位置:renew_position(无),遍历每辆车,根据其方向和速度更新位置,到路口时依据概率顺时针转弯,到地图边界时使其顺时针转弯留在地图中。

更新邻居:renew_neighbor(self),已经在Vehicle中进行描述

更新信道:renew_channel(self),这里定义了一个很重要的量:channel_abs,它是路损和阴影衰落的和。【内含所有车辆的信息】

  1. def renew_channel(self):
  2. """ Renew slow fading channel """
  3. self.V2V_pathloss = np.zeros((len(self.vehicles), len(self.vehicles))) + 50 * np.identity(len(self.vehicles))
  4. self.V2I_pathloss = np.zeros((len(self.vehicles)))
  5. self.V2V_channels_abs = np.zeros((len(self.vehicles), len(self.vehicles)))
  6. self.V2I_channels_abs = np.zeros((len(self.vehicles)))
  7. for i in range(len(self.vehicles)):
  8. for j in range(i + 1, len(self.vehicles)):
  9. self.V2V_Shadowing[j][i] = self.V2V_Shadowing[i][j] = self.V2Vchannels.get_shadowing(self.delta_distance[i] + self.delta_distance[j], self.V2V_Shadowing[i][j])
  10. self.V2V_pathloss[j,i] = self.V2V_pathloss[i][j] = self.V2Vchannels.get_path_loss(self.vehicles[i].position, self.vehicles[j].position)
  11. self.V2V_channels_abs = self.V2V_pathloss + self.V2V_Shadowing
  12. self.V2I_Shadowing = self.V2Ichannels.get_shadowing(self.delta_distance, self.V2I_Shadowing)
  13. for i in range(len(self.vehicles)):
  14. self.V2I_pathloss[i] = self.V2Ichannels.get_path_loss(self.vehicles[i].position)
  15. self.V2I_channels_abs = self.V2I_pathloss + self.V2I_Shadowing

更新快衰落信道:renew_channels_fastfading(self),其数值为把channels_abs减了一个随机数,这里在减之前将channels_abs增加了一维,层数为RB的个数。

  1. def renew_channels_fastfading(self):
  2. """ Renew fast fading channel """
  3. # 1 2, 3 4 --> 1 1 2 2 3 3 4 4 逐个元素复制
  4. V2V_channels_with_fastfading = np.repeat(self.V2V_channels_abs[:, :, np.newaxis], self.n_RB, axis=2)
  5. # A - 20 log
  6. self.V2V_channels_with_fastfading = V2V_channels_with_fastfading - 20 * np.log10(
  7. np.abs(np.random.normal(0, 1, V2V_channels_with_fastfading.shape) + 1j * np.random.normal(0, 1, V2V_channels_with_fastfading.shape)) / math.sqrt(2))
  8. # 1 2, 3 4 --> 1 1 2 2, 3 3 4 4
  9. V2I_channels_with_fastfading = np.repeat(self.V2I_channels_abs[:, np.newaxis], self.n_RB, axis=1)
  10. self.V2I_channels_with_fastfading = V2I_channels_with_fastfading - 20 * np.log10(
  11. np.abs(np.random.normal(0, 1, V2I_channels_with_fastfading.shape) + 1j * np.random.normal(0, 1, V2I_channels_with_fastfading.shape))/ math.sqrt(2))

计算Reward:Compute_Performance_Reward_Train(self, actions_power),这里的输入非常重要,是RL的action,其定义在main_marl_train.py中,是个三维数组,以(层,行,列)进行说明,一层一个车,一行一个邻居,共有两列分别为RB选择(用RB的序号表示)和power选择(也用序号表示,作为power_db_list的索引),如下所示:

  1. for i in range(n_veh):
  2. for j in range(n_neighbor):
  3. state_old = get_state(env, [i, j], 1, epsi_final)
  4. action = predict(sesses[i*n_neighbor+j], state_old, epsi_final, True)
  5. action_all_testing[i, j, 0] = action % n_RB # chosen RB
  6. action_all_testing[i, j, 1] = int(np.floor(action / n_RB)) # power level

具体计算步骤为:

  1. 从action中取出RB选择、power选择
  2. 计算V2I信道容量 V2I_rate  # 返回值的长度是RB个数,但实际含义是V2I链路的数目,因为V2I链路数=RB个数
  3. 计算V2V信道容量V2V_rate  # 返回值中一格对应一个V2V链路,这里返回的是所有V2V的速率
    1. 遍历每一个RB,从actions找到共用一个RB的车号
    2. 分V2I对V2V的干扰、V2V之间的干扰两步,计算信道容量
  4. 计算剩余demand和time_limit的剩余时间
  5. 生成reward(reward_elements = V2V_Rate/10,并且demand=0的记作1)
  6. 根据剩余demand将active_links置0(这是唯二修改active_links的方法,另一种是初始化active_links时将其全部置一)
    1. 将active_links置1的场合
      1. ​​​​​​​env.py中,new_random_game时(该函数在 *train.py中在最开始出现过一次)
      2. *train.py中episode的开端,直接对active_links置一

代码如下:

  1. def Compute_Performance_Reward_Train(self, actions_power):
  2. actions = actions_power[:, :, 0] # the channel_selection_part
  3. power_selection = actions_power[:, :, 1] # power selection
  4. # ------------ Compute V2I rate --------------------
  5. V2I_Rate = np.zeros(self.n_RB)
  6. V2I_Interference = np.zeros(self.n_RB) # V2I interference
  7. for i in range(len(self.vehicles)):
  8. for j in range(self.n_neighbor):
  9. if not self.active_links[i, j]:
  10. continue
  11. V2I_Interference[actions[i][j]] += 10 ** ((self.V2V_power_dB_List[power_selection[i, j]] - self.V2I_channels_with_fastfading[i, actions[i, j]]
  12. + self.vehAntGain + self.bsAntGain - self.bsNoiseFigure) / 10)
  13. self.V2I_Interference = V2I_Interference + self.sig2
  14. V2I_Signals = 10 ** ((self.V2I_power_dB - self.V2I_channels_with_fastfading.diagonal() + self.vehAntGain + self.bsAntGain - self.bsNoiseFigure) / 10)
  15. V2I_Rate = np.log2(1 + np.divide(V2I_Signals, self.V2I_Interference)) # 计算V2I信道容量
  16. # ------------ Compute V2V rate -------------------------
  17. V2V_Interference = np.zeros((len(self.vehicles), self.n_neighbor))
  18. V2V_Signal = np.zeros((len(self.vehicles), self.n_neighbor))
  19. actions[(np.logical_not(self.active_links))] = -1 # inactive links will not transmit regardless of selected power levels
  20. for i in range(self.n_RB): # scanning all bands
  21. indexes = np.argwhere(actions == i) # find spectrum-sharing V2Vs
  22. for j in range(len(indexes)):
  23. receiver_j = self.vehicles[indexes[j, 0]].destinations[indexes[j, 1]]
  24. V2V_Signal[indexes[j, 0], indexes[j, 1]] = 10 ** ((self.V2V_power_dB_List[power_selection[indexes[j, 0], indexes[j, 1]]]
  25. - self.V2V_channels_with_fastfading[indexes[j][0], receiver_j, i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
  26. # V2I links interference to V2V links
  27. V2V_Interference[indexes[j, 0], indexes[j, 1]] = 10 ** ((self.V2I_power_dB - self.V2V_channels_with_fastfading[i, receiver_j, i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
  28. # V2V interference
  29. for k in range(j + 1, len(indexes)): # spectrum-sharing V2Vs
  30. receiver_k = self.vehicles[indexes[k][0]].destinations[indexes[k][1]]
  31. V2V_Interference[indexes[j, 0], indexes[j, 1]] += 10 ** ((self.V2V_power_dB_List[power_selection[indexes[k, 0], indexes[k, 1]]]
  32. - self.V2V_channels_with_fastfading[indexes[k][0]][receiver_j][i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
  33. V2V_Interference[indexes[k, 0], indexes[k, 1]] += 10 ** ((self.V2V_power_dB_List[power_selection[indexes[j, 0], indexes[j, 1]]]
  34. - self.V2V_channels_with_fastfading[indexes[j][0]][receiver_k][i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
  35. self.V2V_Interference = V2V_Interference + self.sig2
  36. V2V_Rate = np.log2(1 + np.divide(V2V_Signal, self.V2V_Interference))
  37. self.demand -= V2V_Rate * self.time_fast * self.bandwidth
  38. self.demand[self.demand < 0] = 0 # eliminate negative demands
  39. self.individual_time_limit -= self.time_fast
  40. reward_elements = V2V_Rate/10
  41. reward_elements[self.demand <= 0] = 1
  42. self.active_links[np.multiply(self.active_links, self.demand <= 0)] = 0 # transmission finished, turned to "inactive"
  43. return V2I_Rate, V2V_Rate, reward_elements

注:这里返回三个数值,其中最后一个并不是最终的reward,最终的reward需要把这三个数值加权组合起来。

执行训练:act_for_training(self, actions),输入actions,通过Compute_Performance_Reward_Train计算最终reward,代码如下:

  1. def act_for_training(self, actions):
  2. action_temp = actions.copy()
  3. V2I_Rate, V2V_Rate, reward_elements = self.Compute_Performance_Reward_Train(action_temp)
  4. lambdda = 0.
  5. reward = lambdda * np.sum(V2I_Rate) / (self.n_Veh * 10) + (1 - lambdda) * np.sum(reward_elements) / (self.n_Veh * self.n_neighbor)
  6. return reward

执行测试:act_for_testing(self, actions),这里和上面差不多,也用到了Compute_Performance_Reward_Train,但最后返回的是V2I_rate, V2V_success, V2V_rate。

  1. def act_for_testing(self, actions):
  2. action_temp = actions.copy()
  3. V2I_Rate, V2V_Rate, reward_elements = self.Compute_Performance_Reward_Train(action_temp)
  4. V2V_success = 1 - np.sum(self.active_links) / (self.n_Veh * self.n_neighbor) # V2V success rates
  5. return V2I_Rate, V2V_success, V2V_Rate

上面所述的三个量,是一次episode中的单步step所生成的最终结果,main_marl_train.py的testing部分可以看到,部分代码如下:

  1. for test_step in range(n_step_per_episode):
  2. # trained models
  3. action_all_testing = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
  4. for i in range(n_veh):
  5. for j in range(n_neighbor):
  6. state_old = get_state(env, [i, j], 1, epsi_final)
  7. action = predict(sesses[i*n_neighbor+j], state_old, epsi_final, True)
  8. action_all_testing[i, j, 0] = action % n_RB # chosen RB
  9. action_all_testing[i, j, 1] = int(np.floor(action / n_RB)) # power level
  10. action_temp = action_all_testing.copy()
  11. V2I_rate, V2V_success, V2V_rate = env.act_for_testing(action_temp)
  12. V2I_rate_per_episode.append(np.sum(V2I_rate)) # sum V2I rate in bps
  13. rate_marl[idx_episode, test_step,:,:] = V2V_rate
  14. demand_marl[idx_episode, test_step+1,:,:] = env.demand

计算干扰:Compute_Interference(self, actions),通过+=的方法计算V2V_Interference_all,代码如下:

  1. def Compute_Interference(self, actions):
  2. V2V_Interference = np.zeros((len(self.vehicles), self.n_neighbor, self.n_RB)) + self.sig2
  3. channel_selection = actions.copy()[:, :, 0] # 取所有层的第0
  4. power_selection = actions.copy()[:, :, 1] # 取所有层的第1
  5. channel_selection[np.logical_not(self.active_links)] = -1 # 将未激活的链路置为-1
  6. # interference from V2I links
  7. for i in range(self.n_RB):
  8. for k in range(len(self.vehicles)):
  9. for m in range(len(channel_selection[k, :])):
  10. V2V_Interference[k, m, i] += 10 ** ((self.V2I_power_dB - self.V2V_channels_with_fastfading[i][self.vehicles[k].destinations[m]][i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
  11. # interference from peer V2V links
  12. for i in range(len(self.vehicles)):
  13. for j in range(len(channel_selection[i, :])):
  14. for k in range(len(self.vehicles)):
  15. for m in range(len(channel_selection[k, :])):
  16. # if i == k or channel_selection[i,j] >= 0:
  17. if i == k and j == m or channel_selection[i, j] < 0:
  18. continue
  19. V2V_Interference[k, m, channel_selection[i, j]] += 10 ** ((self.V2V_power_dB_List[power_selection[i, j]]
  20. - self.V2V_channels_with_fastfading[i][self.vehicles[k].destinations[m]][channel_selection[i,j]] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
  21. self.V2V_Interference_all = 10 * np.log10(V2V_Interference)

在main_marl_train.py的get_state中有用到,用于构成state中的V2V_interference,如下:

  1. def get_state(env, idx=(0,0), ind_episode=1., epsi=0.02):
  2. """ Get state from the environment """
  3. # include V2I/V2V fast_fading, V2V interference, V2I/V2V 信道信息(PL+shadow),
  4. # 剩余时间, 剩余负载
  5. # V2I_channel = (env.V2I_channels_with_fastfading[idx[0], :] - 80) / 60
  6. V2I_fast = (env.V2I_channels_with_fastfading[idx[0], :] - env.V2I_channels_abs[idx[0]] + 10)/35
  7. # V2V_channel = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - 80) / 60
  8. V2V_fast = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] + 10)/35
  9. V2V_interference = (-env.V2V_Interference_all[idx[0], idx[1], :] - 60) / 60
  10. V2I_abs = (env.V2I_channels_abs[idx[0]] - 80) / 60.0
  11. V2V_abs = (env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] - 80)/60.0
  12. load_remaining = np.asarray([env.demand[idx[0], idx[1]] / env.demand_size])
  13. time_remaining = np.asarray([env.individual_time_limit[idx[0], idx[1]] / env.time_slow])
  14. # return np.concatenate((np.reshape(V2V_channel, -1), V2V_interference, V2I_abs, V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))
  15. return np.concatenate((V2I_fast, np.reshape(V2V_fast, -1), V2V_interference, np.asarray([V2I_abs]), V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))
  16. # 这里有所有感兴趣的物理量:V2V_fast V2I_fast V2V_interference V2I_abs V2V_abs

有的小伙伴看到这就有点迷了,为什么这里又要计算V2V_Interference了?我怎么感觉之前好像算过,是的,在计算V2V_rate的时候就需要计算V2V_Interference,我目前观察那个是按照RB分配来算的,这个是直接按照车挨个遍历的。

ReplayMemory

这部分内容来自replay_memory.py,内容不多,只定义了一个类: ReplayMemory,需要注意的是每一个agent都有一个memory,在main_marl_train.py--class Agent可以看到,如下所示

  1. class Agent(object):
  2. def __init__(self, memory_entry_size):
  3. self.discount = 1
  4. self.double_q = True
  5. self.memory_entry_size = memory_entry_size
  6. self.memory = ReplayMemory(self.memory_entry_size)

初始化:需要输入memory的容量:entry_size,初始化的代码如下:

  1. class ReplayMemory:
  2. def __init__(self, entry_size):
  3. self.entry_size = entry_size
  4. self.memory_size = 200000
  5. self.actions = np.empty(self.memory_size, dtype = np.uint8)
  6. self.rewards = np.empty(self.memory_size, dtype = np.float64)
  7. self.prestate = np.empty((self.memory_size, self.entry_size), dtype = np.float16)
  8. self.poststate = np.empty((self.memory_size, self.entry_size), dtype = np.float16)
  9. self.batch_size = 2000
  10. self.count = 0
  11. self.current = 0

添加(s, a)对:add(self, prestate, poststate, reward, action),从add方法的参数可以看出参数包括:(上一个状态,下一个状态,奖励,动作),代码如下:

  1. def add(self, prestate, poststate, reward, action):
  2. self.actions[self.current] = action
  3. self.rewards[self.current] = reward
  4. self.prestate[self.current] = prestate
  5. self.poststate[self.current] = poststate
  6. self.count = max(self.count, self.current + 1)
  7. self.current = (self.current + 1) % self.memory_size

对每个agent来说,都需要将自己在每个time_step将这个状态转移的信息记录下来,在main_marl_train.py--Training的部分可以看到add的使用,代码如下,这个for循环上面还有一个对于episode的for循环,可以看出,在每个episode的每个step,都需要对所有agent进行(s,a)对的添加【最后一行】

  1. for i_step in range(n_step_per_episode): # range内是0.1/0.001 = 100
  2. time_step = i_episode*n_step_per_episode + i_step # time_step是整体的step
  3. state_old_all = []
  4. action_all = []
  5. action_all_training = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
  6. for i in range(n_veh):
  7. for j in range(n_neighbor):
  8. state = get_state(env, [i, j], i_episode/(n_episode-1), epsi)
  9. state_old_all.append(state)
  10. action = predict(sesses[i*n_neighbor+j], state, epsi)
  11. action_all.append(action)
  12. action_all_training[i, j, 0] = action % n_RB # chosen RB
  13. action_all_training[i, j, 1] = int(np.floor(action / n_RB)) # power level
  14. # All agents take actions simultaneously, obtain shared reward, and update the environment.
  15. action_temp = action_all_training.copy()
  16. train_reward = env.act_for_training(action_temp)
  17. record_reward[time_step] = train_reward
  18. env.renew_channels_fastfading()
  19. env.Compute_Interference(action_temp)
  20. for i in range(n_veh):
  21. for j in range(n_neighbor):
  22. state_old = state_old_all[n_neighbor * i + j]
  23. action = action_all[n_neighbor * i + j]
  24. state_new = get_state(env, [i, j], i_episode/(n_episode-1), epsi)
  25. agents[i * n_neighbor + j].memory.add(state_old, state_new, train_reward, action) # add entry to this agent's memory

采样:sample(self),经过多次add后,每个agent已经有了多个(s,a)对,但是实际训练的时候一次取出batch_size个(s,a)对进行训练,代码如下所示:

  1. def sample(self):
  2. if self.count < self.batch_size:
  3. indexes = range(0, self.count)
  4. else:
  5. indexes = random.sample(range(0,self.count), self.batch_size)
  6. prestate = self.prestate[indexes]
  7. poststate = self.poststate[indexes]
  8. actions = self.actions[indexes]
  9. rewards = self.rewards[indexes]
  10. return prestate, poststate, actions, rewards

主代码-main_marl_train.py

定义CLASS Agent:Agent(object),无输入参数,内容是一些算法参数,注意memory的实现方法是ReplayMemory,上面刚提到过

  1. class Agent(object):
  2. def __init__(self, memory_entry_size):
  3. self.discount = 1
  4. self.double_q = True
  5. self.memory_entry_size = memory_entry_size
  6. self.memory = ReplayMemory(self.memory_entry_size)

参数初始化:这部分直接写在代码中,没有函数,大概包括:地图属性(路口坐标,整体地图尺寸)、#车、#邻居、#RB、#episode,一些算法参数,代码如下:

对于地图参数 up_lanes / down_lanes / left_lanes / right_lanes 的含义,首先要了解本次所用的系统模型由3GPP TR 36.885的城市案例给出,每条街有四个车道(正反方向各两个车道) ,车道宽3.5m,模型网格(road grid)的尺寸以黄线之间的距离确定,为433m*250m,区域面积为1299m*750m。仿真中等比例缩小为原来的1/2(这点可以由 width 和 height 参数是 / 2 的看出来),反映在车道的参数上就是在 lanes 中的 i / 2.0 。

下面以 up_lanes 为例进行说明。在上图中我们可以看到,车道宽3.5m,所以将车视作质点的话,应该是在3.5m的车道中间移动的,因此在 up_lanes 中 in 后面的 中括号里 3.5 需要 /2,第二项的3.5就是通向双车道的第二条车道的中间;第三项 +250 就是越过建筑物的第一条同向车道,以此类推。

  1. up_lanes = [i/2.0 for i in [3.5/2, 3.5 + 3.5/2, 250+3.5/2, 250+3.5+3.5/2, 500+3.5/2, 500+3.5+3.5/2]]
  2. down_lanes = [i/2.0 for i in [250-3.5-3.5/2,250-3.5/2,500-3.5-3.5/2,500-3.5/2,750-3.5-3.5/2,750-3.5/2]]
  3. left_lanes = [i/2.0 for i in [3.5/2,3.5/2 + 3.5,433+3.5/2, 433+3.5+3.5/2, 866+3.5/2, 866+3.5+3.5/2]]
  4. right_lanes = [i/2.0 for i in [433-3.5-3.5/2,433-3.5/2,866-3.5-3.5/2,866-3.5/2,1299-3.5-3.5/2,1299-3.5/2]]
  5. width = 750/2
  6. height = 1298/2
  7. IS_TRAIN = 1
  8. IS_TEST = 1-IS_TRAIN
  9. label = 'marl_model'
  10. n_veh = 4
  11. n_neighbor = 1
  12. n_RB = n_veh
  13. env = Environment_marl.Environ(down_lanes, up_lanes, left_lanes, right_lanes, width, height, n_veh, n_neighbor)
  14. env.new_random_game() # initialize parameters in env
  15. # n_episode = 3000
  16. n_episode = 600
  17. n_step_per_episode = int(env.time_slow/env.time_fast) # slow = 0.1, fast = 0.001
  18. epsi_final = 0.02
  19. epsi_anneal_length = int(0.8*n_episode)
  20. mini_batch_step = n_step_per_episode
  21. target_update_step = n_step_per_episode*4
  22. n_episode_test = 100 # test episodes

获取状态:get_state(env, idx=(0,0), ind_episode=1., epsi=0.02),输入是env(环境),输出包括:

  1. V2V_fast:(PL+shadowing) - 随机数(在本文 基本类定义 -- Environ -- 更新快衰信道 一节有)
  2. V2I_fast:同上
  3. V2V_interference(在本文 基本类定义 -- Environ -- 计算干扰 一节有)
  4. V2I_abs(PL+shadowing) 
  5. V2V_abs(PL+shadowing) 

需要注意的是,代码中的V2I_abs出现了-80,/60 的操作,这个将代码作者在github讨论区的解释放在这里:

This is to roughly normalize DQN inputs for the ease of training. The numbers are obtained from several trial runs

“这是为了使DQN输入大致标准化,以便于培训。 这些数字是从几次试运行获得的”

  1. def get_state(env, idx=(0,0), ind_episode=1., epsi=0.02):
  2. """ Get state from the environment """
  3. # include V2I/V2V fast_fading, V2V interference, V2I/V2V 信道信息(PL+shadow),
  4. # 剩余时间, 剩余负载
  5. # V2I_channel = (env.V2I_channels_with_fastfading[idx[0], :] - 80) / 60
  6. V2I_fast = (env.V2I_channels_with_fastfading[idx[0], :] - env.V2I_channels_abs[idx[0]] + 10)/35
  7. # V2V_channel = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - 80) / 60
  8. V2V_fast = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] + 10)/35
  9. V2V_interference = (-env.V2V_Interference_all[idx[0], idx[1], :] - 60) / 60
  10. V2I_abs = (env.V2I_channels_abs[idx[0]] - 80) / 60.0
  11. V2V_abs = (env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] - 80)/60.0
  12. load_remaining = np.asarray([env.demand[idx[0], idx[1]] / env.demand_size])
  13. time_remaining = np.asarray([env.individual_time_limit[idx[0], idx[1]] / env.time_slow])
  14. # return np.concatenate((np.reshape(V2V_channel, -1), V2V_interference, V2I_abs, V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))
  15. return np.concatenate((V2I_fast, np.reshape(V2V_fast, -1), V2V_interference, np.asarray([V2I_abs]), V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))

定义NN:

  1. with g.as_default():
  2. # ============== Training network ========================
  3. x = tf.placeholder(tf.float32, [None, n_input]) # 输入
  4. w_1 = tf.Variable(tf.truncated_normal([n_input, n_hidden_1], stddev=0.1))
  5. w_2 = tf.Variable(tf.truncated_normal([n_hidden_1, n_hidden_2], stddev=0.1))
  6. w_3 = tf.Variable(tf.truncated_normal([n_hidden_2, n_hidden_3], stddev=0.1))
  7. w_4 = tf.Variable(tf.truncated_normal([n_hidden_3, n_output], stddev=0.1))
  8. b_1 = tf.Variable(tf.truncated_normal([n_hidden_1], stddev=0.1))
  9. b_2 = tf.Variable(tf.truncated_normal([n_hidden_2], stddev=0.1))
  10. b_3 = tf.Variable(tf.truncated_normal([n_hidden_3], stddev=0.1))
  11. b_4 = tf.Variable(tf.truncated_normal([n_output], stddev=0.1))
  12. layer_1 = tf.nn.relu(tf.add(tf.matmul(x, w_1), b_1))
  13. layer_1_b = tf.layers.batch_normalization(layer_1)
  14. layer_2 = tf.nn.relu(tf.add(tf.matmul(layer_1_b, w_2), b_2))
  15. layer_2_b = tf.layers.batch_normalization(layer_2)
  16. layer_3 = tf.nn.relu(tf.add(tf.matmul(layer_2_b, w_3), b_3))
  17. layer_3_b = tf.layers.batch_normalization(layer_3)
  18. y = tf.nn.relu(tf.add(tf.matmul(layer_3_b, w_4), b_4))
  19. g_q_action = tf.argmax(y, axis=1)
  20. # compute loss
  21. g_target_q_t = tf.placeholder(tf.float32, None, name="target_value")
  22. g_action = tf.placeholder(tf.int32, None, name='g_action')
  23. action_one_hot = tf.one_hot(g_action, n_output, 1.0, 0.0, name='action_one_hot')
  24. q_acted = tf.reduce_sum(y * action_one_hot, reduction_indices=1, name='q_acted')
  25. g_loss = tf.reduce_mean(tf.square(g_target_q_t - q_acted), name='g_loss') # 求误差
  26. optim = tf.train.RMSPropOptimizer(learning_rate=0.001, momentum=0.95, epsilon=0.01).minimize(g_loss) # 梯度下降
  27. # ==================== Prediction network ========================
  28. x_p = tf.placeholder(tf.float32, [None, n_input]) # 输入
  29. w_1_p = tf.Variable(tf.truncated_normal([n_input, n_hidden_1], stddev=0.1))
  30. w_2_p = tf.Variable(tf.truncated_normal([n_hidden_1, n_hidden_2], stddev=0.1))
  31. w_3_p = tf.Variable(tf.truncated_normal([n_hidden_2, n_hidden_3], stddev=0.1))
  32. w_4_p = tf.Variable(tf.truncated_normal([n_hidden_3, n_output], stddev=0.1))
  33. b_1_p = tf.Variable(tf.truncated_normal([n_hidden_1], stddev=0.1))
  34. b_2_p = tf.Variable(tf.truncated_normal([n_hidden_2], stddev=0.1))
  35. b_3_p = tf.Variable(tf.truncated_normal([n_hidden_3], stddev=0.1))
  36. b_4_p = tf.Variable(tf.truncated_normal([n_output], stddev=0.1))
  37. layer_1_p = tf.nn.relu(tf.add(tf.matmul(x_p, w_1_p), b_1_p))
  38. layer_1_p_b = tf.layers.batch_normalization(layer_1_p)
  39. layer_2_p = tf.nn.relu(tf.add(tf.matmul(layer_1_p_b, w_2_p), b_2_p))
  40. layer_2_p_b = tf.layers.batch_normalization(layer_2_p)
  41. layer_3_p = tf.nn.relu(tf.add(tf.matmul(layer_2_p_b, w_3_p), b_3_p))
  42. layer_3_p_b = tf.layers.batch_normalization(layer_3_p)
  43. y_p = tf.nn.relu(tf.add(tf.matmul(layer_3_p_b, w_4_p), b_4_p))
  44. g_target_q_idx = tf.placeholder('int32', [None, None], 'output_idx') # 输入,这是一个(n, 2)的list
  45. target_q_with_idx = tf.gather_nd(y_p, g_target_q_idx) # 提取首参的某几行/
  46. init = tf.global_variables_initializer()
  47. saver = tf.train.Saver()

在这里仅说明大体结构,具体含义请见下问“采样并获得loss”部分,有结合算法原理的Network结构说明。

整体分成三个NN:Training,compute_loss,Prediction,分别用N1 N2 N3表示。其中N1和N3结构完全一致,为算法结构中的DQN网络,输出Q值,不同点在于,N1每次迭代式都更新,而N3每隔一段时间更新一次。N2接受N1的输入,负责计算Q函数并对N1实现迭代更新。

预测:predict(sess, s_t, ep, test_ep = False),此函数用于驱动NN,生成动作action,代码如下:

  1. def predict(sess, s_t, ep, test_ep = False):
  2. n_power_levels = len(env.V2V_power_dB_List)
  3. if np.random.rand() < ep and not test_ep:
  4. pred_action = np.random.randint(n_RB*n_power_levels)
  5. else:
  6. pred_action = sess.run(g_q_action, feed_dict={x: [s_t]})[0]
  7. return pred_action

这里的action是一个int,但内涵了RB和power_level的信息,在本代码后面Training和Testing中都有出现,使用方法如下:

  1. action = predict(sesses[i*n_neighbor+j], state, epsi)
  2. action_all.append(action)
  3. action_all_training[i, j, 0] = action % n_RB # chosen RB
  4. action_all_training[i, j, 1] = int(np.floor(action / n_RB)) # power level

采样并获得loss:q_learning_mini_batch(current_agent, current_sess),输入单个agent,里面用到了CLASS:memory的sample方法,上面有提到。此外double q-learning也在这里设置。

  1. def q_learning_mini_batch(current_agent, current_sess):
  2. """ Training a sampled mini-batch """
  3. batch_s_t, batch_s_t_plus_1, batch_action, batch_reward = current_agent.memory.sample()
  4. if current_agent.double_q: # double q-learning
  5. pred_action = current_sess.run(g_q_action, feed_dict={x: batch_s_t_plus_1})
  6. q_t_plus_1 = current_sess.run(target_q_with_idx, {x_p: batch_s_t_plus_1, g_target_q_idx: [[idx, pred_a] for idx, pred_a in enumerate(pred_action)]})
  7. batch_target_q_t = current_agent.discount * q_t_plus_1 + batch_reward
  8. else:
  9. q_t_plus_1 = current_sess.run(y_p, {x_p: batch_s_t_plus_1})
  10. max_q_t_plus_1 = np.max(q_t_plus_1, axis=1)
  11. batch_target_q_t = current_agent.discount * max_q_t_plus_1 + batch_reward
  12. _, loss_val = current_sess.run([optim, g_loss], {g_target_q_t: batch_target_q_t, g_action: batch_action, x: batch_s_t})
  13. return loss_val

4.23 补充:这个函数需要结合NN的结构来看,个人感觉还是有点复杂的。如表面意思通过 if 表现了不同DQN和double q-learning两种方法,需要注意的是在两个if里面都只计算了target network的部分,算法图左上方的Network的输入、迭代更新由最后一句完成:

    _, loss_val = current_sess.run([optim, g_loss], {g_target_q_t: batch_target_q_t, g_action: batch_action, x: batch_s_t})

这段代码需要和这篇博客中的图相对应才可以理解,在这里将算法原理图和代码流程图贴出来(代码图由博主通过VISIO绘制,没有遵循标准格式,有错误请见谅)

普通DQN

普通DQN对于policy的迭代更新流程(来自李宏毅老师ppt)

 

代码中的普通DQN数据流图(红字为实参)

Double DQN

来自 这篇博客,Double DQN算法说明
代码中的 double dqn数据流动图

与普通DQN在target network处有不同,前者直接通过Predict Network(上图的‘predict/每隔一段时间更新一次’)和max构成target network,但是doubkle DQN将training network和Predict Network级联构成target network。

Training环节

for i in episode:(对于一次完整的episode迭代)

  • 根据i确定epsi(递增->不变)
  • 每100次更新一次位置、邻居、快衰、信道。
  • 初始化demand time_limit active_links(全1)
  • for i_step in episode:(对于episode中的每一步):
    • 初始化state_old_all,action_all action_all_training

    • for循环:对每一个链路
      • 获取该链路的state【对于单个链路】
      • 通过predict得到action(包含RB和POWER的信息)【对于单个链路】

      • 根据action得到action_all_trainging = [车,邻居,RB/power]【讲单个链路的内容存储起来】

    • 通过action_for_training得到reward【这里是对于所有链路的】如果是sarl,则把计算reward的放到上面的for内,其他一样

    • 把reward加入record_reward

    • 更新快衰

    • 根据action计算干扰

    • 使用for循环对每个链路

      • 计算新状态

      • 将(state_old,state_new,train_reward,action)加入agent的memory中【所以说这里的memory每一条是对于单个链路的】  

      • 每当得到mini_batch_step个新状态后:通过Q-learning_mini_batch得到loss

      • 每当到达target_update_step后,更新target_q_network

  1. record_reward = np.zeros([n_episode*n_step_per_episode, 1])
  2. record_loss = []
  3. if IS_TRAIN:
  4. for i_episode in range(n_episode):
  5. print("-------------------------")
  6. print('Episode:', i_episode)
  7. if i_episode < epsi_anneal_length:
  8. epsi = 1 - i_episode * (1 - epsi_final) / (epsi_anneal_length - 1) # epsilon decreases over each episode
  9. else:
  10. epsi = epsi_final
  11. # 每迭代100次更新一次位置、邻居、信道、快衰
  12. if i_episode%100 == 0:
  13. env.renew_positions() # update vehicle position
  14. env.renew_neighbor()
  15. env.renew_channel() # update channel slow fading
  16. env.renew_channels_fastfading() # update channel fast fading
  17. env.demand = env.demand_size * np.ones((env.n_Veh, env.n_neighbor))
  18. env.individual_time_limit = env.time_slow * np.ones((env.n_Veh, env.n_neighbor))
  19. env.active_links = np.ones((env.n_Veh, env.n_neighbor), dtype='bool')
  20. for i_step in range(n_step_per_episode): # range内是0.1/0.001 = 100
  21. time_step = i_episode*n_step_per_episode + i_step # time_step是整体的step
  22. state_old_all = []
  23. action_all = []
  24. action_all_training = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
  25. for i in range(n_veh):
  26. for j in range(n_neighbor):
  27. state = get_state(env, [i, j], i_episode/(n_episode-1), epsi)
  28. state_old_all.append(state)
  29. action = predict(sesses[i*n_neighbor+j], state, epsi)
  30. action_all.append(action)
  31. action_all_training[i, j, 0] = action % n_RB # chosen RB
  32. action_all_training[i, j, 1] = int(np.floor(action / n_RB)) # power level
  33. # All agents take actions simultaneously, obtain shared reward, and update the environment.
  34. action_temp = action_all_training.copy()
  35. train_reward = env.act_for_training(action_temp)
  36. record_reward[time_step] = train_reward
  37. env.renew_channels_fastfading()
  38. env.Compute_Interference(action_temp)
  39. for i in range(n_veh):
  40. for j in range(n_neighbor):
  41. state_old = state_old_all[n_neighbor * i + j]
  42. action = action_all[n_neighbor * i + j]
  43. state_new = get_state(env, [i, j], i_episode/(n_episode-1), epsi)
  44. agents[i * n_neighbor + j].memory.add(state_old, state_new, train_reward, action) # add entry to this agent's memory
  45. # training this agent
  46. if time_step % mini_batch_step == mini_batch_step-1:
  47. loss_val_batch = q_learning_mini_batch(agents[i*n_neighbor+j], sesses[i*n_neighbor+j])
  48. record_loss.append(loss_val_batch)
  49. if i == 0 and j == 0:
  50. print('step:', time_step, 'agent',i*n_neighbor+j, 'loss', loss_val_batch)
  51. if time_step % target_update_step == target_update_step-1:
  52. update_target_q_network(sesses[i*n_neighbor+j])
  53. if i == 0 and j == 0:
  54. print('Update target Q network...')
  55. print('Training Done. Saving models...')
  56. for i in range(n_veh):
  57. for j in range(n_neighbor):
  58. model_path = label + '/agent_' + str(i * n_neighbor + j)
  59. save_models(sesses[i * n_neighbor + j], model_path)
  60. current_dir = os.path.dirname(os.path.realpath(__file__))
  61. reward_path = os.path.join(current_dir, "model/" + label + '/reward.mat')
  62. scipy.io.savemat(reward_path, {'reward': record_reward})
  63. record_loss = np.asarray(record_loss).reshape((-1, n_veh*n_neighbor))
  64. loss_path = os.path.join(current_dir, "model/" + label + '/train_loss.mat')
  65. scipy.io.savemat(loss_path, {'train_loss': record_loss})

Testing环节

首先加载training得到的模型

for i in episode:(对于一次完整的episode迭代)

  • 更新位置、邻居、快衰、信道。
  • 初始化demand time_limit active_links(全1)
  • for i_step in episode:(对于episode中的每一步):
    • 初始化state_old_all,action_all  action_all_testing

    • 通过predict得到action(包含RB和POWER的信息)

    • 根据action得到action_all_traingingaction_all_testing = [车,邻居,RB/power]

    • 通过action_for_trainingaction_for_testing得到reward V2I_rate, V2V_success, V2V_rate

    • 对V2I_rate求和并加入V2I_rate_per_episode

    • 将V2V_rate加入rate_marl

    • 更新demand

  1. if IS_TEST:
  2. print("\nRestoring the model...")
  3. for i in range(n_veh):
  4. for j in range(n_neighbor):
  5. model_path = label + '/agent_' + str(i * n_neighbor + j)
  6. load_models(sesses[i * n_neighbor + j], model_path)
  7. V2I_rate_list = []
  8. V2V_success_list = []
  9. V2I_rate_list_rand = []
  10. V2V_success_list_rand = []
  11. rate_marl = np.zeros([n_episode_test, n_step_per_episode, n_veh, n_neighbor])
  12. rate_rand = np.zeros([n_episode_test, n_step_per_episode, n_veh, n_neighbor])
  13. demand_marl = env.demand_size * np.ones([n_episode_test, n_step_per_episode+1, n_veh, n_neighbor])
  14. demand_rand = env.demand_size * np.ones([n_episode_test, n_step_per_episode+1, n_veh, n_neighbor])
  15. power_rand = np.zeros([n_episode_test, n_step_per_episode, n_veh, n_neighbor])
  16. for idx_episode in range(n_episode_test):
  17. print('----- Episode', idx_episode, '-----')
  18. env.renew_positions()
  19. env.renew_neighbor()
  20. env.renew_channel()
  21. env.renew_channels_fastfading()
  22. env.demand = env.demand_size * np.ones((env.n_Veh, env.n_neighbor))
  23. env.individual_time_limit = env.time_slow * np.ones((env.n_Veh, env.n_neighbor))
  24. env.active_links = np.ones((env.n_Veh, env.n_neighbor), dtype='bool')
  25. env.demand_rand = env.demand_size * np.ones((env.n_Veh, env.n_neighbor))
  26. env.individual_time_limit_rand = env.time_slow * np.ones((env.n_Veh, env.n_neighbor))
  27. env.active_links_rand = np.ones((env.n_Veh, env.n_neighbor), dtype='bool')
  28. V2I_rate_per_episode = []
  29. V2I_rate_per_episode_rand = []
  30. for test_step in range(n_step_per_episode):
  31. # trained models
  32. action_all_testing = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
  33. for i in range(n_veh):
  34. for j in range(n_neighbor):
  35. state_old = get_state(env, [i, j], 1, epsi_final)
  36. action = predict(sesses[i*n_neighbor+j], state_old, epsi_final, True)
  37. action_all_testing[i, j, 0] = action % n_RB # chosen RB
  38. action_all_testing[i, j, 1] = int(np.floor(action / n_RB)) # power level
  39. action_temp = action_all_testing.copy()
  40. V2I_rate, V2V_success, V2V_rate = env.act_for_testing(action_temp)
  41. V2I_rate_per_episode.append(np.sum(V2I_rate)) # sum V2I rate in bps
  42. rate_marl[idx_episode, test_step,:,:] = V2V_rate
  43. demand_marl[idx_episode, test_step+1,:,:] = env.demand
  44. # random baseline
  45. action_rand = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
  46. action_rand[:, :, 0] = np.random.randint(0, n_RB, [n_veh, n_neighbor]) # band
  47. action_rand[:, :, 1] = np.random.randint(0, len(env.V2V_power_dB_List), [n_veh, n_neighbor]) # power
  48. V2I_rate_rand, V2V_success_rand, V2V_rate_rand = env.act_for_testing_rand(action_rand)
  49. V2I_rate_per_episode_rand.append(np.sum(V2I_rate_rand)) # sum V2I rate in bps
  50. rate_rand[idx_episode, test_step, :, :] = V2V_rate_rand
  51. demand_rand[idx_episode, test_step+1,:,:] = env.demand_rand
  52. for i in range(n_veh):
  53. for j in range(n_neighbor):
  54. power_rand[idx_episode, test_step, i, j] = env.V2V_power_dB_List[int(action_rand[i, j, 1])]
  55. # update the environment and compute interference
  56. env.renew_channels_fastfading()
  57. env.Compute_Interference(action_temp)
  58. if test_step == n_step_per_episode - 1:
  59. V2V_success_list.append(V2V_success)
  60. V2V_success_list_rand.append(V2V_success_rand)
  61. V2I_rate_list.append(np.mean(V2I_rate_per_episode))
  62. V2I_rate_list_rand.append(np.mean(V2I_rate_per_episode_rand))
  63. print(round(np.average(V2I_rate_per_episode), 2), 'rand', round(np.average(V2I_rate_per_episode_rand), 2))
  64. print(V2V_success_list[idx_episode], 'rand', V2V_success_list_rand[idx_episode])

 参考自

[1]3GPP TR36.885报告

[2]《5G移动通信技术》

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家自动化/article/detail/444673
推荐阅读
相关标签
  

闽ICP备14008679号