赞
踩
论文原文:Spectrum Sharing in Vehicular Networks Based on Multi-Agent Reinforcement Learning
论文翻译 & 解读:[论文笔记]Spectrum Sharing in Vehicular Networks Based on Multi-Agent Reinforcement Learning
代码地址:https://github.com/le-liang/MARLspectrumSharingV2X
博客中用到的VISIO流程图(由博主个人绘制,有错误欢迎交流指教):https://download.csdn.net/download/m0_37495408/12353933
(来自原作者github的readme)
在Environment_marl.py文件中,定义了架构的四个基本CLASS,分别是:V2Vchannels,V2Ichannels,Vehicle,Environ。其中Environ的方法(即函数)最多,Vehicle没有函数只有几个属性,其余两者各有两个方法(分别是计算路损和阴影衰落)。
初始化时需要传入三个参数:起始位置、起始方向、速度。函数内部将自己定义两个list:neighbors、destinations,分别存放邻居和V2V的通信端(这里两者在数值上相同,因为设定V2V的对象即为邻居)
- class Vehicle:
- # Vehicle simulator: include all the information for a vehicle
-
- def __init__(self, start_position, start_direction, velocity):
- self.position = start_position
- self.direction = start_direction
- self.velocity = velocity
- self.neighbors = []
- self.destinations = []
从下方的代码可见destionations的含义
- def renew_neighbor(self): # 这个来自CLASS Env
- """ Determine the neighbors of each vehicles """
-
- for i in range(len(self.vehicles)):
- self.vehicles[i].neighbors = []
- self.vehicles[i].actions = []
- z = np.array([[complex(c.position[0], c.position[1]) for c in self.vehicles]])
- Distance = abs(z.T - z)
-
- for i in range(len(self.vehicles)):
- sort_idx = np.argsort(Distance[:, i])
- for j in range(self.n_neighbor):
- self.vehicles[i].neighbors.append(sort_idx[j + 1])
- destination = self.vehicles[i].neighbors
-
- self.vehicles[i].destinations = destination
内部参数z:这里将bs和ms的高度设置为1.5m,阴影的std为3,都是来自TR36 885-A.1.4-1;载波频率为2,单位为GHz;
- class V2Vchannels:
- # Simulator of the V2V Channels
-
- def __init__(self):
- self.t = 0
- self.h_bs = 1.5
- self.h_ms = 1.5
- self.fc = 2
- self.decorrelation_distance = 10
- self.shadow_std = 3
包含两个方法:
计算路损
- def get_path_loss(self, position_A, position_B):
- d1 = abs(position_A[0] - position_B[0])
- d2 = abs(position_A[1] - position_B[1])
- d = math.hypot(d1, d2) + 0.001 # sqrt(x*x + y*y)
- # 下一行定义有效BP距离
- d_bp = 4 * (self.h_bs - 1) * (self.h_ms - 1) * self.fc * (10 ** 9) / (3 * 10 ** 8)
-
- def PL_Los(d):
- if d <= 3:
- return 22.7 * np.log10(3) + 41 + 20 * np.log10(self.fc / 5)
- else:
- if d < d_bp:
- return 22.7 * np.log10(d) + 41 + 20 * np.log10(self.fc / 5)
- else:
- return 40.0 * np.log10(d) + 9.45 - 17.3 * np.log10(self.h_bs) - 17.3 * np.log10(self.h_ms) + 2.7 * np.log10(self.fc / 5)
-
- def PL_NLos(d_a, d_b):
- n_j = max(2.8 - 0.0024 * d_b, 1.84)
- return PL_Los(d_a) + 20 - 12.5 * n_j + 10 * n_j * np.log10(d_b) + 3 * np.log10(self.fc / 5)
-
- if min(d1, d2) < 7:
- PL = PL_Los(d)
- else:
- PL = min(PL_NLos(d1, d2), PL_NLos(d2, d1))
- return PL # + self.shadow_std * np.random.normal()
说明:上述代码使用随机过程模型(见[2]-p328)。
路损使用曼哈顿网格布局LOS模型,即:
, for
以及:, for
上面的n_1=2.2、n_2=4.0,分别表示在BP之前和之后的率衰落常数,d'表示有效BP距离,代码中用d_bp表示。
曼哈顿网格布局NLOS模型:
代码后半出现的min函数,在[2]的p344页有描述,这是假设接收机位于垂直街道是对PL的估计方法。
代码中的公式出自IST-4-027756 WINNER II D1.1.2 V1.2 WINNER II
其中有如下表格,与代码中的参数完全符合:
更新阴影衰落
- def get_shadowing(self, delta_distance, shadowing):
- return np.exp(-1 * (delta_distance / self.decorrelation_distance)) * shadowing \
- + math.sqrt(1 - np.exp(-2 * (delta_distance / self.decorrelation_distance))) * np.random.normal(0, 3) # standard dev is 3 db
这个更新公式是出自文献[1]-A-1.4 Channel model表格后的部分,如下:
包含的两个方法和V2V相同,但是计算路损的时候不再区分Los了
- def get_path_loss(self, position_A):
- d1 = abs(position_A[0] - self.BS_position[0])
- d2 = abs(position_A[1] - self.BS_position[1])
- distance = math.hypot(d1, d2)
- return 128.1 + 37.6 * np.log10(math.sqrt(distance ** 2 + (self.h_bs - self.h_ms) ** 2) / 1000) # + self.shadow_std * np.random.normal()
-
- def get_shadowing(self, delta_distance, shadowing):
- nVeh = len(shadowing)
- self.R = np.sqrt(0.5 * np.ones([nVeh, nVeh]) + 0.5 * np.identity(nVeh))
- return np.multiply(np.exp(-1 * (delta_distance / self.Decorrelation_distance)), shadowing) \
- + np.sqrt(1 - np.exp(-2 * (delta_distance / self.Decorrelation_distance))) * np.random.normal(0, 8, nVeh)
上面的两个方法均是文献[1]-Table A.1.4-2的内容和其后的说明,如下:
初始化需要传入4个list(为上下左右路口的位置数据):down_lane, up_lane, left_lane, right_lane;地图的宽和高;车辆数和邻居数。除以上所提外,内部含有好多参数,如下:
- class Environ:
- def __init__(self, down_lane, up_lane, left_lane, right_lane, width, height, n_veh, n_neighbor):
- self.V2Vchannels = V2Vchannels()
- self.V2Ichannels = V2Ichannels()
- self.vehicles = []
-
- self.demand = []
- self.V2V_Shadowing = []
- self.V2I_Shadowing = []
- self.delta_distance = []
- self.V2V_channels_abs = []
- self.V2I_channels_abs = []
-
- self.V2I_power_dB = 23 # dBm
- self.V2V_power_dB_List = [23, 15, 5, -100] # the power levels
- self.V2I_power = 10 ** (self.V2I_power_dB)
- self.sig2_dB = -114
- self.bsAntGain = 8
- self.bsNoiseFigure = 5
- self.vehAntGain = 3
- self.vehNoiseFigure = 9
- self.sig2 = 10 ** (self.sig2_dB / 10)
-
- self.n_RB = n_veh
- self.n_Veh = n_veh
- self.n_neighbor = n_neighbor
- self.time_fast = 0.001
- self.time_slow = 0.1 # update slow fading/vehicle position every 100 ms
- self.bandwidth = int(1e6) # bandwidth per RB, 1 MHz
- # self.bandwidth = 1500
- self.demand_size = int((4 * 190 + 300) * 8 * 2) # V2V payload: 1060 Bytes every 100 ms
- # self.demand_size = 20
-
- self.V2V_Interference_all = np.zeros((self.n_Veh, self.n_neighbor, self.n_RB)) + self.sig2
添加车:有两个方法:add_new_vehivles(需要传输起始坐标、方向、速度),add_new_vehicles_by_number(n)。后者比较有意思,只需要一个参数,n,但是并不是添加n辆车,而是4n辆车,上下左右方向各一台,位置是随机的。
更新车辆位置:renew_position(无),遍历每辆车,根据其方向和速度更新位置,到路口时依据概率顺时针转弯,到地图边界时使其顺时针转弯留在地图中。
更新邻居:renew_neighbor(self),已经在Vehicle中进行描述
更新信道:renew_channel(self),这里定义了一个很重要的量:channel_abs,它是路损和阴影衰落的和。【内含所有车辆的信息】
- def renew_channel(self):
- """ Renew slow fading channel """
-
- self.V2V_pathloss = np.zeros((len(self.vehicles), len(self.vehicles))) + 50 * np.identity(len(self.vehicles))
- self.V2I_pathloss = np.zeros((len(self.vehicles)))
- self.V2V_channels_abs = np.zeros((len(self.vehicles), len(self.vehicles)))
- self.V2I_channels_abs = np.zeros((len(self.vehicles)))
- for i in range(len(self.vehicles)):
- for j in range(i + 1, len(self.vehicles)):
- self.V2V_Shadowing[j][i] = self.V2V_Shadowing[i][j] = self.V2Vchannels.get_shadowing(self.delta_distance[i] + self.delta_distance[j], self.V2V_Shadowing[i][j])
- self.V2V_pathloss[j,i] = self.V2V_pathloss[i][j] = self.V2Vchannels.get_path_loss(self.vehicles[i].position, self.vehicles[j].position)
- self.V2V_channels_abs = self.V2V_pathloss + self.V2V_Shadowing
-
- self.V2I_Shadowing = self.V2Ichannels.get_shadowing(self.delta_distance, self.V2I_Shadowing)
- for i in range(len(self.vehicles)):
- self.V2I_pathloss[i] = self.V2Ichannels.get_path_loss(self.vehicles[i].position)
- self.V2I_channels_abs = self.V2I_pathloss + self.V2I_Shadowing
更新快衰落信道:renew_channels_fastfading(self),其数值为把channels_abs减了一个随机数,这里在减之前将channels_abs增加了一维,层数为RB的个数。
- def renew_channels_fastfading(self):
- """ Renew fast fading channel """
-
- # 1 2, 3 4 --> 1 1 2 2 3 3 4 4 逐个元素复制
- V2V_channels_with_fastfading = np.repeat(self.V2V_channels_abs[:, :, np.newaxis], self.n_RB, axis=2)
- # A - 20 log
- self.V2V_channels_with_fastfading = V2V_channels_with_fastfading - 20 * np.log10(
- np.abs(np.random.normal(0, 1, V2V_channels_with_fastfading.shape) + 1j * np.random.normal(0, 1, V2V_channels_with_fastfading.shape)) / math.sqrt(2))
-
- # 1 2, 3 4 --> 1 1 2 2, 3 3 4 4
- V2I_channels_with_fastfading = np.repeat(self.V2I_channels_abs[:, np.newaxis], self.n_RB, axis=1)
- self.V2I_channels_with_fastfading = V2I_channels_with_fastfading - 20 * np.log10(
- np.abs(np.random.normal(0, 1, V2I_channels_with_fastfading.shape) + 1j * np.random.normal(0, 1, V2I_channels_with_fastfading.shape))/ math.sqrt(2))
计算Reward:Compute_Performance_Reward_Train(self, actions_power),这里的输入非常重要,是RL的action,其定义在main_marl_train.py中,是个三维数组,以(层,行,列)进行说明,一层一个车,一行一个邻居,共有两列分别为RB选择(用RB的序号表示)和power选择(也用序号表示,作为power_db_list的索引),如下所示:
- for i in range(n_veh):
- for j in range(n_neighbor):
- state_old = get_state(env, [i, j], 1, epsi_final)
- action = predict(sesses[i*n_neighbor+j], state_old, epsi_final, True)
- action_all_testing[i, j, 0] = action % n_RB # chosen RB
- action_all_testing[i, j, 1] = int(np.floor(action / n_RB)) # power level
具体计算步骤为:
代码如下:
- def Compute_Performance_Reward_Train(self, actions_power):
-
- actions = actions_power[:, :, 0] # the channel_selection_part
- power_selection = actions_power[:, :, 1] # power selection
-
- # ------------ Compute V2I rate --------------------
- V2I_Rate = np.zeros(self.n_RB)
- V2I_Interference = np.zeros(self.n_RB) # V2I interference
- for i in range(len(self.vehicles)):
- for j in range(self.n_neighbor):
- if not self.active_links[i, j]:
- continue
- V2I_Interference[actions[i][j]] += 10 ** ((self.V2V_power_dB_List[power_selection[i, j]] - self.V2I_channels_with_fastfading[i, actions[i, j]]
- + self.vehAntGain + self.bsAntGain - self.bsNoiseFigure) / 10)
- self.V2I_Interference = V2I_Interference + self.sig2
- V2I_Signals = 10 ** ((self.V2I_power_dB - self.V2I_channels_with_fastfading.diagonal() + self.vehAntGain + self.bsAntGain - self.bsNoiseFigure) / 10)
- V2I_Rate = np.log2(1 + np.divide(V2I_Signals, self.V2I_Interference)) # 计算V2I信道容量
-
- # ------------ Compute V2V rate -------------------------
- V2V_Interference = np.zeros((len(self.vehicles), self.n_neighbor))
- V2V_Signal = np.zeros((len(self.vehicles), self.n_neighbor))
- actions[(np.logical_not(self.active_links))] = -1 # inactive links will not transmit regardless of selected power levels
- for i in range(self.n_RB): # scanning all bands
- indexes = np.argwhere(actions == i) # find spectrum-sharing V2Vs
- for j in range(len(indexes)):
- receiver_j = self.vehicles[indexes[j, 0]].destinations[indexes[j, 1]]
- V2V_Signal[indexes[j, 0], indexes[j, 1]] = 10 ** ((self.V2V_power_dB_List[power_selection[indexes[j, 0], indexes[j, 1]]]
- - self.V2V_channels_with_fastfading[indexes[j][0], receiver_j, i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
- # V2I links interference to V2V links
- V2V_Interference[indexes[j, 0], indexes[j, 1]] = 10 ** ((self.V2I_power_dB - self.V2V_channels_with_fastfading[i, receiver_j, i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
-
- # V2V interference
- for k in range(j + 1, len(indexes)): # spectrum-sharing V2Vs
- receiver_k = self.vehicles[indexes[k][0]].destinations[indexes[k][1]]
- V2V_Interference[indexes[j, 0], indexes[j, 1]] += 10 ** ((self.V2V_power_dB_List[power_selection[indexes[k, 0], indexes[k, 1]]]
- - self.V2V_channels_with_fastfading[indexes[k][0]][receiver_j][i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
- V2V_Interference[indexes[k, 0], indexes[k, 1]] += 10 ** ((self.V2V_power_dB_List[power_selection[indexes[j, 0], indexes[j, 1]]]
- - self.V2V_channels_with_fastfading[indexes[j][0]][receiver_k][i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
- self.V2V_Interference = V2V_Interference + self.sig2
- V2V_Rate = np.log2(1 + np.divide(V2V_Signal, self.V2V_Interference))
-
- self.demand -= V2V_Rate * self.time_fast * self.bandwidth
- self.demand[self.demand < 0] = 0 # eliminate negative demands
-
- self.individual_time_limit -= self.time_fast
-
- reward_elements = V2V_Rate/10
- reward_elements[self.demand <= 0] = 1
-
- self.active_links[np.multiply(self.active_links, self.demand <= 0)] = 0 # transmission finished, turned to "inactive"
-
- return V2I_Rate, V2V_Rate, reward_elements
注:这里返回三个数值,其中最后一个并不是最终的reward,最终的reward需要把这三个数值加权组合起来。
执行训练:act_for_training(self, actions),输入actions,通过Compute_Performance_Reward_Train计算最终reward,代码如下:
- def act_for_training(self, actions):
-
- action_temp = actions.copy()
- V2I_Rate, V2V_Rate, reward_elements = self.Compute_Performance_Reward_Train(action_temp)
-
- lambdda = 0.
- reward = lambdda * np.sum(V2I_Rate) / (self.n_Veh * 10) + (1 - lambdda) * np.sum(reward_elements) / (self.n_Veh * self.n_neighbor)
-
- return reward
执行测试:act_for_testing(self, actions),这里和上面差不多,也用到了Compute_Performance_Reward_Train,但最后返回的是V2I_rate, V2V_success, V2V_rate。
- def act_for_testing(self, actions):
-
- action_temp = actions.copy()
- V2I_Rate, V2V_Rate, reward_elements = self.Compute_Performance_Reward_Train(action_temp)
- V2V_success = 1 - np.sum(self.active_links) / (self.n_Veh * self.n_neighbor) # V2V success rates
-
- return V2I_Rate, V2V_success, V2V_Rate
上面所述的三个量,是一次episode中的单步step所生成的最终结果,main_marl_train.py的testing部分可以看到,部分代码如下:
- for test_step in range(n_step_per_episode):
- # trained models
- action_all_testing = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
- for i in range(n_veh):
- for j in range(n_neighbor):
- state_old = get_state(env, [i, j], 1, epsi_final)
- action = predict(sesses[i*n_neighbor+j], state_old, epsi_final, True)
- action_all_testing[i, j, 0] = action % n_RB # chosen RB
- action_all_testing[i, j, 1] = int(np.floor(action / n_RB)) # power level
-
- action_temp = action_all_testing.copy()
- V2I_rate, V2V_success, V2V_rate = env.act_for_testing(action_temp)
- V2I_rate_per_episode.append(np.sum(V2I_rate)) # sum V2I rate in bps
- rate_marl[idx_episode, test_step,:,:] = V2V_rate
- demand_marl[idx_episode, test_step+1,:,:] = env.demand
计算干扰:Compute_Interference(self, actions),通过+=的方法计算V2V_Interference_all,代码如下:
- def Compute_Interference(self, actions):
- V2V_Interference = np.zeros((len(self.vehicles), self.n_neighbor, self.n_RB)) + self.sig2
-
- channel_selection = actions.copy()[:, :, 0] # 取所有层的第0列
- power_selection = actions.copy()[:, :, 1] # 取所有层的第1列
- channel_selection[np.logical_not(self.active_links)] = -1 # 将未激活的链路置为-1
-
- # interference from V2I links
- for i in range(self.n_RB):
- for k in range(len(self.vehicles)):
- for m in range(len(channel_selection[k, :])):
- V2V_Interference[k, m, i] += 10 ** ((self.V2I_power_dB - self.V2V_channels_with_fastfading[i][self.vehicles[k].destinations[m]][i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
-
- # interference from peer V2V links
- for i in range(len(self.vehicles)):
- for j in range(len(channel_selection[i, :])):
- for k in range(len(self.vehicles)):
- for m in range(len(channel_selection[k, :])):
- # if i == k or channel_selection[i,j] >= 0:
- if i == k and j == m or channel_selection[i, j] < 0:
- continue
- V2V_Interference[k, m, channel_selection[i, j]] += 10 ** ((self.V2V_power_dB_List[power_selection[i, j]]
- - self.V2V_channels_with_fastfading[i][self.vehicles[k].destinations[m]][channel_selection[i,j]] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
- self.V2V_Interference_all = 10 * np.log10(V2V_Interference)
在main_marl_train.py的get_state中有用到,用于构成state中的V2V_interference,如下:
- def get_state(env, idx=(0,0), ind_episode=1., epsi=0.02):
- """ Get state from the environment """
- # include V2I/V2V fast_fading, V2V interference, V2I/V2V 信道信息(PL+shadow),
- # 剩余时间, 剩余负载
-
- # V2I_channel = (env.V2I_channels_with_fastfading[idx[0], :] - 80) / 60
- V2I_fast = (env.V2I_channels_with_fastfading[idx[0], :] - env.V2I_channels_abs[idx[0]] + 10)/35
-
- # V2V_channel = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - 80) / 60
- V2V_fast = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] + 10)/35
-
- V2V_interference = (-env.V2V_Interference_all[idx[0], idx[1], :] - 60) / 60
-
- V2I_abs = (env.V2I_channels_abs[idx[0]] - 80) / 60.0
- V2V_abs = (env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] - 80)/60.0
-
- load_remaining = np.asarray([env.demand[idx[0], idx[1]] / env.demand_size])
- time_remaining = np.asarray([env.individual_time_limit[idx[0], idx[1]] / env.time_slow])
-
- # return np.concatenate((np.reshape(V2V_channel, -1), V2V_interference, V2I_abs, V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))
- return np.concatenate((V2I_fast, np.reshape(V2V_fast, -1), V2V_interference, np.asarray([V2I_abs]), V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))
- # 这里有所有感兴趣的物理量:V2V_fast V2I_fast V2V_interference V2I_abs V2V_abs
有的小伙伴看到这就有点迷了,为什么这里又要计算V2V_Interference了?我怎么感觉之前好像算过,是的,在计算V2V_rate的时候就需要计算V2V_Interference,我目前观察那个是按照RB分配来算的,这个是直接按照车挨个遍历的。
这部分内容来自replay_memory.py,内容不多,只定义了一个类: ReplayMemory,需要注意的是每一个agent都有一个memory,在main_marl_train.py--class Agent可以看到,如下所示
- class Agent(object):
- def __init__(self, memory_entry_size):
- self.discount = 1
- self.double_q = True
- self.memory_entry_size = memory_entry_size
- self.memory = ReplayMemory(self.memory_entry_size)
初始化:需要输入memory的容量:entry_size,初始化的代码如下:
- class ReplayMemory:
- def __init__(self, entry_size):
- self.entry_size = entry_size
- self.memory_size = 200000
- self.actions = np.empty(self.memory_size, dtype = np.uint8)
- self.rewards = np.empty(self.memory_size, dtype = np.float64)
- self.prestate = np.empty((self.memory_size, self.entry_size), dtype = np.float16)
- self.poststate = np.empty((self.memory_size, self.entry_size), dtype = np.float16)
- self.batch_size = 2000
- self.count = 0
- self.current = 0
添加(s, a)对:add(self, prestate, poststate, reward, action),从add方法的参数可以看出参数包括:(上一个状态,下一个状态,奖励,动作),代码如下:
- def add(self, prestate, poststate, reward, action):
- self.actions[self.current] = action
- self.rewards[self.current] = reward
- self.prestate[self.current] = prestate
- self.poststate[self.current] = poststate
- self.count = max(self.count, self.current + 1)
- self.current = (self.current + 1) % self.memory_size
对每个agent来说,都需要将自己在每个time_step将这个状态转移的信息记录下来,在main_marl_train.py--Training的部分可以看到add的使用,代码如下,这个for循环上面还有一个对于episode的for循环,可以看出,在每个episode的每个step,都需要对所有agent进行(s,a)对的添加【最后一行】
- for i_step in range(n_step_per_episode): # range内是0.1/0.001 = 100
- time_step = i_episode*n_step_per_episode + i_step # time_step是整体的step
- state_old_all = []
- action_all = []
- action_all_training = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
- for i in range(n_veh):
- for j in range(n_neighbor):
- state = get_state(env, [i, j], i_episode/(n_episode-1), epsi)
- state_old_all.append(state)
- action = predict(sesses[i*n_neighbor+j], state, epsi)
- action_all.append(action)
-
- action_all_training[i, j, 0] = action % n_RB # chosen RB
- action_all_training[i, j, 1] = int(np.floor(action / n_RB)) # power level
-
- # All agents take actions simultaneously, obtain shared reward, and update the environment.
- action_temp = action_all_training.copy()
- train_reward = env.act_for_training(action_temp)
- record_reward[time_step] = train_reward
-
- env.renew_channels_fastfading()
- env.Compute_Interference(action_temp)
-
- for i in range(n_veh):
- for j in range(n_neighbor):
- state_old = state_old_all[n_neighbor * i + j]
- action = action_all[n_neighbor * i + j]
- state_new = get_state(env, [i, j], i_episode/(n_episode-1), epsi)
- agents[i * n_neighbor + j].memory.add(state_old, state_new, train_reward, action) # add entry to this agent's memory
采样:sample(self),经过多次add后,每个agent已经有了多个(s,a)对,但是实际训练的时候一次取出batch_size个(s,a)对进行训练,代码如下所示:
- def sample(self):
-
- if self.count < self.batch_size:
- indexes = range(0, self.count)
- else:
- indexes = random.sample(range(0,self.count), self.batch_size)
- prestate = self.prestate[indexes]
- poststate = self.poststate[indexes]
- actions = self.actions[indexes]
- rewards = self.rewards[indexes]
- return prestate, poststate, actions, rewards
定义CLASS Agent:Agent(object),无输入参数,内容是一些算法参数,注意memory的实现方法是ReplayMemory,上面刚提到过
- class Agent(object):
- def __init__(self, memory_entry_size):
- self.discount = 1
- self.double_q = True
- self.memory_entry_size = memory_entry_size
- self.memory = ReplayMemory(self.memory_entry_size)
参数初始化:这部分直接写在代码中,没有函数,大概包括:地图属性(路口坐标,整体地图尺寸)、#车、#邻居、#RB、#episode,一些算法参数,代码如下:
对于地图参数 up_lanes / down_lanes / left_lanes / right_lanes 的含义,首先要了解本次所用的系统模型由3GPP TR 36.885的城市案例给出,每条街有四个车道(正反方向各两个车道) ,车道宽3.5m,模型网格(road grid)的尺寸以黄线之间的距离确定,为433m*250m,区域面积为1299m*750m。仿真中等比例缩小为原来的1/2(这点可以由 width 和 height 参数是 / 2 的看出来),反映在车道的参数上就是在 lanes 中的 i / 2.0 。
下面以 up_lanes 为例进行说明。在上图中我们可以看到,车道宽3.5m,所以将车视作质点的话,应该是在3.5m的车道中间移动的,因此在 up_lanes 中 in 后面的 中括号里 3.5 需要 /2,第二项的3.5就是通向双车道的第二条车道的中间;第三项 +250 就是越过建筑物的第一条同向车道,以此类推。
- up_lanes = [i/2.0 for i in [3.5/2, 3.5 + 3.5/2, 250+3.5/2, 250+3.5+3.5/2, 500+3.5/2, 500+3.5+3.5/2]]
- down_lanes = [i/2.0 for i in [250-3.5-3.5/2,250-3.5/2,500-3.5-3.5/2,500-3.5/2,750-3.5-3.5/2,750-3.5/2]]
- left_lanes = [i/2.0 for i in [3.5/2,3.5/2 + 3.5,433+3.5/2, 433+3.5+3.5/2, 866+3.5/2, 866+3.5+3.5/2]]
- right_lanes = [i/2.0 for i in [433-3.5-3.5/2,433-3.5/2,866-3.5-3.5/2,866-3.5/2,1299-3.5-3.5/2,1299-3.5/2]]
-
- width = 750/2
- height = 1298/2
-
- IS_TRAIN = 1
- IS_TEST = 1-IS_TRAIN
-
- label = 'marl_model'
-
- n_veh = 4
- n_neighbor = 1
- n_RB = n_veh
-
- env = Environment_marl.Environ(down_lanes, up_lanes, left_lanes, right_lanes, width, height, n_veh, n_neighbor)
- env.new_random_game() # initialize parameters in env
-
- # n_episode = 3000
- n_episode = 600
- n_step_per_episode = int(env.time_slow/env.time_fast) # slow = 0.1, fast = 0.001
- epsi_final = 0.02
- epsi_anneal_length = int(0.8*n_episode)
- mini_batch_step = n_step_per_episode
- target_update_step = n_step_per_episode*4
-
- n_episode_test = 100 # test episodes
获取状态:get_state(env, idx=(0,0), ind_episode=1., epsi=0.02),输入是env(环境),输出包括:
需要注意的是,代码中的V2I_abs出现了-80,/60 的操作,这个将代码作者在github讨论区的解释放在这里:
This is to roughly normalize DQN inputs for the ease of training. The numbers are obtained from several trial runs
“这是为了使DQN输入大致标准化,以便于培训。 这些数字是从几次试运行获得的”
- def get_state(env, idx=(0,0), ind_episode=1., epsi=0.02):
- """ Get state from the environment """
- # include V2I/V2V fast_fading, V2V interference, V2I/V2V 信道信息(PL+shadow),
- # 剩余时间, 剩余负载
-
- # V2I_channel = (env.V2I_channels_with_fastfading[idx[0], :] - 80) / 60
- V2I_fast = (env.V2I_channels_with_fastfading[idx[0], :] - env.V2I_channels_abs[idx[0]] + 10)/35
-
- # V2V_channel = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - 80) / 60
- V2V_fast = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] + 10)/35
-
- V2V_interference = (-env.V2V_Interference_all[idx[0], idx[1], :] - 60) / 60
-
- V2I_abs = (env.V2I_channels_abs[idx[0]] - 80) / 60.0
- V2V_abs = (env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] - 80)/60.0
-
- load_remaining = np.asarray([env.demand[idx[0], idx[1]] / env.demand_size])
- time_remaining = np.asarray([env.individual_time_limit[idx[0], idx[1]] / env.time_slow])
-
- # return np.concatenate((np.reshape(V2V_channel, -1), V2V_interference, V2I_abs, V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))
- return np.concatenate((V2I_fast, np.reshape(V2V_fast, -1), V2V_interference, np.asarray([V2I_abs]), V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))
定义NN:
- with g.as_default():
- # ============== Training network ========================
- x = tf.placeholder(tf.float32, [None, n_input]) # 输入
-
- w_1 = tf.Variable(tf.truncated_normal([n_input, n_hidden_1], stddev=0.1))
- w_2 = tf.Variable(tf.truncated_normal([n_hidden_1, n_hidden_2], stddev=0.1))
- w_3 = tf.Variable(tf.truncated_normal([n_hidden_2, n_hidden_3], stddev=0.1))
- w_4 = tf.Variable(tf.truncated_normal([n_hidden_3, n_output], stddev=0.1))
-
- b_1 = tf.Variable(tf.truncated_normal([n_hidden_1], stddev=0.1))
- b_2 = tf.Variable(tf.truncated_normal([n_hidden_2], stddev=0.1))
- b_3 = tf.Variable(tf.truncated_normal([n_hidden_3], stddev=0.1))
- b_4 = tf.Variable(tf.truncated_normal([n_output], stddev=0.1))
-
- layer_1 = tf.nn.relu(tf.add(tf.matmul(x, w_1), b_1))
- layer_1_b = tf.layers.batch_normalization(layer_1)
- layer_2 = tf.nn.relu(tf.add(tf.matmul(layer_1_b, w_2), b_2))
- layer_2_b = tf.layers.batch_normalization(layer_2)
- layer_3 = tf.nn.relu(tf.add(tf.matmul(layer_2_b, w_3), b_3))
- layer_3_b = tf.layers.batch_normalization(layer_3)
- y = tf.nn.relu(tf.add(tf.matmul(layer_3_b, w_4), b_4))
-
- g_q_action = tf.argmax(y, axis=1)
-
- # compute loss
- g_target_q_t = tf.placeholder(tf.float32, None, name="target_value")
-
- g_action = tf.placeholder(tf.int32, None, name='g_action')
- action_one_hot = tf.one_hot(g_action, n_output, 1.0, 0.0, name='action_one_hot')
- q_acted = tf.reduce_sum(y * action_one_hot, reduction_indices=1, name='q_acted')
-
- g_loss = tf.reduce_mean(tf.square(g_target_q_t - q_acted), name='g_loss') # 求误差
- optim = tf.train.RMSPropOptimizer(learning_rate=0.001, momentum=0.95, epsilon=0.01).minimize(g_loss) # 梯度下降
-
- # ==================== Prediction network ========================
- x_p = tf.placeholder(tf.float32, [None, n_input]) # 输入
-
- w_1_p = tf.Variable(tf.truncated_normal([n_input, n_hidden_1], stddev=0.1))
- w_2_p = tf.Variable(tf.truncated_normal([n_hidden_1, n_hidden_2], stddev=0.1))
- w_3_p = tf.Variable(tf.truncated_normal([n_hidden_2, n_hidden_3], stddev=0.1))
- w_4_p = tf.Variable(tf.truncated_normal([n_hidden_3, n_output], stddev=0.1))
-
- b_1_p = tf.Variable(tf.truncated_normal([n_hidden_1], stddev=0.1))
- b_2_p = tf.Variable(tf.truncated_normal([n_hidden_2], stddev=0.1))
- b_3_p = tf.Variable(tf.truncated_normal([n_hidden_3], stddev=0.1))
- b_4_p = tf.Variable(tf.truncated_normal([n_output], stddev=0.1))
-
- layer_1_p = tf.nn.relu(tf.add(tf.matmul(x_p, w_1_p), b_1_p))
- layer_1_p_b = tf.layers.batch_normalization(layer_1_p)
- layer_2_p = tf.nn.relu(tf.add(tf.matmul(layer_1_p_b, w_2_p), b_2_p))
- layer_2_p_b = tf.layers.batch_normalization(layer_2_p)
- layer_3_p = tf.nn.relu(tf.add(tf.matmul(layer_2_p_b, w_3_p), b_3_p))
- layer_3_p_b = tf.layers.batch_normalization(layer_3_p)
- y_p = tf.nn.relu(tf.add(tf.matmul(layer_3_p_b, w_4_p), b_4_p))
-
- g_target_q_idx = tf.placeholder('int32', [None, None], 'output_idx') # 输入,这是一个(n, 2)的list
- target_q_with_idx = tf.gather_nd(y_p, g_target_q_idx) # 提取首参的某几行/列
-
- init = tf.global_variables_initializer()
- saver = tf.train.Saver()
在这里仅说明大体结构,具体含义请见下问“采样并获得loss”部分,有结合算法原理的Network结构说明。
整体分成三个NN:Training,compute_loss,Prediction,分别用N1 N2 N3表示。其中N1和N3结构完全一致,为算法结构中的DQN网络,输出Q值,不同点在于,N1每次迭代式都更新,而N3每隔一段时间更新一次。N2接受N1的输入,负责计算Q函数并对N1实现迭代更新。
在
预测:predict(sess, s_t, ep, test_ep = False),此函数用于驱动NN,生成动作action,代码如下:
- def predict(sess, s_t, ep, test_ep = False):
-
- n_power_levels = len(env.V2V_power_dB_List)
- if np.random.rand() < ep and not test_ep:
- pred_action = np.random.randint(n_RB*n_power_levels)
- else:
- pred_action = sess.run(g_q_action, feed_dict={x: [s_t]})[0]
- return pred_action
这里的action是一个int,但内涵了RB和power_level的信息,在本代码后面Training和Testing中都有出现,使用方法如下:
- action = predict(sesses[i*n_neighbor+j], state, epsi)
- action_all.append(action)
-
- action_all_training[i, j, 0] = action % n_RB # chosen RB
- action_all_training[i, j, 1] = int(np.floor(action / n_RB)) # power level
采样并获得loss:q_learning_mini_batch(current_agent, current_sess),输入单个agent,里面用到了CLASS:memory的sample方法,上面有提到。此外double q-learning也在这里设置。
- def q_learning_mini_batch(current_agent, current_sess):
- """ Training a sampled mini-batch """
-
- batch_s_t, batch_s_t_plus_1, batch_action, batch_reward = current_agent.memory.sample()
-
- if current_agent.double_q: # double q-learning
- pred_action = current_sess.run(g_q_action, feed_dict={x: batch_s_t_plus_1})
- q_t_plus_1 = current_sess.run(target_q_with_idx, {x_p: batch_s_t_plus_1, g_target_q_idx: [[idx, pred_a] for idx, pred_a in enumerate(pred_action)]})
- batch_target_q_t = current_agent.discount * q_t_plus_1 + batch_reward
- else:
- q_t_plus_1 = current_sess.run(y_p, {x_p: batch_s_t_plus_1})
- max_q_t_plus_1 = np.max(q_t_plus_1, axis=1)
- batch_target_q_t = current_agent.discount * max_q_t_plus_1 + batch_reward
-
- _, loss_val = current_sess.run([optim, g_loss], {g_target_q_t: batch_target_q_t, g_action: batch_action, x: batch_s_t})
- return loss_val
4.23 补充:这个函数需要结合NN的结构来看,个人感觉还是有点复杂的。如表面意思通过 if 表现了不同DQN和double q-learning两种方法,需要注意的是在两个if里面都只计算了target network的部分,算法图左上方的Network的输入、迭代更新由最后一句完成:
_, loss_val = current_sess.run([optim, g_loss], {g_target_q_t: batch_target_q_t, g_action: batch_action, x: batch_s_t})
这段代码需要和这篇博客中的图相对应才可以理解,在这里将算法原理图和代码流程图贴出来(代码图由博主通过VISIO绘制,没有遵循标准格式,有错误请见谅)
普通DQN
Double DQN
与普通DQN在target network处有不同,前者直接通过Predict Network(上图的‘predict/每隔一段时间更新一次’)和max构成target network,但是doubkle DQN将training network和Predict Network级联构成target network。
Training环节
for i in episode:(对于一次完整的episode迭代)
初始化state_old_all,action_all action_all_training
通过predict得到action(包含RB和POWER的信息)【对于单个链路】
根据action得到action_all_trainging = [车,邻居,RB/power]【讲单个链路的内容存储起来】
通过action_for_training得到reward【这里是对于所有链路的】如果是sarl,则把计算reward的放到上面的for内,其他一样
把reward加入record_reward
更新快衰
根据action计算干扰
使用for循环对每个链路
计算新状态
将(state_old,state_new,train_reward,action)加入agent的memory中【所以说这里的memory每一条是对于单个链路的】
每当得到mini_batch_step个新状态后:通过Q-learning_mini_batch得到loss
每当到达target_update_step后,更新target_q_network
- record_reward = np.zeros([n_episode*n_step_per_episode, 1])
- record_loss = []
- if IS_TRAIN:
- for i_episode in range(n_episode):
- print("-------------------------")
- print('Episode:', i_episode)
- if i_episode < epsi_anneal_length:
- epsi = 1 - i_episode * (1 - epsi_final) / (epsi_anneal_length - 1) # epsilon decreases over each episode
- else:
- epsi = epsi_final
-
- # 每迭代100次更新一次位置、邻居、信道、快衰
- if i_episode%100 == 0:
- env.renew_positions() # update vehicle position
- env.renew_neighbor()
- env.renew_channel() # update channel slow fading
- env.renew_channels_fastfading() # update channel fast fading
-
- env.demand = env.demand_size * np.ones((env.n_Veh, env.n_neighbor))
- env.individual_time_limit = env.time_slow * np.ones((env.n_Veh, env.n_neighbor))
- env.active_links = np.ones((env.n_Veh, env.n_neighbor), dtype='bool')
-
- for i_step in range(n_step_per_episode): # range内是0.1/0.001 = 100
- time_step = i_episode*n_step_per_episode + i_step # time_step是整体的step
- state_old_all = []
- action_all = []
- action_all_training = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
- for i in range(n_veh):
- for j in range(n_neighbor):
- state = get_state(env, [i, j], i_episode/(n_episode-1), epsi)
- state_old_all.append(state)
- action = predict(sesses[i*n_neighbor+j], state, epsi)
- action_all.append(action)
-
- action_all_training[i, j, 0] = action % n_RB # chosen RB
- action_all_training[i, j, 1] = int(np.floor(action / n_RB)) # power level
-
- # All agents take actions simultaneously, obtain shared reward, and update the environment.
- action_temp = action_all_training.copy()
- train_reward = env.act_for_training(action_temp)
- record_reward[time_step] = train_reward
-
- env.renew_channels_fastfading()
- env.Compute_Interference(action_temp)
-
- for i in range(n_veh):
- for j in range(n_neighbor):
- state_old = state_old_all[n_neighbor * i + j]
- action = action_all[n_neighbor * i + j]
- state_new = get_state(env, [i, j], i_episode/(n_episode-1), epsi)
- agents[i * n_neighbor + j].memory.add(state_old, state_new, train_reward, action) # add entry to this agent's memory
- # training this agent
- if time_step % mini_batch_step == mini_batch_step-1:
- loss_val_batch = q_learning_mini_batch(agents[i*n_neighbor+j], sesses[i*n_neighbor+j])
- record_loss.append(loss_val_batch)
- if i == 0 and j == 0:
- print('step:', time_step, 'agent',i*n_neighbor+j, 'loss', loss_val_batch)
- if time_step % target_update_step == target_update_step-1:
- update_target_q_network(sesses[i*n_neighbor+j])
- if i == 0 and j == 0:
- print('Update target Q network...')
- print('Training Done. Saving models...')
- for i in range(n_veh):
- for j in range(n_neighbor):
- model_path = label + '/agent_' + str(i * n_neighbor + j)
- save_models(sesses[i * n_neighbor + j], model_path)
- current_dir = os.path.dirname(os.path.realpath(__file__))
- reward_path = os.path.join(current_dir, "model/" + label + '/reward.mat')
- scipy.io.savemat(reward_path, {'reward': record_reward})
- record_loss = np.asarray(record_loss).reshape((-1, n_veh*n_neighbor))
- loss_path = os.path.join(current_dir, "model/" + label + '/train_loss.mat')
- scipy.io.savemat(loss_path, {'train_loss': record_loss})
Testing环节
首先加载training得到的模型
for i in episode:(对于一次完整的episode迭代)
初始化state_old_all,action_all action_all_testing
通过predict得到action(包含RB和POWER的信息)
根据action得到action_all_traingingaction_all_testing = [车,邻居,RB/power]
通过action_for_trainingaction_for_testing得到reward V2I_rate, V2V_success, V2V_rate
对V2I_rate求和并加入V2I_rate_per_episode
将V2V_rate加入rate_marl
更新demand
- if IS_TEST:
- print("\nRestoring the model...")
-
- for i in range(n_veh):
- for j in range(n_neighbor):
- model_path = label + '/agent_' + str(i * n_neighbor + j)
- load_models(sesses[i * n_neighbor + j], model_path)
-
- V2I_rate_list = []
- V2V_success_list = []
- V2I_rate_list_rand = []
- V2V_success_list_rand = []
- rate_marl = np.zeros([n_episode_test, n_step_per_episode, n_veh, n_neighbor])
- rate_rand = np.zeros([n_episode_test, n_step_per_episode, n_veh, n_neighbor])
- demand_marl = env.demand_size * np.ones([n_episode_test, n_step_per_episode+1, n_veh, n_neighbor])
- demand_rand = env.demand_size * np.ones([n_episode_test, n_step_per_episode+1, n_veh, n_neighbor])
- power_rand = np.zeros([n_episode_test, n_step_per_episode, n_veh, n_neighbor])
- for idx_episode in range(n_episode_test):
- print('----- Episode', idx_episode, '-----')
-
- env.renew_positions()
- env.renew_neighbor()
- env.renew_channel()
- env.renew_channels_fastfading()
-
- env.demand = env.demand_size * np.ones((env.n_Veh, env.n_neighbor))
- env.individual_time_limit = env.time_slow * np.ones((env.n_Veh, env.n_neighbor))
- env.active_links = np.ones((env.n_Veh, env.n_neighbor), dtype='bool')
-
- env.demand_rand = env.demand_size * np.ones((env.n_Veh, env.n_neighbor))
- env.individual_time_limit_rand = env.time_slow * np.ones((env.n_Veh, env.n_neighbor))
- env.active_links_rand = np.ones((env.n_Veh, env.n_neighbor), dtype='bool')
-
- V2I_rate_per_episode = []
- V2I_rate_per_episode_rand = []
- for test_step in range(n_step_per_episode):
- # trained models
- action_all_testing = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
- for i in range(n_veh):
- for j in range(n_neighbor):
- state_old = get_state(env, [i, j], 1, epsi_final)
- action = predict(sesses[i*n_neighbor+j], state_old, epsi_final, True)
- action_all_testing[i, j, 0] = action % n_RB # chosen RB
- action_all_testing[i, j, 1] = int(np.floor(action / n_RB)) # power level
-
- action_temp = action_all_testing.copy()
- V2I_rate, V2V_success, V2V_rate = env.act_for_testing(action_temp)
- V2I_rate_per_episode.append(np.sum(V2I_rate)) # sum V2I rate in bps
- rate_marl[idx_episode, test_step,:,:] = V2V_rate
- demand_marl[idx_episode, test_step+1,:,:] = env.demand
-
- # random baseline
- action_rand = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
- action_rand[:, :, 0] = np.random.randint(0, n_RB, [n_veh, n_neighbor]) # band
- action_rand[:, :, 1] = np.random.randint(0, len(env.V2V_power_dB_List), [n_veh, n_neighbor]) # power
-
- V2I_rate_rand, V2V_success_rand, V2V_rate_rand = env.act_for_testing_rand(action_rand)
- V2I_rate_per_episode_rand.append(np.sum(V2I_rate_rand)) # sum V2I rate in bps
- rate_rand[idx_episode, test_step, :, :] = V2V_rate_rand
- demand_rand[idx_episode, test_step+1,:,:] = env.demand_rand
- for i in range(n_veh):
- for j in range(n_neighbor):
- power_rand[idx_episode, test_step, i, j] = env.V2V_power_dB_List[int(action_rand[i, j, 1])]
-
- # update the environment and compute interference
- env.renew_channels_fastfading()
- env.Compute_Interference(action_temp)
-
- if test_step == n_step_per_episode - 1:
- V2V_success_list.append(V2V_success)
- V2V_success_list_rand.append(V2V_success_rand)
-
- V2I_rate_list.append(np.mean(V2I_rate_per_episode))
- V2I_rate_list_rand.append(np.mean(V2I_rate_per_episode_rand))
-
- print(round(np.average(V2I_rate_per_episode), 2), 'rand', round(np.average(V2I_rate_per_episode_rand), 2))
- print(V2V_success_list[idx_episode], 'rand', V2V_success_list_rand[idx_episode])
[2]《5G移动通信技术》
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。