Article Reading

本次DQN我选择阅读的是1312.5602这篇论文

Motivation

  • 过往的方法在处理高维度的输入,例如视频和音频的时候略显乏力,往往依赖人工选取特征,但是可以通过卷积神经网络、多层感知机等方式直接利用神经网络来提取高维的特征。
  • 深度学习里面往往需要大量的有标签的样本,但是RL有延迟奖励问题,一个动作的价值可能需要一个Episode结束之后才能确定
  • 深度学习里面有一个重要假设是独立同分布采样,但是RL里面的数据往往是有很高的相关性的,不符合该假设

Idea

如何处理图像输入

前面套一个CNN对图像进行卷积处理,提取图像的特征进行有效的降维处理

如何处理样本量少的问题

使用经验回放数组的方法,即创建一个buffer,每次获得一个状态-动作-价值-下一状态组的时候,不仅是运用这个组来进行训练,更是把它放到buffer里面,每次训练的时候从里面采样出一个batch,利用batch来进行训练

Limitation

单神经网络训练不稳定

在13年的原版论文里面,使用的是一个神经网络,没有分为target网络和Q网络,导致在训练的时候loss上下波动比较大,reward上升趋势不明显

epsilon-greedy的epsilon下降速度与学习效率不匹配

在实验中我发现,即使是使用了指数的下降方式,在epsilon没有显著下降的情况下,即使当前的loss有了明显的下降,获得的reward也没有明显上升。

Code implement

完整的项目代码在我的 github上,下面都是不完整的片段

项目结构

1
2
3
4
5
6
7
8
9
10
11
12
13
.
├── clearlog.sh
├── DQN
│ ├── DQNAgent.py
├── logs/
├── main.py
├── README.md
├── requirements.txt
├── RL.yaml
└── utility
├── EnvConfig.py
├── NetWork.py
└── ReplayBuffer.py

这里程序的入口是 main.py ,在 utility/ 目录下给出了环境配置、神经网络和经验回访数组的实现,在 DQN/ 的目录下实现了一个 Agent ,日志文件放在 logs/ 下面

环境交互部分

本次实验我遇到的最大问题就是与环境的交互出了问题,首先是没有使用合适的修饰器。论文中给出的方法是把四帧合成一个,发送给网络进行训练,最开始我使用的方式是调用四次 env.step(action) 获得四帧然后拼成一个大的tensor,但是这其实是不对的,因为这本质上已经是四个样本了,应该使用 FrameStack() 来进行修饰。

其次是,在每次调用 env.reset() 的时候,返回的四个参数只有第一个是图片信息,其他的都不是,应该使用 state = env.reset()[0] 来获取。

下面给出正确的环境配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def make_env(env_name):
"""
Create and configure an environment for reinforcement learning.

Parameters:
- env_name (str): The name of the environment to create.

Returns:
- env (gym.Env): The configured environment.
"""

env = gym.make(env_name)
env = AtariPreprocessing(env, scale_obs=False, terminal_on_life_loss=True)
env = FrameStack(env, num_stack=4)
return env

这里选择的环境是 PongNoFrameskip-v4

训练以及与环境交互的部分(这里的代码片段删去了使用tensorboard绘制图像的部分)如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def train(env_name='PongNoFrameskip-v4', learning_rate=3e-4, gamma=0.99, memory_size=100000,total_frame=2000000):
env = make_env(env_name)
DQNAgent = AG.Agent(in_channels=env.observation_space.shape[0], num_actions=env.action_space.n, reset_network_interval=update,
lr=learning_rate, alpha=alpha, gamma=gamma, epsilon=epsilon_min, replay_size=memory_size)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

frame = env.reset()[0]
total_reward = 0
Loss = []
Reward = []
episodes = 0

for frame_num in range(total_frame):
eps = epsilon(frame_num)
state = DQNAgent.replay_buffer.transform(frame)
action = DQNAgent.get_action(state, epsilon=eps)
next_frame, reward, terminated, trunacated, _ = env.step(action)
DQNAgent.replay_buffer.push(frame, action, reward, next_frame, terminated)
total_reward += reward
frame = next_frame
loss = 0

if len(DQNAgent.replay_buffer) > replay_start_size:
loss = DQNAgent.train(batch_size=batch_size)
Loss.append(loss)

if frame_num % DQNAgent.reset_network_interval == 0:
DQNAgent.reset()

if terminated:
episodes += 1
Reward.append(total_reward)
print('episode {}: total reward {}'.format(episodes, total_reward))
frame = env.reset()[0]
total_reward = 0

if frame_num % 1000 == 0:
torch.cuda.empty_cache()

writer.close()

其中最重要的就是

1
frame = env.reset()[0]

这一句

经验回放数组

经验回放数组的实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import random
import torch


class ReplayBuffer:
"""
A replay buffer class for storing and sampling experiences for reinforcement learning.

Args:
size (int): The maximum size of the replay buffer.

Attributes:
size (int): The maximum size of the replay buffer.
buffer (list): A list to store the experiences.
cur (int): The current index in the buffer.
device (torch.device): The device to use for tensor operations.

Methods:
__len__(): Returns the number of experiences in the buffer.
transform(lazy_frame): Transforms a lazy frame into a tensor.
push(state, action, reward, next_state, done): Adds an experience to the buffer.
sample(batch_size): Samples a batch of experiences from the buffer.

"""

def __init__(self, size):
self.size = size
self.buffer = []
self.cur = 0
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def __len__(self):
return len(self.buffer)

def transform(self, lazy_frame):
state = torch.from_numpy(lazy_frame.__array__()[None] / 255).float()
return state.to(self.device)

def push(self, state, action, reward, next_state, done):
"""
Adds an experience to the replay buffer.

Args:
state (numpy.ndarray): The current state.
action (int): The action taken.
reward (float): The reward received.
next_state (numpy.ndarray): The next state.
done (bool): Whether the episode is done.

"""
if len(self.buffer) == self.size:
self.buffer[self.cur] = (state, action, reward, next_state, done)
else:
self.buffer.append((state, action, reward, next_state, done))
self.cur = (self.cur + 1) % self.size

def sample(self, batch_size):
"""
Samples a batch of experiences from the replay buffer.

Args:
batch_size (int): The size of the batch to sample.

Returns:
tuple: A tuple containing the batch of states, actions, rewards, next states, and dones.

"""
states, actions, rewards, next_states, dones = [], [], [], [], []
for _ in range(batch_size):
frame, action, reward, next_frame, done = self.buffer[random.randint(0, len(self.buffer) - 1)]
state = self.transform(frame)
next_state = self.transform(next_frame)
state = torch.squeeze(state, 0)
next_state = torch.squeeze(next_state, 0)
states.append(state)
actions.append(action)
rewards.append(reward)
next_states.append(next_state)
dones.append(done)
return (torch.stack(states).to(self.device), torch.tensor(actions).to(self.device), torch.tensor(rewards).to(self.device),
torch.stack(next_states).to(self.device), torch.tensor(dones).to(self.device))

这里需要注意的是返回转换成 torch.Tensor 的操作,这个项目只有这里进行了转换,首先是把每个 LazyFrame 转换成单个的 torch.Tensor ,此时每个的大小是 [4,84,84] 然后再整个通过 torch.stack() 方法转换成一个大的。

神经网络定义

神经网络的定义如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import torch.nn as nn
import torch.nn.functional as F
import torch


class NetWork(nn.Module):
"""
Deep Q-Network (DQN) class.

Args:
in_channels (int): Number of input channels.
num_actions (int): Number of possible actions.

Attributes:
conv1 (nn.Conv2d): First convolutional layer.
conv2 (nn.Conv2d): Second convolutional layer.
conv3 (nn.Conv2d): Third convolutional layer.
fc4 (nn.Linear): Fourth fully connected layer.
fc5 (nn.Linear): Fifth fully connected layer.
"""

def __init__(self, in_channels, num_actions):
super(NetWork, self).__init__()
self.conv1 = nn.Conv2d(in_channels=in_channels, out_channels=32, kernel_size=8, stride=4)
self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2)
self.conv3 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1)
self.fc4 = nn.Linear(in_features=7 * 7 * 64, out_features=512)
self.fc5 = nn.Linear(in_features=512, out_features=num_actions)

def forward(self, x):
"""
Forward pass of the DQN.

Args:
x (torch.Tensor): Input tensor.

Returns:
torch.Tensor: Output tensor.
"""
x = F.relu(self.conv1(x))
x = F.relu(self.conv2(x))
x = F.relu(self.conv3(x))
x = F.relu(self.fc4(x.reshape(x.size(0), -1)))
return self.fc5(x)

这里直接选择使用 relu 作为激活函数了,经过试验换成 leaky_relu 提升不大。

Agent实现

在Agent中主要实现了target网络和Q网络的同步以及每一次的学习:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
from utility.NetWork import NetWork
from utility.ReplayBuffer import ReplayBuffer
import torch
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
import random


class Agent:
"""
The Agent class represents a Deep Q-Network (DQN) agent for reinforcement learning.

Args:
in_channels (int): Number of input channels.
num_actions (int): Number of possible actions.
c (float): Exploration factor for epsilon-greedy action selection.
lr (float): Learning rate for the optimizer.
alpha (float): RMSprop optimizer alpha value.
gamma (float): Discount factor for future rewards.
epsilon (float): Exploration rate for epsilon-greedy action selection.
replay_size (int): Size of the replay buffer.

Attributes:
num_actions (int): Number of possible actions.
replay (ReplayBuffer): Replay buffer for storing and sampling experiences.
device (torch.device): Device (CPU or GPU) for running computations.
c (float): Exploration factor for epsilon-greedy action selection.
gamma (float): Discount factor for future rewards.
q_network (DQN): Q-network for estimating action values.
target_network (DQN): Target network for estimating target action values.
optimizer (torch.optim.RMSprop): Optimizer for updating the Q-network.

Methods:
greedy(state, epsilon): Selects an action using epsilon-greedy policy.
calculate_loss(states, actions, rewards, next_states, dones): Calculates the loss for a batch of experiences.
reset(): Resets the target network to match the Q-network.
learn(batch_size): Performs a single learning step using a batch of experiences.
"""

def __init__(self, in_channels, num_actions, reset_network_interval, lr, alpha, gamma, epsilon, replay_size):
self.num_actions = num_actions
self.replay_buffer = ReplayBuffer(replay_size)
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(self.device)
self.reset_network_interval = reset_network_interval
self.gamma = gamma
self.q_network = NetWork(in_channels, num_actions).to(self.device)
self.target_network = NetWork(in_channels, num_actions).to(self.device)
self.target_network.load_state_dict(self.q_network.state_dict())
# self.optimizer = optim.RMSprop(self.q_network.parameters(), lr=lr, eps=epsilon, alpha=alpha)
self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr, eps=epsilon)

def get_action(self, state, epsilon):
"""
Selects an action using epsilon-greedy policy.

Args:
state (torch.Tensor): Current state.
epsilon (float): Exploration rate.

Returns:
int: Selected action.
"""
if random.random() < epsilon:
action = random.randrange(self.num_actions)
else:
q_values = self.q_network(state).detach().cpu().numpy()
action = np.argmax(q_values)
del q_values
return action

def calculate_loss(self, states, actions, rewards, next_states, dones):
"""
Calculates the loss for a batch of experiences.

Args:
states (torch.Tensor): Batch of states.
actions (torch.Tensor): Batch of actions.
rewards (torch.Tensor): Batch of rewards.
next_states (torch.Tensor): Batch of next states.
dones (torch.Tensor): Batch of done flags.

Returns:
torch.Tensor: Loss value.
"""
tmp = self.q_network(states)
rewards = rewards.to(self.device)
q_values = tmp[range(states.shape[0]), actions.long()]
default = rewards + self.gamma * self.target_network(next_states).max(dim=1)[0]
target = torch.where(dones.to(self.device), rewards, default).to(self.device).detach()
return F.mse_loss(target, q_values)

def reset(self):
"""
Resets the target network to match the Q-network.
"""
self.target_network.load_state_dict(self.q_network.state_dict())

def train(self, batch_size):
"""
Performs a single learning step using a batch of experiences.

Args:
batch_size (int): Size of the batch.

Returns:
float: Loss value.
"""
if batch_size < len(self.replay_buffer):
states, actions, rewards, next_states, dones = self.replay_buffer.sample(batch_size)
loss = self.calculate_loss(states, actions, rewards, next_states, dones)
self.optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.q_network.parameters(), max_norm=20, norm_type=2)
self.optimizer.step()
return loss.item()
return 0

main.py实现命令行参数处理

在main中,使用如下方式处理参数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--env_name", type=str, default='PongNoFrameskip-v4', help="Name of the environment")
parser.add_argument("--gamma", type=float, default=0.99, help="Discount factor")
parser.add_argument("--lr", type=float, default=3e-4, help="Learning rate")
parser.add_argument("--memory_size", type=int, default=100000, help="Size of the replay buffer")
parser.add_argument("--total_frame",type=int,default=5000000,help="Total number of frames to train")
parser.add_argument("--eps-max",type=float,default=1,help="Max epsilon value")
parser.add_argument("--eps-min",type=float,default=0.02,help="Min epsilon value")
args = parser.parse_args()
epsilon_begin = args.eps_max
epsilon_end = args.eps_min
train(env_name=args.env_name,
learning_rate=args.lr,
gamma=args.gamma,
memory_size=args.memory_size,
total_frame=args.total_frame)

可以做到只用修改命令行参数就运行不同的超参数

使用Tensorboard绘图部分

首先引入库并定义一个writer:

1
2
3
4
from torch.utils.tensorboard import SummaryWriter


writer = SummaryWriter(log_dir=f'./logs/SESSION_NAME')

然后下面是绘制reward和loss的折线图以及查看当前的网络参数值以及梯度值:

1
2
3
4
5
6
7
8
9
10
11
if frame_num % print_interval == 0:
cur_reward = -22
if len(Reward) > 0:
cur_reward = np.mean(Reward[-10:])
writer.add_scalar('loss', loss, frame_num)
writer.add_scalar('reward', cur_reward, frame_num)
writer.add_scalar('epsilon', eps, frame_num)
if len(DQNAgent.replay_buffer) > replay_start_size:
for name, param in DQNAgent.q_network.named_parameters():
writer.add_histogram(tag=name+'_grad', values=param.grad, global_step=frame_num // 1000)
writer.add_histogram(tag=name+'_data', values=param.data, global_step=frame_num // 1000)

实验效果以及超参数影响探究

实验效果

在参数配置为:

1
2
3
4
5
6
7
8
9
10
batch_size = 32
learning_rate = 3e-4
epsilon_begin = 1.0
epsilon_end = 0.2
epsilon_decay = 200000
epsilon_min = 0.001
alpha = 0.95
replay_start_size = 10000
update = 1000
print_interval = 1000

使用adam优化器的情况下,最终reward可以收敛到20以上


这里的图片都是没有加平滑的。

超参数影响探究

epsilon的影响

首先,在eps_min被设置为0.2的情况下,使用相同的参数reward仅能收敛至10一下,且loss无法进一步下降:

而eps_min为0.01或0.02在reward上无显著影响,但loss上0.01有较明显的提升:


其中的蓝色为 0.01 灰色为 0.02

gamma的影响

下面我还探究了折扣因子gamma的影响,发现在0.99及其附近 (0.988-0.992) 的范围内,收敛速度和loss无明显差异,但是在差距更大,即0.995 0.999 0.95三组上,出现了loss爆炸的问题,效果不好:

0.999实验效果


0.995 实验效果


三组无明显差异组