Continuous Control with Deep Reinforcement Learning

论文链接：https://arxiv.org/abs/1509.02971v2

前言

Reinforcement Learning是2019年9月新开的专栏企划，旨在为今后RL的学习提供比较广阔的专业视野，同时贡献于New Idea的产生。

方法

意义：DDPG诞生于2016年的ICRL，作为Deep Q-learning的进阶算法，将Q-learning从离散action space拓展到了连续action space的应用领域，并实现了一个end-to-end ( image to action ) 的决策框架。
难点：DDPG提供了一个以未处理的高维传感信号作输入，解决复杂任务的算法框架。One of the primary goals of the field of artificial intelligence is to solve complex tasks from unprocessed, high-dimensional, sensory input.
DQN的缺点：However, while DQN solves problems with high-dimensional observation spaces, it can only handle discrete and low-dimensional action spaces.
面向的应用场景：Many tasks of interest, most notably physical control tasks, have continuous (real valued) and high dimensional action spaces.
如果想把DQN用在连续的action space中，有一种办法是把action离散化。但是这样会大大增加action的维度，从而带来决策上的困难。例如一个3自由度的机器人，把每个关节角离散为10个角度，那么action space维度就会变为。在需要精确控制的时候，这个问题会更加严重。（the number of actions increases exponentially with the number of degrees of freedom）。此外，这种粗略的离散化丢失了action的结构信息，可能会影响到问题的求解。
DDPG本质：In this work we present a model-free, off-policy actor-critic algorithm using deep function approximators that can learn policies in high-dimensional, continuous action spaces.
DDPG基于DPG算法，DPG由于只是加入了简单的神经网络函数近似，因此算法并不稳定。
DQN is able to learn value functions using such function approximators in a stable and robust way due to two innovations:
- the network is trained off-policy with samples from a replay buffer to minimize correlations between samples;
- the network is trained with a target Q network to give consistent targets during temporal difference backups
DDPG在实验中用摄像机的画面提取低维信息（例如：cartesian coordinates or joint angles 笛卡尔坐标或者关节角），将其作为observation。using both low-dimensional observations (e.g. joint angles) and directly from pixels.
Here, we assumed the environment is fully-observed so .
replay buffer + soft updates

DDPG algorithm

Randomly initialize critic network and actor with weights and

Initialize target network and with weights ,

Initialize replay buffer For , do

---- Initialize a random process for action exploration ---- Receive initial observation state

---- For do

---- ---- Select action according to the current policy and exploration noise

---- ---- Execute action at and observe reward and observe new state

---- ---- Store transition in

---- ---- Sample a random minibatch of transitions from

---- ---- Set

---- ---- Update critic by minimizing the loss:

---- ---- Update the actor policy using the sampled policy gradient:

---- ---- ----

---- ---- Update the target networks:

---- ---- ---- ,

---- ---- ----

---- end For

end For

总结

不知道为什么看完之后觉得DDPG并没有想象中的复杂，也没有什么推导过程，架构也比较简单清晰，文章中大量的实验验证了算法的普适性。可能这些点正是一个RL算法的可贵之处吧。

Txing

Reinforcement Learning | Deep Deterministic Policy Gradient algorithm (DDPG)

Continuous Control with Deep Reinforcement Learning

前言

方法

总结