Txing

欢迎来到 | 伽蓝之堂

0%

Reinforcement Learning | Deep Deterministic Policy Gradient algorithm (DDPG)

Continuous Control with Deep Reinforcement Learning

论文链接:https://arxiv.org/abs/1509.02971v2

前言

Reinforcement Learning是2019年9月新开的专栏企划,旨在为今后RL的学习提供比较广阔的专业视野,同时贡献于New Idea的产生。

方法

  • 意义:DDPG诞生于2016年的ICRL,作为Deep Q-learning的进阶算法,将Q-learning从离散action space拓展到了连续action space的应用领域,并实现了一个end-to-end ( image to action ) 的决策框架。

  • 难点:DDPG提供了一个以未处理的高维传感信号作输入,解决复杂任务的算法框架。One of the primary goals of the field of artificial intelligence is to solve complex tasks from unprocessed, high-dimensional, sensory input.

  • DQN的缺点:However, while DQN solves problems with high-dimensional observation spaces, it can only handle discrete and low-dimensional action spaces.

  • 面向的应用场景:Many tasks of interest, most notably physical control tasks, have continuous (real valued) and high dimensional action spaces.

  • 如果想把DQN用在连续的action space中,有一种办法是把action离散化。但是这样会大大增加action的维度,从而带来决策上的困难。例如一个3自由度的机器人,把每个关节角离散为10个角度,那么action space维度就会变为。在需要精确控制的时候,这个问题会更加严重。(the number of actions increases exponentially with the number of degrees of freedom)。此外,这种粗略的离散化丢失了action的结构信息,可能会影响到问题的求解。

  • DDPG本质:In this work we present a model-free, off-policy actor-critic algorithm using deep function approximators that can learn policies in high-dimensional, continuous action spaces.

  • DDPG基于DPG算法,DPG由于只是加入了简单的神经网络函数近似,因此算法并不稳定。

  • DQN is able to learn value functions using such function approximators in a stable and robust way due to two innovations:

    • the network is trained off-policy with samples from a replay buffer to minimize correlations between samples;
    • the network is trained with a target Q network to give consistent targets during temporal difference backups
  • DDPG在实验中用摄像机的画面提取低维信息(例如:cartesian coordinates or joint angles 笛卡尔坐标或者关节角),将其作为observation。using both low-dimensional observations (e.g. joint angles) and directly from pixels.

  • Here, we assumed the environment is fully-observed so .

  • replay buffer + soft updates

DDPG algorithm


Randomly initialize critic network and actor with weights and

Initialize target network and with weights ,

Initialize replay buffer For , do

---- Initialize a random process for action exploration ---- Receive initial observation state

---- For do

---- ---- Select action according to the current policy and exploration noise

---- ---- Execute action at and observe reward and observe new state

---- ---- Store transition in

---- ---- Sample a random minibatch of transitions from

---- ---- Set

---- ---- Update critic by minimizing the loss:

---- ---- Update the actor policy using the sampled policy gradient:

---- ---- ----

---- ---- Update the target networks:

---- ---- ---- ,

---- ---- ----

---- end For

end For


总结

不知道为什么看完之后觉得DDPG并没有想象中的复杂,也没有什么推导过程,架构也比较简单清晰,文章中大量的实验验证了算法的普适性。可能这些点正是一个RL算法的可贵之处吧。