Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective
intrinsically motivated 这个词最开始是在1950年的文章中提出的,文章认为需要有一种内在的操纵驱动力来解释为什么猴子会在没有任何外在奖励的情况下精力充沛地持续工作数小时来解决复杂的机械难题。内在动机行为对智力增长至关重要
定义有趣度:Lenat's AM system [18], for example, focused on heuristic definitions of “interestingness,”
建立好奇心框架:Schmidhuber [32] [33] [34] [35] [36] [37] introduced methods for implementing forms of curiosity using the framework of computational reinforcement learning (RL)1 [47].
奖励函数设计:Other researchers have reported interesting results of computational experiments involving evolutionary search for RL reward functions [1], [8], [19], [31], [43], but they did not directly address the motivational issues on which we focus.
寻找本质奖励:Uchibe and Doya [51] do address intrinsic reward in an evolutionary context, but their aim and approach differ significantly from ours.
与我们最接近的研究是 Elfwing 等人的研究。[11]其中使用遗传算法来搜索塑造奖励[23]和其他提高 RL 学习系统性能的学习算法参数。
这篇文章[3]使用属于内部奖励(intrinsic reward),并将其与强化学习框架结合。所谓内部奖励,就是内部动机生成的奖励函数;外部奖励就是常规的强化学习框架的奖励函数。
奖励信号(Reward signal)指的是大脑中奖励神经元发放的信号;
Schultz [38], [39] writes that “Rewards are objects or events that make us come back for more,” whereas reward signals are produced by reward neurons in the brain.
RL中的环境应该分为an external environment 和 an internal environment.
Primary and Secondary Reward:使用内部动机意义在于作为一种次要的奖励信号,配合主要的外部奖励信号完成行为
Among the most influential theories of motivation in psychology is the drive theory of Hull [13] [14] [15]. According to Hull's theory, all behavior is motivated either by an organism's survival and reproductive needs giving rise to primary drives (such as hunger, thirst, sex, and the avoidance of pain), or by derivative drives that have acquired their motivational significance through learning. Primary drives are the result of physiological deficits—“tissue needs”— and they energize behavior whose result is to reduce the deficit. A key additional feature of Hull's theory is that a need reduction, and hence a drive reduction, acts as a primary reinforcer for learning: behavior that reduces a primary drive is reinforced. Additionally, through the process of secondary reinforcement in which a neutral stimulus is paired with a primary reinforcer, the formerly neutral stimulus becomes a secondary reinforcer, i.e., acquires the reinforcing power of the primary reinforcer. In this way, stimuli that predict primary reward, i.e., predict a reduction in a primary drive, become rewarding themselves. According to this influential theory (in its several variants), all behavior is energized and directed by its relevance to primal drives, either directly or as the result of learning through secondary reinforcement.
使用一个适应度函数和一些环境兴趣的分布(an explicit fitness function and some distribution of environments of interest),这个适应度可以是累积的外部奖励和等形式。
一个6x6的格子空间中分割了4个3x3的子空间,格子之间的墙壁不是全部封闭的,其中2个子格子空间中分别有一个打开的盒子和一个封闭的盒子(盒子位置在智能体生命周期内是不再变化的),一个打开的盒子在每个时间步以 0.1 的概率关闭,密闭的盒子里总是装着食物。智能体在这样的格子迷宫中要找到打开的盒子,并在封闭的盒子中寻找食物。智能体的行动分为上下左右4个方向。
当代理食用食物时,它会在一个时间步长内感到饱足。代理在所有其他时间步都饥饿。智能体每吃一次食物,它的适应度就会增加 1
- constant condition: 食物总是智能体10000步生命周期中,在封闭的盒子中出现;
- step condition: 智能体的生命周期是20000步,食物总是出现在10000步之后出现在封闭盒子中;
- agent
- a space of reward function
- a specific reward function
- a sampled environment
- history of agent
adapting to environment over its lifetime using the reward function , i.e., - fitness function
produces a scalar evaluation for each history - optimal reward function
- agent
使用的算法 the lookup-table -greedy Q-learning [52].
- 一个三走廊的觅食环境,虫子每次随机出现正在其中一个走廊的尽头。由于每次虫子的位置是随机且未知的,智能体过去的经验不能直接直到新的觅食,所以这个环境是非马尔科夫的。
- 实验要求智能体在固定的10000步之内尽可能多地吃到虫子。
表示在 状态下,选择 后转移到 的次数。 表示在 状态下,选择 的次数。 奖励函数:
,其中 和 是权重系数。特征 在智能体饱的时候是 1 ,其他时候是 0 。特征 ,其中 是智能体在历史 中执行行为的时间步数。 当参数
为正时,智能体会因为最近没有从当前状态采取的行动而获得奖励。这种奖励不是外部环境的固定函数. feature is a hunger-status feature, and thus when and , the reward function is the fitness-based reward function