Txing

欢迎来到 | 伽蓝之堂

0%

Reinforcement Learning | DRN: A Deep Reinforcement Learning Framework for News Recommendation

DRN: A Deep Reinforcement Learning Framework for News Recommendation

论文链接:http://www.personal.psu.edu/~gjz5038/paper/www2018_reinforceRec/www2018_reinforceRec.pdf

主要内容

  • 针对新闻和用户动态特征的问题,目前的方法主要存在三个问题:

    • 模型只考虑当前奖励(例如CTR,点击转化率)
    • 很少有模型考虑使用用户的反馈而不是点击/不点击的标签(例如用户反馈频率)
    • 模型总是推荐相似的东西,会使得用户感到无聊
  • 一些典型方法:

    • content based methods [19, 22, 33],
    • collaborative filtering based methods [11, 28, 34]
    • hybrid methods [12, 24, 25].
    • deep learning models [8, 45, 52]
  • 三个挑战:

    • First, the dynamic changes in news recommendations are difficult to handle.
      • First, news become outdated very fast.
      • Second, users’ interest on different news might evolve during time.
    • Second, current recommendation methods [23, 35, 36, 43] usually only consider the click / no click labels or ratings as users’ feedback.
    • The third major issue of current recommendation methods is its tendency to keep recommending similar items to users, which might decrease users’ interest in similar topics.
  • 一般的强化学习推荐方法使用 或者 Upper Confidence Bound (UCB) 方法增加推荐的探索能力。

    但是 会损害当前推荐的性能,而UCB在一件物品尝试了几次之后才能得到相对准确的奖励评估。因此需要更高效的探索方式。

  • 通过使用 Dueling Bandit Gradient Descent (DBGD) 算法作为探索策略,它会随机从候选项的邻域或者候选项中选择推荐对象。这个探索策略可以避免推荐完全无关的信息,因此可以保持当前策略的精度。

  • 模型框图:

  • 算法每1小时更新一次。Every one hour, the agent will use the log in the memory to update its recommendation algorithm.

  • 主要贡献:

    • 我们认为用户的活跃性有助于提高推荐的准确性,这可以提供额外的信息,而不是简单地使用用户点击标签。
    • 应用了更有效的探索方法,避免了由经典探索方法(例如,-greedy和置信上限)引起的推荐准确度下降。
    • 我们的系统已经在线部署在商业新闻推荐应用程序中。大量的离线和在线实验显示了我们方法的优越性能。
  • 一些推荐方法的特点:

    • Content-based methods:统计内容的频率,推荐相似的内容;
    • Collaborative filtering methods:使用过去的转化率或者其他类似用户的转化率,或者结合两者来预测;
    • Hybrid methods:提升用户画像模型;
    • deep learning models:建模复杂的用户-物品关系

算法模型:

  • 算法步骤:

    1. PUSH: In each timestamp (t1, t2, t3, t4, t5, ...), when a user sends a news request to the system, the recommendation agent G will take the feature representation of the current user and news candidates as input, and generate a top-k list of news to recommend L. L is generated by combining the exploitation of current model (will be discussed in Section 4.3) and exploration of novel items (will be discussed in Section 4.5).
    2. FEEDBACK: User u who has received recommended news L will give their feedback B by his clicks on this set of news.
    3. MINOR UPDATE: After each timestamp (e.g., after timestamp t1), with the feature representation of the previous user u and news list L, and the feedback B, agent G will update the model by comparing the recommendation performance of exploitation network Q and exploration network Q˜ (will be discussed in Section 4.5). If Q˜ gives better recommendation result, the current network will be updated towards Q˜ . Otherwise, Q will be kept unchanged. Minor update can happen after every recommendation impression happens.
    4. MAJOR UPDATE: After certain period of timeTR(e.g., after timestamp t3), agent G will use the user feedback B and user activeness stored in the memory to update the network Q. Here, we use the experience replay technique [31] to update the network. Specifically, agent G maintains a memory with recent historical click and user activeness records. When each update happens, agent G will sample a batch of records to update the model. Major update usually happens after a certain time interval, like one hour, during which thousands of recommendation impressions are conducted and their feedbacks are collected.
    5. Repeat step (1)-(4).
  • 特征设计:

    • News features includes 417 dimension one hot features that describe whether certain property appears in this piece of news, including headline, provider, ranking, entity name, category, topic category, and click counts in last 1 hour, 6 hours, 24 hours, 1 week, and 1 year respectively.

    • User features mainly describes the features (i.e., headline, provider, ranking, entity name, category, and topic category) of the news that the user clicked in 1 hour, 6 hours, 24 hours, 1 week, and 1 year respectively. There is also a total click count for each time granularity. Therefore, there will be totally 413 × 5 = 2065 dimensions.

    • User news features. These 25-dimensional features describe the interaction between user and one certain piece of news, i.e., the frequency for the entity (also category, topic category and provider) to appear in the history of the user’s readings.

    • Context features. These 32-dimensional features describe the context when a news request happens, including time, weekday, and the freshness of the news (the gap between request time and news publish time).

  • 模型框架,包含离线和在线部分

Model framework

Q network

User Activeness

User activeness estimation the maximum user activeness is truncated to 1.

The click / no click label and the user activeness are combined as:

  • 网络框架
Exploration by Dueling Bandit Gradient Descent

首先使用概率交错方法从 。大致就是首先随机从 选择推荐项,之后按照概率值更新网络

总结

感觉整体框架还是很好理解的,针对任务的改进比较多,对强化学习框架的改动比较小。

两个网络的构造也类似于DQN中的target network,输入的高维特征类比于输入的图像信息。