强化学习背景知识

Reading time ~5 minutes

High-level idea

Reinforcement_learning_diagram

  • Given a state s, the reward r is a function that can tell the agent how good or bad an action a is.
  • Based on received rewards, the agent learns to take more good actions and gradually filter out bad actions.

Categorization of methods

  • Value-based method: evaluate the goodness of an action given a state using the Q-value function.
    • inefficient/impractical when the number of states or actions is large or infinite.
  • Policy-gradient method: derive actions directly by learning a policy π(s,a) that is a probability distribution over all possible actions.
    • suffer from a large fluctuation.
  • Actor-critic method (combination of value-based and policy-gradient methods): the actor attempts to learn a policy by receiving feedback from the critic.
    • critic-value loss function: L1=(RV(s))2, where the discount future reward R=r+γV(s)
      • γ[0,1] is the discount factor that manages the importance levels of future rewards.
      • V(s) represents the expected (scalar) reward of a given state
    • actor-policy loss function: L2=log(π(a|s))A(s)θH(π), where the estimated advantage function A(s)=RV(s)
      • H(π) is the entropy term controlled by the hyperparameter θ.
      • A(s) shows how advantageous the agent is when it is in a particular state.
image-20230317160005432
cited from "Deep Reinforcement Learning for Cyber Security"

SHAP源码之Masker

SHAP如何实现mask Continue reading

SHAP源码之Benchmark示例

Published on April 30, 2022

SHAP源码之GradientExplainer

Published on January 01, 2022