High-level idea
- Given a state s, the reward r is a function that can tell the agent how good or bad an action a is.
- Based on received rewards, the agent learns to take more good actions and gradually filter out bad actions.
Categorization of methods
- Value-based method: evaluate the goodness of an action given a state using the Q-value function.
- inefficient/impractical when the number of states or actions is large or infinite.
- Policy-gradient method: derive actions directly by learning a policy π(s,a) that is a probability distribution over all possible actions.
- suffer from a large fluctuation.
- Actor-critic method (combination of value-based and policy-gradient methods): the actor attempts to learn a policy by receiving feedback from the critic.
- critic-value loss function: L1=∑(R−V(s))2, where the discount future reward R=r+γV(s′)
- γ∈[0,1] is the discount factor that manages the importance levels of future rewards.
- V(s) represents the expected (scalar) reward of a given state
- actor-policy loss function: L2=−log(π(a|s))∗A(s)−θH(π), where the estimated advantage function A(s)=R−V(s)
- H(π) is the entropy term controlled by the hyperparameter θ.
- A(s) shows how advantageous the agent is when it is in a particular state.
- critic-value loss function: L1=∑(R−V(s))2, where the discount future reward R=r+γV(s′)
