RL - Overview

Reinforcement Learning

Model-Based vs Model-Free#

The score function is used to measure the performance.

A model M\cal M is a representation of an MDPS,A,P,R\text{MDP} \lang {\cal S, A, P, R} \rang

S\cal S: state space
A\cal A: action space
P\cal P: state transitions
R\cal R: rewards

In practice, it does not depend on a specific method. Multiple methods are harmonized following the causality. And it is not easy to specify a single appropriate method because it usually has a complex environment.


Value Function#

Bellman Optimality Equation - (state)#

V(s)=maxasT(s,a,s)[R(s,a,s)+γV(s)]V^*(s) = \max_a \sum_s T(s, a, s^{\prime})[R(s, a, s^{\prime}){\small +}\gamma\cdot V^*(s^\prime)]
  • Iteratively, Vk+1(s)maxasT(s,a,s)[R(s,a,s)+γVk(s)]V_{k+1}(s) \gets \max_a \sum_{s^{\prime}} T(s, a, s^{\prime})[R(s, a, s^{\prime}){\small +}\gamma\cdot V_{k}(s^\prime)]

Quality(Q)-Value Iteration algorithm - (state, action)#

Qk+1(s,a)sT(s,a,s)[R(s,a,s)+γmaxaQk(s,a)]Q_{k+1}(s, a) \gets \sum_{s^{\prime}} T(s, a, s^{\prime})[R(s, a, s^{\prime}){\small +}\gamma\cdot \max_{a^\prime}Q_{k}(s^\prime, a^\prime)]

Optimal policy, π(s)=arg maxaQ(s,a)\pi^*(s) = {\argmax}_{a}Q^*(s, a)


Temporal Difference (TD) Learning algorithm#

Unknown transitions and rewards

Vk+1(s) (1α)Vk(s)+α(r+γVk(s))V(s)α r+γV(s), Equivalent to the above eq.\begin{aligned}
V_{k+1}(s) &\xleftarrow{}\ (1-\alpha)V_k(s)+\alpha(r + \gamma\cdot V_{k}(s^\prime))\\
V(s) &\xleftarrow[\alpha]{}\ r + \gamma\cdot V(s^\prime),\ \text{\small Equivalent to the above eq.}
\end{aligned}
  • Notation α\xleftarrow[\alpha]{}{}:
    • aαb:=ak+1(1α)ak+αbka \xleftarrow[\alpha]{} b := a_{k+1} \larr (1 - \alpha) \cdot a_k + \alpha \cdot b_k

Q-Learning algorithm#

Unknown transitions and rewards

Q(s,a)α r+γmaxaQ(s,a)Q(s, a) \xleftarrow[\alpha]{}\ r + \gamma\cdot \max_{a^\prime}Q(s^\prime, a^\prime)

Approximate Q-Learning Qθ(s,a)Q_\theta(s, a) uses a manageable number of parameters (given by the parameter vector θ\theta) instead of impossible covering states because they have huge possible states (like 215010452^{150} \approx 10^{45}).

A DNN used to estimate Q-Values is Deep Q-Network (DQN).
Q-Value를 추정하는데 사용된다.

A DNN using a DQN for Approximate Q-Learning is called Deep Q-Learning.
DQN을 포함하여 업데이트 하는 과정을 의미한다.


Target Q-Value#

Qtarget(s,a)=r+γmaxaQθ(s,a)Q_\text{target}(s, a) = r + \gamma\cdot \max_{a^\prime}Q_\theta(s^\prime, a^\prime)

During training, the squared error between the estimated Q-value and the target Q-value is generally minimized.

state 탐색 과정에는 주로 ϵ\epsilon-greedy 방식이 주로 사용된다. ϵ\epsilon-greedy는 학습 초기에는 랜덤하게 다양한 state를 탐색하고, 학습 후기에는 좋은 value를 가진 state 위주로 탐색한다.
예를 들면 학습 초기에는 다양한 음식들을 먹어보며 어떤 음식이 맛있는지 확인하는 과정을 거치게 되고, 해당 과정을 거치고 나면 맛있는 음식들 위주로 먹으며 좀 더 깊게 음미한다고 할 수 있다. 전자를 explore라고 하고, 후자를 exploit이라고 한다.

Reference

  • Hands On Machine Learning, 2nd ed., Aurélien Géron

Policy#

Monte-Carlo Policy Gradient (REINFORCE)#

p(x;θ)p(x;\theta): probability distribution

f(x)f(x): value function; Returns a scalar value along the distribution of pp

θlog(z)=1zθz\nabla_\theta \log(z) = \frac{1}{z}\nabla_\theta z

J(θ)=Exp(xθ)[f(x)],θExp(xθ)[f(x)]=θxp(x)f(x)=xp(x)f(x)=xp(x)θp(x)p(x)f(x)=xp(x)θlogp(x)f(x)=Ex[f(x)θlogp(x)]\begin{aligned}
J(\theta) &= {\Bbb E}_{x\sim p(x|\theta)}[f(x)],\\
\nabla_\theta {\Bbb E}_{x\sim p(x|\theta)}[f(x)] &= \nabla_\theta \sum_x p(x)f(x)\\
&= \sum_x \nabla p(x)f(x)\\
&= \sum_x p(x)\frac{\nabla_\theta p(x)}{p(x)}f(x)\\
&= \sum_x p(x)\nabla_\theta \log p(x)f(x)\\
&= {\Bbb E}_x[f(x)\nabla_\theta \log p(x)]
\end{aligned}

Actor-Critic algorithms#

  • Critic: Updates action-value function parameters w

    Critic 네트워크를 도입하여, Policy에 의존하는 것을 줄인다. Qπθ(s,a)Qw(s,a)Q^{\pi_\theta}(s, a) \approx Q_{\rm w}(s, a), PG의 value function을 DNN을 이용해 approximate.

  • Actor: Updates policy parameters θ, in direction suggested by critic

θJ(θ)Eπθ[θlogπθ(s,a)Qw(s,a)]\nabla_\theta J(\theta) \approx {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot Q_{\rm w}(s, a)]

https://www.davidsilver.uk/wp-content/uploads/2020/03/pg.pdf

하지만, Approximated value function을 기존 policy의 (state-action) value function 목적으로 minimize하면 기존 PG와 같다. 여전히 high variance를 가질 수 있다.

Advantage Function Critic#

x = np.random.rand(10, 5)*5
baseline = x.mean(axis=1)
x_with_baseline = x - baseline[:, None]

baseline을 적용하면 variance를 줄일 수 있다.

θJ(θ)=Eπθ[θlogπθ(s,a)Aπθ(s,a)]\nabla_\theta J(\theta) = {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot A^{\pi_\theta}(s, a)]

Aπθ(s,a)=Qπθ(s,a)Vπθ(s)A^{\pi_\theta}(s, a) = Q^{\pi_\theta}(s, a) - V^{\pi_\theta}(s), the advantage function

Value function을 baseline으로 사용한다. Value function은 state만 고려하여 좀 더 안정적으로 될 수 있다. 더불어 QwQ_{\rm w}VvV_{\rm v} 각각 자신의 파라미터를 사용하여 추정한다.

Policy Gradient (PG) with TD#

θJ(θ)=Eπθ[θlogπθ(s,a)δπθ]\nabla_\theta J(\theta) = {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot \delta^{\pi_\theta}]

δπθ=r+γVπθ(s)Vπθ(s)\delta^{\pi_\theta} = r + \gamma V^{\pi_\theta}(s^\prime) - V^{\pi_\theta}(s), the TD error

Δθ=α(r+γVv(st+1)Vv(st))θlogπθ(st,at)\Delta\theta = \alpha(r + \gamma V_{\rm v}(s_{t+1}) - V_{\rm v}(s_t))\nabla_\theta \log \pi_\theta(s_t, a_t), AC PG with the one-step TD error

Eligibility Traces#

Δθ=α(vtλVv(st))θlogπθ(st,at)\Delta\theta = \alpha({\rm v}_t^\lambda - V_{\rm v}(s_t))\nabla_\theta \log \pi_\theta(s_t, a_t)

The parameter λ\lambda in TD(λ)\text{TD}(\lambda) controls the degree of bootstrapping and the decay rate of the eligibility traces.

δ=rt+1+γVv(st+1)Vv(st)et+1=λet+θlogπθ(st,at)Δθ=αδet\begin{aligned}
\delta &= r_{t+1} + \gamma V_\text{v}(s_{\rm t+1}) - V_\text{v}(s_t)\\
e_{t+1} &= \lambda e_t + \nabla_\theta \log \pi_\theta(s_t, a_t)\\
\Delta\theta &= \alpha\delta e_t
\end{aligned}

Summary#

θJ(θ)=Eπθ[θlogπθ(s,a)vt]REINFORCE=Eπθ[θlogπθ(s,a)Qw(s,a)]Q Actor-Critic (AC)=Eπθ[θlogπθ(s,a)Aw(s,a)]Advantage AC=Eπθ[θlogπθ(s,a)δ]TD AC=Eπθ[θlogπθ(s,a)δe]TD(λ) AC\begin{aligned}
\nabla_\theta J(\theta)
&= {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot{\rm v}_t] &\text{\small REINFORCE}\\
 &= {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot Q^{\rm w}(s, a)] &\text{\small Q Actor-Critic (AC)}\\
&= {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot A^{\rm w}(s, a)] &\text{\small Advantage AC}\\
&= {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot \delta] &\text{\small TD AC}\\
&= {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot \delta e] &{\small \text{TD}(\lambda)\ \text{\small AC}}\\
\end{aligned}

Reference