RL - Overview

Advantages:
- Can be more sample-efficient than model-free RL.
- Can plan ahead and avoid exploring ineffective actions.
- Can provide insights into the underlying structure of the environment.
Disadvantages:
- Requires an accurate model of the environment.
- Can be computationally expensive to learn and simulate the model.
- Can be sensitive to errors in the model.

Model-Free RL:

Model-free RL does not require a model of the environment. Instead, it learns a policy directly from the interactions with the environment.

Two main approaches:
- Value-based methods involve learning a value function. The value function is learned, the optimal policy can be derived by selecting the action that maximizes the value function in each state.
- Policy-based methods learn a policy directly by optimizing the policy parameters to maximize the expected cumulative reward.
Advantages:
- More flexible than model-based RL, because it does not require a model of the environment.
- Can be easier to implement and less computationally expensive than model-based RL.
- Can learn policies for complex environments with high-dimensional state and action spaces.
Disadvantages:
- Can be more sample-intensive than model-based RL.
- Can suffer from high variance in the estimates of the value function or policy gradient.
- May require careful selection of the learning rate and exploration strategy.

Mark

Searching methods:

Forward search involves generating a tree of possible future states and actions, and selecting the action that leads to the highest expected cumulative reward. Forward search can be computationally expensive, but can be effective in environments with a small number of states and actions.

Simulation-based search involves simulating the environment using a model, and selecting the action that leads to the highest expected cumulative reward. Simulation-based search can be more computationally efficient than forward search, because it can use the model to simulate possible outcomes of actions. However, it requires an accurate model of the environment, and can be sensitive to errors in the model.

Deep learning can be used in combination with both forward search and simulation-based search to improve their effectiveness and efficiency. For example, deep neural networks can be used to approximate the value function or policy in forward search, or to learn a model of the environment in simulation-based search.

In practice, it does not depend on a specific method. Multiple methods are harmonized following the causality. And it is not easy to specify a single appropriate method because it usually has a complex environment.

Value Function#

Bellman Optimality Equation - (state)#

V^*(s) = \max_a \sum_s T(s, a, s^{\prime})[R(s, a, s^{\prime}){\small +}\gamma\cdot V^*(s^\prime)]

Iteratively, $V_{k+1}(s) \gets \max_a \sum_{s^{\prime}} T(s, a, s^{\prime})[R(s, a, s^{\prime}){\small +}\gamma\cdot V_{k}(s^\prime)]$

Quality(Q)-Value Iteration algorithm - (state, action)#

Q_{k+1}(s, a) \gets \sum_{s^{\prime}} T(s, a, s^{\prime})[R(s, a, s^{\prime}){\small +}\gamma\cdot \max_{a^\prime}Q_{k}(s^\prime, a^\prime)]

Optimal policy, $\pi^*(s) = {\argmax}_{a}Q^*(s, a)$

Temporal Difference (TD) Learning algorithm#

Unknown transitions and rewards

\begin{aligned} V_{k+1}(s) &\xleftarrow{}\ (1-\alpha)V_k(s)+\alpha(r + \gamma\cdot V_{k}(s^\prime))\\ V(s) &\xleftarrow[\alpha]{}\ r + \gamma\cdot V(s^\prime),\ \text{\small Equivalent to the above eq.} \end{aligned}

Notation $\xleftarrow[\alpha]{}{}$ $α$ :
- $a \xleftarrow[\alpha]{} b := a_{k+1} \larr (1 - \alpha) \cdot a_k + \alpha \cdot b_k$

Q-Learning algorithm#

Unknown transitions and rewards

Q(s, a) \xleftarrow[\alpha]{}\ r + \gamma\cdot \max_{a^\prime}Q(s^\prime, a^\prime)

Approximate Q-Learning $Q_\theta(s, a)$ uses a manageable number of parameters (given by the parameter vector $\theta$ ) instead of impossible covering states because they have huge possible states (like $2^{150} \approx 10^{45}$ ).

A DNN used to estimate Q-Values is Deep Q-Network (DQN).
Q-Value를 추정하는데 사용된다.

A DNN using a DQN for Approximate Q-Learning is called Deep Q-Learning.
DQN을 포함하여 업데이트 하는 과정을 의미한다.

Target Q-Value#

Q_\text{target}(s, a) = r + \gamma\cdot \max_{a^\prime}Q_\theta(s^\prime, a^\prime)

During training, the squared error between the estimated Q-value and the target Q-value is generally minimized.

state 탐색 과정에는 주로 $\epsilon$ -greedy 방식이 주로 사용된다. $\epsilon$ -greedy는 학습 초기에는 랜덤하게 다양한 state를 탐색하고, 학습 후기에는 좋은 value를 가진 state 위주로 탐색한다.
예를 들면 학습 초기에는 다양한 음식들을 먹어보며 어떤 음식이 맛있는지 확인하는 과정을 거치게 되고, 해당 과정을 거치고 나면 맛있는 음식들 위주로 먹으며 좀 더 깊게 음미한다고 할 수 있다. 전자를 explore라고 하고, 후자를 exploit이라고 한다.

Reference

Hands On Machine Learning, 2nd ed., Aurélien Géron

Policy#

Monte-Carlo Policy Gradient (REINFORCE)#

$p(x;\theta)$ : probability distribution

$f(x)$ : value function; Returns a scalar value along the distribution of $p$

$\nabla_\theta \log(z) = \frac{1}{z}\nabla_\theta z$

\begin{aligned} J(\theta) &= {\Bbb E}_{x\sim p(x|\theta)}[f(x)],\\ \nabla_\theta {\Bbb E}_{x\sim p(x|\theta)}[f(x)] &= \nabla_\theta \sum_x p(x)f(x)\\ &= \sum_x \nabla p(x)f(x)\\ &= \sum_x p(x)\frac{\nabla_\theta p(x)}{p(x)}f(x)\\ &= \sum_x p(x)\nabla_\theta \log p(x)f(x)\\ &= {\Bbb E}_x[f(x)\nabla_\theta \log p(x)] \end{aligned}

Actor-Critic algorithms#

Critic: Updates action-value function parameters w

Critic 네트워크를 도입하여, Policy에 의존하는 것을 줄인다. $Q^{\pi_\theta}(s, a) \approx Q_{\rm w}(s, a)$ , PG의 value function을 DNN을 이용해 approximate.
Actor: Updates policy parameters θ, in direction suggested by critic

\nabla_\theta J(\theta) \approx {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot Q_{\rm w}(s, a)]

https://www.davidsilver.uk/wp-content/uploads/2020/03/pg.pdf

하지만, Approximated value function을 기존 policy의 (state-action) value function 목적으로 minimize하면 기존 PG와 같다. 여전히 high variance를 가질 수 있다.

Advantage Function Critic#

x = np.random.rand(10, 5)*5
baseline = x.mean(axis=1)
x_with_baseline = x - baseline[:, None]

baseline을 적용하면 variance를 줄일 수 있다.

\nabla_\theta J(\theta) = {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot A^{\pi_\theta}(s, a)]

$A^{\pi_\theta}(s, a) = Q^{\pi_\theta}(s, a) - V^{\pi_\theta}(s)$ , the advantage function

Value function을 baseline으로 사용한다. Value function은 state만 고려하여 좀 더 안정적으로 될 수 있다. 더불어 $Q_{\rm w}$ 와 $V_{\rm v}$ 각각 자신의 파라미터를 사용하여 추정한다.

Policy Gradient (PG) with TD#

\nabla_\theta J(\theta) = {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot \delta^{\pi_\theta}]

$\delta^{\pi_\theta} = r + \gamma V^{\pi_\theta}(s^\prime) - V^{\pi_\theta}(s)$ , the TD error

$\Delta\theta = \alpha(r + \gamma V_{\rm v}(s_{t+1}) - V_{\rm v}(s_t))\nabla_\theta \log \pi_\theta(s_t, a_t)$ , AC PG with the one-step TD error

Eligibility Traces#

\Delta\theta = \alpha({\rm v}_t^\lambda - V_{\rm v}(s_t))\nabla_\theta \log \pi_\theta(s_t, a_t)

The parameter $\lambda$ in $\text{TD}(\lambda)$ controls the degree of bootstrapping and the decay rate of the eligibility traces.

\begin{aligned} \delta &= r_{t+1} + \gamma V_\text{v}(s_{\rm t+1}) - V_\text{v}(s_t)\\ e_{t+1} &= \lambda e_t + \nabla_\theta \log \pi_\theta(s_t, a_t)\\ \Delta\theta &= \alpha\delta e_t \end{aligned}

Summary#

\begin{aligned} \nabla_\theta J(\theta) &= {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot{\rm v}_t] &\text{\small REINFORCE}\\ &= {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot Q^{\rm w}(s, a)] &\text{\small Q Actor-Critic (AC)}\\ &= {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot A^{\rm w}(s, a)] &\text{\small Advantage AC}\\ &= {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot \delta] &\text{\small TD AC}\\ &= {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot \delta e] &{\small \text{TD}(\lambda)\ \text{\small AC}}\\ \end{aligned}