Model-Based vs Model-Free#
The score function is used to measure the performance.
A model M \cal M M is a representation of an MDP ⟨ S , A , P , R ⟩ \text{MDP} \lang {\cal S, A, P, R} \rang MDP ⟨ S , A , P , R ⟩
S \cal S S : state spaceA \cal A A : action spaceP \cal P P : state transitionsR \cal R R : rewards
Mark
Model-Based RL:
⟨ S , A , P , R ⟩ \lang {\cal S, A, P, R} \rang ⟨ S , A , P , R ⟩
Model-based RL involves learning a model of the environment, which represents the transition dynamics and the reward function .
Advantages:
Can be more sample-efficient than model-free RL.
Can plan ahead and avoid exploring ineffective actions.
Can provide insights into the underlying structure of the environment.
Disadvantages:
Requires an accurate model of the environment.
Can be computationally expensive to learn and simulate the model.
Can be sensitive to errors in the model.
Model-Free RL:
Model-free RL does not require a model of the environment . Instead, it learns a policy directly from the interactions with the environment .
Two main approaches:
Value-based methods involve learning a value function. The value function is learned, the optimal policy can be derived by selecting the action that maximizes the value function in each state.
Policy-based methods learn a policy directly by optimizing the policy parameters to maximize the expected cumulative reward .
Advantages:
More flexible than model-based RL, because it does not require a model of the environment.
Can be easier to implement and less computationally expensive than model-based RL.
Can learn policies for complex environments with high-dimensional state and action spaces.
Disadvantages:
Can be more sample-intensive than model-based RL.
Can suffer from high variance in the estimates of the value function or policy gradient.
May require careful selection of the learning rate and exploration strategy.
Mark
Searching methods:
Forward search involves generating a tree of possible future states and actions, and selecting the action that leads to the highest expected cumulative reward. Forward search can be computationally expensive, but can be effective in environments with a small number of states and actions.
Simulation-based search involves simulating the environment using a model, and selecting the action that leads to the highest expected cumulative reward . Simulation-based search can be more computationally efficient than forward search, because it can use the model to simulate possible outcomes of actions. However, it requires an accurate model of the environment, and can be sensitive to errors in the model.
Deep learning can be used in combination with both forward search and simulation-based search to improve their effectiveness and efficiency. For example, deep neural networks can be used to approximate the value function or policy in forward search , or to learn a model of the environment in simulation-based search .
In practice, it does not depend on a specific method. Multiple methods are harmonized following the causality. And it is not easy to specify a single appropriate method because it usually has a complex environment.
Value Function# Bellman Optimality Equation - (state)# V ∗ ( s ) = max a ∑ s T ( s , a , s ′ ) [ R ( s , a , s ′ ) + γ ⋅ V ∗ ( s ′ ) ] V^*(s) = \max_a \sum_s T(s, a, s^{\prime})[R(s, a, s^{\prime}){\small +}\gamma\cdot V^*(s^\prime)] V ∗ ( s ) = max a ∑ s T ( s , a , s ′ ) [ R ( s , a , s ′ ) + γ ⋅ V ∗ ( s ′ )]
Iteratively, V k + 1 ( s ) ← max a ∑ s ′ T ( s , a , s ′ ) [ R ( s , a , s ′ ) + γ ⋅ V k ( s ′ ) ] V_{k+1}(s) \gets \max_a \sum_{s^{\prime}} T(s, a, s^{\prime})[R(s, a, s^{\prime}){\small +}\gamma\cdot V_{k}(s^\prime)] V k + 1 ( s ) ← max a ∑ s ′ T ( s , a , s ′ ) [ R ( s , a , s ′ ) + γ ⋅ V k ( s ′ )]
Quality(Q)-Value Iteration algorithm - (state, action)# Q k + 1 ( s , a ) ← ∑ s ′ T ( s , a , s ′ ) [ R ( s , a , s ′ ) + γ ⋅ max a ′ Q k ( s ′ , a ′ ) ] Q_{k+1}(s, a) \gets \sum_{s^{\prime}} T(s, a, s^{\prime})[R(s, a, s^{\prime}){\small +}\gamma\cdot \max_{a^\prime}Q_{k}(s^\prime, a^\prime)] Q k + 1 ( s , a ) ← ∑ s ′ T ( s , a , s ′ ) [ R ( s , a , s ′ ) + γ ⋅ max a ′ Q k ( s ′ , a ′ )] Optimal policy, π ∗ ( s ) = arg max a Q ∗ ( s , a ) \pi^*(s) = {\argmax}_{a}Q^*(s, a) π ∗ ( s ) = arg max a Q ∗ ( s , a )
Temporal Difference (TD) Learning algorithm# Unknown transitions and rewards
V k + 1 ( s ) ← ( 1 − α ) V k ( s ) + α ( r + γ ⋅ V k ( s ′ ) ) V ( s ) ← α r + γ ⋅ V ( s ′ ) , Equivalent to the above eq. \begin{aligned}
V_{k+1}(s) &\xleftarrow{}\ (1-\alpha)V_k(s)+\alpha(r + \gamma\cdot V_{k}(s^\prime))\\
V(s) &\xleftarrow[\alpha]{}\ r + \gamma\cdot V(s^\prime),\ \text{\small Equivalent to the above eq.}
\end{aligned} V k + 1 ( s ) V ( s ) ( 1 − α ) V k ( s ) + α ( r + γ ⋅ V k ( s ′ )) α r + γ ⋅ V ( s ′ ) , Equivalent to the above eq.
Notation ← α \xleftarrow[\alpha]{}{} α :
a ← α b : = a k + 1 ← ( 1 − α ) ⋅ a k + α ⋅ b k a \xleftarrow[\alpha]{} b := a_{k+1} \larr (1 - \alpha) \cdot a_k + \alpha \cdot b_k a α b := a k + 1 ← ( 1 − α ) ⋅ a k + α ⋅ b k
Q-Learning algorithm# Unknown transitions and rewards
Q ( s , a ) ← α r + γ ⋅ max a ′ Q ( s ′ , a ′ ) Q(s, a) \xleftarrow[\alpha]{}\ r + \gamma\cdot \max_{a^\prime}Q(s^\prime, a^\prime) Q ( s , a ) α r + γ ⋅ max a ′ Q ( s ′ , a ′ ) Approximate Q-Learning Q θ ( s , a ) Q_\theta(s, a) Q θ ( s , a ) uses a manageable number of parameters (given by the parameter vector θ \theta θ ) instead of impossible covering states because they have huge possible states (like 2 150 ≈ 1 0 45 2^{150} \approx 10^{45} 2 150 ≈ 1 0 45 ).
A DNN used to estimate Q-Values is Deep Q-Network (DQN). Q-Value를 추정하는데 사용된다.
A DNN using a DQN for Approximate Q-Learning is called Deep Q-Learning. DQN을 포함하여 업데이트 하는 과정을 의미한다.
Target Q-Value# Q target ( s , a ) = r + γ ⋅ max a ′ Q θ ( s ′ , a ′ ) Q_\text{target}(s, a) = r + \gamma\cdot \max_{a^\prime}Q_\theta(s^\prime, a^\prime) Q target ( s , a ) = r + γ ⋅ max a ′ Q θ ( s ′ , a ′ ) During training, the squared error between the estimated Q-value and the target Q-value is generally minimized .
state 탐색 과정에는 주로 ϵ \epsilon ϵ -greedy 방식이 주로 사용된다. ϵ \epsilon ϵ -greedy는 학습 초기에는 랜덤하게 다양한 state를 탐색하고, 학습 후기에는 좋은 value를 가진 state 위주로 탐색한다. 예를 들면 학습 초기에는 다양한 음식들을 먹어보며 어떤 음식이 맛있는지 확인하는 과정을 거치게 되고, 해당 과정을 거치고 나면 맛있는 음식들 위주로 먹으며 좀 더 깊게 음미한다고 할 수 있다. 전자를 explore라고 하고, 후자를 exploit이라고 한다.
Reference
Hands On Machine Learning, 2nd ed., Aurélien Géron
Policy# Monte-Carlo Policy Gradient (REINFORCE)# p ( x ; θ ) p(x;\theta) p ( x ; θ ) : probability distribution
f ( x ) f(x) f ( x ) : value function; Returns a scalar value along the distribution of p p p
∇ θ log ( z ) = 1 z ∇ θ z \nabla_\theta \log(z) = \frac{1}{z}\nabla_\theta z ∇ θ log ( z ) = z 1 ∇ θ z
J ( θ ) = E x ∼ p ( x ∣ θ ) [ f ( x ) ] , ∇ θ E x ∼ p ( x ∣ θ ) [ f ( x ) ] = ∇ θ ∑ x p ( x ) f ( x ) = ∑ x ∇ p ( x ) f ( x ) = ∑ x p ( x ) ∇ θ p ( x ) p ( x ) f ( x ) = ∑ x p ( x ) ∇ θ log p ( x ) f ( x ) = E x [ f ( x ) ∇ θ log p ( x ) ] \begin{aligned}
J(\theta) &= {\Bbb E}_{x\sim p(x|\theta)}[f(x)],\\
\nabla_\theta {\Bbb E}_{x\sim p(x|\theta)}[f(x)] &= \nabla_\theta \sum_x p(x)f(x)\\
&= \sum_x \nabla p(x)f(x)\\
&= \sum_x p(x)\frac{\nabla_\theta p(x)}{p(x)}f(x)\\
&= \sum_x p(x)\nabla_\theta \log p(x)f(x)\\
&= {\Bbb E}_x[f(x)\nabla_\theta \log p(x)]
\end{aligned} J ( θ ) ∇ θ E x ∼ p ( x ∣ θ ) [ f ( x )] = E x ∼ p ( x ∣ θ ) [ f ( x )] , = ∇ θ x ∑ p ( x ) f ( x ) = x ∑ ∇ p ( x ) f ( x ) = x ∑ p ( x ) p ( x ) ∇ θ p ( x ) f ( x ) = x ∑ p ( x ) ∇ θ log p ( x ) f ( x ) = E x [ f ( x ) ∇ θ log p ( x )] Mark
Why does the REINFORCE algorithm have high variance?
Because the Monte Carlo method involves sampling from the policy , the quality of the estimate depends on the number of samples used and the quality of the samples. With a small number of samples or poor-quality samples, the estimate can be noisy and have high variance, which can make it difficult to learn an optimal policy.
The Gaussian distribution is a popular choice for modeling the policy in the REINFORCE algorithm, as it can produce a smooth and continuous output that can be easily sampled to generate actions.
However, the use of a Gaussian distribution can lead to high variance in the policy gradient estimate, particularly if the standard deviation of the distribution is high.
Actor-Critic algorithms#
Critic: Updates action-value function parameters w
Critic 네트워크를 도입하여, Policy에 의존하는 것을 줄인다. Q π θ ( s , a ) ≈ Q w ( s , a ) Q^{\pi_\theta}(s, a) \approx Q_{\rm w}(s, a) Q π θ ( s , a ) ≈ Q w ( s , a ) , PG의 value function을 DNN을 이용해 approximate.
Actor: Updates policy parameters θ, in direction suggested by critic
∇ θ J ( θ ) ≈ E π θ [ ∇ θ log π θ ( s , a ) ⋅ Q w ( s , a ) ] \nabla_\theta J(\theta) \approx {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot Q_{\rm w}(s, a)] ∇ θ J ( θ ) ≈ E π θ [ ∇ θ log π θ ( s , a ) ⋅ Q w ( s , a )]
https://www.davidsilver.uk/wp-content/uploads/2020/03/pg.pdf
하지만, Approximated value function을 기존 policy의 (state-action) value function 목적으로 minimize하면 기존 PG와 같다. 여전히 high variance를 가질 수 있다.
Advantage Function Critic# x = np. random. rand( 10 , 5 ) * 5
baseline = x. mean( axis= 1 )
x_with_baseline = x - baseline[ : , None ]
baseline을 적용하면 variance를 줄일 수 있다.
∇ θ J ( θ ) = E π θ [ ∇ θ log π θ ( s , a ) ⋅ A π θ ( s , a ) ] \nabla_\theta J(\theta) = {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot A^{\pi_\theta}(s, a)] ∇ θ J ( θ ) = E π θ [ ∇ θ log π θ ( s , a ) ⋅ A π θ ( s , a )] A π θ ( s , a ) = Q π θ ( s , a ) − V π θ ( s ) A^{\pi_\theta}(s, a) = Q^{\pi_\theta}(s, a) - V^{\pi_\theta}(s) A π θ ( s , a ) = Q π θ ( s , a ) − V π θ ( s ) , the advantage function
Value function을 baseline으로 사용한다. Value function은 state만 고려하여 좀 더 안정적으로 될 수 있다. 더불어 Q w Q_{\rm w} Q w 와 V v V_{\rm v} V v 각각 자신의 파라미터를 사용하여 추정한다.
Policy Gradient (PG) with TD# ∇ θ J ( θ ) = E π θ [ ∇ θ log π θ ( s , a ) ⋅ δ π θ ] \nabla_\theta J(\theta) = {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot \delta^{\pi_\theta}] ∇ θ J ( θ ) = E π θ [ ∇ θ log π θ ( s , a ) ⋅ δ π θ ] δ π θ = r + γ V π θ ( s ′ ) − V π θ ( s ) \delta^{\pi_\theta} = r + \gamma V^{\pi_\theta}(s^\prime) - V^{\pi_\theta}(s) δ π θ = r + γ V π θ ( s ′ ) − V π θ ( s ) , the TD error
Δ θ = α ( r + γ V v ( s t + 1 ) − V v ( s t ) ) ∇ θ log π θ ( s t , a t ) \Delta\theta = \alpha(r + \gamma V_{\rm v}(s_{t+1}) - V_{\rm v}(s_t))\nabla_\theta \log \pi_\theta(s_t, a_t) Δ θ = α ( r + γ V v ( s t + 1 ) − V v ( s t )) ∇ θ log π θ ( s t , a t ) , AC PG with the one-step TD error
Eligibility Traces# Δ θ = α ( v t λ − V v ( s t ) ) ∇ θ log π θ ( s t , a t ) \Delta\theta = \alpha({\rm v}_t^\lambda - V_{\rm v}(s_t))\nabla_\theta \log \pi_\theta(s_t, a_t) Δ θ = α ( v t λ − V v ( s t )) ∇ θ log π θ ( s t , a t ) The parameter λ \lambda λ in TD ( λ ) \text{TD}(\lambda) TD ( λ ) controls the degree of bootstrapping and the decay rate of the eligibility traces.
δ = r t + 1 + γ V v ( s t + 1 ) − V v ( s t ) e t + 1 = λ e t + ∇ θ log π θ ( s t , a t ) Δ θ = α δ e t \begin{aligned}
\delta &= r_{t+1} + \gamma V_\text{v}(s_{\rm t+1}) - V_\text{v}(s_t)\\
e_{t+1} &= \lambda e_t + \nabla_\theta \log \pi_\theta(s_t, a_t)\\
\Delta\theta &= \alpha\delta e_t
\end{aligned} δ e t + 1 Δ θ = r t + 1 + γ V v ( s t + 1 ) − V v ( s t ) = λ e t + ∇ θ log π θ ( s t , a t ) = α δ e t Summary# ∇ θ J ( θ ) = E π θ [ ∇ θ log π θ ( s , a ) ⋅ v t ] REINFORCE = E π θ [ ∇ θ log π θ ( s , a ) ⋅ Q w ( s , a ) ] Q Actor-Critic (AC) = E π θ [ ∇ θ log π θ ( s , a ) ⋅ A w ( s , a ) ] Advantage AC = E π θ [ ∇ θ log π θ ( s , a ) ⋅ δ ] TD AC = E π θ [ ∇ θ log π θ ( s , a ) ⋅ δ e ] TD ( λ ) AC \begin{aligned}
\nabla_\theta J(\theta)
&= {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot{\rm v}_t] &\text{\small REINFORCE}\\
&= {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot Q^{\rm w}(s, a)] &\text{\small Q Actor-Critic (AC)}\\
&= {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot A^{\rm w}(s, a)] &\text{\small Advantage AC}\\
&= {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot \delta] &\text{\small TD AC}\\
&= {\Bbb E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s, a)\cdot \delta e] &{\small \text{TD}(\lambda)\ \text{\small AC}}\\
\end{aligned} ∇ θ J ( θ ) = E π θ [ ∇ θ log π θ ( s , a ) ⋅ v t ] = E π θ [ ∇ θ log π θ ( s , a ) ⋅ Q w ( s , a )] = E π θ [ ∇ θ log π θ ( s , a ) ⋅ A w ( s , a )] = E π θ [ ∇ θ log π θ ( s , a ) ⋅ δ ] = E π θ [ ∇ θ log π θ ( s , a ) ⋅ δe ] REINFORCE Q Actor-Critic (AC) Advantage AC TD AC TD ( λ ) AC Reference