fulltopic.github.io

#Chapter05

Notes

Q or V? Why? How to remove model?
Advantages of MC
Bias and Bound and How to deal with it
1. No bootstrap
First-visit and Every-visit
Exploring policy: how to meet the requirements
MCES = Monte Carlo Exploring Starts
DP converge? Converge to optimal?

5.1

MC is used to learn v(s).
Both first-visit and every-visit algorithm converge to unbiased expectation and standard deviation of error fall as (1 / n^1/2)
Estimation of state value is independent of that of other states. Which means:
1. No bootstrap
2. Just experiences, no Bellman equation.
3. Can focus on certain states, ignoring others
Advantages of MC:
1. Learn from actual experience
2. Learn from simulation
3. Only states concerned to be estimated
  5.2
Given V(s) learned, the policy is to be determined by argmax_a∑_ap(s’|s,a)(r + v(s’)), still the model p(s’|s,a) to be given. So, Q(s,a) is more directly choice for policy learning
To learn Q(s,a), we’d better learn each action of a state to choose the best one for policy improvement. While as MC could estimate Q(s,a) without considering Q(s,a’), esp in case of deterministic π, we need exploration.
1. Exploration start is useful, but sometimes not practical
2. Alternatives to assuring all state-action pairs are encountered
  5.3
  
  MCES:
Start from random (s,a) pair
Run episode from start pair to terminal state
Record each (s,a) and corresponding reward for each step
After each episode, update G for each (s,a); update Q for each (s,a); update π(s) for each step.
By backward order
5.4
To remove exploring start assumption: on-policy and off-policy learning.
ϵ-greedy policy on-policy MC control learning achieves the best policy among all ϵ-soft policies.
5.5
The on-policy approach is actually a compromise – it learns action values not for the optimal policy, but for a near-optimal policy that still explores.
π is target policy, b is behavior policy, both policies are considered fixed and given
Assumption of coverage
Almost all off-policy methods utilize importance sampling.
Importance sampling: a general technique for estimating expected values under one distribution given samples from another

5.7

The behavior policy b can be anything, but in order to assure convergence of π to the optimal policy, an infinite number of returns must be obtained for each pair of state and action. This can be assured by choosing b to be epsilon-soft.

Exercises

5.1

The last two rows represent 20 and 21 of player’s cards, which holds high possibility of winning of player. When sum < 20, the policy tells the player to hit. Then, when sum is small, player holds high possibility of losing game; when sum is large, player holds high possibility of bust.

The player and the dealer hold the same possibility of getting sum = {20, 21}. At left corner, the dealer holds Ace, which contributes high possibility of dealer winning for two reasons:

Dealer can reach 20/21 by fewer steps. As more steps means larger sum and higher possibility of bust, fewer steps mean more possibility of 20/21.
It reduces possibility of the case that the player holds Ace.

Similar reasons as listed above

5.2

No. Because the sum is allowed to be increased merely, no chance to be meet a state more than once in one episode

5.3

5.4

Q_n(S, A) = ∑_i=(1…n)G_i / n

= (∑_{i=(1…n - 1)}G_i + G_n(S,A))/ n

= (Q_n-1 * (n - 1) + G_n(S,A)) / n

= Q_n-1 + (G_n(S,A) - Q_n-1) / n

5.5

states machine

G_t = t ∀t

First-visit

State reached first at time t = 1, G₁ = 1, v(s) = 1

Every-visit

State arrived 9 times with t = {1, 2, …, 9}

v(s) = ∑_tG_t / T : t = {1, 2, …, 9}, T = 9

= (1 + 2 + … + 9) / 9 = 5

5.6

The probability of subsequent trajectory is:

Pr{S_t+1, A_t+1,S_t+2, A_t+2,…,S_T

S_t+1,A_t:T-1}

= p(S_t+1

S_t,A_t) * ∏_k=t+1:T-1π(A_k

S_k)p(S_k+1

S_k,A_k)

ρ_t = ∏_k=t+1:T-1π(A_k

S_k)p(S_k+1

S_k,A_k) / ∏_k=t+1:T-1b(A_k

S_k)p(S_k+1

S_k,A_k)

= ∏_k=t+1:T-1π(A_k

S_k) / ∏_k=t+1:T-1b(A_k

S_k)

Q(s,a) = ∑_tρ_tG_t / ∑_tρ_t

5.7

The bias of weighted average is explicit as the bias is also introduced by that of ρ. And, at the very beginning, ρ is also very biased.

5.8

Assume g_k represent return for a trajectory with length = k from start point. g_k = ρ_k * Return_k

For example, an episode with T = 2 includes a trajectory with length = 1 and trajectory with length = 2.

As Reward = 1 at the end of episode and γ = 1, an episode with (T + 1) steps would include T visit to S, each with return = 1. That is, for each episode with length = (T + 1), the V(s) is updated by (g₁ + g₂ + … + g_T) / T

Assume p_k = probability of an episode last for k

For all episode,

Sum = p₁ * g₁ + p₂ * (g₁ + g₂) + …+ p_T * ∑_k=(1:T)g_k

= ∑_k=(1:T)p_kg₁ + ∑_k=(2:T)p_kg₂ + … + g_T

= ∑_i=(1:T)g_i∑_k=(i:T)p_k

To take average:

Average = ∑_i=(1:T)g_i * (1/i) ∑_k=(i:T)p_k

g_i = 2ⁱ

p_k = 0.9^k

Average = ∑_i=(1:T)(1/i) * 2ⁱ * 0.9ⁱ ∑_k=(0:T-i)0.9^k > ∑_i=(1:T)(1/i) * 1.8ⁱ * Constant = INF

Then E(X²) > Average² = INF

There would be still INF variance for every-visit

5.9

Modify figure in 5.6

Loop forever (for each episode):
  b <- any policy with coverage of π
  Generate episode following b: S0, A0, R1, ..., S(T-1),A(T-1),R(T)
  G <- 0
  W <- 1
  Loop for each step of espisode, t = T-1, T-2, ..., 0 while (w != 0)
    G <- γG + R(t+1)
    If (S(t),A(t)) has not been touched in this episode:
      C(S(t),A(t)) += W
      Q(S(t),A(t)) += W/C * [G - Q(S(t),A(t))]
    W *= π(A|S)/b(A|S)

Modify figure in 5.1

Initialize:
  N(S) = 0, for all S
Loop forever (for each episode):
  Generate episode following b: S0, A0, R1, ..., S(T-1),A(T-1),R(T)
  G <- 0
  visit(s) <- -1, for all s
  g(t) <- 0, for all g
  //They are to find first visit step in backward traverse
  Loop for each step of espisode, t = T-1, T-2, ..., 0 while (w != 0)
    G <- γG + R(t+1)
    g(t) = G
    visit(s) = t
  Loop for each visit:
    if (visit(s) >= 0):
      N(s) += 1
      V(s) += [g(visit(s)) - V(s)] / N(s)

5.10

V_n = ∑_n-1(k)W_kG_k / ∑_n-1(k)W_k

V_n+1 = ∑_n(k)W_kG_k / ∑_n(k)W_k

= (∑_n-1(k)W_kG_k + W_nG_n) / ∑_n(k)W_k

= (V_n * ∑_n-1(k)W_k + V_nW_n + W_nG_n - V_nW_n) / ∑_n(k)W_k

= (V_n * ∑_n(k)W_k + W_n(G_n - V_n)) / ∑_n(k)W_k

= V_n + W_n(G_n - V_n) / ∑_n(k)W_k

= V_n + W_n(G_n - V_n) / C_n

5.11

Because π is a greedy policy, i.e., A = π(S) is a deterministic. And the algorithm says:

“If A_t != π(S_t then exit inner Loop (proceed to next episode)”.

So, π(A_t, S_t) = 1.