Skip to the content.

#Chapter05

Notes

5.1

|visit|Item|Ordinary| Weighted | Comments | |—|—|—|—|—| |first-visit|biased|N: It is similar to E(V(pai))/Constant|Y: It also includes factor of Theta|Bias of weighted converge to 0| |first-visit|Variance| Extreme: as there is no constraint on b|Bounded: as max(Theta) <= 1|Variance of weighted converge to 0 providing bounded returns| |every-visit|biased|Y|Y|Bias falls to 0 in both cases|

5.7

The behavior policy b can be anything, but in order to assure convergence of π to the optimal policy, an infinite number of returns must be obtained for each pair of state and action. This can be assured by choosing b to be epsilon-soft.

Exercises

5.1

The last two rows represent 20 and 21 of player’s cards, which holds high possibility of winning of player. When sum < 20, the policy tells the player to hit. Then, when sum is small, player holds high possibility of losing game; when sum is large, player holds high possibility of bust.

The player and the dealer hold the same possibility of getting sum = {20, 21}. At left corner, the dealer holds Ace, which contributes high possibility of dealer winning for two reasons:

Similar reasons as listed above

5.2

No. Because the sum is allowed to be increased merely, no chance to be meet a state more than once in one episode

5.3

q

5.4

Qn(S, A) = ∑i=(1…n)Gi / n

= (∑i=(1…n - 1)Gi + Gn(S,A))/ n

= (Qn-1 * (n - 1) + Gn(S,A)) / n

= Qn-1 + (Gn(S,A) - Qn-1) / n

5.5

states machine

Gt = t ∀t

First-visit

State reached first at time t = 1, G1 = 1, v(s) = 1

Every-visit

State arrived 9 times with t = {1, 2, …, 9}

v(s) = ∑tGt / T : t = {1, 2, …, 9}, T = 9

= (1 + 2 + … + 9) / 9 = 5

5.6

The probability of subsequent trajectory is:

Pr{St+1, At+1,St+2, At+2,…,ST St+1,At:T-1}
= p(St+1 St,At) * ∏k=t+1:T-1π(Ak Sk)p(Sk+1 Sk,Ak)
ρt = ∏k=t+1:T-1π(Ak Sk)p(Sk+1 Sk,Ak) / ∏k=t+1:T-1b(Ak Sk)p(Sk+1 Sk,Ak)
= ∏k=t+1:T-1π(Ak Sk) / ∏k=t+1:T-1b(Ak Sk)

Q(s,a) = ∑tρtGt / ∑tρt

5.7

The bias of weighted average is explicit as the bias is also introduced by that of ρ. And, at the very beginning, ρ is also very biased.

5.8

Assume gk represent return for a trajectory with length = k from start point. gk = ρk * Returnk

For example, an episode with T = 2 includes a trajectory with length = 1 and trajectory with length = 2.

As Reward = 1 at the end of episode and γ = 1, an episode with (T + 1) steps would include T visit to S, each with return = 1. That is, for each episode with length = (T + 1), the V(s) is updated by (g1 + g2 + … + gT) / T

Assume pk = probability of an episode last for k

For all episode,

Sum = p1 * g1 + p2 * (g1 + g2) + …+ pT * ∑k=(1:T)gk

= ∑k=(1:T)pkg1 + ∑k=(2:T)pkg2 + … + gT

= ∑i=(1:T)gik=(i:T)pk

To take average:

Average = ∑i=(1:T)gi * (1/i) ∑k=(i:T)pk

gi = 2i

pk = 0.9k

Average = ∑i=(1:T)(1/i) * 2i * 0.9ik=(0:T-i)0.9k > ∑i=(1:T)(1/i) * 1.8i * Constant = INF

Then E(X2) > Average2 = INF

There would be still INF variance for every-visit

5.9

Modify figure in 5.6

Loop forever (for each episode):
  b <- any policy with coverage of π
  Generate episode following b: S0, A0, R1, ..., S(T-1),A(T-1),R(T)
  G <- 0
  W <- 1
  Loop for each step of espisode, t = T-1, T-2, ..., 0 while (w != 0)
    G <- γG + R(t+1)
    If (S(t),A(t)) has not been touched in this episode:
      C(S(t),A(t)) += W
      Q(S(t),A(t)) += W/C * [G - Q(S(t),A(t))]
    W *= π(A|S)/b(A|S)

Modify figure in 5.1

Initialize:
  N(S) = 0, for all S
Loop forever (for each episode):
  Generate episode following b: S0, A0, R1, ..., S(T-1),A(T-1),R(T)
  G <- 0
  visit(s) <- -1, for all s
  g(t) <- 0, for all g
  //They are to find first visit step in backward traverse
  Loop for each step of espisode, t = T-1, T-2, ..., 0 while (w != 0)
    G <- γG + R(t+1)
    g(t) = G
    visit(s) = t
  Loop for each visit:
    if (visit(s) >= 0):
      N(s) += 1
      V(s) += [g(visit(s)) - V(s)] / N(s)

5.10

Vn = ∑n-1(k)WkGk / ∑n-1(k)Wk

Vn+1 = ∑n(k)WkGk / ∑n(k)Wk

= (∑n-1(k)WkGk + WnGn) / ∑n(k)Wk

= (Vn * ∑n-1(k)Wk + VnWn + WnGn - VnWn) / ∑n(k)Wk

= (Vn * ∑n(k)Wk + Wn(Gn - Vn)) / ∑n(k)Wk

= Vn + Wn(Gn - Vn) / ∑n(k)Wk

= Vn + Wn(Gn - Vn) / Cn

5.11

Because π is a greedy policy, i.e., A = π(S) is a deterministic. And the algorithm says:

“If At != π(St then exit inner Loop (proceed to next episode)”.

So, π(At, St) = 1.