Chapter07

Exercises

7.1

As no update from step to step, V_t+n = V_t = V

δ_t = R_t+1 + γV(S_t+1) - V(S_t)

H_t = G_t:t+n = R_t+1 + γR_t+2 + … + γⁿV(S_t+n)

G_t:t+n - V(S_t) = R_t+1 + γV(S_t+1) - V(S_t) - γV(S_t+1) + γR_t+2 + … + γⁿV(S_t+n)

= δ_t + γ(R_t+2 + … + γ^n-1V(S_t+n - V(S_t+1))

= δ_t + γδ_t + … + γⁿδ_t+n

7.2

Generally, the difference between 2 algorithms are double-buffer and single-buffer.

As discussed in previous chapters, using of the updated V_t+k accelerated propagation of update. Which accelerated convergence and also the bias propagation. It is hard to determine which one is better, while updated case would be better as

In simple case, the algorithm will converge with probability = 1, no matter which one used. So quicker one is better.
In complicated case, the step size (alpha) is very small, updated version has very little difference to no-update version, and introduced slim bootstrapping effect. So updated version is better
7.3
In small step-size cases, the problem is easy to become an MC cases, and also introduced a bias to samples: less samples for long episode, more samples for short episode.
In according to above reasons, the n would be smaller in smaller problems
If return of both terminal state are 0s, the problem is in fact an (n /2) problem as walking toward each side are the same. Make left side = -1, the agent has to learn how to escape left side. That is an n problem
7.4

G_t:t+n = R_t+1 + γR_t+2 + … + γⁿQ(S_t+n, A_t+n)

δ_t = G_t:t+n - Q(S_t, A_t)

= R_t+1 + γQ(S_t+1, A_t+1) - Q(S_t, A_t) + γG_t+1:t+n - γQ(S_t+1, A_t+1)

= R_t+1 + γQ(S_t+1, A_t+1) - Q(S_t, A_t) + γ(R_t+1 + γQ(S_t+2, A_t+2) - Q(S_t+1, A_t+1)) + γ²Q(S_t+3, A_t+3)

= ∑_{k=t:min(t+n,T)-1}γ^k-t(R_k+1 + γQ(S_k+1, A_k+1) - Q(S_k, A_k)

7.5

if τ ≥ 0:
    G = 0
    for k = t ... t + τ - 1:
        G = ρ(k) * (R(k+1) + γG) + (1 - ρ(k)) * V(S(k))
    V(S(τ)) += α * (G - V(S(τ))

Chapter07

Exercises

7.1

7.2

7.3

7.4

7.5