Skip to the content.

Chapter07

Exercises

7.1

As no update from step to step, Vt+n = Vt = V

δt = Rt+1 + γV(St+1) - V(St)

Ht = Gt:t+n = Rt+1 + γRt+2 + … + γnV(St+n)

Gt:t+n - V(St) = Rt+1 + γV(St+1) - V(St) - γV(St+1) + γRt+2 + … + γnV(St+n)

= δt + γ(Rt+2 + … + γn-1V(St+n - V(St+1))

= δt + γδt + … + γnδt+n

7.2

Generally, the difference between 2 algorithms are double-buffer and single-buffer.

As discussed in previous chapters, using of the updated Vt+k accelerated propagation of update. Which accelerated convergence and also the bias propagation. It is hard to determine which one is better, while updated case would be better as

δt = Gt:t+n - Q(St, At)

= Rt+1 + γQ(St+1, At+1) - Q(St, At) + γGt+1:t+n - γQ(St+1, At+1)

= Rt+1 + γQ(St+1, At+1) - Q(St, At) + γ(Rt+1 + γQ(St+2, At+2) - Q(St+1, At+1)) + γ2Q(St+3, At+3)

= ∑k=t:min(t+n,T)-1γk-t(Rk+1 + γQ(Sk+1, Ak+1) - Q(Sk, Ak)

7.5

if τ ≥ 0:
    G = 0
    for k = t ... t + τ - 1:
        G = ρ(k) * (R(k+1) + γG) + (1 - ρ(k)) * V(S(k))
    V(S(τ)) += α * (G - V(S(τ))