Chapter07
Exercises
7.1
As no update from step to step, Vt+n = Vt = V
δt = Rt+1 + γV(St+1) - V(St)
Ht = Gt:t+n = Rt+1 + γRt+2 + … + γnV(St+n)
Gt:t+n - V(St) = Rt+1 + γV(St+1) - V(St) - γV(St+1) + γRt+2 + … + γnV(St+n)
= δt + γ(Rt+2 + … + γn-1V(St+n - V(St+1))
= δt + γδt + … + γnδt+n
7.2
Generally, the difference between 2 algorithms are double-buffer and single-buffer.
As discussed in previous chapters, using of the updated Vt+k accelerated propagation of update. Which accelerated convergence and also the bias propagation. It is hard to determine which one is better, while updated case would be better as
- In simple case, the algorithm will converge with probability = 1, no matter which one used. So quicker one is better.
- In complicated case, the step size (alpha) is very small, updated version has very little difference to no-update version, and introduced slim bootstrapping effect. So updated version is better
7.3
- In small step-size cases, the problem is easy to become an MC cases, and also introduced a bias to samples: less samples for long episode, more samples for short episode.
- In according to above reasons, the n would be smaller in smaller problems
- If return of both terminal state are 0s, the problem is in fact an (n /2) problem as walking toward each side are the same.
Make left side = -1, the agent has to learn how to escape left side. That is an n problem
7.4
Gt:t+n = Rt+1 + γRt+2 + … + γnQ(St+n, At+n)
δt = Gt:t+n - Q(St, At)
= Rt+1 + γQ(St+1, At+1) - Q(St, At) + γGt+1:t+n - γQ(St+1, At+1)
= Rt+1 + γQ(St+1, At+1) - Q(St, At) + γ(Rt+1 + γQ(St+2, At+2) - Q(St+1, At+1)) + γ2Q(St+3, At+3)
= ∑k=t:min(t+n,T)-1γk-t(Rk+1 + γQ(Sk+1, Ak+1) - Q(Sk, Ak)
7.5
if τ ≥ 0:
G = 0
for k = t ... t + τ - 1:
G = ρ(k) * (R(k+1) + γG) + (1 - ρ(k)) * V(S(k))
V(S(τ)) += α * (G - V(S(τ))