Skip to the content.

#Chapter06

Notes

Gt - Vt(St) = Rt+1 + γVt+1(St+1) - Vt(St) + γVt(St+1) - γVt+1(St+1)

Suppose dt = Vt(St+1) - Vt+1(St+1)

Right side = Rt+1 + γVt+1(St+1) - Vt(St) + γdt

= δt + γ(Gt+1 - Vt(St+1)) + γdt+1

= ∑k=(t:T-1)k-tδk + γk-t+1dk+1)

6.2

Properties of some states are relatively static, for short-term planning, estimating current state value by relatively stable next state value provides quick and good response

6.3

δt = Rt+1 + γVt+1(St+1) - Vt(St)

Δ = α * δt

All Vt(St) are initialized with the same value; γ = 1; Rt = 0 for all t.

That is, before termination, there would be no update on each Vt(St). When episode ends in Terminal state at T, only V(ST-1) would be updated. So, the first episode ended with …->A->T. As V(A) updated with negative Δ, the T is the left T.

And we get Δ = 0.1 * (0 + 1 * 0 - 0.5) = -0.05

6.4

Intuitively No. As the large α may cause fluctuation and may lead to converge failure, and small α had been shown in the figure.

I am not able to prove it actually.

6.5

After long run, values are relatively converged and stable. The large α makes Δ depend heavily on single step, the biased approximation drives V away from true values.

It should little to do with initiation as the estimation is near the true distribution after long run.

6.6

a = (0 + b) / 2

b = (a + c) / 2

c = (b + d) / 2

d = (c + e) / 2

d = (1 + d) / 2

Solve the equations

6.7

Δ = α * δt

δt = Gt - V(St) = Rt+1 + γVt+1(St+1) - Vt(St)

Suppose ρt = possibility of trajectory {s0, s1, s2, …, st}

By behavior policy b and target policy π,

δt ~ G - V(St)

= ρπt+1 / ρbt+1 * (Rt+1 + γV(St+1)) - V(St)

6.8

Gt - Q(St, At) = Rt+1 + γGt+1 - Q(St, At)

= Rt+1 + γGt+1 + γQ(St+1, At+1) - Q(St, At) - γQ(St+1, At+1)

= δt + γ(Gt+1 - Q(St+1, At+1))

= δt + γδt+1 + γ2(Gt+2 - Q(St+2, At+2))

= ∑k=(t:T-1)γk-tδk

6.9 6.10

codes

6.11

Because that the Q(St, At) is always updated by maxa(Q(St+1, At+1)), while sometimes the policy did not choose the argmax(Q) as the next action (e.g. e-greedy)

6.12

Yes, they are the same

6.13

δi = R + γ∑iπi(A|S)Q1-i(S,A) - Qi (i = 0 or i = 1)

6.14

Define S = afterstate, i.e. how many cars are in each location, then learn V(s) or Q(s) by TD(0).

It speeds up convergence as