#Chapter06
Notes
- TD learning is a combination of MC ideas and DP ideas. Like MC, TD can learn directly from raw experience without a model. Like DP, TD update estimates based in part on other learned estimates, without waiting for final outcome
- TD is bootstrapping as its update based part on an existing estimate
6.2
Advantages:
- No (environment)model required
- on-line
- Bootstrapping, so all trajectories are OK
- For fixed π, TD(0) converges to vπ in the mean for constant step-size if it is sufficiently small, and decreases according to usual stochastic approximation conditions
6.3
MC is to match the training set, TD is to estimate the Markov process.
6.5
Requirements:
- All pairs of (S,A) continue to update
- Conditions required on sequence of step-size parameters
6.8
afterstate value functions
Exercises
6.1
δt = Rt+1 + γVt+1(St+1) - Vt(St)
Gt - Vt(St) = Rt+1 + γVt+1(St+1) - Vt(St) + γVt(St+1) - γVt+1(St+1)
Suppose dt = Vt(St+1) - Vt+1(St+1)
Right side = Rt+1 + γVt+1(St+1) - Vt(St) + γdt
= δt + γ(Gt+1 - Vt(St+1)) + γdt+1
= ∑k=(t:T-1)(γk-tδk + γk-t+1dk+1)
6.2
Properties of some states are relatively static, for short-term planning, estimating current state value by relatively stable next state value provides quick and good response
6.3
δt = Rt+1 + γVt+1(St+1) - Vt(St)
Δ = α * δt
All Vt(St) are initialized with the same value; γ = 1; Rt = 0 for all t.
That is, before termination, there would be no update on each Vt(St). When episode ends in Terminal state at T, only V(ST-1) would be updated. So, the first episode ended with …->A->T. As V(A) updated with negative Δ, the T is the left T.
And we get Δ = 0.1 * (0 + 1 * 0 - 0.5) = -0.05
6.4
Intuitively No. As the large α may cause fluctuation and may lead to converge failure, and small α had been shown in the figure.
I am not able to prove it actually.
6.5
After long run, values are relatively converged and stable. The large α makes Δ depend heavily on single step, the biased approximation drives V away from true values.
It should little to do with initiation as the estimation is near the true distribution after long run.
6.6
a = (0 + b) / 2
b = (a + c) / 2
c = (b + d) / 2
d = (c + e) / 2
d = (1 + d) / 2
Solve the equations
6.7
Δ = α * δt
δt = Gt - V(St) = Rt+1 + γVt+1(St+1) - Vt(St)
Suppose ρt = possibility of trajectory {s0, s1, s2, …, st}
By behavior policy b and target policy π,
δt ~ G’ - V(St)
= ρπt+1 / ρbt+1 * (Rt+1 + γV(St+1)) - V(St)
6.8
Gt - Q(St, At) = Rt+1 + γGt+1 - Q(St, At)
= Rt+1 + γGt+1 + γQ(St+1, At+1) - Q(St, At) - γQ(St+1, At+1)
= δt + γ(Gt+1 - Q(St+1, At+1))
= δt + γδt+1 + γ2(Gt+2 - Q(St+2, At+2))
= ∑k=(t:T-1)γk-tδk
6.9 6.10
6.11
Because that the Q(St, At) is always updated by maxa(Q(St+1, At+1)), while sometimes the policy did not choose the argmax(Q) as the next action (e.g. e-greedy)
6.12
Yes, they are the same
6.13
δi = R + γ∑iπi(A|S)Q1-i(S,A) - Qi (i = 0 or i = 1)
6.14
Define S = afterstate, i.e. how many cars are in each location, then learn V(s) or Q(s) by TD(0).
It speeds up convergence as
- In DP, each policy-value process requires many iterations; in each iteration all V(s) have to be calculated; in each V(s) calculation all related V(s’)s have to be involved
- In TD, in each episode, only s’ are required for update of s. The transition are just s -> s’. The randomness are introduced by policy and environment themselves, some less possible states and transitions would be ignored in learning.