fulltopic.github.io

#Chapter06

Notes

TD learning is a combination of MC ideas and DP ideas. Like MC, TD can learn directly from raw experience without a model. Like DP, TD update estimates based in part on other learned estimates, without waiting for final outcome
TD is bootstrapping as its update based part on an existing estimate
6.2

Advantages:
No (environment)model required
on-line
Bootstrapping, so all trajectories are OK
For fixed π, TD(0) converges to v_π in the mean for constant step-size if it is sufficiently small, and decreases according to usual stochastic approximation conditions
6.3

MC is to match the training set, TD is to estimate the Markov process.

6.5

Requirements:
All pairs of (S,A) continue to update
Conditions required on sequence of step-size parameters
6.8

afterstate value functions

Exercises

6.1

δ_t = R_t+1 + γV_t+1(S_t+1) - V_t(S_t)

G_t - V_t(S_t) = R_t+1 + γV_t+1(S_t+1) - V_t(S_t) + γV_t(S_t+1) - γV_t+1(S_t+1)

Suppose d_t = V_t(S_t+1) - V_t+1(S_t+1)

Right side = R_t+1 + γV_t+1(S_t+1) - V_t(S_t) + γd_t

= δ_t + γ(G_t+1 - V_t(S_t+1)) + γd_t+1

= ∑_k=(t:T-1)(γ^k-tδ_k + γ^k-t+1d_k+1)

6.2

Properties of some states are relatively static, for short-term planning, estimating current state value by relatively stable next state value provides quick and good response

6.3

δ_t = R_t+1 + γV_t+1(S_t+1) - V_t(S_t)

Δ = α * δ_t

All V_t(S_t) are initialized with the same value; γ = 1; R_t = 0 for all t.

That is, before termination, there would be no update on each V_t(S_t). When episode ends in Terminal state at T, only V(S_T-1) would be updated. So, the first episode ended with …->A->T. As V(A) updated with negative Δ, the T is the left T.

And we get Δ = 0.1 * (0 + 1 * 0 - 0.5) = -0.05

6.4

Intuitively No. As the large α may cause fluctuation and may lead to converge failure, and small α had been shown in the figure.

I am not able to prove it actually.

6.5

After long run, values are relatively converged and stable. The large α makes Δ depend heavily on single step, the biased approximation drives V away from true values.

It should little to do with initiation as the estimation is near the true distribution after long run.

6.6

a = (0 + b) / 2

b = (a + c) / 2

c = (b + d) / 2

d = (c + e) / 2

d = (1 + d) / 2

Solve the equations

6.7

Δ = α * δ_t

δ_t = G_t - V(S_t) = R_t+1 + γV_t+1(S_t+1) - V_t(S_t)

Suppose ρ_t = possibility of trajectory {s0, s1, s2, …, st}

By behavior policy b and target policy π,

δ_t ~ G^’ - V(S_t)

= ρ^π_t+1 / ρ^b_t+1 * (R_t+1 + γV(S_t+1)) - V(S_t)

6.8

G_t - Q(S_t, A_t) = R_t+1 + γG_t+1 - Q(S_t, A_t)

= R_t+1 + γG_t+1 + γQ(S_t+1, A_t+1) - Q(S_t, A_t) - γQ(S_t+1, A_t+1)

= δ_t + γ(G_t+1 - Q(S_t+1, A_t+1))

= δ_t + γδ_t+1 + γ²(G_t+2 - Q(S_t+2, A_t+2))

= ∑_k=(t:T-1)γ^k-tδ_k

6.9 6.10

codes

6.11

Because that the Q(S_t, A_t) is always updated by max_a(Q(S_t+1, A_t+1)), while sometimes the policy did not choose the argmax(Q) as the next action (e.g. e-greedy)

6.12

Yes, they are the same

6.13

δ_i = R + γ∑_iπ_i(A|S)Q_1-i(S,A) - Q_i (i = 0 or i = 1)

6.14

Define S = afterstate, i.e. how many cars are in each location, then learn V(s) or Q(s) by TD(0).

It speeds up convergence as

In DP, each policy-value process requires many iterations; in each iteration all V(s) have to be calculated; in each V(s) calculation all related V(s^’)s have to be involved
In TD, in each episode, only s^’ are required for update of s. The transition are just s -> s^’. The randomness are introduced by policy and environment themselves, some less possible states and transitions would be ignored in learning.