Chapter09

Notes

In fact, all the theoretical results for methods using function approximation presented in this part of book apply equally well to cases of partial observability. What function approximation can’t do, however, is augment the state representation with memories of past observations.

9.1

For RL, the target functions are nonstationary. For example, in control methods based on GPI we often seek to learn q_π while π changes. Even if the policy remains the same, the target values of training examples are nonstationary if they are generated by boostrapping methods (DP, TD). Methods that cannot easily handle such nonstationarity are less suitable for RL.

9.4

Critical to these convergence results is that states are updated according to the on-policy distribution. For other update distributions, bootstrapping methods using function approximation may actually diverge to infinity.

MC has larger variance than TD does

Exercises

9.1

answer

video

9.2

For k states, each state could be order of {0, 1, …, n}

9.3

n = 2; c = {0, 1, 2}

9.4

Dense tiling in important dimension works. For example, dense stripes in requiring dimension and coarse stripes in the other dimension.

9.5

Suppose each tiling is represented by T_i = [b₀, b₁, …, b_ni] (b = {0, 1}, ni = number of grids for tiling i)

The final encoding of a state is represented as X = [T₀, T₁, …, T₉₇]

X^TX = T₀^TT₀ + T₁^TT₁ + … + T₉₇^TT₉₇

As T is one-hot vector, T^TT = 1, X^TX = ∑_iT_i^TT_i = 98

τ = 10

α = (τE[X^TX])^-1 = (10 * 98)^-1 = 0.00102