Chapter08

Notes

Total step = 8000, change model at step = 4000; red = dyna, green = dyna+

Move The Block

move the block

Shortcut

sc_block

It requires another table for P(S^’ | S, A). And Q(S,A) updated by max_a(∑_s^’’p(s^’’|s^’,a) * Q(s^’, a))

As learning from real world and learning from model are separated, the speed of learning from model may lay back the changing of real world.

Adjust the ratio of environment learning and model learning

It would strengthen for sample updates over expected updates as

Number of states that contributes little importance to expected update increases
Estimated value of sample update has less gap to expected value
8.7

As random policy is similar to greedy policy with huge exploration factor, when b=1, the deterministic environment is more suitable for greedy policy with minor exploration factor. The random policy takes too much time exploring non-optimal candidates.
The gap between on-policy and uniform increase when b increases
The gap between on-policy and uniform increase when number of states increases
8.8

codes

Not explicit test result:

value