fulltopic.github.io

#Chapter02

2.1

Suppose a bandit with 10 arms. The true value of arms generated by (0,1) normal distribution.
Suppose true values of arms are V = {v₀, v₁, …, v₉}
Suppose among all 10 true values, there are M values that are positive are PV = {pv₀, pv₁, …, pv_M}
Suppose the max value of V = V_max

For the greedy algorithm with ε = 0, the agent would trap in a local optimized value. The value the agent selected as final choice would be a random choice of PV.

For algorithms with ε > 0, the agent would find the global optimized value, that is the V_max.

The improvement would be E(V_max) - E(PV). I don’t know how to solve it.

The experiment result shown below:

#arm = 10. generated by standard normal distribution.
reward = (true value) + (standard normal distribution noise)
#sample = 1000
#epoch = 8

red line: ε = 0.1
gree line: ε = 0.01
blue line: ε = 0

reward

hits

2.2

Refer to greedy model code

2.3

Q_k+1 = Q_k + α_k+1 * (R_k - Q_k)

= α_k+1 * R_k + (1 - α_k+1) * Q_k

= α_k+1 * R_k + (1 - α_k+1) * (α_k * R_k-1 + (1 - α_k) * Q_k-1)

= α_k+1 * R_k + α_k * (1 - α_k+1) * R_k-1 + … + (1 - α_k+1) * (1 - α_k) * (1- α_k-1) … (1 - α₂) * α₁ * R₁

2.4

Refer to 2.4 code

The result:

Reward

2.5

In the case described in question, the Q_initial » (true value).

At the very beginning, whichever the arm was tried by the agent, the Q_k would be decreased to fit the true value. So each arm has the same chance to be tried. So the curve starts with low hit possibility.

After a round of trial, the best arm had been identified and tried, so an explicit upward observed. While the more the potential arm was chosen, the more Q_k was decreased. So the curve went downward dramatically.

At the same time, the relative bad choices got the chance to be chosen. After rounds of trial, the effect of initial value disappeared, and the best choice was recognized. So the curve went upward slowly.

Inject prior knowledge of the game into initial values may impact relatively large effect on learning.

2.6

Suppose possibility of action1 was chosen = p. The reward got was:

p * (0.1 + 0.9) + (1 - p) * (0.2 + 0.8) = 1.

Seemed no specific tragedy makes difference.

If the cases are known between action taken, agent could learn it as two independent games.