To be able to edit code and run cells, you need to run the notebook yourself. Where would you like to run the notebook?

This notebook takes about 2 minutes to run.

In the cloud (experimental)

Binder is a free, open source service that runs scientific notebooks in the cloud! It will take a while, usually 2-7 minutes to get a session.

On your computer

(Recommended if you want to store your changes.)

  1. Copy the notebook URL:
  2. Run Pluto

    (Also see: How to install Julia and Pluto)

  3. Paste URL in the Open box

Frontmatter

If you are publishing this notebook on the web, you can set the parameters below to provide HTML metadata. This is useful for search engines and social media.

Author 1

Chapter 8 - Planning and Learning with Tabular Methods

So far we have seen example problems where we have a full model of the environment in the form of the probability transition function as well as environments where we can only sample trajectories. In this chapter we will integrate the techniques that are model-based and model-free and show how they can be intermixed.

8.1 Models and Planning

A model is anything that an agent can use to predict the environment. If the model provides a full description of all possible transitions it is called a distribution model vs a sample model that can only generate one of those possibilities according to the correct probability distribution. In dynamic programming, we used a distribution model while for certain example problems such as blackjack we only had a sample model.

A model can be used the create a simulated experience of the environment such as a trajectory. The common thread across all the techniques is the computation of the value function to improve a policy and using some update process to compute the value function for example from the data collected in simulated experience. For the learning methods considered so far, we have assumed that the data collected from trajectories is generated by the environment itself while in planning methods this experience would come instead from a model. However the learning techniques largely still apply to planning techniques as well since the nature of the data is the same. Consider a planning method analogous to Q-learning called random-sample one-step tabular Q-planning. This technique applies the Q-learning update to a transition sapmled from a model. Instead of interacting with the environment in an episode or continuing task, this technique simply selects a state action pair at random and observes the transition. Just like with Q-learning, this method converges to the optimal policy under the assumption that all state-action pairs are visited an infinite number of times but the policy will only be optimal for the model of the environment.

Performing updates on single transitions highlights another theme of planning methods which don't necessarily involve exaustive solutions to the whole environment. We can direct the method to specific states of interest which may be important for problems with very large state spaces.

8.2 Dyna: Integrated Planning, Acting, and Learning

A planning agent can use real experience in at least two ways: 1) it can be used to improve the model to make it a better match for the real environment (model-learning) and 2) it can be used directly to improve the value function using the previous learning methods (direct reinforcement learning). If a better model is then used to improve the value function this is also called indirect reinforcement learning.

Indirect methods can make better use of a limited amount of experience, but direct methods are much simpler and are not affected by the biases in the design of the model. Dyna-Q includes all the processes of planning, acting, model-learning, and direct RL. The planning method is the random-sample one-step tabular Q-planning described above. The direct RL method is one-step tabular Q-learning. The model-learning mdethod is also table-based and assumes the environment is deterministic. After each transition $S_t,A_t \longrightarrow R_{t+1},S_{t+1}$, the model records in its table entry for $S_t,A_t$ the prediction that $R_{t+1},S_{t+1}$ will deterministically follow. Thus if the model is queried with a state-action pair that has been experienced before, it simply returns the last-observed next state and next reward as its prediction.

During planning, the Q-planning algorithm randomly samples only from state-action pairs that have previously been experienced, so the model is never queried with a pair about which it has no information. The learning and planning portions of the algorithm are connected in that they use the same type of update. The only difference is the source of the experience used.

The collection of real experience and planning could occur simultaneously in these agents, but for a serial implementation it is assumed that the acting, model-learning, and direct RL processes are very fast while the planning process is the model computation-intensive. Let us assume that after each step of acting, model-learning, and direct RL there is time for $n$ iterations of the Q-planning algorithm. Without the model update and the $n$ loop planning step, this algorithm is identical to one-step tabular Q-learning. An example implementation is below along with an example applying it to a maze environment.

👀 Reading hidden code
1.2 ms
tabular_dynaQ (generic function with 1 method)
function tabular_dynaQ(mdp::MDP_TD{S, A, F, G, H}, α::T, γ::T; num_episodes = 1000,
max_steps = 10000, n = 100, qinit = zero(T), ϵinit = one(T)/10,
Qinit = initialize_state_action_value(mdp; qinit=qinit), πinit = create_ϵ_greedy_policy(Qinit, ϵinit),
decay_ϵ = false, history_state::S = first(mdp.states),
save_history = false, update_policy! = (v, ϵ, s) -> make_ϵ_greedy_policy!(v, ϵ),
history = Dict{Int64, Set{Int64}}(),
#each column contains the index of the state reached from the state represented by the column index while taking the action represented by the row index. the values are initialized at 0 which represents a transition that has not occured
state_transition_map = zeros(Int64, length(mdp.actions), length(mdp.states)),
#each column contains the reward received from the state represented by the column index while taking the action represented by the row index. the state_transition_map can be used to determine if any of these values have been updated from the zero initialization
reward_transition_map = zeros(T, length(mdp.actions), length(mdp.states)),
init_step = 0,
isoptimal = l -> false) where {S, A, F, G, H, T<:AbstractFloat}

π = copy(πinit)
Q = copy(Qinit)
ϵ = ϵinit
terminds = findall(mdp.isterm(s) for s in mdp.states)
Q[:, terminds] .= zero(T)
num_updates = 0
vhold = zeros(T, length(mdp.actions))
#keep track of rewards per step and steps per episode
rewards = zeros(T, max_steps)
steps = zeros(Int64, num_episodes)

function updateQ!(Q, i_s, i_a, i_s′, r)
qmax = maximum(Q[i, i_s′] for i in eachindex(mdp.actions))
Q[i_a, i_s] += α*(r + γ*qmax - Q[i_a, i_s])
num_updates += 1
end

function q_planning!(Q, history)
for _ in 1:n
i_s = rand(keys(history))
i_a = rand(history[i_s])
i_s′ = state_transition_map[i_a, i_s]
r = reward_transition_map[i_a, i_s]
updateQ!(Q, i_s, i_a, i_s′, r)
end
end

ep = 1
step = 1
optimal_path = false
while (step <= max_steps) && (ep <= num_episodes) && !optimal_path
s = mdp.state_init()
l = 0
while !mdp.isterm(s) && step <= max_steps
(i_s, i_s′, r, s′, a, i_a) = takestep(mdp, π, s)
updateQ!(Q, i_s, i_a, i_s′, r)
state_transition_map[i_a, i_s] = i_s′
reward_transition_map[i_a, i_s] = r
#save state action pair visited
if haskey(history, i_s)
push!(history[i_s], i_a)
else
history[i_s] = Set([i_a])
end
q_planning!(Q, history)
π = create_ϵ_greedy_policy(Q, ϵ; π = π)
s = s′
rewards[step] = r
step += 1
l += 1
end
steps[ep] = l
optimal_path = isoptimal(l)
ep += 1
end
return Q, steps[1:ep-1], rewards[1:step-1], history, state_transition_map, reward_transition_map, num_updates
end
👀 Reading hidden code
18.1 ms

Example 8.1: Dyna Maze

👀 Reading hidden code
181 μs
begin
const maze_walls = Set(GridworldState(x, y) for (x, y) in [(3, 3), (3, 4), (3, 5), (6, 2), (8, 4), (8, 5), (8, 6)])
const dyna_maze = make_gridworld(;xmax = 9, ymax = 6, sterm = GridworldState(9, 6), iswall = s -> in(s, maze_walls))
end
👀 Reading hidden code
719 ms
make_dyna_maze (generic function with 1 method)
function make_dyna_maze(scale)
scale_value(v) = ceil(Int64, v*scale)
xmax = scale_value(9)
ymax = scale_value(6)
sterm = GridworldState(xmax, ymax)
wall1 = Set(GridworldState(scale_value(3), y) for y in scale_value(3):scale_value(5))
wall2 = Set([GridworldState(scale_value(6), scale_value(3) - 1)])
wall3 = Set(GridworldState(xmax - 1, y) for y in scale_value(4):ymax)
walls = union(wall1, wall2, wall3)
iswall(s) = in(s, walls)
start = GridworldState(1, scale_value(4))
optimal_length = xmax - 1 + (ymax - scale_value(3) + 1) + start.y - scale_value(3) + 1
(maze = make_gridworld(;start = start, xmax=xmax, ymax=ymax, sterm = sterm, iswall = iswall), iswall = iswall, optimal_length = optimal_length)
end
👀 Reading hidden code
4.5 ms
Actions
@htl("""
<div style = "background-color: white; color: black; display: flex; align-items: center; justify-content: center;">
<div>$(plot_path(dyna_maze; title = "Random policy path example in Dyna Maze", max_steps = 10000, iswall = s -> in(s, maze_walls)))</div>
<div>$rook_action_display</div>
</div>
""")
👀 Reading hidden code
27.5 ms

Consider the simple maze shown above where the agent can take four actions which take it deterministically into the neighboring square unless it is blocked by a wall represented by a W in the diagram. In this case the agent remains in its original state. Reward is zero on all transitions except those into the goal state, on which it is +1. After reaching the goal the agent returns to the start to begin a new episode. This is a discounted episodic task with $\gamma = 0.95$

👀 Reading hidden code
215 μs

Figure 8.2

Learning curves for a Dyna-Q agent in the maze

👀 Reading hidden code
231 μs
figure8_2()
👀 Reading hidden code
852 ms

Figure 8.3:

Policies found by planning and nonplanning Dyna-Q agents after episode . The arrows indicate the greedy action with respect to the learned state-action value function. Without planning, the only states with a policy update are those within n episodes of the goal where n is the number of episodes experienced so far in training. For the planning agent after 1 episode, at least one of the values will be updated so during future planning steps, that information can propagate through the other states through bootstrapping since we are free to sample transitions from states into the ones that already have a value update.

👀 Reading hidden code
323 ms
Without planning (n = 0)
Actions
With planning (n = 50)
Actions
👀 Reading hidden code
71.1 ms
figure8_2 (generic function with 1 method)
function figure8_2(;num_episodes = 50, α = 0.1f0, nlist = [0, 5, 50], γ = 0.95f0, samples = 5)
traces = [begin
step_avg = zeros(Int64, num_episodes)
for s in 1:samples
Random.seed!(s)
(Q, steps, rewards, _, _) = tabular_dynaQ(dyna_maze, α, γ; num_episodes = num_episodes, n = n)
step_avg .+= steps
end
scatter(x = 2:num_episodes, y = step_avg[2:end] ./ samples, name = "$n planning steps")
end
for n in nlist]
plot(traces, Layout(xaxis_title = "Episodes", yaxis_title = "Steps per episode"))
end
👀 Reading hidden code
4.3 ms
figure8_3 (generic function with 1 method)
function figure8_3(num_episodes; α = 0.1f0, nlist = [0, 50], γ = 0.95f0,)
(Q1, _) = tabular_dynaQ(dyna_maze, α, γ; num_episodes = num_episodes, n = first(nlist))
d1 = show_grid_policy(dyna_maze, create_greedy_policy(Q1), "test")
(Q2, _) = tabular_dynaQ(dyna_maze, α, γ; num_episodes = num_episodes, n = last(nlist))
d2 = show_grid_policy(dyna_maze, create_greedy_policy(Q2), "test")
@htl("""
<div style = "display: flex;">
<div>Without planning (n = $(first(nlist)))$d1</div>
<div>With planning (n = $(last(nlist)))$d2</div>
</div>
""")
end
👀 Reading hidden code
2.8 ms

Because planning proceeds incrementally, it is trivial to intermix planning and acting. So the learning from the model can inform the interaction with the environment in parallel and then any information from the environment can be used to update the model whenever it is received.

👀 Reading hidden code
196 μs

Exercise 8.1

The nonplanning method looks particularly poor in Figure 8.3 because it is a one-step method; a method using multi-step bootstrapping would do better. Do you think one of the multi-step bootstrapping methods from Chapter 7 could do as well as the Dyna method? Explain why or why not.

For the n = 50 agent, it can learn a policy that covers nearly the entire maze during the second episode. In the extreme case of a multistep method we would attempt to solve the maze using monte carlo sampling in which case after a single episode we would have action/value updates for every state visited along the random trajectory. However, these action/value estimates would be high variance estimates of the randomly initialized policy. In contrast, the Dyna method after one random episode has observed most or all of the transitions in the maze. During the long planning phase its Q updates would actually be able to converge to the optimal policy given a large enough n using bootstrapping and the single reward from reaching the goal. As long as something is sampled close to the goal that information will propagate through to the rest of the states and each update is simultaneously improving the ϵ-greedy policy. With multi-step bootstrapping we can extend the updates back along the trajectory a certain distance, but in the extreme case we just sample values from the random policy without any bootstrapping or we bootstrap to a limited degree close to the goal and still have the lack of information further away from it. Since this environment is deterministic, having the sample transitions is equivalent to having a full model of the environment. This could be used explicitely with value iteration to obtain the Q function as well. The planning step is effectively performing this computation but only using the known transitions and over time is focused only on the states visited by the optimal or nearly optimal policy. No n-step method can take advantage of previously observed transitions this well, but it should be noted that Dyna-Q only works this well if the environment is deterministic and is taking advantage of the fact that we know our model is correct for the transitions already observed.

👀 Reading hidden code
303 μs

8.3 When the Model Is Wrong

In the maze example the model started out empty and then was only filled in with correct information. In general, we cannot expect to be so fortunate. Models may be incorrect because the environment is stochastic and only a limited number of samples have been observed, or simply because the environment has changed. When the model is incorrect, the planning process will likely compute a suboptimal policy.

In some cases, the suboptimal poicy computed by planning quickly leads to the discovery and correction of the modeling error. This tends to happen when the model is optimistic in the sense of predicting greater reward or better state transitions than are actually possible. The planned policy attempts to exploit these opportunities and in doing so it discovers that they do not exist.

Example 8.2: Blocking Maze

Consider a maze with a barrier that separates the start from the goal except for an opening on one end. After a set number of time steps this path is blocked and replaced with a longer path open at the other end. Below is an example of the environment with a random trajectory before and after the blocking.

👀 Reading hidden code
387 μs
begin
const block1 = Set(GridworldState(x, 3) for x in 1:8)
const block2 = Set(GridworldState(x, 3) for x in 2:9)
const block_maze_base_args = (xmax = 9, ymax = 6, sterm = GridworldState(9, 6), start = GridworldState(4, 1))
const blocking_maze1 = make_gridworld(;iswall = s -> in(s, block1), block_maze_base_args...)
const blocking_maze2 = make_gridworld(;iswall = s -> in(s, block2), block_maze_base_args...)
[plot_path(blocking_maze1, iswall = s -> in(s, block1), max_steps = 2000, title = "Blocking Maze Steps <= 1000") plot_path(blocking_maze2, iswall = s -> in(s, block2), max_steps = 1000, title = "Blocking Maze Steps > 1000")]
end
👀 Reading hidden code
2.8 s

Figure 8.4:

Average performance of Dyna-Q agent on a blocking task. After 1000 episodes of training the environment is switched to the second version. The flat portion of the graph shows the agent getting stuck following the old optimal policy and needing many time steps to learn the new environment

👀 Reading hidden code
236 μs

Planning Steps:

👀 Reading hidden code
265 ms
👀 Reading hidden code
200 ms
figure_8_4 (generic function with 1 method)
function figure_8_4(; max_steps1 = 1000, max_steps2 = 2000, ntrials = 10, n = 10, α = 0.1f0, ϵ = 0.1f0)
y = zeros(max_steps1 + max_steps2)
x = 1:(max_steps1 + max_steps2)
for _ in 1:ntrials
(Q, steps, rewards, history, stm, rtm) = tabular_dynaQ(blocking_maze1, α, 0.95f0; ϵinit = ϵ, max_steps = max_steps1, n = n)
(Q2, steps2, rewards2, _, _) = tabular_dynaQ(blocking_maze2, α, 0.95f0; ϵinit = ϵ, max_steps = max_steps2, n = n, Qinit = Q, history = history, state_transition_map = stm, reward_transition_map = rtm)
y .+= cumsum([rewards; rewards2])
end
plot(scatter(x = x, y = y ./ ntrials))
end
👀 Reading hidden code
4.1 ms
Loading more cells...