Chapter 8 - Planning and Learning with Tabular Methods
So far we have seen example problems where we have a full model of the environment in the form of the probability transition function as well as environments where we can only sample trajectories. In this chapter we will integrate the techniques that are model-based and model-free and show how they can be intermixed.
8.1 Models and Planning
A model is anything that an agent can use to predict the environment. If the model provides a full description of all possible transitions it is called a distribution model vs a sample model that can only generate one of those possibilities according to the correct probability distribution. In dynamic programming, we used a distribution model while for certain example problems such as blackjack we only had a sample model.
A model can be used the create a simulated experience of the environment such as a trajectory. The common thread across all the techniques is the computation of the value function to improve a policy and using some update process to compute the value function for example from the data collected in simulated experience. For the learning methods considered so far, we have assumed that the data collected from trajectories is generated by the environment itself while in planning methods this experience would come instead from a model. However the learning techniques largely still apply to planning techniques as well since the nature of the data is the same. Consider a planning method analogous to Q-learning called random-sample one-step tabular Q-planning. This technique applies the Q-learning update to a transition sapmled from a model. Instead of interacting with the environment in an episode or continuing task, this technique simply selects a state action pair at random and observes the transition. Just like with Q-learning, this method converges to the optimal policy under the assumption that all state-action pairs are visited an infinite number of times but the policy will only be optimal for the model of the environment.
Performing updates on single transitions highlights another theme of planning methods which don't necessarily involve exaustive solutions to the whole environment. We can direct the method to specific states of interest which may be important for problems with very large state spaces.
8.2 Dyna: Integrated Planning, Acting, and Learning
A planning agent can use real experience in at least two ways: 1) it can be used to improve the model to make it a better match for the real environment (model-learning) and 2) it can be used directly to improve the value function using the previous learning methods (direct reinforcement learning). If a better model is then used to improve the value function this is also called indirect reinforcement learning.
Indirect methods can make better use of a limited amount of experience, but direct methods are much simpler and are not affected by the biases in the design of the model. Dyna-Q includes all the processes of planning, acting, model-learning, and direct RL. The planning method is the random-sample one-step tabular Q-planning described above. The direct RL method is one-step tabular Q-learning. The model-learning mdethod is also table-based and assumes the environment is deterministic. After each transition $S_t,A_t \longrightarrow R_{t+1},S_{t+1}$, the model records in its table entry for $S_t,A_t$ the prediction that $R_{t+1},S_{t+1}$ will deterministically follow. Thus if the model is queried with a state-action pair that has been experienced before, it simply returns the last-observed next state and next reward as its prediction.
During planning, the Q-planning algorithm randomly samples only from state-action pairs that have previously been experienced, so the model is never queried with a pair about which it has no information. The learning and planning portions of the algorithm are connected in that they use the same type of update. The only difference is the source of the experience used.
The collection of real experience and planning could occur simultaneously in these agents, but for a serial implementation it is assumed that the acting, model-learning, and direct RL processes are very fast while the planning process is the model computation-intensive. Let us assume that after each step of acting, model-learning, and direct RL there is time for $n$ iterations of the Q-planning algorithm. Without the model update and the $n$ loop planning step, this algorithm is identical to one-step tabular Q-learning. An example implementation is below along with an example applying it to a maze environment.
tabular_dynaQ (generic function with 1 method)
Example 8.1: Dyna Maze
1
1
1
2
1
3
1
4
1
5
1
6
2
1
2
2
2
3
9
6
2
2
8
9
2
50
1
3
3
3
2
14
4
5
23
5
5
29
6
6
36
3
6
18
3
3
15
4
4
22
2
4
1
3
#243 (generic function with 1 method)
(::Main.var"workspace#3".var"#tr#253"{Main.var"workspace#3".GridworldState, Float32, Float32, Main.var"workspace#3".var"#246#257", Float32, Main.var"workspace#3".GridworldState, Float32, Bool, Main.var"workspace#3".var"#isterm#252"{Main.var"workspace#3".GridworldState, Main.var"workspace#3".GridworldState, Bool}, Main.var"workspace#3".var"#step#251"{Main.var"workspace#3".var"#260#262", Main.var"workspace#3".var"#boundstate#250"{Int64, Int64}}}) (generic function with 1 method)
(::Main.var"workspace#3".var"#isterm#252"{Main.var"workspace#3".GridworldState, Main.var"workspace#3".GridworldState, Bool}) (generic function with 1 method)
make_dyna_maze (generic function with 1 method)
Consider the simple maze shown above where the agent can take four actions which take it deterministically into the neighboring square unless it is blocked by a wall represented by a W in the diagram. In this case the agent remains in its original state. Reward is zero on all transitions except those into the goal state, on which it is +1. After reaching the goal the agent returns to the start to begin a new episode. This is a discounted episodic task with $\gamma = 0.95$
Figure 8.2
Learning curves for a Dyna-Q agent in the maze
Figure 8.3:
Policies found by planning and nonplanning Dyna-Q agents after episode
figure8_2 (generic function with 1 method)
figure8_3 (generic function with 1 method)
Because planning proceeds incrementally, it is trivial to intermix planning and acting. So the learning from the model can inform the interaction with the environment in parallel and then any information from the environment can be used to update the model whenever it is received.
Exercise 8.1
The nonplanning method looks particularly poor in Figure 8.3 because it is a one-step method; a method using multi-step bootstrapping would do better. Do you think one of the multi-step bootstrapping methods from Chapter 7 could do as well as the Dyna method? Explain why or why not.
For the n = 50 agent, it can learn a policy that covers nearly the entire maze during the second episode. In the extreme case of a multistep method we would attempt to solve the maze using monte carlo sampling in which case after a single episode we would have action/value updates for every state visited along the random trajectory. However, these action/value estimates would be high variance estimates of the randomly initialized policy. In contrast, the Dyna method after one random episode has observed most or all of the transitions in the maze. During the long planning phase its Q updates would actually be able to converge to the optimal policy given a large enough n using bootstrapping and the single reward from reaching the goal. As long as something is sampled close to the goal that information will propagate through to the rest of the states and each update is simultaneously improving the ϵ-greedy policy. With multi-step bootstrapping we can extend the updates back along the trajectory a certain distance, but in the extreme case we just sample values from the random policy without any bootstrapping or we bootstrap to a limited degree close to the goal and still have the lack of information further away from it. Since this environment is deterministic, having the sample transitions is equivalent to having a full model of the environment. This could be used explicitely with value iteration to obtain the Q function as well. The planning step is effectively performing this computation but only using the known transitions and over time is focused only on the states visited by the optimal or nearly optimal policy. No n-step method can take advantage of previously observed transitions this well, but it should be noted that Dyna-Q only works this well if the environment is deterministic and is taking advantage of the fact that we know our model is correct for the transitions already observed.
8.3 When the Model Is Wrong
In the maze example the model started out empty and then was only filled in with correct information. In general, we cannot expect to be so fortunate. Models may be incorrect because the environment is stochastic and only a limited number of samples have been observed, or simply because the environment has changed. When the model is incorrect, the planning process will likely compute a suboptimal policy.
In some cases, the suboptimal poicy computed by planning quickly leads to the discovery and correction of the modeling error. This tends to happen when the model is optimistic in the sense of predicting greater reward or better state transitions than are actually possible. The planned policy attempts to exploit these opportunities and in doing so it discovers that they do not exist.
Example 8.2: Blocking Maze
Consider a maze with a barrier that separates the start from the goal except for an opening on one end. After a set number of time steps this path is blocked and replaced with a longer path open at the other end. Below is an example of the environment with a random trajectory before and after the blocking.
Figure 8.4:
Average performance of Dyna-Q agent on a blocking task. After 1000 episodes of training the environment is switched to the second version. The flat portion of the graph shows the agent getting stuck following the old optimal policy and needing many time steps to learn the new environment
Planning Steps:
figure_8_4 (generic function with 1 method)