bondscell_results$8ddf6b9d-d76d-401f-96ad-2a0b5c114fa4queued¤logsrunning¦outputbody;create_noisy_gridworld_mdp (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA @persist_js_state·has_pluto_hook_features§cell_id$8ddf6b9d-d76d-401f-96ad-2a0b5c114fa4depends_on_disabled_cells§runtime$}published_object_keysdepends_on_skipped_cells§errored$5290ae65-6f56-4849-a842-fe347315c6dcqueued¤logsrunning¦outputbodyX

6.2 Advantages of TD Prediction Methods

TD methods can learn before an episode terminates, so this is an advantage in environments that have very long episodes. Also, in continuing problems, Monte Carlo methods may not be suitable at all because there is no termination condition. Furthermore, if we consider off-policy learning, Monte Carlo methods must ignore returns if exploratory actions (ones never taken by the target policy) are taken later in the episode whereas TD methods could learn from individual steps that are not exploratory regardless of what happens later on.

For any fixed policy $v_\pi$ TD(0) has been proved to converge to $v_\pi$ in the mean for a constant step-size parameter if it is sufficiently small, and with probability 1 if the step-size parameter decreases according to the usual stochastic approximation conditions (2.7). Since both TD and Monte Carlo methods converge, one natural question is which converges faster, which makes more efficient use of limited data? There is no mathematical proof to this question, nor is it clear how to even pose it formally; however, TD methods have usually been found to converge faster than constant-α MC methods on stochastic tasks, as illustrated in Example 6.2.

mimetext/htmlrootassigneelast_run_timestampA ޷ persist_js_state·has_pluto_hook_features§cell_id$5290ae65-6f56-4849-a842-fe347315c6dcdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$b3d4117f-7db4-43a6-8427-c08f3542d71fqueued¤logsrunning¦outputbody(poisson (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$b3d4117f-7db4-43a6-8427-c08f3542d71fdepends_on_disabled_cells§runtimePpublished_object_keysdepends_on_skipped_cells§errored$3ed12c33-ab0a-49b1-b9e7-c4305ba35767queued¤logsrunning¦outputbody*init_step (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA "Spersist_js_state·has_pluto_hook_features§cell_id$3ed12c33-ab0a-49b1-b9e7-c4305ba35767depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$209881b3-3ac8-490e-97bd-fa5ae24a39f5queued¤logsrunning¦outputbody.update_value! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 3persist_js_state·has_pluto_hook_features§cell_id$209881b3-3ac8-490e-97bd-fa5ae24a39f5depends_on_disabled_cells§runtime&Ypublished_object_keysdepends_on_skipped_cells§errored$6e06bd39-486f-425a-bbca-bf363b58988cqueued¤logsrunning¦outputbody

6.6 Expected Sarsa

Consider the learning algorithm that is just like Q-learning except that intsead of the maximization over next state-action pairs it uses the expected value, taking into account how likely each action is under the current policy. That is consider the algorithm with the update rule

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left [ R_{t+1} + \gamma \text{E}_\pi [Q(S_{t+1}, A_{t+1})|S_{t+1}] - Q(S_t, A_t) \right ]$$

$$= Q(S_t, A_t) + \alpha \left [ R_{t+1} + \gamma \sum_a \pi(a|S_{t+1})Q(S_{t+1}, a) - Q(S_t, A_t) \right ]$$

but that otherwise follows the scheme of Q-learning. Given the next state, $S_{t+1}$, this algorithm moves deterministically in the same direction as Sarsa moves in expectation, and accordingly it is called Expected Sarsa. Although more computationally complex than Sarsa, it eliminates the variance due to the random selection of $A_{t+1}$

In general Expected Sarsa might use a policy different from the target policy π to generate behavior in which case it becomes an off-policy algorithm. For example, supppose π is the greedy policy while behavior is more exploratory; then Expected Sarsa is exactly Q-learning. In this sense Expected Sarsa subsumes and generalizes Q-learning while reliably improving over Sarsa.

mimetext/htmlrootassigneelast_run_timestampA ޼persist_js_state·has_pluto_hook_features§cell_id$6e06bd39-486f-425a-bbca-bf363b58988cdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$e039a5be-4b59-4023-be97-2d1de970be27queued¤logsrunning¦outputbodyD

Double Learning Implementation

mimetext/htmlrootassigneelast_run_timestampA ޽ðpersist_js_state·has_pluto_hook_features§cell_id$e039a5be-4b59-4023-be97-2d1de970be27depends_on_disabled_cells§runtimedpublished_object_keysdepends_on_skipped_cells§errored$2786101e-d365-4d6a-8de7-b9794499efb4queued¤logsrunning¦outputbody,example_6_2 (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA ~=persist_js_state·has_pluto_hook_features§cell_id$2786101e-d365-4d6a-8de7-b9794499efb4depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$14b456f9-5fd1-4340-a3c7-ab9b91b4e3e0queued¤logsrunning¦outputbodyٺ mimetext/htmlrootassigneelast_run_timestampA ޿persist_js_state·has_pluto_hook_features§cell_id$14b456f9-5fd1-4340-a3c7-ab9b91b4e3e0depends_on_disabled_cells§runtimespublished_object_keysdepends_on_skipped_cells§errored$ec285c96-4a75-4af6-8898-ec3176fa34c6queued¤logsrunning¦outputbody5make_windy_gridworld (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA =persist_js_state·has_pluto_hook_features§cell_id$ec285c96-4a75-4af6-8898-ec3176fa34c6depends_on_disabled_cells§runtimem!published_object_keysdepends_on_skipped_cells§errored$cafedde8-be94-4697-a511-510a5fea0155queued¤logsrunning¦outputbodyj mimetext/htmlrootassigneelast_run_timestampA XSipersist_js_state·has_pluto_hook_features§cell_id$cafedde8-be94-4697-a511-510a5fea0155depends_on_disabled_cells§runtimeHJpublished_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/6021fa627daa4cd3depends_on_skipped_cells§errored$d526a3a4-63cc-4f94-8f55-98c9a4a9d134queued¤logsrunning¦outputbody2double_q_learning (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA Ppersist_js_state·has_pluto_hook_features§cell_id$d526a3a4-63cc-4f94-8f55-98c9a4a9d134depends_on_disabled_cells§runtime-published_object_keysdepends_on_skipped_cells§errored$02f34da1-551f-4ce5-a588-7f3a14afd716queued¤logsrunning¦outputbodyprefixInt64elements-1text/plain0text/plain1text/plaintypeArrayprefix_shortobjectid7394916db5e0e55mime!application/vnd.pluto.tree+objectrootassigneeconst wind_varlast_run_timestampA #persist_js_state·has_pluto_hook_features§cell_id$02f34da1-551f-4ce5-a588-7f3a14afd716depends_on_disabled_cells§runtimeHapublished_object_keysdepends_on_skipped_cells§errored$f11dca8f-5557-49fc-9720-35034eadba57queued¤logsrunning¦outputbody

Consider a square gridworld in which the rewards for each step are -1.2 or 1.0 with equal probability. There is no wind and the allowed moves are just up, down, left, and right. The start is the lower left corner and the finish is the upper right corner. It is obvious that the expected reward for a step is -0.1, so the optimal policy is to move to the goal as quickly as possible which will take $(l-1) \times 2$ steps. For a 3x3 grid, this would be 4 steps, so $\mathbb{E} \{ G_0 \} = 4 \times -0.1 = -0.4$.

Because the positive reward is so much larger than the expected value, we might expect a large maximization bias to confuse the training method and favor long episodes with expected values that are positive. Below are example solutions after thousands of episodes for each of the previously discussed methods. The first solution shown is the correct optimal policy and value function using value iteration

mimetext/htmlrootassigneelast_run_timestampA ޾fZpersist_js_state·has_pluto_hook_features§cell_id$f11dca8f-5557-49fc-9720-35034eadba57depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$4ddc7d99-0b79-4689-bd93-8798b105c0a2queued¤logsrunning¦outputbodyprefixMDP_TD{GridworldState, GridworldAction, var"#tr#115"{var"#110#119", var"#step#114"{typeof(stochastic_wind), Vector{Int64}, var"#boundstate#113"{Int64, Int64}}}, var"#108#117"{GridworldState}, var"#isterm#116"{GridworldState}}elementsstatesprefix$Main.var"workspace#3".GridworldStateelementsprefixGridworldStateelementsx1text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid78e123e4d06443c5!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy2text/plaintypestructprefix_shortGridworldStateobjectide3e6b18864c38362!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid7d75a915b81b9730!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid32586272439d3588!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy5text/plaintypestructprefix_shortGridworldStateobjectid593769200b7ddf14!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy6text/plaintypestructprefix_shortGridworldStateobjectidd7705072ebc67732!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid32fa797472e0a83!application/vnd.pluto.tree+objectprefixGridworldStateelementsx2text/plainy1text/plaintypestructprefix_shortGridworldStateobjectidef30e57ae60bdc38!application/vnd.pluto.tree+object prefixGridworldStateelementsx2text/plainy2text/plaintypestructprefix_shortGridworldStateobjectid74f49756a2864a57!application/vnd.pluto.tree+objectmoreFprefixGridworldStateelementsx10text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid91d5970141de4b2d!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectidd8028ea24f24d35d!application/vnd.pluto.tree+objectstatelookupprefix1Dict{Main.var"workspace#3".GridworldState, Int64}elementsprefixGridworldStateelementsx8text/plainy5text/plaintypestructprefix_shortGridworldStateobjectid14e5eae9a48c6749!application/vnd.pluto.tree+object54text/plainprefixGridworldStateelementsx6text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid4e5052ac2b36c8be!application/vnd.pluto.tree+object39text/plainprefixGridworldStateelementsx7text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid6d65389daed97014!application/vnd.pluto.tree+object46text/plainprefixGridworldStateelementsx8text/plainy4text/plaintypestructprefix_shortGridworldStateobjectidb85af438304886c5!application/vnd.pluto.tree+object53text/plainprefixGridworldStateelementsx10text/plainy1text/plaintypestructprefix_shortGridworldStateobjectiddad6dff35c9621ff!application/vnd.pluto.tree+object64text/plainprefixGridworldStateelementsx6text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid4e4b90239eb3be65!application/vnd.pluto.tree+object42text/plainprefixGridworldStateelementsx8text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid6d43cd1ca99a553e!application/vnd.pluto.tree+object50text/plainprefixGridworldStateelementsx2text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid166e372c47e8ffa6!application/vnd.pluto.tree+object10text/plainprefixGridworldStateelementsx5text/plainy3text/plaintypestructprefix_shortGridworldStateobjectidf8402269233868c7!application/vnd.pluto.tree+object31text/plainprefixGridworldStateelementsx8text/plainy7text/plaintypestructprefix_shortGridworldStateobjectidb08053c76dcd8072!application/vnd.pluto.tree+object56text/plainmoretypeDictprefix_shortDictobjectidf2c827ab8104601f!application/vnd.pluto.tree+objectactionsprefix%Main.var"workspace#3".GridworldActionelementsprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+objectprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+objectprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+objectprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+objectprefixUpRightelementstypestructprefix_shortUpRightobjectidffffffff65dca132!application/vnd.pluto.tree+objectprefixUpLeftelementstypestructprefix_shortUpLeftobjectidffffffff68f3503e!application/vnd.pluto.tree+objectprefixDownRightelementstypestructprefix_shortDownRightobjectidffffffff97f641f9!application/vnd.pluto.tree+objectprefixDownLeftelementstypestructprefix_shortDownLeftobjectidffffffffd243dd41!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectidd84fdc99910d1e41!application/vnd.pluto.tree+objectactionlookupprefix2Dict{Main.var"workspace#3".GridworldAction, Int64}elementsprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+object2text/plainprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+object3text/plainprefixUpRightelementstypestructprefix_shortUpRightobjectidffffffff65dca132!application/vnd.pluto.tree+object5text/plainprefixDownRightelementstypestructprefix_shortDownRightobjectidffffffff97f641f9!application/vnd.pluto.tree+object7text/plainprefixDownLeftelementstypestructprefix_shortDownLeftobjectidffffffffd243dd41!application/vnd.pluto.tree+object8text/plainprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+object4text/plainprefixUpLeftelementstypestructprefix_shortUpLeftobjectidffffffff68f3503e!application/vnd.pluto.tree+object6text/plainprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+object1text/plaintypeDictprefix_shortDictobjectidc905ef492d7feaa3!application/vnd.pluto.tree+objectstate_init%#108 (generic function with 1 method)text/plainstep (::Main.var"workspace#3".var"#tr#115"{Main.var"workspace#3".var"#110#119", Main.var"workspace#3".var"#step#114"{typeof(Main.var"workspace#3".stochastic_wind), Vector{Int64}, Main.var"workspace#3".var"#boundstate#113"{Int64, Int64}}}) (generic function with 1 method)text/plainistermq(::Main.var"workspace#3".var"#isterm#116"{Main.var"workspace#3".GridworldState}) (generic function with 1 method)text/plaintypestructprefix_shortMDP_TDobjectid46bc7f640018fe80mime!application/vnd.pluto.tree+objectrootassigneeconst stochastic_gridworldlast_run_timestampA Ցpersist_js_state·has_pluto_hook_features§cell_id$4ddc7d99-0b79-4689-bd93-8798b105c0a2depends_on_disabled_cells§runtimerpublished_object_keysdepends_on_skipped_cells§errored$bd1029f9-d6a8-4c68-98cd-8af94297b521queued¤logsrunning¦outputbody+plot_path (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA տpersist_js_state·has_pluto_hook_features§cell_id$bd1029f9-d6a8-4c68-98cd-8af94297b521depends_on_disabled_cells§runtimeQpublished_object_keysdepends_on_skipped_cells§errored$cb07a6a5-c50a-4900-9e5b-a17dc7ee5710queued¤logsrunning¦outputbody4make_greedy_policy! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 8persist_js_state·has_pluto_hook_features§cell_id$cb07a6a5-c50a-4900-9e5b-a17dc7ee5710depends_on_disabled_cells§runtime#published_object_keysdepends_on_skipped_cells§errored$ddf3bb61-16c9-48c4-95d4-263260309762queued¤logsrunning¦outputbody-exercise_6_5 (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 挔persist_js_state·has_pluto_hook_features§cell_id$ddf3bb61-16c9-48c4-95d4-263260309762depends_on_disabled_cells§runtime Qpublished_object_keysdepends_on_skipped_cells§errored$d7566d1b-8938-4e2c-8c54-124f790e72aequeued¤logsrunning¦outputbodyFiniteMDPmimetext/plainrootassigneelast_run_timestampA @}persist_js_state·has_pluto_hook_features§cell_id$d7566d1b-8938-4e2c-8c54-124f790e72aedepends_on_disabled_cells§runtimeP ٵpublished_object_keysdepends_on_skipped_cells§errored$42799973-9884-4a0e-b29a-039890e92d21queued¤logsrunning¦outputbody %

Exercise 6.13

What are the update equations for Double Expected Sarsa with an ϵ-greedy target policy?

For Q-learning the action-value update equation is:

$$Q(S_t, A_t) = Q(S_t, A_t) + \alpha[R_{t+1} + \gamma \text{max}_a Q(S_{t+1}, a) - Q(S_t, A_t)]$$

For expected Sarsa the action-value update equation is:

$$Q(S_t, A_t) = Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma \sum_a \pi(a|S_{t+1})Q(S_{t+1}, a) - Q(S_t, A_t)]$$

For double Q-learning, the twin action-value update equations are:

$$Q_1(S_t, A_t) = Q_1(S_t, A_t) + \alpha [ R_{t+1} + \gamma Q_2(S_{t+1}, \text{argmax}_a Q_1(S_{t+1}, a)) - Q_1(S_t, A_t)]$$

$$Q_2(S_t, A_t) = Q_2(S_t, A_t) + \alpha [ R_{t+1} + \gamma Q_1(S_{t+1}, \text{argmax}_a Q_2(S_{t+1}, a)) - Q_2(S_t, A_t)]$$

For double expected sarsa, we have two action-value estimates like in Double Q-learining, but the bootstrap calculation is an expected value calculation using each value function's target policy. In this case that target is the $\epsilon$-greedy policy rather than the greedy policy in Q-learning. The expected value uses the probabilities from the matching value function but the values from the other one:

With 50% probability:

$$Q_1(S_t, A_t) = Q_1(S_t, A_t) + \alpha [ R_{t+1} + \gamma \sum_a \pi_1(a|S_{t+1}) Q_2(S_{t+1}, a) - Q_1(S_t, A_t)]$$

and make $\pi_1$ $\epsilon$-greedy with respect to $Q_1$

With 50% probability:

$$Q_2(S_t, A_t) = Q_2(S_t, A_t) + \alpha [ R_{t+1} + \gamma \sum_a \pi_2(a|S_{t+1}) Q_1(S_{t+1}, a) - Q_2(S_t, A_t)]$$

and make $\pi_2$ $\epsilon$-greedy with respect to $Q_2$

mimetext/htmlrootassigneelast_run_timestampA ޾Ҟpersist_js_state·has_pluto_hook_features§cell_id$42799973-9884-4a0e-b29a-039890e92d21depends_on_disabled_cells§runtime蜵published_object_keysdepends_on_skipped_cells§errored$187fc682-2282-46ca-b988-c9de438f36fdqueued¤logsrunning¦outputbody=

Batch Training of Random Walk Task

$\alpha$0.01
Number of States5
Maximum Episodes100

mimetext/htmlrootassigneelast_run_timestampA SPpersist_js_state·has_pluto_hook_features§cell_id$187fc682-2282-46ca-b988-c9de438f36fddepends_on_disabled_cells§runtime,^ҵpublished_object_keysdepends_on_skipped_cells§errored$8fe856ec-5f0a-4483-bb7d-3f6fe270b6f3queued¤logsrunning¦outputbodyB

Example 6.8: Noisy Gridworld

mimetext/htmlrootassigneelast_run_timestampA ޾Gpersist_js_state·has_pluto_hook_features§cell_id$8fe856ec-5f0a-4483-bb7d-3f6fe270b6f3depends_on_disabled_cells§runtimeϵpublished_object_keysdepends_on_skipped_cells§errored$8e15f4b5-0dc7-47a5-9477-9f4d8807b331queued¤logsrunning¦outputbodyprefix3FiniteMDP{Float32, GridworldState, GridworldAction}elementsstatesprefix$Main.var"workspace#3".GridworldStateelementsprefixGridworldStateelementsx1text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid78e123e4d06443c5!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy2text/plaintypestructprefix_shortGridworldStateobjectide3e6b18864c38362!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid7d75a915b81b9730!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid32586272439d3588!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy5text/plaintypestructprefix_shortGridworldStateobjectid593769200b7ddf14!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy6text/plaintypestructprefix_shortGridworldStateobjectidd7705072ebc67732!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid32fa797472e0a83!application/vnd.pluto.tree+objectprefixGridworldStateelementsx2text/plainy1text/plaintypestructprefix_shortGridworldStateobjectidef30e57ae60bdc38!application/vnd.pluto.tree+object prefixGridworldStateelementsx2text/plainy2text/plaintypestructprefix_shortGridworldStateobjectid74f49756a2864a57!application/vnd.pluto.tree+objectmoreFprefixGridworldStateelementsx10text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid91d5970141de4b2d!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectide873767f6f57e41e!application/vnd.pluto.tree+objectactionsprefix%Main.var"workspace#3".GridworldActionelementsprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+objectprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+objectprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+objectprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+objectprefixUpRightelementstypestructprefix_shortUpRightobjectidffffffff65dca132!application/vnd.pluto.tree+objectprefixUpLeftelementstypestructprefix_shortUpLeftobjectidffffffff68f3503e!application/vnd.pluto.tree+objectprefixDownRightelementstypestructprefix_shortDownRightobjectidffffffff97f641f9!application/vnd.pluto.tree+objectprefixDownLeftelementstypestructprefix_shortDownLeftobjectidffffffffd243dd41!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectidd84fdc99910d1e41!application/vnd.pluto.tree+objectrewardsprefixFloat32elements0.0text/plain-1.0text/plaintypeArrayprefix_shortobjectid589d78fbcf524589!application/vnd.pluto.tree+objectptfQ70×2×8×70 Array{Float32, 4}: [:, :, 1, 1] = 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 1] = 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 1] = 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 1] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 5, 1] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 6, 1] = 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 7, 1] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 8, 1] = 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 2] = 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 2] = 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 2] = 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 2] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 5, 2] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 6, 2] = 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 7, 2] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 8, 2] = 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 3] = 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 3] = 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 5, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 6, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 7, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 8, 3] = 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ;;;; … [:, :, 1, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 2, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 [:, :, 5, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 6, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 7, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 8, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 [:, :, 2, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 [:, :, 3, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 5, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 [:, :, 6, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 7, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 [:, :, 8, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 [:, :, 2, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 3, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 [:, :, 5, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 [:, :, 6, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 7, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 8, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0text/plainaction_scratchprefixFloat32elements-1.2text/plain0.0text/plain3.0f-45text/plain0.0text/plain1.0f-45text/plain0.0text/plain3.0f-45text/plain0.0text/plaintypeArrayprefix_shortobjectiddfeee36c5df5172c!application/vnd.pluto.tree+objectstate_scratchprefixFloat32elements6.726f-42text/plain2.69f-43text/plain6.726f-42text/plain2.69f-43text/plain9.596f-42text/plain2.69f-43text/plain9.596f-42text/plain6.726f-42text/plain 6.771f-42text/plainmoreG9.42f-43text/plaintypeArrayprefix_shortobjectid78eef2780ff5340d!application/vnd.pluto.tree+objectreward_scratchprefixFloat32elements4.0f-45text/plain0.0text/plaintypeArrayprefix_shortobjectidbe420ce59a9b6a13!application/vnd.pluto.tree+objectstate_indexprefix1Dict{Main.var"workspace#3".GridworldState, Int64}elementsprefixGridworldStateelementsx8text/plainy5text/plaintypestructprefix_shortGridworldStateobjectid14e5eae9a48c6749!application/vnd.pluto.tree+object54text/plainprefixGridworldStateelementsx6text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid4e5052ac2b36c8be!application/vnd.pluto.tree+object39text/plainprefixGridworldStateelementsx7text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid6d65389daed97014!application/vnd.pluto.tree+object46text/plainprefixGridworldStateelementsx8text/plainy4text/plaintypestructprefix_shortGridworldStateobjectidb85af438304886c5!application/vnd.pluto.tree+object53text/plainprefixGridworldStateelementsx10text/plainy1text/plaintypestructprefix_shortGridworldStateobjectiddad6dff35c9621ff!application/vnd.pluto.tree+object64text/plainprefixGridworldStateelementsx6text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid4e4b90239eb3be65!application/vnd.pluto.tree+object42text/plainprefixGridworldStateelementsx8text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid6d43cd1ca99a553e!application/vnd.pluto.tree+object50text/plainprefixGridworldStateelementsx2text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid166e372c47e8ffa6!application/vnd.pluto.tree+object10text/plainprefixGridworldStateelementsx5text/plainy3text/plaintypestructprefix_shortGridworldStateobjectidf8402269233868c7!application/vnd.pluto.tree+object31text/plainprefixGridworldStateelementsx8text/plainy7text/plaintypestructprefix_shortGridworldStateobjectidb08053c76dcd8072!application/vnd.pluto.tree+object56text/plainmoretypeDictprefix_shortDictobjectid5965d00bcc2d1e1d!application/vnd.pluto.tree+objectaction_indexprefix2Dict{Main.var"workspace#3".GridworldAction, Int64}elementsprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+object2text/plainprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+object3text/plainprefixUpRightelementstypestructprefix_shortUpRightobjectidffffffff65dca132!application/vnd.pluto.tree+object5text/plainprefixDownRightelementstypestructprefix_shortDownRightobjectidffffffff97f641f9!application/vnd.pluto.tree+object7text/plainprefixDownLeftelementstypestructprefix_shortDownLeftobjectidffffffffd243dd41!application/vnd.pluto.tree+object8text/plainprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+object4text/plainprefixUpLeftelementstypestructprefix_shortUpLeftobjectidffffffff68f3503e!application/vnd.pluto.tree+object6text/plainprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+object1text/plaintypeDictprefix_shortDictobjectidcadd71e6af1c650d!application/vnd.pluto.tree+objecttypestructprefix_shortFiniteMDPobjectid5729813aff5969a5mime!application/vnd.pluto.tree+objectrootassignee!const stochastic_gridworld_mdp_dplast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$8e15f4b5-0dc7-47a5-9477-9f4d8807b331depends_on_disabled_cells§runtimeedpublished_object_keysdepends_on_skipped_cells§errored$9d01c0ef-6313-4091-b444-3e9765aba90cqueued¤logsrunning¦outputbodyO

Windy Gridworld Solutions with Q-Learning

mimetext/htmlrootassigneelast_run_timestampA ޻8vpersist_js_state·has_pluto_hook_features§cell_id$9d01c0ef-6313-4091-b444-3e9765aba90cdepends_on_disabled_cells§runtime0published_object_keysdepends_on_skipped_cells§errored$62a9a36a-bedb-4f5a-80a4-2d4111a65c12queued¤logsrunning¦outputbody2

$$\cdots \:$$

$$S_t$$

$$A_t$$

$$R_{t+1}$$

$$S_{t+1}$$

$$A_{t+1}$$

$$R_{t+2}$$

$$S_{t+2}$$

$$A_{t+2}$$

$$R_{t+3}$$

$$S_{t+3}$$

$$\:\cdots$$

mimetext/htmlrootassigneelast_run_timestampA /persist_js_state·has_pluto_hook_features§cell_id$62a9a36a-bedb-4f5a-80a4-2d4111a65c12depends_on_disabled_cells§runtime*еpublished_object_keysdepends_on_skipped_cells§errored$2651af2d-56a8-4f7e-a56a-45cabd665c72queued¤logsrunning¦outputbodyj mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$2651af2d-56a8-4f7e-a56a-45cabd665c72depends_on_disabled_cells§runtime Lpublished_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/a68a31a7f0a83bf4depends_on_skipped_cells§errored$620a6426-cb29-4010-997b-aa4f9d5f8fb0queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$620a6426-cb29-4010-997b-aa4f9d5f8fb0depends_on_disabled_cells§runtime2%published_object_keysdepends_on_skipped_cells§errored$889611fb-7dac-4769-9251-9a90e3a1422fqueued¤logsrunning¦outputbody+statestyle (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$889611fb-7dac-4769-9251-9a90e3a1422fdepends_on_disabled_cells§runtime{ʵpublished_object_keysdepends_on_skipped_cells§errored$5455fc97-55cb-4b0e-a3be-9433ccc96fc0queued¤logsrunning¦outputbody^

Number of States: 5

Animation Interval (s): 0.5

mimetext/htmlrootassigneelast_run_timestampA *v\persist_js_state·has_pluto_hook_features§cell_id$5455fc97-55cb-4b0e-a3be-9433ccc96fc0depends_on_disabled_cells§runtime gpublished_object_keysdepends_on_skipped_cells§errored$24a441c8-7aaf-4642-b245-5e1201456d67queued¤logsrunning¦outputbody-check_policy (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA &-persist_js_state·has_pluto_hook_features§cell_id$24a441c8-7aaf-4642-b245-5e1201456d67depends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cells§errored$1e45a661-c2e1-40c2-b27b-5f80f95efdabqueued¤logsrunning¦outputbody
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
-11.0
-12.0
-11.0
-11.0
-12.0
-13.0
-13.0
-11.0
-11.0
-9.9
-11.0
-12.0
-13.0
-13.0
-9.5
-8.8
-10.0
-12.0
-12.0
-13.0
-13.0
-7.7
-8.6
-11.0
-12.0
-13.0
-13.0
-14.0
-6.4
-6.8
-10.0
-11.0
-13.0
-13.0
-14.0
-4.8
-5.6
-6.5
-10.0
-11.0
-13.0
-13.0
-4.4
-2.6
-5.2
-8.4
-12.0
-12.0
-12.0
-6.0
-5.0
-8.1
0.0
-9.8
-11.0
-10.0
-6.7
-5.8
-5.1
-7.7
-7.6
-8.1
-9.1
-6.9
-7.3
-7.0
-5.5
-6.1
-7.1
-8.3
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
mimetext/htmlrootassigneelast_run_timestampA ipersist_js_state·has_pluto_hook_features§cell_id$1e45a661-c2e1-40c2-b27b-5f80f95efdabdepends_on_disabled_cells§runtime published_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/2b03c89d05785d10depends_on_skipped_cells§errored$21fbdc3b-4444-4f56-9934-fb58e184d685queued¤logsrunning¦outputbodyٖ

Load existing figure:

mimetext/htmlrootassigneelast_run_timestampA {persist_js_state·has_pluto_hook_features§cell_id$21fbdc3b-4444-4f56-9934-fb58e184d685depends_on_disabled_cells§runtimebdpublished_object_keysdepends_on_skipped_cells§errored$30e663da-282c-42ff-8171-dbe3c5c467c6queued¤logsrunning¦outputbody5makepolicyvalueplots (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA Opersist_js_state·has_pluto_hook_features§cell_id$30e663da-282c-42ff-8171-dbe3c5c467c6depends_on_disabled_cells§runtime*\published_object_keysdepends_on_skipped_cells§errored$9651f823-e1cd-4e6e-9ce0-be9ea1c3f0a4queued¤logsrunning¦outputbody4display_king_policy (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA vpersist_js_state·has_pluto_hook_features§cell_id$9651f823-e1cd-4e6e-9ce0-be9ea1c3f0a4depends_on_disabled_cells§runtime*ٵpublished_object_keysdepends_on_skipped_cells§errored$84a71bf8-0d66-42cd-ac7b-589d63a16edaqueued¤logsrunning¦outputbody5create_greedy_policy (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA ^d=persist_js_state·has_pluto_hook_features§cell_id$84a71bf8-0d66-42cd-ac7b-589d63a16edadepends_on_disabled_cells§runtime%hӵpublished_object_keysdepends_on_skipped_cells§errored$c9f7646a-ec01-4d90-9215-5027b7c1c885queued¤logsrunning¦outputbody

Q-learning Instability at Higher Learning Rate

Learning Rate $\alpha$ 0.3

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$c9f7646a-ec01-4d90-9215-5027b7c1c885depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$8e34202a-f841-4464-9017-cd50194f7987queued¤logsrunning¦outputbody3make_random_policy (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA Zհpersist_js_state·has_pluto_hook_features§cell_id$8e34202a-f841-4464-9017-cd50194f7987depends_on_disabled_cells§runtime`published_object_keysdepends_on_skipped_cells§errored$95245673-2c29-401e-bb4b-a39dc8172297queued¤logsrunning¦outputbody5create_gridworld_mdp (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 2!qpersist_js_state·has_pluto_hook_features§cell_id$95245673-2c29-401e-bb4b-a39dc8172297depends_on_disabled_cells§runtime7Npublished_object_keysdepends_on_skipped_cells§errored$c34678f6-53bb-4f2a-96f0-a7b16f894dddqueued¤logsrunning¦outputbody
Value Iteration Solution
0
0
0
Actions
Wind Values
-0.4
-0.3
-0.2
-0.3
-0.2
-0.1
-0.2
-0.1
0.0
0
0
0
Actions
Wind Values
Sarsa Solution
0
0
0
Actions
Wind Values
0.039
-0.76
-0.096
0.096
-0.52
0.019
-0.83
-0.36
0.0
0
0
0
Actions
Wind Values
Expected Sarsa Solution
0
0
0
Actions
Wind Values
-0.68
-0.63
-0.23
-0.44
-0.46
0.058
-0.36
-0.35
0.0
0
0
0
Actions
Wind Values
Double Expected Sarsa Solution
0
0
0
Actions
Wind Values
-0.85
-0.71
-0.65
-0.85
-0.45
-0.58
-0.39
-0.0088
0.0
0
0
0
Actions
Wind Values
Q-learning Solution
0
0
0
Actions
Wind Values
-0.82
-0.57
-0.99
-0.77
-0.037
-0.39
-0.56
-0.42
0.0
0
0
0
Actions
Wind Values
Double Q-learning Solution
0
0
0
Actions
Wind Values
-0.79
-0.63
-0.26
-0.39
-0.24
-0.19
-0.38
-0.25
0.0
0
0
0
Actions
Wind Values
mimetext/htmlrootassigneelast_run_timestampA  炰persist_js_state·has_pluto_hook_features§cell_id$e4e80015-40ce-4f8a-aac7-4a9584da4baadepends_on_disabled_cells§runtimeTc!published_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/72ba1d0790a4c52459c6be96e-38f7-11f0-2d30-a71f02755abc/3f5340d82e7339da59c6be96e-38f7-11f0-2d30-a71f02755abc/4cf46394be540b7349c6be96e-38f7-11f0-2d30-a71f02755abc/d3a9386ca62c61859c6be96e-38f7-11f0-2d30-a71f02755abc/f97aed3be1675ad659c6be96e-38f7-11f0-2d30-a71f02755abc/93bf178085e446c559c6be96e-38f7-11f0-2d30-a71f02755abc/895c5d874ea742c3depends_on_skipped_cells§errored$64fe8336-d1c2-41fe-a522-1b6f63260fc9queued¤logsrunning¦outputbody31×6 Matrix{Float32}: 1.0 1.0 1.0 1.0 1.0 1.0mimetext/plainrootassigneeconst π_mrplast_run_timestampA аpersist_js_state·has_pluto_hook_features§cell_id$64fe8336-d1c2-41fe-a522-1b6f63260fc9depends_on_disabled_cells§runtimep:Opublished_object_keysdepends_on_skipped_cells§errored$dea61907-d4fb-492d-b2bb-c037c7f785cbqueued¤logsrunning¦outputbody8bellman_optimal_value! (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$dea61907-d4fb-492d-b2bb-c037c7f785cbdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$678cad7a-1abb-4fcc-91ba-b5abcbb914cbqueued¤logsrunning¦outputbody1show_grid_value (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA ضذpersist_js_state·has_pluto_hook_features§cell_id$678cad7a-1abb-4fcc-91ba-b5abcbb914cbdepends_on_disabled_cells§runtimepkpublished_object_keysdepends_on_skipped_cells§errored$d299d800-a64e-4ba2-9603-efa833343405queued¤logsrunning¦outputbody,example_6_5 (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA >persist_js_state·has_pluto_hook_features§cell_id$d299d800-a64e-4ba2-9603-efa833343405depends_on_disabled_cells§runtimep=published_object_keysdepends_on_skipped_cells§errored$c5718459-2323-4615-b2c4-f92a0fa189d9queued¤logsrunning¦outputbody

Let $\mathcal{M}$ be the set of labels of estimators that maximize the expcted values of $X$:

$$\mathcal{M} \doteq \left \{ j \mid \mathbb{E} \{ X_j \} = \max_i \mathbb{E} \{ X_i \} \right \}$$

Let $Max(S)$ be the set of labels of estimators that yield the maximum estimate for some set of samples S:

$$Max(S) \doteq \left \{ j \mid \mu_j(S) = \max_i \mu_i(S) \right \}$$

The claim is that for all $j \in \mathcal{M}$

$$\mathbb{E} \{ \max_i \mu_i \} \geq \mathbb{E} \{ \mu_j \} = \mathbb{E} \{ X_j \} \doteq \max_i \mathbb{E} \{ X_i \} \tag{d}$$

Proof. Assume $j \in \mathcal{M}$, i.e. $\mu_j$ is any estimator whose expected value is the maximal. Then

$$\begin{flalign} \mathbb{E} \{ \max_i \mu_i \} &= P(j \in Max) \mathbb{E} \{ \max_i \mu_i \} + P(j \notin Max) \mathbb{E} \{ \max_i \mu_i \} \\ &= P(j \in Max) \mathbb{E} \{\mu_j \vert j \in Max \} + P(j \notin Max) \mathbb{E} \{ \max_i \mu_i \} \\ &\geq P(j \in Max) \mathbb{E} \{\mu_j \vert j \in Max \} + P(j \notin Max) \mathbb{E} \{ \mu_j \vert j \notin Max \} \\ &=\mathbb{E} \{ \mu_j \} = \mathbb{E} \{X_j\} \doteq \max_i \mathbb{E} \{ X_i \} \end{flalign}$$

The third line in the proof follows from the definition of $Max$ which implies $\mathbb{E} \{ \max_i \mu_i \} \gt \mathbb{E} \{ \mu_j \vert j \notin Max \}$, for any $j$. Therefore the inequality is strict if and only if $P(j \notin Max) \gt 0$, for some $j \in \mathcal{M}$. If we do not know whether this is the case, we do not know if the inequality in $(d)$ is strict and theremore in general we write $\mathbb{E} \{ \max_i \mu_i \} \geq \max_i \mathbb{E} \{ \mu_i \}$ so the claim has been proven.

Recall that $j$ is assumed to be in the set $\mathcal{M}$ meaning it has a maximizing expected value while the set $Max(S)$ contains the variables that produce the maximum estimate over some sample $S$. So, intuitively, the proof says that calculating the expected value of the maximum of the estimators will always have a positive bias, unless there is 0 probability that the variables that produces the highest estimates over a given sample are different than the true set of maximizing variables. This means that unless the underlying distribution of the variables have zero overlap (in this case the ranking of estimates will match the ranking of true expected values), there is always an expected positive bias.

mimetext/htmlrootassigneelast_run_timestampA ޽apersist_js_state·has_pluto_hook_features§cell_id$c5718459-2323-4615-b2c4-f92a0fa189d9depends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cells§errored$c306867b-f137-44f2-97dd-3d10c226ca5cqueued¤logsrunning¦outputbody

Consider instead policy improvement with afterstate value estimates $W_\pi(y)$ where we seek to choose a policy that is greedy with respect to the afterstate values:

$$\pi^\prime(s) = \mathrm{argmax}_a (f_2(s, a) + W_\pi(f_1(s, a))$$

where $f_1$ and $f_2$ are the deterministic functions defined above that determine which afterstate is reached from $(s, a)$ and whether any intermediate reward is received. This looks much closer to the policy improvement that occurs with $Q(s, a)$ and that is because $Q_\pi(s, a) = f_2(s, a) + W_\pi(f_1(s, a))$. So, if we use afterstates, we can have the benefits of learning the state action value function while only saving values for the afterstates. The functions $f_1$ and $f_2$ provide all the extra information needed to recover those values.

Continuing the comparison to value iteration, recall that we adapted the Bellman optimality equation for the state value function to have a single update rule to estimate $V^*(s)$:

$$V^*(s) = \max_a Q^*(s, a) = \max_a \sum_{r, s^\prime} p(r, s^\prime \vert s, a) (r + \gamma V^*(s^\prime))$$

We can only apply this update rule if we have $p(r, s^\prime \vert s, a)$ or if we instead estimate $Q^*$ and sample the transitions from the environment. To estimate $W^*(y)$, we need to represent the Bellman optimality equation for the afterstate value function instead of the state value function:

$$\begin{flalign} W^*(y) &= \sum_{r, s^\prime} p(r, s^\prime \vert y)(r + \gamma \max_a(f_2(s^\prime, a) + W^*(f_1(s^\prime, a)))) \\ &= \sum_{r, s^\prime} p(r, s^\prime \vert y)r + \gamma \sum_{s^\prime} p(s^\prime \vert y) \max_a(f_2(s^\prime, a) + W^*(f_1(s^\prime, a))) \end{flalign}$$

where $p(s^\prime \vert y) = \sum_r p(r, s^\prime \vert y)$

The outer sum is just represents an expected value based on the transition out of $y$, so if we don't have access to $p(r, s^\prime \vert y)$, we could sample the transitions from the environment. The $\max_a$ term can now be calculated explicitely and will involve finding the maximum index of a vector for each transition state and does not depend on the reward. Using state values, the maximization step involves evaluating a double sum every time, so each update with afterstates is less costly. Also, the afterstates themselves might be more informative in the sense that they all have distinct values. If many of the actions from a given state, lead to the same afterstate, this method will immediately treat them all as equal, whereas with usual value iterationthat equivalence would have to be calculated with the probability transition function. The benefits of using an afterstate value function depend entirely on how effectively the environment transitions can be separated into informative deterministic steps and limited stochastic dynamics.

mimetext/htmlrootassigneelast_run_timestampA ޿{Bpersist_js_state·has_pluto_hook_features§cell_id$c306867b-f137-44f2-97dd-3d10c226ca5cdepends_on_disabled_cells§runtime D:published_object_keysdepends_on_skipped_cells§errored$a4c4d5f2-d76d-425e-b8c9-9047fe53c4f0queued¤logsrunning¦outputbodyS
mimetext/htmlrootassigneelast_run_timestampA 5(persist_js_state·has_pluto_hook_features§cell_id$a4c4d5f2-d76d-425e-b8c9-9047fe53c4f0depends_on_disabled_cells§runtimeΡbpublished_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/bc25cbf31a6c694259c6be96e-38f7-11f0-2d30-a71f02755abc/5b7c97cc5c268b2e59c6be96e-38f7-11f0-2d30-a71f02755abc/4d752609bc5b03a959c6be96e-38f7-11f0-2d30-a71f02755abc/6aa5ac91f9de9235depends_on_skipped_cells§errored$410abe1d-04a6-4434-9abf-0d29dd6498e6queued¤logsrunning¦outputbodyJ

Tabular TD(0) Implementation

mimetext/htmlrootassigneelast_run_timestampA ްՑpersist_js_state·has_pluto_hook_features§cell_id$410abe1d-04a6-4434-9abf-0d29dd6498e6depends_on_disabled_cells§runtimeIpublished_object_keysdepends_on_skipped_cells§errored$aa0791a5-8cf1-499b-9900-4d0c59be808cqueued¤logsrunning¦outputbody0stochastic_wind (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA e|persist_js_state·has_pluto_hook_features§cell_id$aa0791a5-8cf1-499b-9900-4d0c59be808cdepends_on_disabled_cells§runtime _0published_object_keysdepends_on_skipped_cells§errored$510761f6-66c7-4faf-937b-e1422ec829a6queued¤logsrunning¦outputbody mimetext/htmlrootassigneelast_run_timestampA Lpersist_js_state·has_pluto_hook_features§cell_id$510761f6-66c7-4faf-937b-e1422ec829a6depends_on_disabled_cells§runtime+published_object_keysdepends_on_skipped_cells§errored$0b9c6dbd-4eb3-4167-886e-64db9ec7ff04queued¤logsrunning¦outputbody

Exercise 6.3

From the results shown in the left graph of the random walk example it appears that the first episode results in a change only in $V(A)$. What does this tell you about what happened on the first episode? Why was only the estimate for this one state changed? By exactly how much was it changed?

The update rule with TD(0) learning is given by

$$V(S_t) \leftarrow V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]$$

All states, A, B, C, D, E are initialized at 0.5 with the terminal state initialized at 0. During the first episode for all transitions before the end, the reward is 0 and the difference between adjacent states would be 0 resulting in no change to the value function. Since the value estimate for state A decreases from the initial value, this means that the first episode terminated to the left. For this final transition we have the following update.

$$V(A) \leftarrow V(A) + \alpha[0 + \gamma V(\text{Term}) - V(A)]$$

We know that prior to the update $V(A) = 0.5$, $V(\text{Term}) = 0$ and $\gamma=1$ so the update is

$$V(A) \leftarrow 0.5 + \alpha[0 - 0.5]$$

For this plot, $\alpha=0.1$, so the updated value for $V(A)$ is $0.5+0.1(-0.5)=0.5-0.05=0.45$

mimetext/htmlrootassigneelast_run_timestampA ޷

Random Walk MDP Setup

mimetext/htmlrootassigneelast_run_timestampA ޷Epersist_js_state·has_pluto_hook_features§cell_id$a9dda9b5-f568-481c-9e8f-9bb887468775depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$ad03500a-bd42-4216-a9cb-3f923152af79queued¤logsrunning¦outputbodyAcreate_car_rental_afterstate_mdp (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA l'persist_js_state·has_pluto_hook_features§cell_id$ad03500a-bd42-4216-a9cb-3f923152af79depends_on_disabled_cells§runtime opublished_object_keysdepends_on_skipped_cells§errored$de50f95f-984e-4387-958c-64e0265f5953queued¤logsrunning¦outputbody,render_walk (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 彗ppersist_js_state·has_pluto_hook_features§cell_id$de50f95f-984e-4387-958c-64e0265f5953depends_on_disabled_cells§runtime/صpublished_object_keysdepends_on_skipped_cells§errored$c8500b89-644d-407f-881a-bcbd7da23502queued¤logsrunning¦outputbody

Figure 6.3 Interim and aymptotic performance shown for TD control methods on cliff-walking task as a function of α. Dashed lines represent interim performance and solid lines are asymptotic.

mimetext/htmlrootassigneelast_run_timestampA ޼Vpersist_js_state·has_pluto_hook_features§cell_id$c8500b89-644d-407f-881a-bcbd7da23502depends_on_disabled_cells§runtime{wpublished_object_keysdepends_on_skipped_cells§errored$84d81413-6334-4965-8632-8a763cd3f28aqueued¤logsrunning¦outputbody8

Comparison of all learning methods with their double estimator counterparts and the simple MDP described in 6.7. Q-learning initially learns to take the left action much more often than the right atcion, and always takes it significantly more often than the 5% minimum probability encorced by $\epsilon$-greedy action selection with $\epsilon$=0.1. In contrast, Double Q-learning is essentially unaffected by maximization bias as is Double Expected Sarsa. Sarsa and Expected Sarsa also exhibit maximization bias as well. All of the sarsa methods eventually take the left action more than Q-learning even though the behavior policy should be the same for both. Even Double Expected Sarsa without maximization bias shows the same tendancy. The only difference between this method and Double Q-learning is the use of the $\epsilon$-greedy policy in the value calculation. So the action value estimates are for the $\epsilon$-greedy policy rather than for the greedy policy under Double Q-learning. Under this policy, sometimes the right action selection goes left and visa versa. Even under the $\epsilon$-greedy policy, the optimal policy would be to select right, but due to the variance in value estimates introduced by $\epsilon$, it will take longer for the behavior policy based on the Q values to converge to the correct values. That slower convergence is apparent in the graph above.

mimetext/htmlrootassigneelast_run_timestampA ޾-persist_js_state·has_pluto_hook_features§cell_id$84d81413-6334-4965-8632-8a763cd3f28adepends_on_disabled_cells§runtimewpublished_object_keysdepends_on_skipped_cells§errored$33d69db9-fa2b-40a3-bbed-21d5fd60f302queued¤logsrunning¦outputbody,example_6_8 (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA W-persist_js_state·has_pluto_hook_features§cell_id$33d69db9-fa2b-40a3-bbed-21d5fd60f302depends_on_disabled_cells§runtime%published_object_keysdepends_on_skipped_cells§errored$3f3ebc9b-b070-4d73-8be9-823b399c664cqueued¤logsrunning¦outputbody0batch_value_est (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA fpersist_js_state·has_pluto_hook_features§cell_id$3f3ebc9b-b070-4d73-8be9-823b399c664cdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$d5b612d8-82a1-4586-b721-1baaea2101cfqueued¤logsrunning¦outputbody7

Value iteration with afterstates converged in 10 fewer steps than state value iteration, but the total runtime is less than 25%. So as expected the afterstate method converges in fewer steps each of which is more efficient to compute than using the state value function.

mimetext/htmlrootassigneelast_run_timestampA ޿persist_js_state·has_pluto_hook_features§cell_id$d5b612d8-82a1-4586-b721-1baaea2101cfdepends_on_disabled_cells§runtime&Եpublished_object_keysdepends_on_skipped_cells§errored$dee6b500-0ba1-4bbc-b217-cbb9ad47ad06queued¤logsrunning¦outputbodyS"
Sarsa Solution
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
-6.9
-7.4
-8.0
-8.9
-8.3
-8.1
-8.1
-6.5
-6.9
-7.3
-8.1
-8.1
-8.3
-8.5
-5.5
-6.3
-5.3
-6.4
-7.8
-8.3
-9.0
-4.6
-4.4
-7.1
-7.6
-8.3
-8.8
-9.6
-4.1
-4.5
-4.4
-7.3
-8.1
-8.7
-9.7
-3.4
-3.3
-2.1
-6.7
-8.0
-8.1
-8.9
-0.99
-1.0
-1.0
-6.5
-7.2
-7.5
-7.7
0.0
0.0
-0.94
0.0
-5.3
-6.5
-6.9
0.0
0.0
-1.0
-1.0
-3.7
-5.2
-5.9
0.0
-0.5
-1.0
-2.0
-3.0
-3.9
-4.9
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
Value Iteration Solution
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
-7.0
-7.0
-7.0
-7.0
-7.0
-7.0
-8.0
-6.0
-6.0
-6.0
-6.0
-6.0
-7.0
-8.0
-5.0
-5.0
-5.0
-5.0
-6.0
-7.0
-8.0
-4.0
-4.0
-4.0
-6.0
-7.0
-8.0
-9.0
-3.0
-3.0
-3.0
-7.0
-8.0
-9.0
-9.0
-2.0
-2.0
-2.0
-7.0
-8.0
-8.0
-8.0
-1.0
-1.0
-1.0
-6.0
-7.0
-7.0
-7.0
-1.0
-1.0
-1.0
0.0
-5.0
-6.0
-6.0
-2.0
-1.0
-1.0
-1.0
-3.0
-4.0
-5.0
-2.0
-2.0
-2.0
-2.0
-2.0
-3.0
-4.0
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
mimetext/htmlrootassigneelast_run_timestampA (/persist_js_state·has_pluto_hook_features§cell_id$dee6b500-0ba1-4bbc-b217-cbb9ad47ad06depends_on_disabled_cells§runtime^published_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/ae6d04b38d0be15f59c6be96e-38f7-11f0-2d30-a71f02755abc/59425f0a6271854659c6be96e-38f7-11f0-2d30-a71f02755abc/ac757a3486dcd2e159c6be96e-38f7-11f0-2d30-a71f02755abc/a7c05c6ee7bae052depends_on_skipped_cells§errored$897fde24-9a4a-465e-96f2-dd9e8baab294queued¤logsrunning¦outputbodyD
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
-14.0
-14.0
-15.0
-15.0
-15.0
-14.0
-14.0
-13.0
-13.0
-14.0
-14.0
-14.0
-14.0
-14.0
-12.0
-13.0
-13.0
-13.0
-13.0
-13.0
-13.0
-11.0
-12.0
-12.0
-12.0
-12.0
-12.0
-12.0
0.0
-11.0
-11.0
-11.0
-11.0
-11.0
-11.0
0.0
0.0
-9.8
-10.0
-10.0
-10.0
-10.0
0.0
0.0
0.0
-8.9
-9.0
-9.0
-9.0
0.0
0.0
-0.88
0.0
-5.9
-8.0
-8.0
-0.5
-1.4
-1.0
-5.0
-6.0
-6.9
-7.0
-1.3
-2.0
-2.0
-3.0
-4.0
-5.0
-6.0
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
mimetext/htmlrootassigneelast_run_timestampA Fpersist_js_state·has_pluto_hook_features§cell_id$897fde24-9a4a-465e-96f2-dd9e8baab294depends_on_disabled_cells§runtime INpublished_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/d6339d133c128c5bdepends_on_skipped_cells§errored$1e3d231a-4065-48ce-a74e-018066fb232aqueued¤logsrunning¦outputbody,example_6_3 (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 1persist_js_state·has_pluto_hook_features§cell_id$1e3d231a-4065-48ce-a74e-018066fb232adepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$0f22e85f-ed31-49df-a7c7-0579298f05fequeued¤logsrunning¦outputbodyl

For Monte Carlo learning each state estimate is updated with the error shown by the red arrows only after the episode is finished. For TD(0) learning, as soon as the feedback from the subsequent state is received, the error can be calculated and it is only based on the new information from one state into the future.

mimetext/htmlrootassigneelast_run_timestampA ޶`persist_js_state·has_pluto_hook_features§cell_id$0f22e85f-ed31-49df-a7c7-0579298f05fedepends_on_disabled_cells§runtimeMpublished_object_keysdepends_on_skipped_cells§errored$9017093c-a9c3-40ea-a9c6-881ee62fc379queued¤logsrunning¦outputbody

Exercise 6.2

This is an exercise to help develop your intuition about why TD methods are often more efficient than Monte Carlo methods. Consider the driving home example and how it is addressed by TD and Monte Carlo methods. Can you imagine a scenario in which a TD update would be better on average than a Monte Carlo update? Give an example scenario - a description of past experience and a current state - in which you would expect the TD update to be better. Here's a hint: Suppose you have lots of experience driving home from work. Then you move to a new building and a new parking lot (but you still enter the highway at the same place). Now you are starting to learn predictions for the new building. Can you see why TD updates are likely to be much better, at least initially, in this case? Might the same sort of thing happen in the original scenario?

Originally, from the starting state, the expected total time to reach home is 30 minutes. Now if we change the route so that it now takes on average 5 more minutes to reach the car, but the expected elapsed time for every other leg of the journey is unchanged. Now our total time estimate should be 35 minutes from the starting state on average. Let's say we reach the car and nothing out of the ordinary is happening. The predicted time to go will be 25 minutes and the predicted total time will be 35 minutes. If nothing further out of the ordinary occurs, then only the first state will be corrected. For the Monte Carlo method, the only state with an estimate error will be the first state, but this update will not occur until after we've arrived at our destination. Either way, the next time we drive we will have a new, more accurate estimate reflecting the longer time required to reach the car.

In the example, during the drive several events occur during the journey that change the predicted and actual time from the average. For simplicity let's assume that when we enter our home street there is a garbage truck blocking our path. Normally it only takes 3 minutes to arrive at home, but with the truck present we estimate it will take 5 minutes (2 minutes longer). Now the total predicted time will be increased from 35 minutes to 37 minutes. In the case of Monte Carlo learning, this additional 2 minutes will propagate backwards to all of the previous states because we experienced a true travel time of 37 minutes rather than the 35 minutes predicted after the 2nd state and the 30 minutes predicted after the first state. For TD(0) learning, however, this delay will only impact the previous state after a single update. Effectively it will increase the predicted time spent on the final leg of the journey only. The prediction from the starting state will only be increased by the 5 minute increase from the walk to the car, not the delay from the garbage truck. Since we are actually starting from a new point, that feedback will be consistent and does reflect a true change in the expected time from the starting state. The garbage truck, however, may be a rare occurence. By the time this change propagates backwards through the states to the starting state, a lot more experience will be accummulated at all the other states and if α is some reasonable value, this delay will not be counted nearly as much as the updates from the first leg of the journey. Since TD(0) only uses feedback from one step into the future immediately, if changes are made to the environment, those changes will only affect the most closely related states immediately. In this example, all of the accurate predictions we still have about the later legs of the journey will be used to keep the predictions more stable.

The opposite extreme though could create a situation where the Monte Carlo updates were better. Imagine instead that you moved houses in the same neighborhood such that once you enter the home street, it takes 5 minutes to reach your home instead of 3 minutes. In this case, the Monte Carlo updates would move all of the state predictions up towards the 2 minute increase since all of the predictions would be too short. The TD(0) update though would initially only increase the prediction for the final leg of the journey and we would have to wait for this change to propagate backwards to all the other states. So the efficiency of updates for each method depends on where in the episode environmental changes occur.

Actual environment change at the end of the route

Now there is a randomly experienced shorter leg at the start of the journey which won't affect most of the Monte Carlo updates.

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$9017093c-a9c3-40ea-a9c6-881ee62fc379depends_on_disabled_cells§runtimemVpublished_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/a1553d03eb64404459c6be96e-38f7-11f0-2d30-a71f02755abc/56740ad756b57fb449c6be96e-38f7-11f0-2d30-a71f02755abc/d59b9cec394378459c6be96e-38f7-11f0-2d30-a71f02755abc/9eed5a2466d73029depends_on_skipped_cells§errored$4b0d96d0-25d1-4fed-b105-c65fa2883a61queued¤logsrunning¦outputbodyprefixKMDP_TD{Int64, Int64, var"#step#28"{Int64}, var"#26#29"{Int64}, var"#27#30"}elementsstatesprefixInt64elements0text/plain1text/plain2text/plain3text/plain4text/plain5text/plaintypeArrayprefix_shortobjectid9b8acb7b7f1ff624!application/vnd.pluto.tree+objectstatelookupprefixDict{Int64, Int64}elements0text/plain1text/plain4text/plain5text/plain5text/plain6text/plain2text/plain3text/plain3text/plain4text/plain1text/plain2text/plaintypeDictprefix_shortDictobjectid9af93f222051e2eb!application/vnd.pluto.tree+objectactionsprefixInt64elements1text/plaintypeArrayprefix_shortobjectid11391f426a7c9ef8!application/vnd.pluto.tree+objectactionlookupprefixDict{Int64, Int64}elements1text/plain1text/plaintypeDictprefix_shortDictobjectidf725d08c58aa899e!application/vnd.pluto.tree+objectstate_init$#26 (generic function with 1 method)text/plainstepO(::Main.var"workspace#3".var"#step#28"{Int64}) (generic function with 1 method)text/plainisterm$#27 (generic function with 1 method)text/plaintypestructprefix_shortMDP_TDobjectidf635a3c4fb5f8d4amime!application/vnd.pluto.tree+objectrootassigneeconst mrp_6_2last_run_timestampA \persist_js_state·has_pluto_hook_features§cell_id$4b0d96d0-25d1-4fed-b105-c65fa2883a61depends_on_disabled_cells§runtimeԵpublished_object_keysdepends_on_skipped_cells§errored$1115f3ec-f4b2-4fba-bd5e-321a63b10a6dqueued¤logsrunning¦outputbody
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
-6.0
-6.3
-6.8
-7.0
-7.0
-6.8
-6.9
-5.7
-5.9
-6.0
-6.0
-6.0
-6.8
-7.1
-5.0
-5.0
-5.0
-5.0
-6.0
-6.8
-7.6
-4.0
-4.0
-4.0
-5.9
-6.8
-7.3
-8.1
-3.0
-3.0
-3.0
-6.0
-6.7
-7.3
-8.4
-2.0
-2.0
-2.0
-5.5
-6.6
-7.2
-8.0
-1.0
-1.0
-1.0
-5.0
-6.0
-6.7
-7.0
0.0
-0.42
-0.81
0.0
-4.4
-5.5
-6.0
-0.1
-0.29
-0.76
-1.0
-2.7
-4.0
-5.0
-0.2
-0.46
-0.84
-1.4
-2.0
-3.0
-4.0
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$1115f3ec-f4b2-4fba-bd5e-321a63b10a6ddepends_on_disabled_cells§runtime r-published_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/afbc8d42c8c4fc44depends_on_skipped_cells§errored$1e3b3234-3fe1-46c9-82b7-f729c656eb25queued¤logsrunning¦outputbody.

$$\begin{flalign} G_t - V_t(S_t) &= \delta_t + \gamma \eta_{t} + \gamma \left [\delta_{t+1} + \gamma \eta_{t+1} + \gamma (G_{t+2} - V_{t+2}(S_{t+2}) ) \right ] \\ &= \delta_t + \gamma \eta_{t} + \gamma \delta_{t+1} + \gamma^2 \eta_{t+1} + \gamma^2 \left [G_{t+2} - V_{t+2}(S_{t+2}) \right ] \\ &= (\delta_t + \gamma \eta_t) + \gamma (\delta_{t+1} + \gamma \eta_{t+1}) + \cdots + \gamma^{T-t-1}(\delta_{T-1} + \gamma \eta_{T-1}) + \gamma^{T-t} \left [G_T - V_T(S_T) \right ]\\ &= (\delta_t + \gamma \eta_t) + \gamma (\delta_{t+1} + \gamma \eta_{t+1}) + \cdots + \gamma^{T-t-1}(\delta_{T-1} + \gamma \eta_{T-1})\\ &=\sum_{k=t}^{T-1} \gamma^{k-t} (\delta_k + \gamma \eta_k)\\ \end{flalign}$$

mimetext/htmlrootassigneelast_run_timestampA ޱذpersist_js_state·has_pluto_hook_features§cell_id$1e3b3234-3fe1-46c9-82b7-f729c656eb25depends_on_disabled_cells§runtimeߵpublished_object_keysdepends_on_skipped_cells§errored$6029990b-eb31-45ae-a869-b789fba673a6queued¤logsrunning¦outputbody

To use afterstates with generalized policy iteration, we need to modify our MDP framework by considering the following trajectory:

$$(S, A) \longrightarrow (Y, P) \longrightarrow (S^\prime, R) \longrightarrow \cdots \longrightarrow (S_T, R_T)$$

where $(S, A, R)$ are the usual state, action, and reward. We introduce $(Y, P)$ to indicate the afterstate and any intermediate reward that is received from the afterstate transition.

The probability transition function for a normal MDP is written as $p(s^\prime, r \vert s, a)$ and represents the probability of transitioning to state $s$ with reward $r$ under the condition that an agent takes action $a$ from state $s$.

When using afterstates, transitions can be represented with two functions:

$$p(y, \rho \vert s, a) \tag{a}$$

is the probability of transitioning to afterstate $y$ with intermediate reward $\rho$ given an agent takes action $a$ from state $s$

$$p(s^\prime, r \vert y) \tag{b}$$

is the probability of transitioning to state $s^\prime$ with reward $r$ given an agent starts in afterstate $y$.

Moreover, when an environment is modified to use afterstates, usually there are known deterministic dynamics that follow actions followed by some stochastic behavior after that. A good example is tic-tac-toe where we fully know the dynamics after making a move, but there could be some unknown behavior from the opponent. In this situation, the afterstate probability transition (a) is deterministic, so it could instead be represented by a mapping function that returns an afterstate and an intermediate reward given a state action pair.

$$f_1(s, a) = y \tag{b1′}$$

$$f_2(s, a) = \rho \tag{b2′}$$

where $y$ and $\rho$ are the afterstate and reward respectively after taking action $a$ in state $s$. Now all of the stochastic dynamics of the environment are captured in (b) and the function only has 3 arguments instead of the usual 4. We can now apply all of the previous techniques to the afterstate example and even combine dynamic programming and trajectory sampling.

mimetext/htmlrootassigneelast_run_timestampA ޿&ppersist_js_state·has_pluto_hook_features§cell_id$6029990b-eb31-45ae-a869-b789fba673a6depends_on_disabled_cells§runtime .published_object_keysdepends_on_skipped_cells§errored$61bbf9db-49a0-4709-83f4-44f228be09c0queued¤logsrunning¦outputbody&sarsa (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA ǰpersist_js_state·has_pluto_hook_features§cell_id$61bbf9db-49a0-4709-83f4-44f228be09c0depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$814d89be-cfdf-11ec-3295-49a8f302bbcfqueued¤logsrunning¦outputbodyv

Chapter 6 Temporal-Difference Learning

TD methods combine the Monte Carlo concept of learning from experience with the self-consistency ideas from dynamic programming. Unlike the pure Monte Carlo methods of Chapter 5, TD methods do not require waiting for the final outcome of an episode to start learning. In other words they bootstrap learning by exploiting what is known about the properties of the value function. Eventually we will see that different degrees of bootstrapping can be used that bridge the gap between the techniques in Chapter 5 and 6.

6.1 TD Prediction

mimetext/htmlrootassigneelast_run_timestampA ްSpersist_js_state·has_pluto_hook_features§cell_id$814d89be-cfdf-11ec-3295-49a8f302bbcfdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$52aebb7b-c2a9-443f-bc03-24cd25793b32queued¤logsrunning¦outputbody

Exercise 6.4

The specific results shown in the right graph of the random walk example are dependent on the value of the step-size parameter $\alpha$. Do you think the conclusions about which algorithm is better would be affected if a wider range of values were used? Is there a different, fixed value of $\alpha$ at which either algorithm would have performed significantly better than shown? Why or why not?

Both algorithms should theoretically converge to the true values with a sufficiently small $\alpha$ and a large enough number of samples. Over this limited window of 100 episodes, an $\alpha$ that is too small might result in convergence so slow that it does not reach error as low as a larger $\alpha$. For the MC method, $\alpha=0.01$ is the smallest value and it has the slowest convergence over this range. $\alpha=0.04$ is the largest value tested, and it results in approximately the same error after 100 episodes. The intermediate values show better performance over this number of episodes indicating that the best possible performance is already captured in this interval.

For the TD method, the best results shown are for $\alpha=0.05$ which is already the smallest value with the slowest convergence rate. An even smaller value might result in a better outcome over 100 episodes, but this performance is already better than anything observed for the MC method.

mimetext/htmlrootassigneelast_run_timestampA ޷persist_js_state·has_pluto_hook_features§cell_id$52aebb7b-c2a9-443f-bc03-24cd25793b32depends_on_disabled_cells§runtimeJŵpublished_object_keysdepends_on_skipped_cells§errored$3d8b1ccd-9bb3-42f2-a77a-6afdb72c1ff8queued¤logsrunning¦outputbody+calc_error (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA +persist_js_state·has_pluto_hook_features§cell_id$3d8b1ccd-9bb3-42f2-a77a-6afdb72c1ff8depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$031e1106-7408-4c7e-b78e-b713c19123d1queued¤logsrunning¦outputbody&move (generic function with 8 methods)mimetext/plainrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$031e1106-7408-4c7e-b78e-b713c19123d1depends_on_disabled_cells§runtimeU=published_object_keysdepends_on_skipped_cells§errored$7035c082-6e50-4df5-919f-5f09d2011b4aqueued¤logsrunning¦outputbody,runepisode (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA 4persist_js_state·has_pluto_hook_features§cell_id$7035c082-6e50-4df5-919f-5f09d2011b4adepends_on_disabled_cells§runtimeNpublished_object_keysdepends_on_skipped_cells§errored$bfe71b40-3157-47df-8494-67f8eb8e4e93queued¤logsrunning¦outputbody+runepisode (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 4lzpersist_js_state·has_pluto_hook_features§cell_id$bfe71b40-3157-47df-8494-67f8eb8e4e93depends_on_disabled_cells§runtime5Epublished_object_keysdepends_on_skipped_cells§errored$b35264b0-ac5b-40ce-95e4-9b2bc4cb106fqueued¤logsrunning¦outputbody

TD(0) update rule for action values:

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})-Q(S_t, A_t)]$$

This update is done after every transition from a nonterminal state $S_t$. If $S_{t+1}$ is terminal, then $Q(S_{t+1}, A_{t+1})$ is defined as zero. This rule uses every element of the quintuple of events, $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$, that make up a transition from one state-action pair to the next. This quintuple gives rise to the name Sarsa for the algorithm. Each update only uses the immediate reward and the value of the state-action pair in the subsequent state as illustrated in the backup diagram shown below.

mimetext/htmlrootassigneelast_run_timestampA ޺%persist_js_state·has_pluto_hook_features§cell_id$b35264b0-ac5b-40ce-95e4-9b2bc4cb106fdepends_on_disabled_cells§runtimeNpublished_object_keysdepends_on_skipped_cells§errored$d259ecca-0249-4b28-a4d7-6880d4d84495queued¤logsrunning¦outputbody#
Actions
mimetext/htmlrootassigneeconst action3_displaylast_run_timestampA 4~persist_js_state·has_pluto_hook_features§cell_id$d259ecca-0249-4b28-a4d7-6880d4d84495depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$22c4ce8c-bd82-4eb3-8af5-55342018edffqueued¤logsrunning¦outputbody>

Dynamic Programming Code

mimetext/htmlrootassigneelast_run_timestampA qpersist_js_state·has_pluto_hook_features§cell_id$22c4ce8c-bd82-4eb3-8af5-55342018edffdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$6faa3015-3ac4-44af-a78c-10b175822441queued¤logsrunning¦outputbodyprefixMDP_TD{GridworldState, GridworldAction, var"#step#166"{var"#cliffcheck#165"{Int64, Float32, Float32, GridworldState}, var"#boundstate#164"{Int64, Int64}}, var"#sinit#160"{GridworldState}, var"#isterm#161"{Int64}}elementsstatesprefix$Main.var"workspace#3".GridworldStateelementsprefixGridworldStateelementsx1text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid78e123e4d06443c5!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy2text/plaintypestructprefix_shortGridworldStateobjectide3e6b18864c38362!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid7d75a915b81b9730!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid32586272439d3588!application/vnd.pluto.tree+objectprefixGridworldStateelementsx2text/plainy1text/plaintypestructprefix_shortGridworldStateobjectidef30e57ae60bdc38!application/vnd.pluto.tree+objectprefixGridworldStateelementsx2text/plainy2text/plaintypestructprefix_shortGridworldStateobjectid74f49756a2864a57!application/vnd.pluto.tree+objectprefixGridworldStateelementsx2text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid166e372c47e8ffa6!application/vnd.pluto.tree+objectprefixGridworldStateelementsx2text/plainy4text/plaintypestructprefix_shortGridworldStateobjectidfe67fb155f3229e7!application/vnd.pluto.tree+object prefixGridworldStateelementsx3text/plainy1text/plaintypestructprefix_shortGridworldStateobjectidec7c7c34244569a4!application/vnd.pluto.tree+objectmore0prefixGridworldStateelementsx12text/plainy4text/plaintypestructprefix_shortGridworldStateobjectidfe8dad79c4afe746!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectidde3ad4e4e4511b!application/vnd.pluto.tree+objectstatelookupprefix1Dict{Main.var"workspace#3".GridworldState, Int64}elementsprefixGridworldStateelementsx12text/plainy3text/plaintypestructprefix_shortGridworldStateobjectidc0d7ffbc18d93d08!application/vnd.pluto.tree+object47text/plainprefixGridworldStateelementsx12text/plainy2text/plaintypestructprefix_shortGridworldStateobjectid8caa1e9c10ca4597!application/vnd.pluto.tree+object46text/plainprefixGridworldStateelementsx6text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid4e5052ac2b36c8be!application/vnd.pluto.tree+object24text/plainprefixGridworldStateelementsx7text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid6d65389daed97014!application/vnd.pluto.tree+object28text/plainprefixGridworldStateelementsx8text/plainy4text/plaintypestructprefix_shortGridworldStateobjectidb85af438304886c5!application/vnd.pluto.tree+object32text/plainprefixGridworldStateelementsx10text/plainy1text/plaintypestructprefix_shortGridworldStateobjectiddad6dff35c9621ff!application/vnd.pluto.tree+object37text/plainprefixGridworldStateelementsx11text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid9bb299bee3584629!application/vnd.pluto.tree+object43text/plainprefixGridworldStateelementsx12text/plainy1text/plaintypestructprefix_shortGridworldStateobjectidba2d0d301e25fc6e!application/vnd.pluto.tree+object45text/plainprefixGridworldStateelementsx8text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid6d43cd1ca99a553e!application/vnd.pluto.tree+object29text/plainprefixGridworldStateelementsx2text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid166e372c47e8ffa6!application/vnd.pluto.tree+object7text/plainmoretypeDictprefix_shortDictobjectidae76171086dcfe51!application/vnd.pluto.tree+objectactionsprefix%Main.var"workspace#3".GridworldActionelementsprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+objectprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+objectprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+objectprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid952f6adeb23ade52!application/vnd.pluto.tree+objectactionlookupprefix2Dict{Main.var"workspace#3".GridworldAction, Int64}elementsprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+object2text/plainprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+object3text/plainprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+object4text/plainprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+object1text/plaintypeDictprefix_shortDictobjectide145d5576c7b4e1e!application/vnd.pluto.tree+objectstate_initp(::Main.var"workspace#3".var"#sinit#160"{Main.var"workspace#3".GridworldState}) (generic function with 1 method)text/plainstep(::Main.var"workspace#3".var"#step#166"{Main.var"workspace#3".var"#cliffcheck#165"{Int64, Float32, Float32, Main.var"workspace#3".GridworldState}, Main.var"workspace#3".var"#boundstate#164"{Int64, Int64}}) (generic function with 1 method)text/plainistermR(::Main.var"workspace#3".var"#isterm#161"{Int64}) (generic function with 1 method)text/plaintypestructprefix_shortMDP_TDobjectid811f3ac70b1110bbmime!application/vnd.pluto.tree+objectrootassigneeconst cliffworldlast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$6faa3015-3ac4-44af-a78c-10b175822441depends_on_disabled_cells§runtime$published_object_keysdepends_on_skipped_cells§errored$fa04d20f-6e3f-46f8-b3f7-a543d1fa360aqueued¤logsrunning¦outputbody7max_bias_visualization (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA {Hpersist_js_state·has_pluto_hook_features§cell_id$fa04d20f-6e3f-46f8-b3f7-a543d1fa360adepends_on_disabled_cells§runtimeX8published_object_keysdepends_on_skipped_cells§errored$297f1606-4ec2-4075-9f81-926dc517b76fqueued¤logsrunning¦outputbodyprefix3FiniteMDP{Float32, GridworldState, GridworldAction}elementsstatesprefix$Main.var"workspace#3".GridworldStateelementsprefixGridworldStateelementsx1text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid78e123e4d06443c5!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy2text/plaintypestructprefix_shortGridworldStateobjectide3e6b18864c38362!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid7d75a915b81b9730!application/vnd.pluto.tree+objectprefixGridworldStateelementsx2text/plainy1text/plaintypestructprefix_shortGridworldStateobjectidef30e57ae60bdc38!application/vnd.pluto.tree+objectprefixGridworldStateelementsx2text/plainy2text/plaintypestructprefix_shortGridworldStateobjectid74f49756a2864a57!application/vnd.pluto.tree+objectprefixGridworldStateelementsx2text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid166e372c47e8ffa6!application/vnd.pluto.tree+objectprefixGridworldStateelementsx3text/plainy1text/plaintypestructprefix_shortGridworldStateobjectidec7c7c34244569a4!application/vnd.pluto.tree+objectprefixGridworldStateelementsx3text/plainy2text/plaintypestructprefix_shortGridworldStateobjectidc1258421535f88fc!application/vnd.pluto.tree+object prefixGridworldStateelementsx3text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid3ed622ab169cc67c!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectidf181cfeac924fd67!application/vnd.pluto.tree+objectactionsprefix%Main.var"workspace#3".GridworldActionelementsprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+objectprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+objectprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+objectprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid952f6adeb23ade52!application/vnd.pluto.tree+objectrewardsprefixFloat32elements0.0text/plain-1.2text/plain1.0text/plaintypeArrayprefix_shortobjectid40ec8c181ac955f3!application/vnd.pluto.tree+objectptfX9×3×4×9 Array{Float32, 4}: [:, :, 1, 1] = 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 1] = 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 1] = 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 1] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 2] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 2] = 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 2] = 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 2] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 3] = 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 4] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 4] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 4] = 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 4] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 5] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 5] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 5] = 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 5] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 [:, :, 1, 6] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 6] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 6] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 6] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 [:, :, 1, 7] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 [:, :, 2, 7] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 7] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 7] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 8] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 [:, :, 2, 8] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 8] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 8] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 [:, :, 1, 9] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 2, 9] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 3, 9] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 4, 9] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0text/plainaction_scratchprefixFloat32elements366.085text/plain366.829text/plain367.146text/plain366.144text/plaintypeArrayprefix_shortobjectid4505058d22fa49d1!application/vnd.pluto.tree+objectstate_scratchprefixFloat32elements-1.2text/plain0.0text/plain3.0f-45text/plain0.0text/plain3.0f-45text/plain0.0text/plain3.0f-45text/plain0.0text/plain -100.0text/plain -100.0text/plaintypeArrayprefix_shortobjectidf6286d0d8541196f!application/vnd.pluto.tree+objectreward_scratchprefixFloat32elements0.025text/plain0.025text/plain0.925text/plaintypeArrayprefix_shortobjectid7d00b65012f1a212!application/vnd.pluto.tree+objectstate_indexprefix1Dict{Main.var"workspace#3".GridworldState, Int64}elementsprefixGridworldStateelementsx2text/plainy2text/plaintypestructprefix_shortGridworldStateobjectid74f49756a2864a57!application/vnd.pluto.tree+object5text/plainprefixGridworldStateelementsx3text/plainy2text/plaintypestructprefix_shortGridworldStateobjectidc1258421535f88fc!application/vnd.pluto.tree+object8text/plainprefixGridworldStateelementsx2text/plainy1text/plaintypestructprefix_shortGridworldStateobjectidef30e57ae60bdc38!application/vnd.pluto.tree+object4text/plainprefixGridworldStateelementsx1text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid7d75a915b81b9730!application/vnd.pluto.tree+object3text/plainprefixGridworldStateelementsx3text/plainy1text/plaintypestructprefix_shortGridworldStateobjectidec7c7c34244569a4!application/vnd.pluto.tree+object7text/plainprefixGridworldStateelementsx1text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid78e123e4d06443c5!application/vnd.pluto.tree+object1text/plainprefixGridworldStateelementsx3text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid3ed622ab169cc67c!application/vnd.pluto.tree+object9text/plainprefixGridworldStateelementsx1text/plainy2text/plaintypestructprefix_shortGridworldStateobjectide3e6b18864c38362!application/vnd.pluto.tree+object2text/plainprefixGridworldStateelementsx2text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid166e372c47e8ffa6!application/vnd.pluto.tree+object6text/plaintypeDictprefix_shortDictobjectidc3f2e783e7b8d04e!application/vnd.pluto.tree+objectaction_indexprefix2Dict{Main.var"workspace#3".GridworldAction, Int64}elementsprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+object2text/plainprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+object3text/plainprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+object4text/plainprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+object1text/plaintypeDictprefix_shortDictobjectid322be8c66151a36!application/vnd.pluto.tree+objecttypestructprefix_shortFiniteMDPobjectidae5b624a3b584e00mime!application/vnd.pluto.tree+objectrootassigneeconst noisy_gridworld_dplast_run_timestampA vݔpersist_js_state·has_pluto_hook_features§cell_id$297f1606-4ec2-4075-9f81-926dc517b76fdepends_on_disabled_cells§runtimel;published_object_keysdepends_on_skipped_cells§errored$f2776908-d06a-4073-b2ce-ecbf109c9cc7queued¤logsrunning¦outputbody2

King Actions

mimetext/htmlrootassigneelast_run_timestampA ޻mpersist_js_state·has_pluto_hook_features§cell_id$f2776908-d06a-4073-b2ce-ecbf109c9cc7depends_on_disabled_cells§runtimekpublished_object_keysdepends_on_skipped_cells§errored$d83ff60f-8973-4dc1-9358-5ad109ea5490queued¤logsrunning¦outputbody

Solutions on Noisy Gridworld

Load Existing Results if Present:

If file does not load correctly, uncheck this box to produce new results.

mimetext/htmlrootassigneelast_run_timestampA ⓰persist_js_state·has_pluto_hook_features§cell_id$d83ff60f-8973-4dc1-9358-5ad109ea5490depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$105c5c23-270d-437e-89dd-12297814c6e0queued¤logsrunning¦outputbody

Exercise 6.6

In Example 6.2 we stated that the true values for the random walk example are 1/6 , 2/6 , 3/6 , 4/6 , and 5/6 , for states A through E. Describe at least two different ways that these could have been computed. Which would you guess we actually used? Why?

Method 1: Set up the following system of equations that represent the relationship between state values

$$\begin{flalign} V(A) &= \frac{0+V(B)}{2} \implies 2V(A)=V(B) \\ V(B) &= \frac{V(A)+V(C)}{2} \implies 2V(B) = V(A)+V(C)\\ V(C) &= \frac{V(B)+V(D)}{2} \implies 2V(C)=V(B)+V(D)\\ V(D) &= \frac{V(C)+V(E)}{2} \implies 2V(D)=V(C)+V(E)\\ V(E) &= \frac{V(D)+1}{2} \implies 2V(E)=V(D)+1\\ \end{flalign}$$

We can work down from the top equation expressing everything in terms of A. For shorter expressions $V(A)$ will be written below as $A$ and likewise for other states:

$$\begin{flalign} B&=2A \\ 2B&=A+C \implies C = 3A \\ 2C&=B+D \implies D = 6A-2A=4A \\ 2D&=C+E \implies E = 8A-3A = 5A \\ 2E &= D + 1 \implies 10A = 4A + 1 \implies A = \frac{1}{6} \end{flalign}$$

Now that we have the value for A, all the others are trivial multiplications of it from 2 to 5.

Method 2: Calculate each value from probability of each trajectory

With this method to get $V(A)$ we would write down every possible trajectory to a terminal state with the associated probability of each. Since trajectories terminating to the left have a value of 0, we only need to add up the trajectories that terminate to the right. Below are some examples for state A.

$$V(A) = 0.5^5 + 4 \times 0.5^7 + \cdots$$

This equation represents the single trajectory that takes 5 steps to the right each with probability one half and the 4 possible trajectories that turn around once on the way right resulting in 7 steps. This sum will end up being infintely long to account for all of the trajectories that bounce back and forth arbitrarily large amounts of time. This method is significantly harder to calculate for each state compared to the first method and is more in line with how estimates are calculated with MC sampling. The first method is more analogous to TD sampling using the bootstrapped form of the Bellman equation.

mimetext/htmlrootassigneelast_run_timestampA ޹Opersist_js_state·has_pluto_hook_features§cell_id$105c5c23-270d-437e-89dd-12297814c6e0depends_on_disabled_cells§runtime8published_object_keysdepends_on_skipped_cells§errored$e8f94345-9ad5-48d4-8709-d796fb55db3fqueued¤logsrunning¦outputbodyj mimetext/htmlrootassigneelast_run_timestampA 澡Qpersist_js_state·has_pluto_hook_features§cell_id$e8f94345-9ad5-48d4-8709-d796fb55db3fdepends_on_disabled_cells§runtimeQ۵published_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/1cb9d5b796f6ec98depends_on_skipped_cells§errored$64b210e8-223f-41f7-a6b7-8af6183ddf87queued¤logsrunning¦outputbody5make_noisy_gridworld (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA &rpersist_js_state·has_pluto_hook_features§cell_id$64b210e8-223f-41f7-a6b7-8af6183ddf87depends_on_disabled_cells§runtime+فpublished_object_keysdepends_on_skipped_cells§errored$2f4e2da2-b1a1-41b1-8904-39b59f426da4queued¤logsrunning¦outputbodyprefix3FiniteMDP{Float32, GridworldState, GridworldAction}elementsstatesprefix$Main.var"workspace#3".GridworldStateelementsprefixGridworldStateelementsx1text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid78e123e4d06443c5!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy2text/plaintypestructprefix_shortGridworldStateobjectide3e6b18864c38362!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid7d75a915b81b9730!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid32586272439d3588!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy5text/plaintypestructprefix_shortGridworldStateobjectid593769200b7ddf14!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy6text/plaintypestructprefix_shortGridworldStateobjectidd7705072ebc67732!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid32fa797472e0a83!application/vnd.pluto.tree+objectprefixGridworldStateelementsx2text/plainy1text/plaintypestructprefix_shortGridworldStateobjectidef30e57ae60bdc38!application/vnd.pluto.tree+object prefixGridworldStateelementsx2text/plainy2text/plaintypestructprefix_shortGridworldStateobjectid74f49756a2864a57!application/vnd.pluto.tree+objectmoreFprefixGridworldStateelementsx10text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid91d5970141de4b2d!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid9c2325e0c8202abe!application/vnd.pluto.tree+objectactionsprefix%Main.var"workspace#3".GridworldActionelementsprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+objectprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+objectprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+objectprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+objectprefixUpRightelementstypestructprefix_shortUpRightobjectidffffffff65dca132!application/vnd.pluto.tree+objectprefixUpLeftelementstypestructprefix_shortUpLeftobjectidffffffff68f3503e!application/vnd.pluto.tree+objectprefixDownRightelementstypestructprefix_shortDownRightobjectidffffffff97f641f9!application/vnd.pluto.tree+objectprefixDownLeftelementstypestructprefix_shortDownLeftobjectidffffffffd243dd41!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectidd84fdc99910d1e41!application/vnd.pluto.tree+objectrewardsprefixFloat32elements0.0text/plain-1.0text/plaintypeArrayprefix_shortobjectid5526f803322f66c4!application/vnd.pluto.tree+objectptfQ70×2×8×70 Array{Float32, 4}: [:, :, 1, 1] = 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 1] = 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 1] = 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 1] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 5, 1] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 6, 1] = 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 7, 1] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 8, 1] = 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 2] = 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 2] = 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 2] = 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 2] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 5, 2] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 6, 2] = 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 7, 2] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 8, 2] = 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 3] = 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 3] = 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 5, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 6, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 7, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 8, 3] = 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ;;;; … [:, :, 1, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 2, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 [:, :, 5, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 6, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 7, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 8, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 [:, :, 2, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 [:, :, 3, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 5, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 [:, :, 6, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 7, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 [:, :, 8, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 [:, :, 2, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 3, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 [:, :, 5, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 [:, :, 6, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 7, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 8, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0text/plainaction_scratchprefixFloat32elements-1.2text/plain0.95text/plain1.0f-45text/plain0.0text/plain3.0f-45text/plain0.0text/plain3.0f-45text/plain0.0text/plaintypeArrayprefix_shortobjectid2ebdf0c7655d58e5!application/vnd.pluto.tree+objectstate_scratchprefixFloat32elements270.546text/plain271.621text/plain272.847text/plain271.53text/plain0.1text/plain0.1text/plain6.90348f-18text/plain4.5677f-41text/plain 6.0f-45text/plainmoreG-2.03361f35text/plaintypeArrayprefix_shortobjectid4860ce2498a311a3!application/vnd.pluto.tree+objectreward_scratchprefixFloat32elements3.0f-45text/plain0.0text/plaintypeArrayprefix_shortobjectida02f8b1848408f61!application/vnd.pluto.tree+objectstate_indexprefix1Dict{Main.var"workspace#3".GridworldState, Int64}elementsprefixGridworldStateelementsx8text/plainy5text/plaintypestructprefix_shortGridworldStateobjectid14e5eae9a48c6749!application/vnd.pluto.tree+object54text/plainprefixGridworldStateelementsx6text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid4e5052ac2b36c8be!application/vnd.pluto.tree+object39text/plainprefixGridworldStateelementsx7text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid6d65389daed97014!application/vnd.pluto.tree+object46text/plainprefixGridworldStateelementsx8text/plainy4text/plaintypestructprefix_shortGridworldStateobjectidb85af438304886c5!application/vnd.pluto.tree+object53text/plainprefixGridworldStateelementsx10text/plainy1text/plaintypestructprefix_shortGridworldStateobjectiddad6dff35c9621ff!application/vnd.pluto.tree+object64text/plainprefixGridworldStateelementsx6text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid4e4b90239eb3be65!application/vnd.pluto.tree+object42text/plainprefixGridworldStateelementsx8text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid6d43cd1ca99a553e!application/vnd.pluto.tree+object50text/plainprefixGridworldStateelementsx2text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid166e372c47e8ffa6!application/vnd.pluto.tree+object10text/plainprefixGridworldStateelementsx5text/plainy3text/plaintypestructprefix_shortGridworldStateobjectidf8402269233868c7!application/vnd.pluto.tree+object31text/plainprefixGridworldStateelementsx8text/plainy7text/plaintypestructprefix_shortGridworldStateobjectidb08053c76dcd8072!application/vnd.pluto.tree+object56text/plainmoretypeDictprefix_shortDictobjectide70499b329487769!application/vnd.pluto.tree+objectaction_indexprefix2Dict{Main.var"workspace#3".GridworldAction, Int64}elementsprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+object2text/plainprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+object3text/plainprefixUpRightelementstypestructprefix_shortUpRightobjectidffffffff65dca132!application/vnd.pluto.tree+object5text/plainprefixDownRightelementstypestructprefix_shortDownRightobjectidffffffff97f641f9!application/vnd.pluto.tree+object7text/plainprefixDownLeftelementstypestructprefix_shortDownLeftobjectidffffffffd243dd41!application/vnd.pluto.tree+object8text/plainprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+object4text/plainprefixUpLeftelementstypestructprefix_shortUpLeftobjectidffffffff68f3503e!application/vnd.pluto.tree+object6text/plainprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+object1text/plaintypeDictprefix_shortDictobjectid69b123a92ea18d23!application/vnd.pluto.tree+objecttypestructprefix_shortFiniteMDPobjectid311b7b18ca3c1a72mime!application/vnd.pluto.tree+objectrootassigneeconst king_gridworld_mdp_dplast_run_timestampA Gpersist_js_state·has_pluto_hook_features§cell_id$2f4e2da2-b1a1-41b1-8904-39b59f426da4depends_on_disabled_cells§runtimeՅpublished_object_keysdepends_on_skipped_cells§errored$bc8bad61-a49a-47d6-8fa6-7dcf6c221910queued¤logsrunning¦outputbody,example_6_1 (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA Jpersist_js_state·has_pluto_hook_features§cell_id$bc8bad61-a49a-47d6-8fa6-7dcf6c221910depends_on_disabled_cells§runtimeipublished_object_keysdepends_on_skipped_cells§errored$2455742f-dc18-4d6b-9f58-5666adac6919queued¤logsrunning¦outputbody6create_car_rental_mdp (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA *persist_js_state·has_pluto_hook_features§cell_id$2455742f-dc18-4d6b-9f58-5666adac6919depends_on_disabled_cells§runtimeRֵpublished_object_keysdepends_on_skipped_cells§errored$f474fcbd-e3c3-49fd-a6b7-6d6a8a7dda09queued¤logsrunning¦outputbody=

Informal Proof for Bias

mimetext/htmlrootassigneelast_run_timestampA ޼hpersist_js_state·has_pluto_hook_features§cell_id$f474fcbd-e3c3-49fd-a6b7-6d6a8a7dda09depends_on_disabled_cells§runtime@published_object_keysdepends_on_skipped_cells§errored$69eedbfd-396f-4461-b7a1-c36abc094581queued¤logsrunning¦outputbody0example_6_7_mdp (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA NUpersist_js_state·has_pluto_hook_features§cell_id$69eedbfd-396f-4461-b7a1-c36abc094581depends_on_disabled_cells§runtimeNpublished_object_keysdepends_on_skipped_cells§errored$7ac99619-5232-4db8-8553-d79ea5415d29queued¤logsrunning¦outputbody6create_gridworld_mdp (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA 9Kpersist_js_state·has_pluto_hook_features§cell_id$7ac99619-5232-4db8-8553-d79ea5415d29depends_on_disabled_cells§runtime$S published_object_keysdepends_on_skipped_cells§errored$0163763b-a15f-447e-b3d2-32d4bf9d2605queued¤logsrunning¦outputbody

Number of Variables:

mimetext/htmlrootassigneelast_run_timestampA g persist_js_state·has_pluto_hook_features§cell_id$0163763b-a15f-447e-b3d2-32d4bf9d2605depends_on_disabled_cells§runtime(ヵpublished_object_keysdepends_on_skipped_cells§errored$53145cc2-784c-468b-8e91-9bb7866db218queued¤logsrunning¦outputbody' image/svg+xml image/svg+xml image/svg+xml speed: mimetext/htmlrootassigneelast_run_timestampA `persist_js_state·has_pluto_hook_features§cell_id$53145cc2-784c-468b-8e91-9bb7866db218depends_on_disabled_cells§runtimeH!published_object_keysdepends_on_skipped_cells§errored$6b496582-cc0e-4195-87ef-94792b0fff54queued¤logsrunning¦outputbody7make_ϵ_greedy_policy! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA Zlpersist_js_state·has_pluto_hook_features§cell_id$6b496582-cc0e-4195-87ef-94792b0fff54depends_on_disabled_cells§runtime:epublished_object_keysdepends_on_skipped_cells§errored$9db7a268-1e6d-4366-a0ec-ebf54916d3b0queued¤logsrunning¦outputbody
The right graph shows learning curves for the two methods for various values of α. The performance measure shown is the root mean square (RMS) error between the vlue function learned and the true value function, averaged over the 5 states, then averaged over 100 runs. In all cases the approximate value function was initialized to the intermediate value 0.5. The TD method was consistently better than the MC method on this task.mimetext/htmlrootassigneelast_run_timestampA >persist_js_state·has_pluto_hook_features§cell_id$9db7a268-1e6d-4366-a0ec-ebf54916d3b0depends_on_disabled_cells§runtimepublished_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/24aa7574d570535059c6be96e-38f7-11f0-2d30-a71f02755abc/97d5d32b3ca95403depends_on_skipped_cells§errored$c2f56287-9a3e-454a-9ec1-53184b788db9queued¤logsrunning¦outputbodyprefix.FiniteMDP{Float32, Tuple{Int64, Int64}, Int64}elementsstatesprefixTuple{Int64, Int64}elementselements0text/plain0text/plaintypeTupleobjectid9b52efd7a2a08bd5!application/vnd.pluto.tree+objectelements0text/plain1text/plaintypeTupleobjectid86128cc9b5ae8f4a!application/vnd.pluto.tree+objectelements0text/plain2text/plaintypeTupleobjectidfc41ae7a664555b0!application/vnd.pluto.tree+objectelements0text/plain3text/plaintypeTupleobjectid5a8d0f981b76571a!application/vnd.pluto.tree+objectelements0text/plain4text/plaintypeTupleobjectid6ac4b5902680c6bb!application/vnd.pluto.tree+objectelements0text/plain5text/plaintypeTupleobjectid22d2c06707ebb5c4!application/vnd.pluto.tree+objectelements0text/plain6text/plaintypeTupleobjectidcd86b46be06a2ab4!application/vnd.pluto.tree+objectelements0text/plain7text/plaintypeTupleobjectid6f83360483e5fb68!application/vnd.pluto.tree+object elements0text/plain8text/plaintypeTupleobjectidf2740b9bf789ce84!application/vnd.pluto.tree+objectmoreelements20text/plain20text/plaintypeTupleobjectid6e264f7db8959fbf!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid4991e339a25b2e8d!application/vnd.pluto.tree+objectactionsprefixInt64elements-5text/plain-4text/plain-3text/plain-2text/plain-1text/plain0text/plain1text/plain2text/plain 3text/plain 4text/plain 5text/plaintypeArrayprefix_shortobjectidde6e880a4c13f858!application/vnd.pluto.tree+objectrewardsprefixFloat32elements-10.0text/plain-8.0text/plain-6.0text/plain-4.0text/plain-2.0text/plain0.0text/plain2.0text/plain4.0text/plain 6.0text/plainmore̽380.0text/plaintypeArrayprefix_shortobjectidd8ad7f081083b5eb!application/vnd.pluto.tree+objectptfڦT441×189×11×441 Array{Float32, 4}: [:, :, 1, 1] = 0.00673795 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0134759 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0134759 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00898393 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00449196 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00179679 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000598929 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 1] = 0.0 0.00673795 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0134759 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0134759 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00898393 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00449196 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00179679 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000598929 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 1] = 0.0 0.0 0.00673795 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0134759 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0134759 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00898393 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00449196 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00179679 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000598929 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ;;; … [:, :, 9, 1] = 0.0 0.0 0.00673795 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0134759 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0134759 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00898393 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00449196 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00179679 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000598929 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 10, 1] = 0.0 0.00673795 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0134759 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0134759 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00898393 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00449196 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00179679 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000598929 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 11, 1] = 0.00673795 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0134759 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0134759 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00898393 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00449196 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00179679 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000598929 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 2] = 0.0 0.0 0.0 0.0 0.0 0.00640248 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.012805 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.012805 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00853665 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00426832 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00170733 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00056911 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 2] = 0.0 0.0 0.0 0.0 0.0 0.0 0.00640248 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.012805 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.012805 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00853665 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00426832 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00170733 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00056911 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 2] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ;;; … [:, :, 9, 2] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00012341 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000164546 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.22732f-5 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.29093f-5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 10, 2] = 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00012341 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000164546 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.22732f-5 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.29093f-5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 11, 2] = 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00012341 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000164546 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.22732f-5 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.29093f-5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ;;; … [:, :, 9, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00012341 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000164546 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.22732f-5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 10, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00012341 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000164546 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.22732f-5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 11, 3] = 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00012341 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000164546 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.22732f-5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ;;;; … [:, :, 1, 439] = 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋱ ⋮ 0.00024682 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000164546 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.22732f-5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.29093f-5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.09698f-5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.13432f-6 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 439] = 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.00024682 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000164546 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.22732f-5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.29093f-5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.51041f-5 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 439] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.00012341 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000164546 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.22732f-5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.80134f-5 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ;;; … [:, :, 9, 439] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000525983 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 10, 439] = 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000321683 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 11, 439] = 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000168458 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 440] = 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋱ ⋮ 0.00024682 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000164546 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.22732f-5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.29093f-5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.51041f-5 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 440] = 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.00012341 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000164546 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.22732f-5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.80134f-5 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 440] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00012341 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000164546 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000130287 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ;;; … [:, :, 9, 440] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000525983 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 10, 440] = 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000321683 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 11, 440] = 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000168458 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 441] = 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋱ ⋮ 0.00012341 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000164546 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.22732f-5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.80134f-5 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 441] = 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00012341 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000164546 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000130287 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 441] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00012341 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000294833 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ;;; … [:, :, 9, 441] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000525983 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 10, 441] = 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000321683 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 11, 441] = 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000168458 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0text/plainaction_scratchprefixFloat32elements-1.55978f29text/plain-2.64806f36text/plain-1.69975f38text/plainNaNtext/plainNaNtext/plainNaNtext/plainNaNtext/plainNaNtext/plain NaNtext/plain NaNtext/plain NaNtext/plaintypeArrayprefix_shortobjectide9a7fe1919d4eb!application/vnd.pluto.tree+objectstate_scratchprefixFloat32elements0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain 0.0text/plainmore0.0text/plaintypeArrayprefix_shortobjectid90562f95b7f2e378!application/vnd.pluto.tree+objectreward_scratchprefixFloat32elements0.037517text/plain4.5677f-41text/plain0.0375508text/plain4.5677f-41text/plain5.53055f-32text/plain4.5677f-41text/plain1.06974f-31text/plain4.5677f-41text/plain 0.0106894text/plainmore̽0.0text/plaintypeArrayprefix_shortobjectid257631496232e689!application/vnd.pluto.tree+objectstate_indexprefix Dict{Tuple{Int64, Int64}, Int64}elementselements11text/plain17text/plaintypeTupleobjectid49ec9371b177a25d!application/vnd.pluto.tree+object249text/plainelements16text/plain14text/plaintypeTupleobjectidd93d095a02371a59!application/vnd.pluto.tree+object351text/plainelements18text/plain16text/plaintypeTupleobjectidaeb6f295858259db!application/vnd.pluto.tree+object395text/plainelements17text/plain12text/plaintypeTupleobjectid68544eea78f6641!application/vnd.pluto.tree+object370text/plainelements8text/plain15text/plaintypeTupleobjectidceff527f41a09840!application/vnd.pluto.tree+object184text/plainelements16text/plain16text/plaintypeTupleobjectid3164689f12bc7404!application/vnd.pluto.tree+object353text/plainelements19text/plain14text/plaintypeTupleobjectidcb90bf273945b2c8!application/vnd.pluto.tree+object414text/plainelements7text/plain18text/plaintypeTupleobjectidf3c6affef4f32144!application/vnd.pluto.tree+object166text/plainelements7text/plain8text/plaintypeTupleobjectid300559d2f34a9666!application/vnd.pluto.tree+object156text/plainelements14text/plain15text/plaintypeTupleobjectidac753ed572b44c1d!application/vnd.pluto.tree+object310text/plainmoretypeDictprefix_shortDictobjectidc3ad687634c83340!application/vnd.pluto.tree+objectaction_indexprefixDict{Int64, Int64}elements5text/plain11text/plain-3text/plain3text/plain1text/plain7text/plain0text/plain6text/plain4text/plain10text/plain-5text/plain1text/plain-1text/plain5text/plain2text/plain8text/plain-2text/plain4text/plain-4text/plain2text/plainmoretypeDictprefix_shortDictobjectidf407545a157c0a2b!application/vnd.pluto.tree+objecttypestructprefix_shortFiniteMDPobjectid15204eb150d1284mime!application/vnd.pluto.tree+objectrootassigneeconst jacks_car_mdplast_run_timestampA ɰpersist_js_state·has_pluto_hook_features§cell_id$c2f56287-9a3e-454a-9ec1-53184b788db9depends_on_disabled_cells§runtime4Hmpublished_object_keysdepends_on_skipped_cells§errored$18e60b1d-97ec-432c-a388-003e7fae415fqueued¤logsrunning¦outputbody7bellman_optimal_value! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA H(persist_js_state·has_pluto_hook_features§cell_id$18e60b1d-97ec-432c-a388-003e7fae415fdepends_on_disabled_cells§runtimeY"published_object_keysdepends_on_skipped_cells§errored$12c5efe4-d64d-4b82-877c-29b0e537fee6queued¤logsrunning¦outputbodyelementsprefixInt64elements3text/plain2text/plain3text/plain4text/plain3text/plain4text/plain3text/plain4text/plain 3text/plainmore1text/plaintypeArrayprefix_shortobjectid3a70a1eb8ec67a61!application/vnd.pluto.tree+objectprefixInt64elements1text/plain1text/plain1text/plain1text/plain1text/plain1text/plain1text/plain1text/plain 1text/plainmore1text/plaintypeArrayprefix_shortobjectidfd5795729494610c!application/vnd.pluto.tree+objectprefixFloat32elements0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain 0.0text/plainmore0.0text/plaintypeArrayprefix_shortobjectid17b12935c947880a!application/vnd.pluto.tree+object0text/plaintypeTupleobjectide1b5fb187f6eb83dmime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA 妫persist_js_state·has_pluto_hook_features§cell_id$12c5efe4-d64d-4b82-877c-29b0e537fee6depends_on_disabled_cells§runtimedpublished_object_keysdepends_on_skipped_cells§errored$a72d07bf-e337-4bd4-af5c-44d74d163b6bqueued¤logsrunning¦outputbodyj mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$a72d07bf-e337-4bd4-af5c-44d74d163b6bdepends_on_disabled_cells§runtimeppublished_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/c69864c8f78f9c34depends_on_skipped_cells§errored$0201ae9f-4a31-497e-86ab-62b454ca85dequeued¤logsrunning¦outputbody

Notice that about about $\alpha = 0.25$, Q-learning sometimes has diverging values and therefore episodes that avoid termination whereas Double Q-learning avoids that problem even at large learning rates.

mimetext/htmlrootassigneelast_run_timestampA ޾Mpersist_js_state·has_pluto_hook_features§cell_id$0201ae9f-4a31-497e-86ab-62b454ca85dedepends_on_disabled_cells§runtime0ĵpublished_object_keysdepends_on_skipped_cells§errored$b37f2395-1480-4c7c-b6c0-eba391e969d7queued¤logsrunning¦outputbody h

Let's first consider the problem of prediction problem for afterstates and see how to compute the afterstate value function and how it could be used for policy improvement. We will use the terminology $W(y)$ to represent the value of afterstate $y$ while $V(s)$ still means the value of state $s$. From the earlier definitions, we can show the relationship between the state and afterstate value functions.

Recall that:

$$\begin{flalign} G_t &\doteq R_t + \gamma R_{t+1} + \cdots \\ V_\pi(s) &\doteq \mathbb{E}_\pi[G_t \mid S_t = s] \\ & = \mathbb{E}_\pi[R_t + \gamma V_\pi(S_{t+1}) \mid S_t = s] \\ &= \sum_a \pi(a \vert s) \sum_{r, s^\prime} p(r, s^\prime \vert s, a) \left ( r + \gamma V(s^\prime) \right ) \end{flalign}$$

Representing the trajectory with afterstates and only considering the reward following an afterstate, we also know that:

$$\begin{flalign} G_t &\doteq R_t + \gamma(P_{t+1} + R_{t+1} + \gamma(P_{t+2} + R_{t+1} + \cdots))\\ W_\pi(y) &\doteq \mathbb{E}_\pi[G_t \mid Y_t = y] \\ & = \mathbb{E}_\pi[R_t + \gamma \left (P_{t+1} + W_\pi(Y_{t+1}) \right ) \mid Y_t = y] \\ &= \sum_{r, s^\prime} p(r, s^\prime \vert y) \left [r + \gamma \sum_{a^\prime} \left [ \pi(a \vert s^\prime) \left ( f_2(s^\prime, a^\prime) + W_\pi(f_1(s^\prime, a^\prime) \right ) \right ] \right ] \end{flalign}$$

Notice that compared to the value function, the policy only matters for this expected value when we consider the action taken from the transition state. The initial transition from the afterstate to $s^\prime$ only depends on our new transition function which only conditioned on the afterstate.

Recall that to improve a policy $\pi$ for which we have a value function $V_\pi$, we must select the greedy policy with respect to $V_\pi$ meaning $\pi^{\prime} (s) = \mathrm{argmax}_a \sum_{r, s^\prime} p(r, s^\prime \vert s, a)(r + \gamma V(s^\prime))$. If we do have access to the full probability transition function, we cannot compute this explicitely. Furthermore, we cannot estimate this either from a single trajectory because from each state we would just have a single transition based on the behavior policy at the time. That's why for MDPs that do not provide the full transition function, we prefer to estimate the state action value function $Q(s, a)$ because using that function policy improvement is much more trivial: $\pi^{\prime} (s) = \mathrm{argmax}_a Q(s, a)$.

mimetext/htmlrootassigneelast_run_timestampA ޿N߰persist_js_state·has_pluto_hook_features§cell_id$b37f2395-1480-4c7c-b6c0-eba391e969d7depends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cells§errored$6edb550d-5c9f-4ea6-8746-6632806df11equeued¤logsrunning¦outputbodyj mimetext/htmlrootassigneelast_run_timestampA ҄{persist_js_state·has_pluto_hook_features§cell_id$6edb550d-5c9f-4ea6-8746-6632806df11edepends_on_disabled_cells§runtime`!published_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/6eecf72f2f10b69cdepends_on_skipped_cells§errored$01582b3b-c4d0-4691-9edf-f77e6d8be2c9queued¤logsrunning¦outputbody\

Maximization Bias Visualization for a Single Estimator

mimetext/htmlrootassigneelast_run_timestampA ޼persist_js_state·has_pluto_hook_features§cell_id$01582b3b-c4d0-4691-9edf-f77e6d8be2c9depends_on_disabled_cells§runtimenpublished_object_keysdepends_on_skipped_cells§errored$7ed07ddc-1c63-4ce7-bfd3-6da54304d297queued¤logsrunning¦outputbody4makepolicyvaluemaps (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA rTpersist_js_state·has_pluto_hook_features§cell_id$7ed07ddc-1c63-4ce7-bfd3-6da54304d297depends_on_disabled_cells§runtime,^published_object_keysdepends_on_skipped_cells§errored$4862942b-d1e2-4ac8-8e88-65205e91a070queued¤logsrunning¦outputbody(
Maximum Number of Variables:
Maxinum Number of Samples Per Variable:
Number of Runs:
mimetext/htmlrootassigneelast_run_timestampA ckpersist_js_state·has_pluto_hook_features§cell_id$4862942b-d1e2-4ac8-8e88-65205e91a070depends_on_disabled_cells§runtime kpublished_object_keysdepends_on_skipped_cells§errored$a5009785-64b4-489b-a967-f7840b4a9463queued¤logsrunning¦outputbodyD

Random Walk Visualization Code

mimetext/htmlrootassigneelast_run_timestampA ޷`-persist_js_state·has_pluto_hook_features§cell_id$a5009785-64b4-489b-a967-f7840b4a9463depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$eb735ead-978b-409c-8990-b5fa7a027ebfqueued¤logsrunning¦outputbody3tabular_TD0_pred_V (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA ;Ăpersist_js_state·has_pluto_hook_features§cell_id$eb735ead-978b-409c-8990-b5fa7a027ebfdepends_on_disabled_cells§runtime$published_object_keysdepends_on_skipped_cells§errored$2034fd1e-5171-4eda-85d5-2de62d7a1e8bqueued¤logsrunning¦outputbody+q_learning (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$2034fd1e-5171-4eda-85d5-2de62d7a1e8bdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$4382928c-6325-4ecd-b7cf-282525a270abqueued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestampA Zpersist_js_state·has_pluto_hook_features§cell_id$4382928c-6325-4ecd-b7cf-282525a270abdepends_on_disabled_cells§runtime"published_object_keysdepends_on_skipped_cells§errored$8bc54c94-9c92-4904-b3a6-13ff3f0110bbqueued¤logsrunning¦outputbody0show_grid_value (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$8bc54c94-9c92-4904-b3a6-13ff3f0110bbdepends_on_disabled_cells§runtimeipublished_object_keysdepends_on_skipped_cells§errored$4b1a4c14-3c2b-40c0-995c-cd0334ed8b3aqueued¤logsrunning¦outputbody4

Normal Actions

mimetext/htmlrootassigneelast_run_timestampA ޻Rpersist_js_state·has_pluto_hook_features§cell_id$4b1a4c14-3c2b-40c0-995c-cd0334ed8b3adepends_on_disabled_cells§runtime摵published_object_keysdepends_on_skipped_cells§errored$f0f9d3d5-e76a-4472-bfb1-da29d73a7916queued¤logsrunning¦outputbodyG
Sarsa Solution
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
-7.3
-7.6
-7.5
-8.8
-8.6
-9.1
-9.2
-7.0
-6.2
-7.8
-8.2
-8.4
-8.8
-9.2
-5.0
-6.2
-6.5
-7.7
-8.3
-8.8
-9.5
-4.0
-4.1
-5.4
-7.9
-8.5
-9.1
-9.9
-3.4
-3.0
-3.4
-7.6
-8.2
-8.7
-10.0
-1.8
-2.0
-2.1
-7.1
-7.8
-8.3
-9.3
-0.94
-1.0
-1.0
-6.6
-7.1
-7.9
-8.4
0.0
0.0
-1.0
0.0
-5.8
-6.4
-7.3
-0.5
-0.98
-1.0
-1.0
-4.0
-5.2
-5.9
-0.75
-0.88
-1.7
-2.0
-2.2
-3.6
-4.7
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
Value Iteration Solution
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
-7.0
-7.0
-7.0
-7.0
-7.0
-7.0
-8.0
-6.0
-6.0
-6.0
-6.0
-6.0
-7.0
-8.0
-5.0
-5.0
-5.0
-5.0
-6.0
-7.0
-8.0
-4.0
-4.0
-4.0
-6.0
-7.0
-8.0
-9.0
-3.0
-3.0
-3.0
-7.0
-8.0
-9.0
-9.0
-2.0
-2.0
-2.0
-7.0
-8.0
-8.0
-8.0
-1.0
-1.0
-1.0
-6.0
-7.0
-7.0
-7.0
-1.0
-2.0
-1.0
0.0
-5.0
-6.0
-6.0
-2.0
-1.0
-1.0
-1.0
-3.0
-4.0
-5.0
-2.0
-2.0
-2.0
-2.0
-2.0
-3.0
-4.0
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
mimetext/htmlrootassigneelast_run_timestampA %Wpersist_js_state·has_pluto_hook_features§cell_id$f0f9d3d5-e76a-4472-bfb1-da29d73a7916depends_on_disabled_cells§runtime>published_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/4e7985c38cb0132059c6be96e-38f7-11f0-2d30-a71f02755abc/2933a969c3841bd159c6be96e-38f7-11f0-2d30-a71f02755abc/7c2857752627f86359c6be96e-38f7-11f0-2d30-a71f02755abc/bf44e09ac1fcc101depends_on_skipped_cells§errored$4c1b286c-2ba9-4293-81e1-bf360baa75faqueued¤logsrunning¦outputbody

The following argument is taken from "Double Q-learning" by Hado van Hasselt published in Advances in Neural Information Processing Systems 23 (NIPS 2010):

Consider a set of $M$ random variables $X=\{X_1, \dots, X_M\}$. We would like to calculate:

$$\max_i \mathbb{E} \{X_i\} \tag{a}$$

Without any knowledge of the underlying distribution of each $X_i$ it is impossible to determine $(\star)$ exactly. Most often we would approximate it by first constructing approximations for $\mathbb{E} \{ X_i \} \: \forall \: i$. Let $S = \bigcup_{i=1}^M S_i$ denote the set of samples where $S_i$ is the subset containing samples for the variable $X_i$. We assume that the samples in $S_i$ are independent and identically distributed (iid). Unbiased estimates for the expected values can be obtained by computing hte sample average for each variable: $\mathbb{E} \{ X_i \} = \mathbb{E} \{ \mu_i \} \approx \mu_i(S) \doteq \frac{1}{\vert S_i \vert } \sum_{s \in S_i} s$ where $\mu_i$ is an estimator for the variable $X_i$. This approximation is unbiased since very sample $s in S_i$ is an unbiased estimat for the value of $\mathbb{E} \{ X_i \}$. The error in approximation thus consists soley of the variance in the estimator and decreases when we obtain more samples. We use the following notations: $f_i$ denotes the probability density function (PDF) of the $i^{th}$ variable $X_i$ and $F_i(x) = \int_{-\infty}^{x} f_i(x)dx$ is the cumulative distribution function (CDF) of this PDF. Similarly, the PDF and CDF of the $i^{th}$ estimator are denoted $f_i^\mu$ and $F_i^\mu$. The maximum expected value cna be expressed in terms of the underlying PDFs as $\max_i \mathbb{E} \{ X_i \} = \max_i \int_{-\infty}^\infty x f_i(x)dx$.

An obvious way to approximate the value of $(a)$ is to use the value of the maximal estimator:

$$\max_i \mathbb{E} \{ X_i \} = \max_i \mathbb{E} \{ \mu_i \} \approx \max_i \mu_i(S) \tag{b}$$

and this is the estimator employed in ordinary Q-learning. This estimator is distributed according to some PDF $f_{max}^\mu$ that is dependent on the PDFs of the estimators $f_i^\mu$. To determine this PDF, consider the CDF $F_{\max}^\mu(x)$, which gives the probability that the maximum estimate is lower or equal to $x$. This probability is equal to the probability that all the estimates are lower or equal to $x: F_{\max}^\mu(x) \doteq P(\max_i \mu_i \leq x) = \prod_{i=1}^M P(\mu_i\leq x) \doteq \prod_{i=1}^M F_i ^\mu (x)$. The value $\max_i \mu_i(S)$ is an unbiased estimate for $\mathbb{E} \{ \max_j \mu_j \} = \int_{-\infty}^{\infty} x f_{\max}^\mu(x)dx$ which can thus be given by:

$$\mathbb{E} \{ \max_j \mu_j \} = \int_{-\infty}^{\infty} x \frac{d}{dx} \prod_{i=1}^M F_i ^ \mu (x) dx = \sum_{j=1}^M \int_{-\infty}^{\infty}x f_j ^ \mu (x) \prod_{i \neq j}^M F_i ^ \mu(x) dx \tag{c}$$

However in $(a)$ the order of the max operator and the expectation operator are the other way around. The following illustrates why $(c)$ has a positive bias.

mimetext/htmlrootassigneelast_run_timestampA ޽3 persist_js_state·has_pluto_hook_features§cell_id$4c1b286c-2ba9-4293-81e1-bf360baa75fadepends_on_disabled_cells§runtime 8published_object_keysdepends_on_skipped_cells§errored$3134e913-1e86-495d-a558-c3ec4828bf7bqueued¤logsrunning¦outputbody9begin_value_iteration_v (generic function with 3 methods)mimetext/plainrootassigneelast_run_timestampA t*persist_js_state·has_pluto_hook_features§cell_id$3134e913-1e86-495d-a558-c3ec4828bf7bdepends_on_disabled_cells§runtime"published_object_keysdepends_on_skipped_cells§errored$db31579e-3e56-4271-8fc3-eb13bc95ac27queued¤logsrunning¦outputbodyy

Adding the no-movement action doesn't seem to change the shortest path of 7 steps

mimetext/htmlrootassigneelast_run_timestampA ޺?persist_js_state·has_pluto_hook_features§cell_id$db31579e-3e56-4271-8fc3-eb13bc95ac27depends_on_disabled_cells§runtimeʵpublished_object_keysdepends_on_skipped_cells§errored$943b6d7e-14a4-4532-90c7-dd5080be0c6equeued¤logsrunning¦outputbodyprefixFloat32elements-1.2text/plain1.0text/plaintypeArrayprefix_shortobjectiddd6555714180979emime!application/vnd.pluto.tree+objectrootassigneeconst noisy_rewardslast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$943b6d7e-14a4-4532-90c7-dd5080be0c6edepends_on_disabled_cells§runtime*published_object_keysdepends_on_skipped_cells§errored$84584793-8274-4aa1-854f-b167c7434548queued¤logsrunning¦outputbodyMgridworld_Q_vs_sarsa_vs_expected_sarsa_solve (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA  #eg1 .circlestate.C::before { background-color: green; } mimetext/htmlrootassigneelast_run_timestampA !Wpersist_js_state·has_pluto_hook_features§cell_id$1dd1ba55-548a-41f6-903e-70742fd60e3ddepends_on_disabled_cells§runtimeNpublished_object_keysdepends_on_skipped_cells§errored$2a3e4617-efbb-4bbc-9c61-8535628e439cqueued¤logsrunning¦outputbody &

Exercise 6.12

Supposed action selection is greedy. Is Q-learning then exactly the same algorithm as Sarsa? Will they make exactly the same action selections and weight updates?

Consider both updates when the greedy policy is followed during training.

Sarsa Update:

$$Q_\pi(S_t, A_t) = \text{E}_\pi [R_{t+1} + \gamma Q_\pi(S_{t+1}, A_{t+1})]$$

with $A_{t+1}$ chosen by the greedy policy accoring to $\text{max}_a Q_\pi(S_{t+1})$ for the estimates prior to this update.

Q-Learning Update:

$$Q_\pi(S_t, A_t) = \text{E}_\pi [R_{t+1} + \gamma \text{max}_a Q_\pi(S_{t+1}, a)]$$

The value updates are identical since the Q estimate used in both cases will be based on the maximizing action at state $S_{t+1}$. In the case of Sarsa, $A_{t+1}$ has already been selected prior to this update occurring, so this value update will properly reflect the next step in the trajectory. In Q-learning, the action selection at $S_{t+1}$ will occur after the update step. Notice that we only updated $Q_\pi(S_t, A_t)$ and did not touch $Q_\pi(S_{t+1}, A_{t+1})$, so our next action selection should be unaffected by this update. However, there in one exception for the case where the state is identical through the transition: $S_t = S_{t+1}$. In this case, the update could actually affect the next action selection, for example, let's say a very low reward was received during the update. That would lower the estimate for this action selected on step t and it may no longer be maximizing on step t+1. Then Sarsa would have chosen the same action ahead of the update but Q-learning would chose a different action on the next step even though the state is unchanged. Despite this difference, both methods are still computing the state-action value function for the optimal policy, but neither is guaranteed to converge to this function due to the violation of the assumption that all state-action pairs are visited during training.

mimetext/htmlrootassigneelast_run_timestampA ޻Epersist_js_state·has_pluto_hook_features§cell_id$2a3e4617-efbb-4bbc-9c61-8535628e439cdepends_on_disabled_cells§runtime~@published_object_keysdepends_on_skipped_cells§errored$5f32fed0-c921-4cbb-85fe-ade54d4c6c95queued¤logsrunning¦outputbody

At each state or checkpoint you try to predict how much longer it will take to get home using any information that is relevant. Notice that regardless of how inaccurate we were on previous steps, we can still make an accurate prediction for the time to go.

StateElapsed Time (minutes)Predicted Time to GoPredicted Total Time
leaving office, friday at 603030
reach car, raining53540
exiting highway201535
2ndary road, behind truck301040
entering home street40343
arriving home43043

The rewards in this example are the elapsed times on each leg of the journey and there is no discounting, thus the return for each state is the actual time to go from that state. The value of each state is the expected time to go. The second column of numbers gives the current estimated value for the state encountered.

A simple way to view the operation of Mone Carlo methods is to plot hte predicted total time (the last column) over the sequence. For each state we would compare that value with the actual elapsed time which was 43 minutes.

mimetext/htmlrootassigneelast_run_timestampA ޶persist_js_state·has_pluto_hook_features§cell_id$5f32fed0-c921-4cbb-85fe-ade54d4c6c95depends_on_disabled_cells§runtimeVpublished_object_keysdepends_on_skipped_cells§errored$a3d10753-2ec3-4252-9629-834145678b6aqueued¤logsrunning¦outputbody?

Afterstate Implementation

mimetext/htmlrootassigneelast_run_timestampA ޿Ѱpersist_js_state·has_pluto_hook_features§cell_id$a3d10753-2ec3-4252-9629-834145678b6adepends_on_disabled_cells§runtime_published_object_keysdepends_on_skipped_cells§errored$12aac612-758b-4655-8ede-daddd4af6d3equeued¤logsrunning¦outputbody+sarsa_step (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA !persist_js_state·has_pluto_hook_features§cell_id$12aac612-758b-4655-8ede-daddd4af6d3edepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$2c49900b-3c57-4d9a-b3dc-ef9cc20c30c1queued¤logsrunning¦outputbody

To understand the origin of the bias, consider a case where we only have a single sample from each variable which follows a standard normal distribution. In this case our estimate of the maximum expected value is just $\max(x, y)$ where $x$ and $y$ are samples from $X$ and $Y$ respectively. The expected value of this estimator can be calculated using the distribution of the maximum of two standard normal random variables:

$$\mathbb{E}\left [ \text{max}(\mathcal{N}(0, 1), \mathcal{N}(0, 1)) \right ] = \frac{1}{\sqrt{\pi}} \approx 0.564$$

Indeed, on the plot for 2 variables after 1 sample collected for each, this average observed value is 0.56 and the value increase the more variables in our list. So apparantly our estimate has a positive bias despite the fact that every underlying variables have exactly the same distribution. If we had more samples for each variable then we would use the distribution of the sample average rather than a single sample and that distribution has a variance proportional to the inverse of the number of samples. So the bias will converge to zero in the limit of infinite samples, and in the graph the bias does in fact converge to zero over more samples.

There is a method of eliminating this positive bias using a so-called double estimator, and this method was first introduced by Hado van Hasselt in a paper published during NIPS 2010. Below is a more thorough overview of the paper, but first I will provide a conceptual sketch of the proof.

First consider a set of $M$ random variables $X = \{X_1, \dots, X_M \}$ and our goal is to estimate: $\max_i \mathbb{E} \{ X_i \}$.

In the single estimator case, we will draw samples from each variable and construct some unbiased estimator for each mean: $\mu_i$. After we have collected some set of samples, using this method, we make the assumption that which ever estimator or set of estimators have the maximum value are the true variables with the maximum expected value. If there is zero overlap in the distribution of each random variable, then these estimators will always be ranked in the same order as the true expected values and our estimate will be unbiased. However, if there is any overlap in the underlying distributions (this also includes the case where all distributions are identical), then there is some non-zero probability that the true maximum index is NOT in the set of indices for the maximum estimators. Let's say the apparent maximizing index from the sample is $s^*$ while one of the true maximizing indices is $j \neq s^*$. So our final estimate for the maximum expected value will be $\mu_{s^*}$. We already know that $\mathbb{E} \{ X_j \} = \max_i \mathbb{E} \{X_i \}$ by assumption. We also know that $\mu_{s^*} > \mu_j$ in the sample and $\mathbb{E} \{ \mu_j\} = \max_i \mathbb{E} \{X_i \}$ which is the true value that we want. So we would always expect this estimator to be larger than the true answer or equal to it in the case where the selected index is correct. This is even true if all the variables share the same distribution, because every estimate has the same expected value which is the true answer, yet the one estimate we use to calculate the maximum is guaranteed to be larger than all of those unbiased alternatives. The underlying reason why this will tend to overestimate is because in any finite sample, we are not guaranteed to know the correct maximizing index and any variable that produces samples high enough to exceed the true maximum will always be selected to represent that maximum.

In the double estimator case, we split the samples into two sets $\mathcal{A}$ and $\mathcal{B}$ such that $\mathcal{A} \bigcap \mathcal{B} = \emptyset$ and have a set of estimators for each set $\mu_i^\mathcal{A}$ and $\mu_i^\mathcal{B}$. Let $a^*$ be in the set of indices with the maximum estimated values in set $\mathcal{A}$. Again, if the underlying distributions overlap at all, then there is some probability that this index is not in the set of true maximizing indices. However, now if all the distributions are equal, then whichever index we pick is still guaranteed to be correct. To estimate the actual value of the maximum, we take $\mu_{i_{a*}}^\mathcal{B}$ which is the estimate from set $\mathcal{B}$ at the maximizing index from set $\mathcal{A}$. Just like in the single estimator case, if this happens to be a correct index, then we have an unbiased estimate for the true value. However, if the index is wrong, we are estimating the expected value of a non-maximizing index from a new set of samples. By the definition of the maximizing indices, we know that in this case $\mathbb{E} \{ \mu_{a^*}^\mathcal{B} \} \lt \max_i \mathbb{E} \{ X_i \}$ resulting in a negative bias for our estimate. Just like in the single estimator case, this estimate will be unbiased if there is no overlap in the underlying probability distributions for each variable. Unlike the single estimator case, this estimate will also be unbiased if all the underlying distributions are equal.

See below for a visualization of the bias removal for the iid case as well as the more formal proof for both methods.

mimetext/htmlrootassigneelast_run_timestampA ޼.persist_js_state·has_pluto_hook_features§cell_id$2c49900b-3c57-4d9a-b3dc-ef9cc20c30c1depends_on_disabled_cells§runtime Gpublished_object_keysdepends_on_skipped_cells§errored$e26f788e-f602-403e-929e-6c98a6e6bf79queued¤logsrunning¦outputbody

The double estimator methods are the only ones that don't show an initial increase in the number of episodes. After enough time though, every methodstarts to converge to the policy that takes a direct path. If $\alpha$ is not low enough, Q-learning fails to converge towards the optimal policy and has diverging value estimates. Both double methods are very stable and correctly estimate every state to have a negative value.

mimetext/htmlrootassigneelast_run_timestampA ޾Qpersist_js_state·has_pluto_hook_features§cell_id$e26f788e-f602-403e-929e-6c98a6e6bf79depends_on_disabled_cells§runtimeg5published_object_keysdepends_on_skipped_cells§errored$c09530bc-f37e-4d57-a267-14d4027147daqueued¤logsrunning¦outputbody :

Returning to the definition of $\eta_t$, we can simplify further:

$$\eta_{t} \doteq V_{t+1}(S_{t+1}) - V_t(S_{t+1})$$

This quantity is the change in value estimate at a state between two time steps. Note that at time $t+1$ we have only performed an update for the value at state $S_t$ using the equation:

$$V_{t+1}(S_t) = V_t(S_t) + \alpha \delta_t$$

If $S_{t+1} \neq S_t$, then the value estimate at this state will not occur on either time step $t$ or $t+1$, so $V_{t+1}(S_{t+1}) = V_t(S_{t+1}) \implies \eta_{t} = 0$

The only case in which $V_{t+1}(S_{t+1}) \neq V_t(S_{t+1})$ is when $S_t = S_{t+1} = S$. In this case, $V_{t+1}(S) = V_t(S) + \alpha \delta_t \implies V_{t+1}(S) - V_t(S) = \alpha \delta_t$

So we can rewrite $\eta_{t} = \alpha \delta_t \mathbb{1}_{t}$ where $\mathbb{1}_{t} = \begin{cases} 1 & \text{if } S_{t+1} = S_t \\ 0 & \text{otherwise} \end{cases}$

So the original equation can be written as:

$$\begin{flalign} G_t - V_t(S_t) &= \sum_{k=t}^{T-1} \gamma^{k-t} (\delta_k + \gamma \alpha \delta_k \mathbb{1}_k) \\ &= \sum_{k=t}^{T-1} \gamma^{k-t} \delta_k (1 + \gamma \alpha \mathbb{1}_k) \\ \end{flalign}$$

Where the first term is the value from the original derivation and the second term is only non-zero when a state appears twice concecutively in an episode.

mimetext/htmlrootassigneelast_run_timestampA ޱWpersist_js_state·has_pluto_hook_features§cell_id$c09530bc-f37e-4d57-a267-14d4027147dadepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$0c0b875e-69f8-46ed-ad06-df9c36088fbequeued¤logsrunning¦outputbody3mimetext/plainrootassigneeconst gridsizelast_run_timestampA  -ðpersist_js_state·has_pluto_hook_features§cell_id$0c0b875e-69f8-46ed-ad06-df9c36088fbedepends_on_disabled_cells§runtimeapublished_object_keysdepends_on_skipped_cells§errored$8d05403a-adeb-40ac-a98a-87586d5a5170queued¤logsrunning¦outputbodyB

Example 6.5: Windy Gridworld

mimetext/htmlrootassigneelast_run_timestampA ޺ppersist_js_state·has_pluto_hook_features§cell_id$8d05403a-adeb-40ac-a98a-87586d5a5170depends_on_disabled_cells§runtimePpublished_object_keysdepends_on_skipped_cells§errored$44c49006-e210-4f97-916e-fe62f36c593fqueued¤logsrunning¦outputbody

6.5 Q-learning: Off-policy TD Control

One of the early breakthroughs in reinforcement learning was the development of an off-policy TD control algorithm known as Q-learning (Watkins, 1989), defined by

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \text{max}_a Q(S_{t+1}, a) - Q(S_t, A_t)]$$

mimetext/htmlrootassigneelast_run_timestampA ޻\persist_js_state·has_pluto_hook_features§cell_id$44c49006-e210-4f97-916e-fe62f36c593fdepends_on_disabled_cells§runtimeOpublished_object_keysdepends_on_skipped_cells§errored$0ad739c9-8aca-4b82-bf20-c73584d29535queued¤logsrunning¦outputbody

Exercise 6.9 Windy Gridworld with King's Moves (programming)

Re-solve the windy gridworld assuming eight possible actions, including the diagonal moves, rather than four. How much better can you do with the extra actions? Can you do even better by including a ninth action that causes no movement at all other than that caused by the wind?

mimetext/htmlrootassigneelast_run_timestampA ޺persist_js_state·has_pluto_hook_features§cell_id$0ad739c9-8aca-4b82-bf20-c73584d29535depends_on_disabled_cells§runtimeQpublished_object_keysdepends_on_skipped_cells§errored$0748902c-ffc0-4634-9a1b-e642b3dfb77bqueued¤logsrunning¦outputbody3form_random_policy (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA WUCpersist_js_state·has_pluto_hook_features§cell_id$0748902c-ffc0-4634-9a1b-e642b3dfb77bdepends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cells§errored$6a1503c6-c77b-4e3a-9f07-74b2af1a5ff7queued¤logsrunning¦outputbody:

Sarsa Implementation

mimetext/htmlrootassigneelast_run_timestampA ޺~persist_js_state·has_pluto_hook_features§cell_id$6a1503c6-c77b-4e3a-9f07-74b2af1a5ff7depends_on_disabled_cells§runtime͵published_object_keysdepends_on_skipped_cells§errored$292d9018-b550-4278-a8e0-78dd6a6853f1queued¤logsrunning¦outputbody/expected_sarsa (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA spersist_js_state·has_pluto_hook_features§cell_id$292d9018-b550-4278-a8e0-78dd6a6853f1depends_on_disabled_cells§runtime&published_object_keysdepends_on_skipped_cells§errored$07c57f37-22be-4c39-8279-d80addcea0c5queued¤logsrunning¦outputbody@create_stochastic_gridworld_mdp (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 9persist_js_state·has_pluto_hook_features§cell_id$07c57f37-22be-4c39-8279-d80addcea0c5depends_on_disabled_cells§runtimeR4published_object_keysdepends_on_skipped_cells§errored$b5187232-d808-49b6-9f7e-a4cbeb6c2b3equeued¤logsrunning¦outputbody?

Example 6.1: Driving Home

mimetext/htmlrootassigneelast_run_timestampA ޱpersist_js_state·has_pluto_hook_features§cell_id$b5187232-d808-49b6-9f7e-a4cbeb6c2b3edepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$54d97122-2d01-46ec-aafe-00bfc9f2d6d1queued¤logsrunning¦outputbody0

Step: 1 / 17

mimetext/htmlrootassigneelast_run_timestampA Qpersist_js_state·has_pluto_hook_features§cell_id$54d97122-2d01-46ec-aafe-00bfc9f2d6d1depends_on_disabled_cells§runtime2hspublished_object_keysdepends_on_skipped_cells§errored$926ec37d-b969-4dc9-99b2-a6b29c6d880cqueued¤logsrunning¦outputbody1

Figure 6.5:

mimetext/htmlrootassigneelast_run_timestampA ޾persist_js_state·has_pluto_hook_features§cell_id$926ec37d-b969-4dc9-99b2-a6b29c6d880cdepends_on_disabled_cells§runtimeOpublished_object_keysdepends_on_skipped_cells§errored$c360945e-f8b2-4c6f-a70c-6ab4ddcf5b54queued¤logsrunning¦outputbody

By changing the initialization to 0, the RMS error monotonically converges to the minimum since the state values never pass through the correct values on their way to overshooting.

mimetext/htmlrootassigneelast_run_timestampA ޸ persist_js_state·has_pluto_hook_features§cell_id$c360945e-f8b2-4c6f-a70c-6ab4ddcf5b54depends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cells§errored$573a9919-bd7e-4a56-b830-4e40e91288efqueued¤logsrunning¦outputbody

Let $X = \{ X_1, \dots, X_M \}$ be a set of random variables and let $\mu^A = \{\mu_1^A, \dots, \mu_M^A \}$ and $\mu^B = \{\mu_1^B, \dots, \mu_M^B\}$ be two sets of unbiased estimators such that $\mathbb{E} \{ \mu_i^A \} = \mathbb{E} \{ \mu_i^B \} = \mathbb{E} \{ X_i \}$ for all $i$. Let $\mathcal{M} \doteq \left \{ j \mid \mathbb{E} \{ X_j \} = \max_i \mathbb{E} \{ X_i \} \right \}$ be the set of labels of estimators that maximize the expcted values of $X$. Let $a^*$ be an element that maximizes $\mu^A:\mu_{a^*}^A = \max_i \mu_i^A$. The claim is that:

$$\mathbb{E} \{ \mu_{a^*}^B \} = \mathbb{E} \{ X_{a^*} \} \leq \max_i \mathbb{E} \{ X_i \}$$

. Furthermore, the inequality is strict if and only if $P(a^* \notin \mathcal{M}) \gt 0$.

Proof. Assume $a^* \in \mathcal{M}$. Then $\mathbb{E} \{ \mu_{a^*}^B\} = \mathbb{E} \{ X_{a^*}\} \doteq \max_i \mathbb{E} \{ X_i \}$. Now assume $a^* \notin \mathcal{M}$ and choose $j \in \mathcal{M}$. Then $\mathbb{E} \{ \mu_{a^*} \} = \mathbb{E} \{ X_{a^*}\} \lt \mathbb{E} \{ X_j \} \doteq \max_i \mathbb{E} \{ X_i \}$. These two possibilities are mutually exclusive, so the combined expression can be written as:

$$\begin{flalign} \mathbb{E} \{ \mu_{a^*}^B \} &= P(a^* \in \mathcal{M}) \mathbb{E} \{ \mu_{a^*}^B \vert a^* \in \mathcal{M} \} + P(a^* \notin \mathcal{M}) \mathbb{E} \{ \mu_{a^*}^B \vert a^* \notin \mathcal{M} \} \\ &= P(a^* \in \mathcal{M}) \max_i \mathbb{E} \{X_i \} + P(a^* \notin \mathcal{M}) \mathbb{E} \{ \mu_{a^*}^B \vert a^* \notin \mathcal{M} \} \\ &\leq P(a^* \in \mathcal{M}) \max_i \mathbb{E} \{X_i \} + P(a^* \notin \mathcal{M}) \max_i \mathbb{E} \{ X_i \} \\ &=\max_i \mathbb{E} \{ X_i \} \end{flalign}$$

The inequality is strict only if $P(a^* \notin \mathcal{M}) \gt 0$ where $\mathcal{M}$ is the true set of maximizing variables. This happens when variables have different expected values, but their distributions overlap. In contrast with the simple estimator, the double estimator is unbiased when the variables are iid, since then all expected values are equal and $P(a^* \in \mathcal{M}) = 1$.

mimetext/htmlrootassigneelast_run_timestampA ޽^persist_js_state·has_pluto_hook_features§cell_id$573a9919-bd7e-4a56-b830-4e40e91288efdepends_on_disabled_cells§runtime6published_object_keysdepends_on_skipped_cells§errored$4556cf44-4a1c-4ca4-bfb8-4841301a2ce6queued¤logsrunning¦outputbody4display_rook_policy (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA ֡persist_js_state·has_pluto_hook_features§cell_id$4556cf44-4a1c-4ca4-bfb8-4841301a2ce6depends_on_disabled_cells§runtimeJpublished_object_keysdepends_on_skipped_cells§errored$bb085f2e-83cb-45b2-adf6-c07da892d6e1queued¤logsrunning¦outputbodyj

Value Iteration Results for Jack's Car Rental

mimetext/htmlrootassigneelast_run_timestampA ! ԰persist_js_state·has_pluto_hook_features§cell_id$bb085f2e-83cb-45b2-adf6-c07da892d6e1depends_on_disabled_cells§runtimeQ // Load the library for consistent smooth scrolling const {default: scrollIntoView} = await import("data:text/javascript;base64,dmFyIFE9ZT0+Im9iamVjdCI9PXR5cGVvZiBlJiZudWxsIT1lJiYxPT09ZS5ub2RlVHlwZSxVPShlLHQpPT4oIXR8fCJoaWRkZW4iIT09ZSkmJiJ2aXNpYmxlIiE9PWUmJiJjbGlwIiE9PWUsQT0oZSx0KT0+e2lmKGUuY2xpZW50SGVpZ2h0PGUuc2Nyb2xsSGVpZ2h0fHxlLmNsaWVudFdpZHRoPGUuc2Nyb2xsV2lkdGgpe2xldCBsPWdldENvbXB1dGVkU3R5bGUoZSxudWxsKTtyZXR1cm4gVShsLm92ZXJmbG93WSx0KXx8VShsLm92ZXJmbG93WCx0KXx8KGU9PntsZXQgdD0oZT0+e2lmKCFlLm93bmVyRG9jdW1lbnR8fCFlLm93bmVyRG9jdW1lbnQuZGVmYXVsdFZpZXcpcmV0dXJuIG51bGw7dHJ5e3JldHVybiBlLm93bmVyRG9jdW1lbnQuZGVmYXVsdFZpZXcuZnJhbWVFbGVtZW50fWNhdGNoe3JldHVybiBudWxsfX0pKGUpO3JldHVybiEhdCYmKHQuY2xpZW50SGVpZ2h0PGUuc2Nyb2xsSGVpZ2h0fHx0LmNsaWVudFdpZHRoPGUuc2Nyb2xsV2lkdGgpfSkoZSl9cmV0dXJuITF9LFg9KGUsdCxsLG8sbixyLGkscyk9PnI8ZSYmaT50fHxyPmUmJmk8dD8wOnI8PWUmJnM8PWx8fGk+PXQmJnM+PWw/ci1lLW86aT50JiZzPGx8fHI8ZSYmcz5sP2ktdCtuOjAsJD1lPT5lLnBhcmVudEVsZW1lbnQ/PyhlLmdldFJvb3ROb2RlKCkuaG9zdHx8bnVsbCksdHQ9KGUsdCk9Pnt2YXIgbCxvLG4scjtpZih0eXBlb2YgZG9jdW1lbnQ+InUiKXJldHVybltdO2xldHtzY3JvbGxNb2RlOmksYmxvY2s6cyxpbmxpbmU6YSxib3VuZGFyeTpoLHNraXBPdmVyZmxvd0hpZGRlbkVsZW1lbnRzOnV9PXQsZz0iZnVuY3Rpb24iPT10eXBlb2YgaD9oOmU9PmUhPT1oO2lmKCFRKGUpKXRocm93IFR5cGVFcnJvcigiSW52YWxpZCB0YXJnZXQiKTtsZXQgdj1kb2N1bWVudC5zY3JvbGxpbmdFbGVtZW50fHxkb2N1bWVudC5kb2N1bWVudEVsZW1lbnQsbT1bXSx3PWU7Zm9yKDtRKHcpJiZnKHcpOyl7aWYoKHc9JCh3KSk9PT12KXttLnB1c2godyk7YnJlYWt9bnVsbCE9dyYmdz09PWRvY3VtZW50LmJvZHkmJkEodykmJiFBKGRvY3VtZW50LmRvY3VtZW50RWxlbWVudCl8fG51bGwhPXcmJkEodyx1KSYmbS5wdXNoKHcpfWxldCBXPW51bGwhPShvPW51bGw9PShsPXdpbmRvdy52aXN1YWxWaWV3cG9ydCk/dm9pZCAwOmwud2lkdGgpP286aW5uZXJXaWR0aCxIPW51bGwhPShyPW51bGw9PShuPXdpbmRvdy52aXN1YWxWaWV3cG9ydCk/dm9pZCAwOm4uaGVpZ2h0KT9yOmlubmVySGVpZ2h0LHtzY3JvbGxYOl8sc2Nyb2xsWTp4fT13aW5kb3cse2hlaWdodDpFLHdpZHRoOlQsdG9wOk4scmlnaHQ6TCxib3R0b206WSxsZWZ0OkN9PWUuZ2V0Qm91bmRpbmdDbGllbnRSZWN0KCksUj0ic3RhcnQiPT09c3x8Im5lYXJlc3QiPT09cz9OOiJlbmQiPT09cz9ZOk4rRS8yLFY9ImNlbnRlciI9PT1hP0MrVC8yOiJlbmQiPT09YT9MOkMsQj1bXTtmb3IobGV0IEQ9MDtEPG0ubGVuZ3RoO0QrKyl7bGV0IE89bVtEXSx7aGVpZ2h0Omosd2lkdGg6SSx0b3A6UyxyaWdodDpxLGJvdHRvbTp6LGxlZnQ6Rn09Ty5nZXRCb3VuZGluZ0NsaWVudFJlY3QoKTtpZigiaWYtbmVlZGVkIj09PWkmJk4+PTAmJkM+PTAmJlk8PUgmJkw8PVcmJk4+PVMmJlk8PXomJkM+PUYmJkw8PXEpYnJlYWs7bGV0IEc9Z2V0Q29tcHV0ZWRTdHlsZShPKSxKPXBhcnNlSW50KEcuYm9yZGVyTGVmdFdpZHRoLDEwKSxLPXBhcnNlSW50KEcuYm9yZGVyVG9wV2lkdGgsMTApLFA9cGFyc2VJbnQoRy5ib3JkZXJSaWdodFdpZHRoLDEwKSxaPXBhcnNlSW50KEcuYm9yZGVyQm90dG9tV2lkdGgsMTApLGVlPTAsZXQ9MCxlbD0ib2Zmc2V0V2lkdGgiaW4gTz9PLm9mZnNldFdpZHRoLU8uY2xpZW50V2lkdGgtSi1QOjAsZW89Im9mZnNldEhlaWdodCJpbiBPP08ub2Zmc2V0SGVpZ2h0LU8uY2xpZW50SGVpZ2h0LUstWjowLGVuPSJvZmZzZXRXaWR0aCJpbiBPPzA9PT1PLm9mZnNldFdpZHRoPzA6SS9PLm9mZnNldFdpZHRoOjAsZXI9Im9mZnNldEhlaWdodCJpbiBPPzA9PT1PLm9mZnNldEhlaWdodD8wOmovTy5vZmZzZXRIZWlnaHQ6MDtpZih2PT09TyllZT0ic3RhcnQiPT09cz9SOiJlbmQiPT09cz9SLUg6Im5lYXJlc3QiPT09cz9YKHgseCtILEgsSyxaLHgrUix4K1IrRSxFKTpSLUgvMixldD0ic3RhcnQiPT09YT9WOiJjZW50ZXIiPT09YT9WLVcvMjoiZW5kIj09PWE/Vi1XOlgoXyxfK1csVyxKLFAsXytWLF8rVitULFQpLGVlPU1hdGgubWF4KDAsZWUreCksZXQ9TWF0aC5tYXgoMCxldCtfKTtlbHNle2VlPSJzdGFydCI9PT1zP1ItUy1LOiJlbmQiPT09cz9SLXorWitlbzoibmVhcmVzdCI9PT1zP1goUyx6LGosSyxaK2VvLFIsUitFLEUpOlItKFMrai8yKStlby8yLGV0PSJzdGFydCI9PT1hP1YtRi1KOiJjZW50ZXIiPT09YT9WLShGK0kvMikrZWwvMjoiZW5kIj09PWE/Vi1xK1ArZWw6WChGLHEsSSxKLFArZWwsVixWK1QsVCk7bGV0e3Njcm9sbExlZnQ6ZWksc2Nyb2xsVG9wOmVkfT1PO2VlPU1hdGgubWF4KDAsTWF0aC5taW4oZWQrZWUvZXIsTy5zY3JvbGxIZWlnaHQtai9lcitlbykpLGV0PU1hdGgubWF4KDAsTWF0aC5taW4oZWkrZXQvZW4sTy5zY3JvbGxXaWR0aC1JL2VuK2VsKSksUis9ZWQtZWUsVis9ZWktZXR9Qi5wdXNoKHtlbDpPLHRvcDplZSxsZWZ0OmV0fSl9cmV0dXJuIEJ9LGY9ZT0+e3ZhciB0O3JldHVybiExPT09ZT97YmxvY2s6ImVuZCIsaW5saW5lOiJuZWFyZXN0In06KHQ9ZSk9PT1PYmplY3QodCkmJjAhPT1PYmplY3Qua2V5cyh0KS5sZW5ndGg/ZTp7YmxvY2s6InN0YXJ0IixpbmxpbmU6Im5lYXJlc3QifX07ZnVuY3Rpb24gYyhlLHQpe3ZhciBsO2lmKCFlLmlzQ29ubmVjdGVkfHwhKGU9PntsZXQgdD1lO2Zvcig7dCYmdC5wYXJlbnROb2RlOyl7aWYodC5wYXJlbnROb2RlPT09ZG9jdW1lbnQpcmV0dXJuITA7dD10LnBhcmVudE5vZGUgaW5zdGFuY2VvZiBTaGFkb3dSb290P3QucGFyZW50Tm9kZS5ob3N0OnQucGFyZW50Tm9kZX1yZXR1cm4hMX0pKGUpKXJldHVybjtpZigib2JqZWN0Ij09dHlwZW9mKGw9dCkmJiJmdW5jdGlvbiI9PXR5cGVvZiBsLmJlaGF2aW9yKXJldHVybiB0LmJlaGF2aW9yKHR0KGUsdCkpO2xldCBvPSJib29sZWFuIj09dHlwZW9mIHR8fG51bGw9PXQ/dm9pZCAwOnQuYmVoYXZpb3I7Zm9yKGxldHtlbDpuLHRvcDpyLGxlZnQ6aX1vZiB0dChlLGYodCkpKW4uc2Nyb2xsKHt0b3A6cixsZWZ0OmksYmVoYXZpb3I6b30pfXZhciBkLHA9KCk9PihkfHwoZD0icGVyZm9ybWFuY2UiaW4gd2luZG93P3BlcmZvcm1hbmNlLm5vdy5iaW5kKHBlcmZvcm1hbmNlKTpEYXRlLm5vdyksZCgpKTtmdW5jdGlvbiBiKGUpe2xldCB0PU1hdGgubWluKChwKCktZS5zdGFydFRpbWUpL2UuZHVyYXRpb24sMSksbD1lLmVhc2UodCksbz1lLnN0YXJ0WCsoZS54LWUuc3RhcnRYKSpsLG49ZS5zdGFydFkrKGUueS1lLnN0YXJ0WSkqbDtlLm1ldGhvZChvLG4sdCxsKSxvIT09ZS54fHxuIT09ZS55P3JlcXVlc3RBbmltYXRpb25GcmFtZSgoKT0+YihlKSk6ZS5jYigpfWZ1bmN0aW9uIHkoZSx0LGwpe2xldCBvPWFyZ3VtZW50cy5sZW5ndGg+MyYmdm9pZCAwIT09YXJndW1lbnRzWzNdP2FyZ3VtZW50c1szXTo2MDAsbj1hcmd1bWVudHMubGVuZ3RoPjQmJnZvaWQgMCE9PWFyZ3VtZW50c1s0XT9hcmd1bWVudHNbNF06ZT0+MSstLWUqZSplKmUqZSxyPWFyZ3VtZW50cy5sZW5ndGg+NT9hcmd1bWVudHNbNV06dm9pZCAwLGk9YXJndW1lbnRzLmxlbmd0aD42P2FyZ3VtZW50c1s2XTp2b2lkIDAscz1lLnNjcm9sbExlZnQsYT1lLnNjcm9sbFRvcDtiKHtzY3JvbGxhYmxlOmUsbWV0aG9kKHQsbCxvLG4pe2xldCByPU1hdGguY2VpbCh0KSxzPU1hdGguY2VpbChsKTtlLnNjcm9sbExlZnQ9cixlLnNjcm9sbFRvcD1zLGk/Lih7dGFyZ2V0OmUsZWxhcHNlZDpvLHZhbHVlOm4sbGVmdDpyLHRvcDpzfSl9LHN0YXJ0VGltZTpwKCksc3RhcnRYOnMsc3RhcnRZOmEseDp0LHk6bCxkdXJhdGlvbjpvLGVhc2U6bixjYjpyfSl9dmFyIE09ZT0+ZSYmIWUuYmVoYXZpb3J8fCJzbW9vdGgiPT09ZS5iZWhhdmlvcixrPWZ1bmN0aW9uKGUsdCl7bGV0IGw9dHx8e307cmV0dXJuIE0obCk/YyhlLHtibG9jazpsLmJsb2NrLGlubGluZTpsLmlubGluZSxzY3JvbGxNb2RlOmwuc2Nyb2xsTW9kZSxib3VuZGFyeTpsLmJvdW5kYXJ5LHNraXBPdmVyZmxvd0hpZGRlbkVsZW1lbnRzOmwuc2tpcE92ZXJmbG93SGlkZGVuRWxlbWVudHMsYmVoYXZpb3I6ZT0+UHJvbWlzZS5hbGwoZS5yZWR1Y2UoKGUsdCk9PntsZXR7ZWw6byxsZWZ0Om4sdG9wOnJ9PXQsaT1vLnNjcm9sbExlZnQscz1vLnNjcm9sbFRvcDtyZXR1cm4gaT09PW4mJnM9PT1yP2U6Wy4uLmUsbmV3IFByb21pc2UoZT0+eShvLG4scixsLmR1cmF0aW9uLGwuZWFzZSwoKT0+ZSh7ZWw6byxsZWZ0OltpLG5dLHRvcDpbcyxyXX0pLGwub25TY3JvbGxDaGFuZ2UpKV19LFtdKSl9KTpQcm9taXNlLnJlc29sdmUoYyhlLHQpKX07ZXhwb3J0e2sgYXMgZGVmYXVsdH07") const indent = true const aside = true const title_text = "Table of Contents" const include_definitions = false const tocNode = html`` tocNode.classList.toggle("aside", aside) tocNode.classList.toggle("indent", indent) const getParentCell = el => el.closest("pluto-cell") const getHeaders = () => { const depth = Math.max(1, Math.min(6, 3)) // should be in range 1:6 const range = Array.from({length: depth}, (x, i) => i+1) // [1, ..., depth] const selector = [ ...(include_definitions ? [ `pluto-notebook pluto-cell .pluto-docs-binding`, `pluto-notebook pluto-cell assignee:not(:empty)`, ] : []), ...range.map(i => `pluto-notebook pluto-cell h${i}`) ].join(",") return Array.from(document.querySelectorAll(selector)).filter(el => // exclude headers inside of a pluto-docs-binding block !(el.nodeName.startsWith("H") && el.closest(".pluto-docs-binding")) ) } const document_click_handler = (event) => { const path = (event.path || event.composedPath()) const toc = path.find(elem => elem?.classList?.contains?.("toc-toggle")) if (toc) { event.stopImmediatePropagation() toc.closest(".plutoui-toc").classList.toggle("hide") } } document.addEventListener("click", document_click_handler) const header_to_index_entry_map = new Map() const currently_highlighted_set = new Set() const last_toc_element_click_time = { current: 0 } const intersection_callback = (ixs) => { let on_top = ixs.filter(ix => ix.intersectionRatio > 0 && ix.intersectionRect.y < ix.rootBounds.height / 2) if(on_top.length > 0){ currently_highlighted_set.forEach(a => a.classList.remove("in-view")) currently_highlighted_set.clear() on_top.slice(0,1).forEach(i => { let div = header_to_index_entry_map.get(i.target) div.classList.add("in-view") currently_highlighted_set.add(div) /// scroll into view /* const toc_height = tocNode.offsetHeight const div_pos = div.offsetTop const div_height = div.offsetHeight const current_scroll = tocNode.scrollTop const header_height = tocNode.querySelector("header").offsetHeight const scroll_to_top = div_pos - header_height const scroll_to_bottom = div_pos + div_height - toc_height // if we set a scrollTop, then the browser will stop any currently ongoing smoothscroll animation. So let's only do this if you are not currently in a smoothscroll. if(Date.now() - last_toc_element_click_time.current >= 2000) if(current_scroll < scroll_to_bottom){ tocNode.scrollTop = scroll_to_bottom } else if(current_scroll > scroll_to_top){ tocNode.scrollTop = scroll_to_top } */ }) } } let intersection_observer_1 = new IntersectionObserver(intersection_callback, { root: null, // i.e. the viewport threshold: 1, rootMargin: "-15px", // slightly smaller than the viewport // delay: 100, }) let intersection_observer_2 = new IntersectionObserver(intersection_callback, { root: null, // i.e. the viewport threshold: 1, rootMargin: "15px", // slightly larger than the viewport // delay: 100, }) const render = (elements) => { header_to_index_entry_map.clear() currently_highlighted_set.clear() intersection_observer_1.disconnect() intersection_observer_2.disconnect() let last_level = `H1` return html`${elements.map(h => { const parent_cell = getParentCell(h) let [className, title_el] = h.matches(`.pluto-docs-binding`) ? ["pluto-docs-binding-el", h.firstElementChild] : [h.nodeName, h] const a = html`${title_el.innerHTML}` /* a.onmouseover=()=>{ parent_cell.firstElementChild.classList.add( 'highlight-pluto-cell-shoulder' ) } a.onmouseout=() => { parent_cell.firstElementChild.classList.remove( 'highlight-pluto-cell-shoulder' ) } */ a.onclick=(e) => { e.preventDefault(); last_toc_element_click_time.current = Date.now() scrollIntoView(h, { behavior: 'smooth', block: 'start', }).then(() => // sometimes it doesn't scroll to the right place // solution: try a second time! scrollIntoView(h, { behavior: 'smooth', block: 'start', }) ) } const row = html`
${a}
` intersection_observer_1.observe(title_el) intersection_observer_2.observe(title_el) header_to_index_entry_map.set(title_el, row) if(className.startsWith("H")) last_level = className return row })}` } const invalidated = { current: false } const updateCallback = () => { if (!invalidated.current) { tocNode.querySelector("section").replaceWith( html`
${render(getHeaders())}
` ) } } updateCallback() setTimeout(updateCallback, 100) setTimeout(updateCallback, 1000) setTimeout(updateCallback, 5000) const notebook = document.querySelector("pluto-notebook") // We have a mutationobserver for each cell: const mut_observers = { current: [], } const createCellObservers = () => { mut_observers.current.forEach((o) => o.disconnect()) mut_observers.current = Array.from(notebook.querySelectorAll("pluto-cell")).map(el => { const o = new MutationObserver(updateCallback) o.observe(el, {attributeFilter: ["class"]}) return o }) } createCellObservers() // And one for the notebook's child list, which updates our cell observers: const notebookObserver = new MutationObserver(() => { updateCallback() createCellObservers() }) notebookObserver.observe(notebook, {childList: true}) // And finally, an observer for the document.body classList, to make sure that the toc also works when it is loaded during notebook initialization const bodyClassObserver = new MutationObserver(updateCallback) bodyClassObserver.observe(document.body, {attributeFilter: ["class"]}) // Hide/show the ToC when the screen gets small let match_listener = () => tocNode.classList.toggle("hide", (tocNode.closest("pluto-editor") ?? document.body).scrollWidth < 1000) for(let s of [1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000]) { let m = matchMedia(`(max-width: ${s}px)`) m.addListener(match_listener) invalidation.then(() => m.removeListener(match_listener)) } match_listener() invalidation.then(() => { invalidated.current = true intersection_observer_1.disconnect() intersection_observer_2.disconnect() notebookObserver.disconnect() bodyClassObserver.disconnect() mut_observers.current.forEach((o) => o.disconnect()) document.removeEventListener("click", document_click_handler) }) return tocNode mimetext/htmlrootassigneelast_run_timestampA Eڰpersist_js_state·has_pluto_hook_features§cell_id$639840dc-976a-4e5c-987f-a92afb2d99d8depends_on_disabled_cells§runtime+gdpublished_object_keysdepends_on_skipped_cells§errored$dd167494-99d6-45c6-99e4-c36fde5e2d3fqueued¤logsrunning¦outputbody@

Jack's Car Rental Code

mimetext/htmlrootassigneelast_run_timestampA 5vpersist_js_state·has_pluto_hook_features§cell_id$dd167494-99d6-45c6-99e4-c36fde5e2d3fdepends_on_disabled_cells§runtime2published_object_keysdepends_on_skipped_cells§errored$ab331778-f892-4690-8bb3-26464e3fc05fqueued¤logsrunning¦outputbodyprefixMDP_TD{GridworldState, GridworldAction, var"#tr#115"{var"#109#118", var"#step#114"{typeof(apply_wind), Vector{Int64}, var"#boundstate#113"{Int64, Int64}}}, var"#108#117"{GridworldState}, var"#isterm#116"{GridworldState}}elementsstatesprefix$Main.var"workspace#3".GridworldStateelementsprefixGridworldStateelementsx1text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid78e123e4d06443c5!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy2text/plaintypestructprefix_shortGridworldStateobjectide3e6b18864c38362!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid7d75a915b81b9730!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid32586272439d3588!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy5text/plaintypestructprefix_shortGridworldStateobjectid593769200b7ddf14!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy6text/plaintypestructprefix_shortGridworldStateobjectidd7705072ebc67732!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid32fa797472e0a83!application/vnd.pluto.tree+objectprefixGridworldStateelementsx2text/plainy1text/plaintypestructprefix_shortGridworldStateobjectidef30e57ae60bdc38!application/vnd.pluto.tree+object prefixGridworldStateelementsx2text/plainy2text/plaintypestructprefix_shortGridworldStateobjectid74f49756a2864a57!application/vnd.pluto.tree+objectmoreFprefixGridworldStateelementsx10text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid91d5970141de4b2d!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectidff3fbf77165dec32!application/vnd.pluto.tree+objectstatelookupprefix1Dict{Main.var"workspace#3".GridworldState, Int64}elementsprefixGridworldStateelementsx8text/plainy5text/plaintypestructprefix_shortGridworldStateobjectid14e5eae9a48c6749!application/vnd.pluto.tree+object54text/plainprefixGridworldStateelementsx6text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid4e5052ac2b36c8be!application/vnd.pluto.tree+object39text/plainprefixGridworldStateelementsx7text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid6d65389daed97014!application/vnd.pluto.tree+object46text/plainprefixGridworldStateelementsx8text/plainy4text/plaintypestructprefix_shortGridworldStateobjectidb85af438304886c5!application/vnd.pluto.tree+object53text/plainprefixGridworldStateelementsx10text/plainy1text/plaintypestructprefix_shortGridworldStateobjectiddad6dff35c9621ff!application/vnd.pluto.tree+object64text/plainprefixGridworldStateelementsx6text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid4e4b90239eb3be65!application/vnd.pluto.tree+object42text/plainprefixGridworldStateelementsx8text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid6d43cd1ca99a553e!application/vnd.pluto.tree+object50text/plainprefixGridworldStateelementsx2text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid166e372c47e8ffa6!application/vnd.pluto.tree+object10text/plainprefixGridworldStateelementsx5text/plainy3text/plaintypestructprefix_shortGridworldStateobjectidf8402269233868c7!application/vnd.pluto.tree+object31text/plainprefixGridworldStateelementsx8text/plainy7text/plaintypestructprefix_shortGridworldStateobjectidb08053c76dcd8072!application/vnd.pluto.tree+object56text/plainmoretypeDictprefix_shortDictobjectid7cd9c16284f4a833!application/vnd.pluto.tree+objectactionsprefix%Main.var"workspace#3".GridworldActionelementsprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+objectprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+objectprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+objectprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid952f6adeb23ade52!application/vnd.pluto.tree+objectactionlookupprefix2Dict{Main.var"workspace#3".GridworldAction, Int64}elementsprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+object2text/plainprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+object3text/plainprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+object4text/plainprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+object1text/plaintypeDictprefix_shortDictobjectidd959e68b9a201521!application/vnd.pluto.tree+objectstate_init%#108 (generic function with 1 method)text/plainstep(::Main.var"workspace#3".var"#tr#115"{Main.var"workspace#3".var"#109#118", Main.var"workspace#3".var"#step#114"{typeof(Main.var"workspace#3".apply_wind), Vector{Int64}, Main.var"workspace#3".var"#boundstate#113"{Int64, Int64}}}) (generic function with 1 method)text/plainistermq(::Main.var"workspace#3".var"#isterm#116"{Main.var"workspace#3".GridworldState}) (generic function with 1 method)text/plaintypestructprefix_shortMDP_TDobjectid72b262b1eeaeea6amime!application/vnd.pluto.tree+objectrootassigneeconst windy_gridworldlast_run_timestampA y persist_js_state·has_pluto_hook_features§cell_id$ab331778-f892-4690-8bb3-26464e3fc05fdepends_on_disabled_cells§runtime!Spublished_object_keysdepends_on_skipped_cells§errored$0e59e813-3d48-4a24-b5b3-9a9de7c500c2queued¤logsrunning¦outputbodyd

Exercise 6.7

Design an off-policy version of the TD(0) update that can be used with arbitrary target policy $\pi$ and convering behavior policy $b$, using each step $t$ the importance sampling ratio $\rho_{t:t}$ (5.3).

Recall that equation 5.3 defines:

$$\rho_{t:T-1} = \prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}$$

with the property that:

$$\mathbb{E}[\rho_{t:T-1}G_t \mid S_t = s] = v_\pi(s)$$

when $G_t$ is generated by the behavior policy.

The TD(0) update rule is given by:

$$V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]$$

based on the following form of the Bellman equation:

$$v_\pi (s)=\text{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s]$$

In the off-policy case, the reward $R_{t+1}$ and the subsequent state $S_{t+1}$ would be generated from the behavior policy, but the subsequent value would still be based on the target policy value function. Consider instead the quantity: $q_\pi(s, a) = \mathbb{E} [R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s, A_t = a]$ where we have removed the policy from the expectation since nothing in the bracket depends on sampling from the policy. Even if we chose actions a based on a behavior policy that differs from the target policy, these estimates will be correct because we are directly calculating the value for choosing that action, regardless of what the probability is. Consier we are following some behavior policy $b$ and recall that:

$$\begin{flalign} v_\pi(s) &= \sum_a \pi(a \vert s) q_\pi (s, a) \\ &= \sum_a \pi(a \vert s) \mathbb{E} [R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s, A_t = a]\\ &= \mathbb{E}_\pi [R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s]\\ v_b(s) &= \sum_a b(a \vert s) q_\pi (s, a) \\ &= \sum_a b(a \vert s) \mathbb{E} [R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s, A_t = a] \\ &= \mathbb{E}_b [R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s]\\ \end{flalign}$$

In the TD(0) update we do not calculate this expected value directly but instead average samples together that are drawn from the target policy. This sampling will produce samples weighted by the target policy probabilities thus mimicking the expected value sum. If instead, our samples are drawn from the behavior policy, then the samples will mimic the behavior policy probability weights instead of the target policy. So in order to correctly calculate the expected value we must multiply each behavior policy sample by $\frac{\pi(a \vert s)}{b(a \vert s)} = \frac{\pi(A_t \vert S_t)}{b(A_t \vert S_t)} = \rho_{t:t}$ resulting in the following update rule:

$$V(S_t) \leftarrow V(S_t) + \alpha [\rho_{t:t} \left ( R_{t+1} + \gamma V(S_{t+1}) \right ) - V(S_t)]$$

mimetext/htmlrootassigneelast_run_timestampA ޹}persist_js_state·has_pluto_hook_features§cell_id$0e59e813-3d48-4a24-b5b3-9a9de7c500c2depends_on_disabled_cells§runtime CIpublished_object_keysdepends_on_skipped_cells§errored$e4c6456c-867d-4ade-a3c8-310c1e065f14queued¤logsrunning¦outputbody@
0
0
0
0
0
1
mimetext/htmlrootassigneelast_run_timestampA ͌persist_js_state·has_pluto_hook_features§cell_id$e4c6456c-867d-4ade-a3c8-310c1e065f14depends_on_disabled_cells§runtimeGpublished_object_keysdepends_on_skipped_cells§errored$3e767962-7339-4f35-a039-b5521a098ed5queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestampA  persist_js_state·has_pluto_hook_features§cell_id$3e767962-7339-4f35-a039-b5521a098ed5depends_on_disabled_cells§runtime4Vpublished_object_keysdepends_on_skipped_cells§errored$834e5810-77ea-4dfd-9f37-9d9dbf6585a4queued¤logsrunning¦outputbody+makelookup (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA !persist_js_state·has_pluto_hook_features§cell_id$834e5810-77ea-4dfd-9f37-9d9dbf6585a4depends_on_disabled_cells§runtimekpublished_object_keysdepends_on_skipped_cells§errored$667666b9-3ab6-4836-953d-9878208103c9queued¤logsrunning¦outputbody
mimetext/htmlrootassigneelast_run_timestampA rpersist_js_state·has_pluto_hook_features§cell_id$667666b9-3ab6-4836-953d-9878208103c9depends_on_disabled_cells§runtimeg,Upublished_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/f51c1fa00f167ddf59c6be96e-38f7-11f0-2d30-a71f02755abc/b3ded7d596cbc23f59c6be96e-38f7-11f0-2d30-a71f02755abc/a0944b0f6ba4cc1f59c6be96e-38f7-11f0-2d30-a71f02755abc/ada388116d66970b59c6be96e-38f7-11f0-2d30-a71f02755abc/5f08b9d1ec5530fddepends_on_skipped_cells§errored$87fadfc0-2cdb-4be2-81ad-e8fdeffb690cqueued¤logsrunning¦outputbody/show_mrp_state (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA tpersist_js_state·has_pluto_hook_features§cell_id$87fadfc0-2cdb-4be2-81ad-e8fdeffb690cdepends_on_disabled_cells§runtimedpublished_object_keysdepends_on_skipped_cells§errored$4019c974-dcaa-46c8-ac90-e6566a376ea1queued¤logsrunning¦outputbody9begin_value_iteration_v (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA  hpersist_js_state·has_pluto_hook_features§cell_id$4019c974-dcaa-46c8-ac90-e6566a376ea1depends_on_disabled_cells§runtime#published_object_keysdepends_on_skipped_cells§errored$4d4577b5-3753-450d-a247-ebd8c3e8f799queued¤logsrunning¦outputbody8create_ϵ_greedy_policy (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 㲁persist_js_state·has_pluto_hook_features§cell_id$4d4577b5-3753-450d-a247-ebd8c3e8f799depends_on_disabled_cells§runtime:}ڵpublished_object_keysdepends_on_skipped_cells§errored$e19db54c-4b3c-42d1-b016-9620daf89bfbqueued¤logsrunning¦outputbodyprefixInt64elements0text/plain0text/plain0text/plain1text/plain1text/plain1text/plain2text/plain2text/plain 1text/plain 0text/plaintypeArrayprefix_shortobjectideb5c5c565e9477dcmime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA ]}persist_js_state·has_pluto_hook_features§cell_id$e19db54c-4b3c-42d1-b016-9620daf89bfbdepends_on_disabled_cells§runtimeVpublished_object_keysdepends_on_skipped_cells§errored$ed4e863b-22dd-4d2b-88d0-b3a56d6713b7queued¤logsrunning¦outputbodyGe
Sarsa Solution
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
-22.0
-23.0
-23.0
-24.0
-23.0
-23.0
-24.0
-22.0
-20.0
-22.0
-22.0
-23.0
-23.0
-24.0
-17.0
-18.0
-20.0
-23.0
-23.0
-23.0
-24.0
-19.0
-22.0
-22.0
-23.0
-23.0
-23.0
-24.0
-8.8
-12.0
-21.0
-22.0
-22.0
-23.0
-22.0
-6.7
-11.0
-21.0
-22.0
-22.0
-19.0
-19.0
-15.0
-7.6
-21.0
-20.0
-20.0
-19.0
-17.0
-10.0
-6.1
-8.4
0.0
-16.0
-18.0
-20.0
-18.0
-13.0
-1.5
-8.4
-9.2
-19.0
-11.0
-18.0
-18.0
-18.0
-3.3
-4.9
-8.3
-11.0
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
Value Iteration Solution
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
-11.0
-11.0
-11.0
-11.0
-12.0
-13.0
-14.0
-10.0
-10.0
-10.0
-11.0
-12.0
-13.0
-14.0
-9.0
-9.0
-9.5
-11.0
-12.0
-13.0
-14.0
-8.0
-8.5
-9.6
-11.0
-12.0
-13.0
-14.0
-6.8
-7.3
-8.5
-10.0
-12.0
-13.0
-13.0
-5.6
-6.2
-7.1
-9.2
-11.0
-12.0
-12.0
-4.6
-4.6
-6.2
-7.5
-11.0
-11.0
-11.0
-5.6
-4.6
-6.2
0.0
-9.5
-10.0
-10.0
-4.6
-4.6
-4.6
-6.2
-7.5
-8.6
-9.3
-5.6
-5.6
-5.6
-5.6
-6.6
-7.6
-8.6
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
mimetext/htmlrootassigneelast_run_timestampA Pf^persist_js_state·has_pluto_hook_features§cell_id$ed4e863b-22dd-4d2b-88d0-b3a56d6713b7depends_on_disabled_cells§runtimeXpublished_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/1b9ce98558a7374959c6be96e-38f7-11f0-2d30-a71f02755abc/5e790add5f7b184459c6be96e-38f7-11f0-2d30-a71f02755abc/7b6adbf2145966c959c6be96e-38f7-11f0-2d30-a71f02755abc/76a25ffbba40a531depends_on_skipped_cells§errored$393cd9d2-dd97-496e-b260-ec6e8b1c13b5queued¤logsrunning¦outputbodyFiniteAfterstateMDPmimetext/plainrootassigneelast_run_timestampA Apersist_js_state·has_pluto_hook_features§cell_id$393cd9d2-dd97-496e-b260-ec6e8b1c13b5depends_on_disabled_cells§runtimeR+published_object_keysdepends_on_skipped_cells§errored$401831c3-3925-465c-a093-28686f0dad2equeued¤logsrunning¦outputbody7initialize_state_value (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA wİpersist_js_state·has_pluto_hook_features§cell_id$401831c3-3925-465c-a093-28686f0dad2edepends_on_disabled_cells§runtime#published_object_keysdepends_on_skipped_cells§errored$2d881aa9-1da3-4d1e-8d05-245956dbaf33queued¤logsrunning¦outputbody mimetext/htmlrootassigneelast_run_timestampA [Kpersist_js_state·has_pluto_hook_features§cell_id$2d881aa9-1da3-4d1e-8d05-245956dbaf33depends_on_disabled_cells§runtime,upublished_object_keysdepends_on_skipped_cells§errored$047a8881-c2ec-4dd1-8778-e3acf9beba2equeued¤logsrunning¦outputbodyp

Sarsa vs Q-learning vs Expected Sarsa Performance on Cliff Walking Example

mimetext/htmlrootassigneelast_run_timestampA ޼9ipersist_js_state·has_pluto_hook_features§cell_id$047a8881-c2ec-4dd1-8778-e3acf9beba2edepends_on_disabled_cells§runtimeMpublished_object_keysdepends_on_skipped_cells§errored$29b0a2d5-9629-46cd-b57c-6f3ef797de66queued¤logsrunning¦outputbody

6.7 Maximization Bias and Double Learning

All the control algorithms that we have discussed so far involve maximization in the construction of the target policies. For example, in Q-learning the target policy is the greedy policy given the current action values, which is defined with a max, and in Sarsa the policy is often $\epsilon$-greedy, which also involves a maximization operation. In these algorithms, a maximum over estimated values is used implicitely as an estimate of the maximum value, which can lead to significant positive bias. To see why, consider a isngle state $s$ where there are many actions $a$ whose true values $q(s, a)$, are all zero, but whose estimated values, $Q(s, a)$, are uncertain and thus distributed above and some below zero. The maximum of the true values is zero, but the maximum of the estimates is positive, a positive bias. We call this maximization bias.

To elaborate on the bias, consider just two random variables $X \sim \mathcal{N}(\theta_1, 1)$ and $Y \sim \mathcal{N}(\theta_2, 1)$. We would like to estimate $\text{max} \left ( \mathbb{E}[X], \mathbb{E}[Y] \right ) = \text{max}(\theta_1, \theta_2)$ and using the approach analogous to our learning algorithms we would calculate $\max(\overline{X}, \overline{Y}) = \text{max} \left ( \sum_{i=1}^N \frac{x_i}{N}, \sum_{i=1}^M \frac{y_i}{M} \right )$. The problem with this approach is that for small numbers of samples, the variance each estimator is high and we are using this estimator both to select which random variable has the higher expected value and what that value is. Empirically, this results in a positive bias which gets worse the more variables we are considering as illustrated in the plot below.

mimetext/htmlrootassigneelast_run_timestampA ޼| persist_js_state·has_pluto_hook_features§cell_id$29b0a2d5-9629-46cd-b57c-6f3ef797de66depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$c1d6532c-38a4-488f-9789-07d63fe6f125queued¤logsrunning¦outputbody٘

Load Existing File if Present:

mimetext/htmlrootassigneelast_run_timestampA N+persist_js_state·has_pluto_hook_features§cell_id$c1d6532c-38a4-488f-9789-07d63fe6f125depends_on_disabled_cells§runtimePpublished_object_keysdepends_on_skipped_cells§errored$e6672866-c0a0-46f2-bb52-25fcc3352645queued¤logsrunning¦outputbody

Exercise 6.5

In the right graph of the random walk example, the RMS error of the TD method seems to go down and then up again, particularly at high $\alpha$’s. What could have caused this? Do you think this always occurs, or might it be a function of how the approximate value function was initialized?

Since the value function was initialized at the correct value for the center state, all of the values to the right must be increased and the values to the left must be decreased to reach the true values. Episodes that terminate to the right will receive a reward of 1 and push up the rightmost estimate while episodes that terminate to the left will receive a reward of 0 and decrease the leftmost estimate. The correct value for each of these estimates is $\frac{1}{6}$ and $\frac{5}{6}$ respectively. Since there is an equal probability of exiting the walk on the right or the left, both ends of the value estimates will be updated at roughly the same rate. That means that both ends of the chain will move towards the correct value at about the same time and if those updates stay someone synchronized, all of the states will move through correct values at a similar time. At the time when the values are roughly accurate, what happens if $\alpha=0.15$? In this case, consider an update for state E assuming the estimate is already the correct value. $V(E) \leftarrow \frac{5}{6} + 0.15[1 - \frac{5}{6}] \approx 0.858 \gt \frac{5}{6}$. A similar effect happens with state A pushing it below the correct value. The larger $\alpha$ is, the more over-correction we have on future transitions and the feedback from the other neighboring states won't be enough to bring it back to the correct value. Since we pass through or very close to the correct value on the way, we pass through a minimum error value before over or undershooting the value estimate.

If we had instead initialized the state values at 0, then the estimate at A would already be too low and would not get corrected until information from the right side propagated through. State E, however, will receive large updates for each episode that exits to the right, but the values for the states to its left will be too low. Since the state value estimates are not moving symmetrically, we won't have the same synchronized pass through the minimum error, since at the time the E estimate is correct, A will still be high error. In this case, we are more likely to see error continue to fall as more updates occur. Below is a visualization of the state estimates at different stages in the training with the original initialization and a 0 initialization. In the 0 case, you can see the left-size estimates take a long time to reach the correct value, but in the original initialization, all the estimate approach the correct values roughly together.

mimetext/htmlrootassigneelast_run_timestampA ޷Ɠpersist_js_state·has_pluto_hook_features§cell_id$e6672866-c0a0-46f2-bb52-25fcc3352645depends_on_disabled_cells§runtime'published_object_keysdepends_on_skipped_cells§errored$223055df-7d5c-4d99-bc8d-fbc9702f906fqueued¤logsrunning¦outputbody

Example 6.7: Maximization Bias Example

Consider an MDP with two non-terminal states A and B. Episodes always start in state A and there are two actions, left and right. Choosing right will always result in a reward of 0 and the episode terminating. Choosing left will transition into state B from which there are many actions, all of which result in a terminal transition with random rewards. The distribution of rewards for each of these actions is $\mathcal{N}(-0.1, 1)$. The estimated value of (A, right) will always be 0 since that is the only possible sample to be collected. The estimated value of (A, left) however will have higher variance but an expected value of -0.1. The problem with Q-learning is that, due to the maximization bias, (A, left) will have a higher value estimate when few samples have been collected since it is very likely that one of the state-action pairs from B will produce a reward greater than 0. The more of these actions exist, the worse the bias and the more samples needed to be collected to remove it. If we employ Double Q-learning instead, however, we can eliminate the bias completely.

mimetext/htmlrootassigneelast_run_timestampA ޽+persist_js_state·has_pluto_hook_features§cell_id$223055df-7d5c-4d99-bc8d-fbc9702f906fdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$35dc0d94-145a-4292-b0df-9e84a286c036queued¤logsrunning¦outputbody}

6.8 Games, Afterstates, and Other Special Cases

In the tic-tac-toe example we considered learning a value function for a state after the player's move but before the opponent's response. This type of state is called an afterstate, and it is useful in situations when we know a portion of the dynamics in an environment, but then a portion of it is stochastic or unknown. For example, we typically know the immediate effect of our moves, but not necessarily what happens after that.

It can be more efficient to learn based on afterstates because there are fewer values to represent than if we need to learn the full action value function. Any state-action pair that maps to the same afterstate would be represented by a single value. These afterstate value functions can also be learned with generalized policy iteration.

mimetext/htmlrootassigneelast_run_timestampA ޾persist_js_state·has_pluto_hook_features§cell_id$35dc0d94-145a-4292-b0df-9e84a286c036depends_on_disabled_cells§runtimeepublished_object_keysdepends_on_skipped_cells§errored$4d7619ee-933f-452a-9202-e95a8f3da20fqueued¤logsrunning¦outputbody]Sarsa backup diagram. Black circles represent actions and white circles represent states.
mimetext/htmlrootassigneelast_run_timestampA ϲdpersist_js_state·has_pluto_hook_features§cell_id$4d7619ee-933f-452a-9202-e95a8f3da20fdepends_on_disabled_cells§runtime} published_object_keysdepends_on_skipped_cells§errored$00d67a93-437c-4cda-899a-9daa1102e1f2queued¤logsrunning¦outputbodyj mimetext/htmlrootassigneelast_run_timestampA hpersist_js_state·has_pluto_hook_features§cell_id$00d67a93-437c-4cda-899a-9daa1102e1f2depends_on_disabled_cells§runtimeIpublished_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/c9aac650c28e0825depends_on_skipped_cells§errored$500d8dd4-fc53-4021-b797-114224ca4debqueued¤logsrunning¦outputbodyH
Actions
mimetext/htmlrootassigneeconst rook_action_displaylast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$500d8dd4-fc53-4021-b797-114224ca4debdepends_on_disabled_cells§runtimefpublished_object_keysdepends_on_skipped_cells§errored$ff5d051e-5de1-48a9-9578-5dbafd71afd1queued¤logsrunning¦outputbodyj mimetext/htmlrootassigneelast_run_timestampA õpersist_js_state·has_pluto_hook_features§cell_id$ff5d051e-5de1-48a9-9578-5dbafd71afd1depends_on_disabled_cells§runtime Uepublished_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/21f195b5663a5875depends_on_skipped_cells§errored$e947f86e-8dc3-4ce7-a9d4-0a7b675a9fa9queued¤logsrunning¦outputbody8begin_value_iteration_v (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA OQpersist_js_state·has_pluto_hook_features§cell_id$e947f86e-8dc3-4ce7-a9d4-0a7b675a9fa9depends_on_disabled_cells§runtime$published_object_keysdepends_on_skipped_cells§errored$a925534e-f9b8-471a-9d86-c9212129b630queued¤logsrunning¦outputbody

The following represents a trajectory taken by a policy in an environment. We week to estimate $q_\pi(s, a)$ for the current behavior policy $\pi$ using the same TD method we introduced above. The update rule now, however, estimates the value of state action pairs rather than the states themselves.

mimetext/htmlrootassigneelast_run_timestampA ޺%persist_js_state·has_pluto_hook_features§cell_id$a925534e-f9b8-471a-9d86-c9212129b630depends_on_disabled_cells§runtimeVkpublished_object_keysdepends_on_skipped_cells§errored$7a5ff8f7-70d4-46f1-a4a7-bbfcec4f6e3fqueued¤logsrunning¦outputbody.sample_action (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 2'persist_js_state·has_pluto_hook_features§cell_id$7a5ff8f7-70d4-46f1-a4a7-bbfcec4f6e3fdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$b5e06f59-33b5-414e-9a81-43e8abd07aa3queued¤logsrunning¦outputbody$

Q-learning Solution

0
0
0
Actions
Wind Values
380.0
380.0
380.0
380.0
380.0
380.0
380.0
380.0
0.0
0
0
0
Actions
Wind Values
Double Q-learning Solution
0
0
0
Actions
Wind Values
-1.2
-1.2
-0.59
-0.43
-0.75
-0.26
0.014
-0.053
0.0
0
0
0
Actions
Wind Values

mimetext/htmlrootassigneelast_run_timestampA (Hpersist_js_state·has_pluto_hook_features§cell_id$b5e06f59-33b5-414e-9a81-43e8abd07aa3depends_on_disabled_cells§runtime{޾_published_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/ff3e7516945b9e1859c6be96e-38f7-11f0-2d30-a71f02755abc/b5c0b7878012e9e3depends_on_skipped_cells§errored$a0d2333f-e87b-4981-bb52-d436ec6481c1queued¤logsrunning¦outputbody`

Because TD(0) bases its update in part on an existing estimate, we say that it is a bootstrapping method, like DP. We know from Chapter 3 that

$$\begin{flalign} v_\pi & \doteq \mathbb{E}_\pi[G_t \mid S_t = s] \tag{6.3}\\ &= \mathbb{E}[R_{t+1} + \gamma G_{t+1} \mid S_t = s] \tag{from (3.9)}\\ &=\mathbb{E}[R_{t+1} + \gamma v_\pi (S_{t+1}) \mid S_t = s] \tag{6.4} \end{flalign}$$

Roughly speaking, Monte Carlo methods use an estimate of (6.3) as a target whereas DP methods use an estiamte of (6.4) as a target. The Monte Carlo target is an estimate because the exepcted value in (6.3) is not known; a sample return is used in place of the real expected return. The DP target is an estimate not because of the expected values, which are assumed to be completely provided by a model of the environment, but because $v_\pi(S_{t+1})$ is not known and the current estimate, $V(S_{t+1})$, is used isntead. The TD target is an estimate for both reasons; it samples the expected values in (6.4) and it uses the current estimate $V$ instead of the true $v_\pi$. Thus, TD methods combine the sampling of Monte Carlo with the bootstrapping of DP.

TD and Monte Carlo updates are both refered to as sample updates because they involve looking ahead to a sample successsor state (or state-action pair). Expected updates used in DP methods use the complete distribution of all possible successor states rather than a single sample.

Note that the quantity in the brakets in (6.2) is a sort of error, measuring the difference between the estimated value of $S_t$ and the better estimate $R_{t+1} + \gamma V(S_{t+1})$. This quantity is called the TD error:

$$\delta_t \doteq R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \tag{6.5}$$

The TD error depends on the subsequent state so it is not available until one step later. That is to say $\delta_t$ is not known until time $t+1$. Also note that if we do not update $V$ during an episode (as we do not in Monte Carlo methods), then the Monte Carlo error can be written as the sum of TD errors:

$$\begin{flalign} G_t - V(S_t) &= R_{t+1} + \gamma G_{t+1} - V(S_t) + \gamma V(S_{t+1}) - \gamma V(S_{t+1}) \tag{from (3.9)} \\ &=\delta_t + \gamma(G_{t+1} - V(S_{t+1})) \tag{a}\\ &=\delta_t + \gamma \left ( \delta_{t+1} + \gamma(G_{t+2} - V(S_{t+2})) \right ) \tag{using (a)}\\ &=\delta_t + \gamma \delta_{t+1} + \gamma^2 \left ( G_{t+2} - V(S_{t+2}) \right ) \\ &=\delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+2} + \cdots + \gamma^{T-t-1}\delta_{T-1} + \gamma^{T-t}(G_T - V(S_T)) \tag{applying (a) until terination}\\ &=\delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+2} + \cdots + \gamma^{T-t-1}\delta_{T-1} + \gamma^{T-t}(0-0) \tag{definition of terminal state}\\ &=\sum_{k=t}^{T-1} \gamma^{k-t} \delta_k \tag{6.6} \end{flalign}$$

This identity is not exact if $V$ is updated during the episode (as it is in TD(0)), but if the step size is small then it may still hold approximately.

mimetext/htmlrootassigneelast_run_timestampA ޱ persist_js_state·has_pluto_hook_features§cell_id$a0d2333f-e87b-4981-bb52-d436ec6481c1depends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cells§errored$f841c4d8-5176-4007-b472-9e01a799d85cqueued¤logsrunning¦outputbody,addelements (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA Օpersist_js_state·has_pluto_hook_features§cell_id$f841c4d8-5176-4007-b472-9e01a799d85cdepends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cells§errored$685a7ba3-0f94-4663-a68a-73fa03bd9445queued¤logsrunning¦outputbody5make_greedy_policy! (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA Hİpersist_js_state·has_pluto_hook_features§cell_id$685a7ba3-0f94-4663-a68a-73fa03bd9445depends_on_disabled_cells§runtimeD"published_object_keysdepends_on_skipped_cells§errored$d5abd922-a8c2-4f5c-9a6e-d2490a8ad7dcqueued¤logsrunning¦outputbody)takestep (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA -/persist_js_state·has_pluto_hook_features§cell_id$d5abd922-a8c2-4f5c-9a6e-d2490a8ad7dcdepends_on_disabled_cells§runtime߹published_object_keysdepends_on_skipped_cells§errored$bce6e4ab-58ec-4e00-be34-bc4caf51f57dqueued¤logsrunning¦outputbody)cum_mean (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA n:persist_js_state·has_pluto_hook_features§cell_id$bce6e4ab-58ec-4e00-be34-bc4caf51f57ddepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$4ddcd409-c31c-444c-8fcf-7cc45b68d93bqueued¤logsrunning¦outputbody)make_mrp (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA Dİpersist_js_state·has_pluto_hook_features§cell_id$4ddcd409-c31c-444c-8fcf-7cc45b68d93bdepends_on_disabled_cells§runtime3ypublished_object_keysdepends_on_skipped_cells§errored$c5d32889-634b-4b00-8ba7-0d1ecaf94f05queued¤logsrunning¦outputbody>initialize_state_action_value (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA epersist_js_state·has_pluto_hook_features§cell_id$c5d32889-634b-4b00-8ba7-0d1ecaf94f05depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$3b16cbb7-f859-4871-9a63-8b40eb4191bequeued¤logsrunning¦outputbody=

Exercise 6.1

If $V$ changes during the episode, then (6.6) only holds approximately; what would the difference be between the two sides? Let $V_t$ denote the array of state values used at time $t$ in the TD error (6.5) and in the TD update (6.2). Redo the derivation above to determine the additional amount that must be added to the sum of TD errors in order to equal the Monte Carlo error.

mimetext/htmlrootassigneelast_run_timestampA ޱ&persist_js_state·has_pluto_hook_features§cell_id$3b16cbb7-f859-4871-9a63-8b40eb4191bedepends_on_disabled_cells§runtime .circlestate.A::before { content: 'A'; } .circlestate.B::before { content: 'B'; } .circlestate.C::before { content: 'C'; } .circlestate.D::before { content: 'D'; } .circlestate.E::before { content: 'E'; } .circlestate.F::before { content: 'F'; } .circlestate.G::before { content: 'G'; } .circlestate.H::before { content: 'H'; } .circlestate.I::before { content: 'I'; } .circlestate.J::before { content: 'J'; } .circlestate.K::before { content: 'K'; } .circlestate.L::before { content: 'L'; } .circlestate.M::before { content: 'M'; } .circlestate.N::before { content: 'N'; } .circlestate.O::before { content: 'O'; } .circlestate.P::before { content: 'P'; } .circlestate.Q::before { content: 'Q'; } .circlestate.R::before { content: 'R'; } .circlestate.S::before { content: 'S'; } .circlestate.T::before { content: 'T'; } .circlestate.U::before { content: 'U'; } .circlestate.V::before { content: 'V'; } .circlestate.W::before { content: 'W'; } .circlestate.X::before { content: 'X'; } .circlestate.Y::before { content: 'Y'; } .circlestate.Z::before { content: 'Z'; } mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$902738c3-2f7b-49cb-8580-29359c857027depends_on_disabled_cells§runtimeiصpublished_object_keysdepends_on_skipped_cells§errored$c93ed1f2-3c38-4f68-8bf8-2cdf4e7bee34queued¤logsrunning¦outputbody

Now we can rewrite the Monte Carlo error using (3.9) again and proceed with the derivation keeping track of the time index of the value estiamtes:

$$\begin{flalign} G_t - V_t(S_t) &= R_{t+1} + \gamma G_{t+1} - V_t(S_t) + \gamma V_{t}(S_{t+1}) - \gamma V_{t}(S_{t+1}) \tag{from (3.9)}\\ &= \delta_t + \gamma \left [ G_{t+1} - V_t(S_{t+1}) \right ] \\ &= \delta_t + \gamma \left [ G_{t+1} - V_{t+1}(S_{t+1}) + V_{t+1}(S_{t+1}) - V_t(S_{t+1}) \right ] \\ \end{flalign}$$

Define the following

$$\eta_{t} \doteq V_{t+1}(S_{t+1}) - V_t(S_{t+1})$$

which let's us re-write the equation

$$G_t - V_t(S_t) = \delta_t + \gamma \eta_{t} + \gamma \left [ G_{t+1} - V_{t+1}(S_{t+1})\right ]$$

Notice that the term in the brakets is equivalent to the left hand side but shifted forward one time step. That implies the equation can be expanded recursively as we did with the original derivation.

mimetext/htmlrootassigneelast_run_timestampA ޱj]persist_js_state·has_pluto_hook_features§cell_id$c93ed1f2-3c38-4f68-8bf8-2cdf4e7bee34depends_on_disabled_cells§runtime޵published_object_keysdepends_on_skipped_cells§errored$f36822d7-9ea8-4f5c-9925-dc2a466a68baqueued¤logsrunning¦outputbody?

Dependencies and Settings

mimetext/htmlrootassigneelast_run_timestampA ޿遰persist_js_state·has_pluto_hook_features§cell_id$f36822d7-9ea8-4f5c-9925-dc2a466a68badepends_on_disabled_cells§runtimeŵpublished_object_keysdepends_on_skipped_cells§errored$3e367811-247b-4bd6-b8fe-63f8996fb9e8queued¤logsrunning¦outputbody;

Formal Proof for Bias

mimetext/htmlrootassigneelast_run_timestampA ޼Gpersist_js_state·has_pluto_hook_features§cell_id$3e367811-247b-4bd6-b8fe-63f8996fb9e8depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$7de9b6a4-49ce-4dc3-9d5b-cecfcb98bba1queued¤logsrunning¦outputbodyprefixMFiniteAfterstateMDP{Float32, Tuple{Int64, Int64}, Tuple{Int64, Int64}, Int64}elementsstatesprefixTuple{Int64, Int64}elementselements0text/plain0text/plaintypeTupleobjectid9b52efd7a2a08bd5!application/vnd.pluto.tree+objectelements0text/plain1text/plaintypeTupleobjectid86128cc9b5ae8f4a!application/vnd.pluto.tree+objectelements0text/plain2text/plaintypeTupleobjectidfc41ae7a664555b0!application/vnd.pluto.tree+objectelements0text/plain3text/plaintypeTupleobjectid5a8d0f981b76571a!application/vnd.pluto.tree+objectelements0text/plain4text/plaintypeTupleobjectid6ac4b5902680c6bb!application/vnd.pluto.tree+objectelements0text/plain5text/plaintypeTupleobjectid22d2c06707ebb5c4!application/vnd.pluto.tree+objectelements0text/plain6text/plaintypeTupleobjectidcd86b46be06a2ab4!application/vnd.pluto.tree+objectelements0text/plain7text/plaintypeTupleobjectid6f83360483e5fb68!application/vnd.pluto.tree+object elements0text/plain8text/plaintypeTupleobjectidf2740b9bf789ce84!application/vnd.pluto.tree+objectmoreelements20text/plain20text/plaintypeTupleobjectid6e264f7db8959fbf!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid6be4cddb9c31579d!application/vnd.pluto.tree+objectafterstatesprefixTuple{Int64, Int64}elementselements0text/plain0text/plaintypeTupleobjectid9b52efd7a2a08bd5!application/vnd.pluto.tree+objectelements0text/plain1text/plaintypeTupleobjectid86128cc9b5ae8f4a!application/vnd.pluto.tree+objectelements0text/plain2text/plaintypeTupleobjectidfc41ae7a664555b0!application/vnd.pluto.tree+objectelements0text/plain3text/plaintypeTupleobjectid5a8d0f981b76571a!application/vnd.pluto.tree+objectelements0text/plain4text/plaintypeTupleobjectid6ac4b5902680c6bb!application/vnd.pluto.tree+objectelements0text/plain5text/plaintypeTupleobjectid22d2c06707ebb5c4!application/vnd.pluto.tree+objectelements0text/plain6text/plaintypeTupleobjectidcd86b46be06a2ab4!application/vnd.pluto.tree+objectelements0text/plain7text/plaintypeTupleobjectid6f83360483e5fb68!application/vnd.pluto.tree+object elements0text/plain8text/plaintypeTupleobjectidf2740b9bf789ce84!application/vnd.pluto.tree+objectmoreelements20text/plain20text/plaintypeTupleobjectid6e264f7db8959fbf!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectidb6c3515e31ee5179!application/vnd.pluto.tree+objectactionsprefixInt64elements-5text/plain-4text/plain-3text/plain-2text/plain-1text/plain0text/plain1text/plain2text/plain 3text/plain 4text/plain 5text/plaintypeArrayprefix_shortobjectidd4363310ecd412c2!application/vnd.pluto.tree+objectrewardsprefixFloat32elements0.0text/plain10.0text/plain20.0text/plain30.0text/plain40.0text/plain50.0text/plain60.0text/plain70.0text/plain 80.0text/plainmore'380.0text/plaintypeArrayprefix_shortobjectidf6cc5eea1e1ab35f!application/vnd.pluto.tree+objectptf~441×39×441 Array{Float32, 3}: [:, :, 1] = 0.00673795 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0134759 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0134759 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00898393 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00449196 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00179679 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000598929 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2] = 0.0 0.00661454 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00012341 0.0132291 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.0132291 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.00881938 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000164546 0.00440969 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.22732f-5 0.00176388 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.29093f-5 0.000587959 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3] = 0.0 0.0 0.0061209 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000493639 0.0122418 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00012341 0.000987278 0.0122418 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.000987278 0.0081612 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00024682 0.000658186 0.0040806 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000164546 0.000329093 0.00163224 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.22732f-5 0.000131637 0.00054408 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ;;; … [:, :, 439] = 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋱ ⋮ 0.0 0.0 0.0 … 3.70237f-30 0.0 0.0 0.0 0.0 0.0 0.0 0.000987278 4.62796f-31 0.0 0.0 0.0 0.0 0.0 0.000493639 0.00338174 5.44466f-32 0.0 0.0 0.0 0.0 0.00012341 0.00133908 0.00523368 6.04962f-33 0.0 0.0 0.0 0.0 0.00024682 0.00169087 0.00502024 6.36803f-34 0.0 0.0 0.0 0.0 0.000541653 0.00272339 0.00635617 … 6.50727f-29 0.0 0.0 0.0 0.0 [:, :, 440] = 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋱ ⋮ 0.0 0.0 0.0 … 3.05282f-29 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.07042f-30 0.0 0.0 0.0 0.0 0.0 0.0 0.000987278 5.08803f-31 0.0 0.0 0.0 0.0 0.0 0.000493639 0.00338174 5.98591f-32 0.0 0.0 0.0 0.0 0.00012341 0.00133908 0.00523368 6.65102f-33 0.0 0.0 0.0 0.0 0.000788472 0.00441426 0.0113764 … 7.15415f-29 0.0 0.0 0.0 0.0 [:, :, 441] = 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋱ ⋮ 0.0 0.0 0.0 … 5.56469f-28 6.42698f-30 0.0 0.0 0.0 0.0 0.0 0.0 7.62311f-29 8.56931f-31 0.0 0.0 0.0 0.0 0.0 0.0 9.78329f-30 1.07116f-31 0.0 0.0 0.0 0.0 0.0 0.000987278 1.18091f-30 1.26019f-32 0.0 0.0 0.0 0.0 0.000493639 0.00338174 1.34537f-31 1.40021f-33 0.0 0.0 0.0 0.000911882 0.00575333 0.0166101 … 8.74815f-28 1.50614f-29 0.0 0.0 0.0text/plainafterstate_map11×441 Matrix{Int64}: 1 22 43 64 85 106 107 108 109 … 429 430 431 432 433 434 435 436 1 22 43 64 85 86 87 88 89 430 431 432 433 434 435 436 437 1 22 43 64 65 66 67 68 69 431 432 433 434 435 436 437 438 1 22 43 44 45 46 47 48 49 432 433 434 435 436 437 438 439 1 22 23 24 25 26 27 28 29 433 434 435 436 437 438 439 440 1 2 3 4 5 6 7 8 9 … 434 435 436 437 438 439 440 441 1 2 3 4 5 6 7 8 9 414 415 416 417 418 419 420 420 1 2 3 4 5 6 7 8 9 394 395 396 397 398 399 399 399 1 2 3 4 5 6 7 8 9 374 375 376 377 378 378 378 378 1 2 3 4 5 6 7 8 9 354 355 356 357 357 357 357 357 1 2 3 4 5 6 7 8 9 … 334 335 336 336 336 336 336 336text/plainreward_interim_map11×441 Matrix{Float32}: -10.0 -10.0 -10.0 -10.0 -10.0 -10.0 … -10.0 -10.0 -10.0 -10.0 -10.0 -8.0 -8.0 -8.0 -8.0 -8.0 -8.0 -8.0 -8.0 -8.0 -8.0 -8.0 -6.0 -6.0 -6.0 -6.0 -6.0 -6.0 -6.0 -6.0 -6.0 -6.0 -6.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -0.0 -0.0 -0.0 -0.0 -0.0 -0.0 … -0.0 -0.0 -0.0 -0.0 -0.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -6.0 -6.0 -6.0 -6.0 -6.0 -6.0 -6.0 -6.0 -6.0 -6.0 -6.0 -8.0 -8.0 -8.0 -8.0 -8.0 -8.0 -8.0 -8.0 -8.0 -8.0 -8.0 -10.0 -10.0 -10.0 -10.0 -10.0 -10.0 … -10.0 -10.0 -10.0 -10.0 -10.0text/plainstate_indexprefix Dict{Tuple{Int64, Int64}, Int64}elementselements11text/plain17text/plaintypeTupleobjectid49ec9371b177a25d!application/vnd.pluto.tree+object249text/plainelements16text/plain14text/plaintypeTupleobjectidd93d095a02371a59!application/vnd.pluto.tree+object351text/plainelements18text/plain16text/plaintypeTupleobjectidaeb6f295858259db!application/vnd.pluto.tree+object395text/plainelements17text/plain12text/plaintypeTupleobjectid68544eea78f6641!application/vnd.pluto.tree+object370text/plainelements8text/plain15text/plaintypeTupleobjectidceff527f41a09840!application/vnd.pluto.tree+object184text/plainelements16text/plain16text/plaintypeTupleobjectid3164689f12bc7404!application/vnd.pluto.tree+object353text/plainelements19text/plain14text/plaintypeTupleobjectidcb90bf273945b2c8!application/vnd.pluto.tree+object414text/plainelements7text/plain18text/plaintypeTupleobjectidf3c6affef4f32144!application/vnd.pluto.tree+object166text/plainelements7text/plain8text/plaintypeTupleobjectid300559d2f34a9666!application/vnd.pluto.tree+object156text/plainelements14text/plain15text/plaintypeTupleobjectidac753ed572b44c1d!application/vnd.pluto.tree+object310text/plainmoretypeDictprefix_shortDictobjectidfafc4b032688336d!application/vnd.pluto.tree+objectafterstate_indexprefix Dict{Tuple{Int64, Int64}, Int64}elementselements11text/plain17text/plaintypeTupleobjectid49ec9371b177a25d!application/vnd.pluto.tree+object249text/plainelements16text/plain14text/plaintypeTupleobjectidd93d095a02371a59!application/vnd.pluto.tree+object351text/plainelements18text/plain16text/plaintypeTupleobjectidaeb6f295858259db!application/vnd.pluto.tree+object395text/plainelements17text/plain12text/plaintypeTupleobjectid68544eea78f6641!application/vnd.pluto.tree+object370text/plainelements8text/plain15text/plaintypeTupleobjectidceff527f41a09840!application/vnd.pluto.tree+object184text/plainelements16text/plain16text/plaintypeTupleobjectid3164689f12bc7404!application/vnd.pluto.tree+object353text/plainelements19text/plain14text/plaintypeTupleobjectidcb90bf273945b2c8!application/vnd.pluto.tree+object414text/plainelements7text/plain18text/plaintypeTupleobjectidf3c6affef4f32144!application/vnd.pluto.tree+object166text/plainelements7text/plain8text/plaintypeTupleobjectid300559d2f34a9666!application/vnd.pluto.tree+object156text/plainelements14text/plain15text/plaintypeTupleobjectidac753ed572b44c1d!application/vnd.pluto.tree+object310text/plainmoretypeDictprefix_shortDictobjectida7eb00b000659f24!application/vnd.pluto.tree+objectaction_indexprefixDict{Int64, Int64}elements5text/plain11text/plain-3text/plain3text/plain1text/plain7text/plain0text/plain6text/plain4text/plain10text/plain-5text/plain1text/plain-1text/plain5text/plain2text/plain8text/plain-2text/plain4text/plain-4text/plain2text/plainmoretypeDictprefix_shortDictobjectid2c82fa6389959ad1!application/vnd.pluto.tree+objecttypestructprefix_shortFiniteAfterstateMDPobjectide4c5edf99f3d49c7mime!application/vnd.pluto.tree+objectrootassigneeconst jacks_car_afterstate_mdplast_run_timestampA y persist_js_state·has_pluto_hook_features§cell_id$7de9b6a4-49ce-4dc3-9d5b-cecfcb98bba1depends_on_disabled_cells§runtimeΰpublished_object_keysdepends_on_skipped_cells§errored$c4719c42-87aa-482a-95aa-a1492d42835dqueued¤logsrunning¦outputbody:

Stochastic Gridworld

mimetext/htmlrootassigneelast_run_timestampA ޻Qpersist_js_state·has_pluto_hook_features§cell_id$c4719c42-87aa-482a-95aa-a1492d42835ddepends_on_disabled_cells§runtimePpublished_object_keysdepends_on_skipped_cells§errored$495f5606-0567-47ad-a266-d21320eecfc6queued¤logsrunning¦outputbodyx

Monte Carlo nonstationary update rule for value function

$$V(S_t) \leftarrow V(S_t) + \alpha [G_t - V(S_t)] \tag{6.1}$$

where $G_t$ is the actual return following time $t$, and $\alpha$ is a constant step-size parameter. Call this method constant-α MC. The use of a constant step size α instead of the usual sample average is what makes this estiamtion method suitable for non-stationary problems. Because the value $G_t$ is required, this method requires waiting for the final results from the end of an episode.

In contrast, TD methods need only wait for results from the following timestep to perform an update. The following is the simplest TD method update rule:

$$V(S_t) \leftarrow V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)] \tag{6.2}$$

where the update can be made immediately on transition to $S_{t+1}$ after receiving $R_{t+1}$. This TD method is called $TD(0)$, or one-step TD. See below for code implementing this.

mimetext/htmlrootassigneelast_run_timestampA ްspersist_js_state·has_pluto_hook_features§cell_id$495f5606-0567-47ad-a266-d21320eecfc6depends_on_disabled_cells§runtimecpublished_object_keysdepends_on_skipped_cells§errored$0a4ed8c7-27ca-45cb-af15-70ddd86240fbqueued¤logsrunning¦outputbodyL

Batch Method Estimation Implementation

mimetext/htmlrootassigneelast_run_timestampA ޹Mpersist_js_state·has_pluto_hook_features§cell_id$0a4ed8c7-27ca-45cb-af15-70ddd86240fbdepends_on_disabled_cells§runtimespublished_object_keysdepends_on_skipped_cells§errored$cdedd35e-52b8-40a5-938d-2d36f6f93217queued¤logsrunning¦outputbody
Actions
mimetext/htmlrootassigneeconst king_action_displaylast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$cdedd35e-52b8-40a5-938d-2d36f6f93217depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$3756a3f8-18e8-4d62-afa1-cfeb4183820cqueued¤logsrunning¦outputbody6double_expected_sarsa (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA cpersist_js_state·has_pluto_hook_features§cell_id$3756a3f8-18e8-4d62-afa1-cfeb4183820cdepends_on_disabled_cells§runtime]published_object_keysdepends_on_skipped_cells§errored$04a0be81-ee5f-4eeb-963a-ad930392d50bqueued¤logsrunning¦outputbodyZ
Sarsa Solution
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
-15.0
-16.0
-16.0
-17.0
-17.0
-16.0
-16.0
-14.0
-15.0
-15.0
-16.0
-16.0
-16.0
-16.0
-14.0
-14.0
-14.0
-15.0
-15.0
-15.0
-15.0
-13.0
-13.0
-14.0
-14.0
-13.0
-14.0
-13.0
0.0
-12.0
-13.0
-13.0
-13.0
-12.0
-12.0
0.0
0.0
-11.0
-12.0
-12.0
-12.0
-12.0
0.0
0.0
0.0
-9.9
-11.0
-12.0
-11.0
0.0
-2.1
-1.0
0.0
-5.9
-8.7
-9.1
-2.0
-2.3
-1.0
-5.6
-6.9
-8.2
-8.0
-3.0
-4.0
-2.0
-3.0
-4.6
-5.9
-6.7
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
Value Iteration Solution
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
-15.0
-15.0
-15.0
-15.0
-15.0
-15.0
-15.0
-14.0
-14.0
-14.0
-14.0
-14.0
-14.0
-14.0
-13.0
-13.0
-13.0
-13.0
-13.0
-13.0
-13.0
-12.0
-12.0
-12.0
-12.0
-12.0
-12.0
-12.0
-11.0
-11.0
-11.0
-11.0
-11.0
-11.0
-11.0
-2.0
-10.0
-10.0
-10.0
-10.0
-10.0
-10.0
-2.0
-1.0
-9.0
-9.0
-9.0
-9.0
-9.0
-1.0
-2.0
-1.0
0.0
-8.0
-8.0
-8.0
-2.0
-2.0
-1.0
-5.0
-6.0
-7.0
-7.0
-3.0
-3.0
-2.0
-3.0
-4.0
-5.0
-6.0
0
0
0
1
1
1
2
2
1
0
Actions
Wind Values
mimetext/htmlrootassigneelast_run_timestampA Upersist_js_state·has_pluto_hook_features§cell_id$04a0be81-ee5f-4eeb-963a-ad930392d50bdepends_on_disabled_cells§runtime6published_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/c925358b3c3d408e49c6be96e-38f7-11f0-2d30-a71f02755abc/13d8f542ac69f8759c6be96e-38f7-11f0-2d30-a71f02755abc/90f5c347caa747c859c6be96e-38f7-11f0-2d30-a71f02755abc/d2eeaee44f48b8a0depends_on_skipped_cells§errored$136d1d96-b590-4f03-9e42-2337efc560ccqueued¤logsrunning¦outputbody mimetext/htmlrootassigneelast_run_timestampA vpersist_js_state·has_pluto_hook_features§cell_id$136d1d96-b590-4f03-9e42-2337efc560ccdepends_on_disabled_cells§runtime*published_object_keysdepends_on_skipped_cells§errored$6bffb08c-704a-4b7c-bfce-b3d099cf35c0queued¤logsrunning¦outputbody;gridworld_Q_vs_sarsa_solve (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA eBpersist_js_state·has_pluto_hook_features§cell_id$6bffb08c-704a-4b7c-bfce-b3d099cf35c0depends_on_disabled_cells§runtimefpublished_object_keysdepends_on_skipped_cells§errored$f95ceb98-f12e-4650-9ad3-0609b7ecd0f3queued¤logsrunning¦outputbody

Exercise 6.14

Describe how the task of Jack's Car Rental (Example 4.2) could be reformulated in terms of afterstates. Why, in terms of this specific task, would such a reformulation be likely to speed convergence?

In the original problem the state is the number of cars at each location at the end of the day. The actions are the net numbers of cars moved between the two locations overnight. With an afterstate approach, the value function would only consider the number of cars after the movement is performed. This would be equivalent to valuing the state the following morning when customers begin to return and rent new cars.

The random processes that occur the following day will have a good/bad outcome based on the cars available at each location at the start of the day. This approach would likely converge faster because we are only modeling the value of the state that is directly related to whether or not cars will be available. Similar to the tic-tac-toe example, many actions will result in the same afterstate, but equivalent afterstates should have the same value. See below for code that creates the car rental MDP and solves it using value iteration with afterstates.

mimetext/htmlrootassigneelast_run_timestampA ޿Npersist_js_state·has_pluto_hook_features§cell_id$f95ceb98-f12e-4650-9ad3-0609b7ecd0f3depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$8787a5fd-d0ab-46b5-a7df-e7bc103a7378queued¤logsrunning¦outputbody3value_iteration_v! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA ]persist_js_state·has_pluto_hook_features§cell_id$8787a5fd-d0ab-46b5-a7df-e7bc103a7378depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$03a06e10-f68a-403c-97bf-7a7627f2c5d6queued¤logsrunning¦outputbody

Hasselt, in his paper proposes an alternative Double Estimator to correct this bias in approximating $\max_i \mathbb{E} \{ X_i \}$ which uses two sets of estimators: $\mu^A = \{ \mu_1^A, \dots, \mu_M^A \}$ and $\mu^B = \{ \mu_1^B, \dots, \mu_M^B \}$.

Both sets of estimators are updated with a subset of samples we draw, such that $S = S^A \cup S^B$ and $S^A \cap S^B = \emptyset$ and $\mu_i^A(S) = \frac{1}{\vert S_i^A \vert } \sum_{s \in S_i^A} s$ and $\mu_i^B(S) = \frac{1}{\vert S_i^B \vert } \sum_{s \in S_i^B} s$. Like the single estimator $\mu_i$, both $\mu_i^A$ and $\mu_i^B$ are unbiased if we assume that samples are split in a proper manner, for instance randomly over the two sets of estimators. Let $Max^A (S) \doteq \{ j \mid \mu_j^A (S) = \max_i \mu_i^A (S) \}$ be the set of maximal estimates in $\mu^A(S)$. Since $\mu^B$ is an independent, unbiased set of estimators, we have $\mathbb{E} \{ \mu_j^B \} = \mathbb{E} \{ X_j \}$ for all $j$, including all $j \in Max^A$. Let $a^*$ be an estimator that maximizes $\mu^A:\mu_{a^*}^A(S) \doteq \max_i \mu_i ^A (S)$. If there are multiple estimators that maximize $\mu^A$, we can for instance pick one at random. Then we can use $\mu_{a^*}^B$ as an estimate for $\max_i \mathbb{E} \{ \mu_i^B \}$ and therefore also for $\max_i \mathbb{E} \{ X_i \}$ and we obtain the approximation

$$\max_i \mathbb{E} \{ X_i \} = \max_i \mathbb{E} \{ \mu_i^B \} \approx \mu_{a^*}^B \tag{e}$$

As we gain more samples the variance of the estimators decreases. In the limit, $\mu_i^A(S) = \mu_i^B(S) = \mathbb{E} \{ X_i \}$ for all $i$ and the approximation in $(e)$ converges to the correct result.

Assume that hte underlying PDFs are continuous. The probability $P(j = a^*)$ for any $j$ is then equal to the probability that all $i \neq j$ give lower estimates. Thus $\mu_j^A(S) = x$ is maximal for some value $x$ with probability $\prod_{i \neq j}^M P(\mu_i ^A \lt x)$. Integrating out $x$ gives $P(j = a^*) = \int_{-\infty}^\infty P(\mu_j^A = x) \prod_{i \neq j}^M P(\mu_i^A < x)dx \doteq \int_{-\infty}^\infty f_j^A(x) \prod_{i \neq j}^M F_i^A(x) dx$, where $f_i^A$ and $F_i^A$ are the PDF and CDF of $\mu_i^A$. The expected value of the approximation by the double estimator can thus be givne by

$$\sum_j^M P(j = a^*) \mathbb{E} \{ \mu_j^B \} = \sum_j^M \mathbb{E} \{ \mu_j ^B \} \int_{-\infty}^\infty f_j^A(x) \prod_{i \neq j} F_i^A(x)dx \tag{f}$$

For discrete PDFs the probability that two or more estimators are equal should be taken into account and the integrals should be replaced with sums.

Comparing (f) to (c), we see the difference is that the double estimator uses $\mathbb{E} \{ \mu_j^B \}$ in place of $x$. The single estimator overestimates, because $x$ is within the integral and therefore correlates with the monotonically increasing product $\prod_{i \neq j} F_i^\mu(x)$. The double estimator underestimates because the probabilities $P(j = a^*)$ sum to one and therefore the approximation is a weighted estimate of unbiased expected values, which must be lower or equal to the maximum expected value. In the following lemma, which holds in both discrete and the continuous case, we prove in general that hte estimate $\mathbb{E} \{ \mu_{a^*}^B \}$ is not an unbiased estimate of $\max_i \mathbb{E} \{ X_i \}$.

mimetext/htmlrootassigneelast_run_timestampA ޽persist_js_state·has_pluto_hook_features§cell_id$03a06e10-f68a-403c-97bf-7a7627f2c5d6depends_on_disabled_cells§runtime U-published_object_keysdepends_on_skipped_cells§errored$0d6a11af-b146-4bbc-997e-a11b897269a7queued¤logsrunning¦outputbodyE

6.4 Sarsa: On-policy TD Control

mimetext/htmlrootassigneelast_run_timestampA ޹persist_js_state·has_pluto_hook_features§cell_id$0d6a11af-b146-4bbc-997e-a11b897269a7depends_on_disabled_cells§runtime$Spublished_object_keysdepends_on_skipped_cells§errored$72b4d8d5-464c-4561-8c69-28ef3f59630bqueued¤logsrunning¦outputbody/update_value! (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA  persist_js_state·has_pluto_hook_features§cell_id$72b4d8d5-464c-4561-8c69-28ef3f59630bdepends_on_disabled_cells§runtime%tpublished_object_keysdepends_on_skipped_cells§errored$47c2cbdd-f6db-4ce5-bae2-8141f30aacbcqueued¤logsrunning¦outputbody/

Example 6.2 Random Walk

In this example we empirically compare the prediction abilities of TD(0) and constant-α MC when applied to the following Markov reward process:

In this MRP the agent's actions are irrelevant as each step the state transition occurs either to the left or the right with equal probability. An episode ends when the transition terminates at the left or right side of the chain. If the agent exits to the right, it receives a reward of 1. Otherwise, all other transitions receive a reward of 0. Below is an animation of the agent randomly moving through an episode. Longer chains will have longer episode times on average growing roughly quadratically with the length of the chain. Underneath the visualizations is the code.

mimetext/htmlrootassigneelast_run_timestampA ޷+wpersist_js_state·has_pluto_hook_features§cell_id$47c2cbdd-f6db-4ce5-bae2-8141f30aacbcdepends_on_disabled_cells§runtime&published_object_keysdepends_on_skipped_cells§errored$8224b808-5778-458b-b683-ea2603c82117queued¤logsrunning¦outputbody@

Example 6.6: Cliff Walking

mimetext/htmlrootassigneelast_run_timestampA ޻persist_js_state·has_pluto_hook_features§cell_id$8224b808-5778-458b-b683-ea2603c82117depends_on_disabled_cells§runtime AGpublished_object_keysdepends_on_skipped_cells§errored$c4919d14-8cba-43e6-9369-efc52bcb9b23queued¤logsrunning¦outputbody5make_greedy_policy! (generic function with 3 methods)mimetext/plainrootassigneelast_run_timestampA Wpersist_js_state·has_pluto_hook_features§cell_id$c4919d14-8cba-43e6-9369-efc52bcb9b23depends_on_disabled_cells§runtimeEV>published_object_keysdepends_on_skipped_cells§errored$05664aaf-575b-4249-974c-d8a2e63f380aqueued¤logsrunning¦outputbodyP

Exercise 6.11

Why is Q-learning considered an off-policy control method?

If we compare to the on-policy update rule, the expected value being calculated at each state action pair should be:

$$Q_\pi(S_t, A_t) = \text{E}_\pi [R_{t+1} + \gamma Q_\pi(S_{t+1}, A_{t+1})]$$

which we estimate with sampling. In Q-learning, the expected value being estimated is instead:

$$Q_\pi(S_t, A_t) = \text{E}_\pi [R_{t+1} + \gamma \text{max}_a Q_\pi(S_{t+1}, a)]$$

Since the behavior policy being used to select the subsequent action taken from state $S_{t+1}$ is $\epsilon$-greedy, there is a probability that the next action will not match the maximizing action. So the Q-Learning update is computing the optimal greedy state-action value function rather than the optimal $\epsilon$-greedy value function of the behavior policy. Sarsa, in contrast follows the same policy and computes the value function which matches this policy, thus making it a true on-policy method.

mimetext/htmlrootassigneelast_run_timestampA ޻:persist_js_state·has_pluto_hook_features§cell_id$05664aaf-575b-4249-974c-d8a2e63f380adepends_on_disabled_cells§runtimelEpublished_object_keysdepends_on_skipped_cells§errored$dda222ef-8178-40bb-bf20-d242924c4fabqueued¤logsrunning¦outputbodyprefixMDP_TD{GridworldState, GridworldAction, var"#tr#115"{var"#110#119", var"#step#114"{typeof(apply_wind), Vector{Int64}, var"#boundstate#113"{Int64, Int64}}}, var"#108#117"{GridworldState}, var"#isterm#116"{GridworldState}}elementsstatesprefix$Main.var"workspace#3".GridworldStateelementsprefixGridworldStateelementsx1text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid78e123e4d06443c5!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy2text/plaintypestructprefix_shortGridworldStateobjectide3e6b18864c38362!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid7d75a915b81b9730!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid32586272439d3588!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy5text/plaintypestructprefix_shortGridworldStateobjectid593769200b7ddf14!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy6text/plaintypestructprefix_shortGridworldStateobjectidd7705072ebc67732!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid32fa797472e0a83!application/vnd.pluto.tree+objectprefixGridworldStateelementsx2text/plainy1text/plaintypestructprefix_shortGridworldStateobjectidef30e57ae60bdc38!application/vnd.pluto.tree+object prefixGridworldStateelementsx2text/plainy2text/plaintypestructprefix_shortGridworldStateobjectid74f49756a2864a57!application/vnd.pluto.tree+objectmoreFprefixGridworldStateelementsx10text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid91d5970141de4b2d!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectide8e8cb666e91b5c0!application/vnd.pluto.tree+objectstatelookupprefix1Dict{Main.var"workspace#3".GridworldState, Int64}elementsprefixGridworldStateelementsx8text/plainy5text/plaintypestructprefix_shortGridworldStateobjectid14e5eae9a48c6749!application/vnd.pluto.tree+object54text/plainprefixGridworldStateelementsx6text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid4e5052ac2b36c8be!application/vnd.pluto.tree+object39text/plainprefixGridworldStateelementsx7text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid6d65389daed97014!application/vnd.pluto.tree+object46text/plainprefixGridworldStateelementsx8text/plainy4text/plaintypestructprefix_shortGridworldStateobjectidb85af438304886c5!application/vnd.pluto.tree+object53text/plainprefixGridworldStateelementsx10text/plainy1text/plaintypestructprefix_shortGridworldStateobjectiddad6dff35c9621ff!application/vnd.pluto.tree+object64text/plainprefixGridworldStateelementsx6text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid4e4b90239eb3be65!application/vnd.pluto.tree+object42text/plainprefixGridworldStateelementsx8text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid6d43cd1ca99a553e!application/vnd.pluto.tree+object50text/plainprefixGridworldStateelementsx2text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid166e372c47e8ffa6!application/vnd.pluto.tree+object10text/plainprefixGridworldStateelementsx5text/plainy3text/plaintypestructprefix_shortGridworldStateobjectidf8402269233868c7!application/vnd.pluto.tree+object31text/plainprefixGridworldStateelementsx8text/plainy7text/plaintypestructprefix_shortGridworldStateobjectidb08053c76dcd8072!application/vnd.pluto.tree+object56text/plainmoretypeDictprefix_shortDictobjectid367cac091827c280!application/vnd.pluto.tree+objectactionsprefix%Main.var"workspace#3".GridworldActionelementsprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+objectprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+objectprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+objectprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+objectprefixUpRightelementstypestructprefix_shortUpRightobjectidffffffff65dca132!application/vnd.pluto.tree+objectprefixUpLeftelementstypestructprefix_shortUpLeftobjectidffffffff68f3503e!application/vnd.pluto.tree+objectprefixDownRightelementstypestructprefix_shortDownRightobjectidffffffff97f641f9!application/vnd.pluto.tree+objectprefixDownLeftelementstypestructprefix_shortDownLeftobjectidffffffffd243dd41!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectidd84fdc99910d1e41!application/vnd.pluto.tree+objectactionlookupprefix2Dict{Main.var"workspace#3".GridworldAction, Int64}elementsprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+object2text/plainprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+object3text/plainprefixUpRightelementstypestructprefix_shortUpRightobjectidffffffff65dca132!application/vnd.pluto.tree+object5text/plainprefixDownRightelementstypestructprefix_shortDownRightobjectidffffffff97f641f9!application/vnd.pluto.tree+object7text/plainprefixDownLeftelementstypestructprefix_shortDownLeftobjectidffffffffd243dd41!application/vnd.pluto.tree+object8text/plainprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+object4text/plainprefixUpLeftelementstypestructprefix_shortUpLeftobjectidffffffff68f3503e!application/vnd.pluto.tree+object6text/plainprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+object1text/plaintypeDictprefix_shortDictobjectid4cc9d28144c214f4!application/vnd.pluto.tree+objectstate_init%#108 (generic function with 1 method)text/plainstep(::Main.var"workspace#3".var"#tr#115"{Main.var"workspace#3".var"#110#119", Main.var"workspace#3".var"#step#114"{typeof(Main.var"workspace#3".apply_wind), Vector{Int64}, Main.var"workspace#3".var"#boundstate#113"{Int64, Int64}}}) (generic function with 1 method)text/plainistermq(::Main.var"workspace#3".var"#isterm#116"{Main.var"workspace#3".GridworldState}) (generic function with 1 method)text/plaintypestructprefix_shortMDP_TDobjectidf3ad8e4ba985532cmime!application/vnd.pluto.tree+objectrootassigneeconst king_gridworldlast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$dda222ef-8178-40bb-bf20-d242924c4fabdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$48b557e3-e239-45e9-ab15-105bcca96492queued¤logsrunning¦outputbody p

6.3 Optimality of TD(0)

Suppose there is available only a finite amount of experience, say 10 episodes or 100 time steps. In this case, a common approach with incremental learning methods is to present the experience repeatedly until the method converges upon an answer. Given an approximate value function $V$, the increments specified by (6.1) or (6.2) are computed for every time step $t$ at which a nonterminal state is visited, but the value function is changed only once, by the sum of all the increments. Then all the available experience is processed again with the new value function to produce a new overall increment, and so on, until the value function converged. We call this batch updating because updates are made only after processing each complete batch of training data.

Under batch updating, TD(0) converges deterministically to a single answer independent of the step-size parameter, $\alpha$, as long as $\alpha$ is chosen to be sufficiently small. The constant $\alpha$ MC method also converges deterministically under the same conditions, but to a difference answer. Understanding these two answers will help us understand the difference between the two methods. Under normal updating the methods do not move all the way to their respective batch answers, but in some sense they take steps in these directions. Before trying to understand the two answers in general, for all possible tasks, we first look at a few examples.

Example 6.3: Random walk under batch updating

Batch-updating versions of TD(0) and constant-$\alpha$ MC were applied as follows to the random walk prediction example (Example 6.2). After each new episode, all episodes seen so far were treated as a batch. They were repeatedly presented to the algorithm, either TD(0) or constant-$\alpha$ MC, with $\alpha$ sufficiently small that the value function converged. The resulting value function was then compared with $v_\pi$, and the average root mean square error across the five states (and accross 100 independent repetitions of the whole experiment) was plotted to obtain the learning curves shown in Figure 6.2. Note that the batch TD method was consistently better than the batch Monte Caro method.

Under batch training, constant-$\alpha$ MC converges to the values, $V(s)$, that are sample averages of the actual returns experienced after visiting each state $s$. These are optimal estimates in the sense that they minimize the mean square error from the actual returns in the training set. In this sense it is surprising that the batch TD method was able to perform better according to the root mean square error measure shown in figure 6.2. How is it that batch TD was able to perform better than this optimal method? The answer is that the Monte Carlo method is optimal only in a limited way, and that TD is optimal in a way that is more relevant to predicting returns.

Below is code implementing both batch methods in general for arbitrary MDPs.

mimetext/htmlrootassigneelast_run_timestampA ޹persist_js_state·has_pluto_hook_features§cell_id$48b557e3-e239-45e9-ab15-105bcca96492depends_on_disabled_cells§runtime !published_object_keysdepends_on_skipped_cells§errored$846720cc-550a-4a3c-a80e-40b99671f4e2queued¤logsrunning¦outputbodyprefixInt64elements-1text/plain1text/plaintypeArrayprefix_shortobjectidf06672b348e7aaamime!application/vnd.pluto.tree+objectrootassigneeconst mrp_moveslast_run_timestampA =persist_js_state·has_pluto_hook_features§cell_id$846720cc-550a-4a3c-a80e-40b99671f4e2depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$6556dafb-04fa-434c-868a-8d7bb7b5b196queued¤logsrunning¦outputbody0make_cliffworld (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 3{persist_js_state·has_pluto_hook_features§cell_id$6556dafb-04fa-434c-868a-8d7bb7b5b196depends_on_disabled_cells§runtimeypublished_object_keysdepends_on_skipped_cells§errored$3f4f078a-9fc4-4b02-b499-a805fd5f1071queued¤logsrunning¦outputbody
Actions
mimetext/htmlrootassigneelast_run_timestampA 豫 persist_js_state·has_pluto_hook_features§cell_id$75bfe913-8757-4789-b708-7d400c225218depends_on_disabled_cells§runtime E~published_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/536a0a4e512619dadepends_on_skipped_cells§errored$fe2ebf39-4ab3-4aa8-abbd-23389eaf400equeued¤logsrunning¦outputbodyF

Sarsa converges with probability 1 to an optimal policy and action-value function, under the usual conditions on step sizes (2.7), as long as all state-action pairs are visited an infinite number of times and the policy converges in the limit to the greedy policy (which can be arranged, for example, with $\epsilon$-greedy policies by setting $\epsilon = 1/t$). Below is code that implements Sarsa using the $\epsilon$-greedy method for exploration.

mimetext/htmlrootassigneelast_run_timestampA ޺@!persist_js_state·has_pluto_hook_features§cell_id$fe2ebf39-4ab3-4aa8-abbd-23389eaf400edepends_on_disabled_cells§runtimeĵpublished_object_keysdepends_on_skipped_cells§errored$98bec66e-d8f3-4d4d-b4ec-5838489164e5queued¤logsrunning¦outputbodyprefixMDP_TD{GridworldState, GridworldAction, var"#tr#115"{var"#221#223", var"#step#114"{var"#220#222", Vector{Int64}, var"#boundstate#113"{Int64, Int64}}}, var"#108#117"{GridworldState}, var"#isterm#116"{GridworldState}}elementsstatesprefix$Main.var"workspace#3".GridworldStateelementsprefixGridworldStateelementsx1text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid78e123e4d06443c5!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy2text/plaintypestructprefix_shortGridworldStateobjectide3e6b18864c38362!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid7d75a915b81b9730!application/vnd.pluto.tree+objectprefixGridworldStateelementsx2text/plainy1text/plaintypestructprefix_shortGridworldStateobjectidef30e57ae60bdc38!application/vnd.pluto.tree+objectprefixGridworldStateelementsx2text/plainy2text/plaintypestructprefix_shortGridworldStateobjectid74f49756a2864a57!application/vnd.pluto.tree+objectprefixGridworldStateelementsx2text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid166e372c47e8ffa6!application/vnd.pluto.tree+objectprefixGridworldStateelementsx3text/plainy1text/plaintypestructprefix_shortGridworldStateobjectidec7c7c34244569a4!application/vnd.pluto.tree+objectprefixGridworldStateelementsx3text/plainy2text/plaintypestructprefix_shortGridworldStateobjectidc1258421535f88fc!application/vnd.pluto.tree+object prefixGridworldStateelementsx3text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid3ed622ab169cc67c!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectidf181cfeac924fd67!application/vnd.pluto.tree+objectstatelookupprefix1Dict{Main.var"workspace#3".GridworldState, Int64}elementsprefixGridworldStateelementsx2text/plainy2text/plaintypestructprefix_shortGridworldStateobjectid74f49756a2864a57!application/vnd.pluto.tree+object5text/plainprefixGridworldStateelementsx3text/plainy2text/plaintypestructprefix_shortGridworldStateobjectidc1258421535f88fc!application/vnd.pluto.tree+object8text/plainprefixGridworldStateelementsx2text/plainy1text/plaintypestructprefix_shortGridworldStateobjectidef30e57ae60bdc38!application/vnd.pluto.tree+object4text/plainprefixGridworldStateelementsx1text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid7d75a915b81b9730!application/vnd.pluto.tree+object3text/plainprefixGridworldStateelementsx3text/plainy1text/plaintypestructprefix_shortGridworldStateobjectidec7c7c34244569a4!application/vnd.pluto.tree+object7text/plainprefixGridworldStateelementsx1text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid78e123e4d06443c5!application/vnd.pluto.tree+object1text/plainprefixGridworldStateelementsx3text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid3ed622ab169cc67c!application/vnd.pluto.tree+object9text/plainprefixGridworldStateelementsx1text/plainy2text/plaintypestructprefix_shortGridworldStateobjectide3e6b18864c38362!application/vnd.pluto.tree+object2text/plainprefixGridworldStateelementsx2text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid166e372c47e8ffa6!application/vnd.pluto.tree+object6text/plaintypeDictprefix_shortDictobjectidd73841a2172a4792!application/vnd.pluto.tree+objectactionsprefix%Main.var"workspace#3".GridworldActionelementsprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+objectprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+objectprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+objectprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid952f6adeb23ade52!application/vnd.pluto.tree+objectactionlookupprefix2Dict{Main.var"workspace#3".GridworldAction, Int64}elementsprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+object2text/plainprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+object3text/plainprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+object4text/plainprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+object1text/plaintypeDictprefix_shortDictobjectidbfde935e6876dc7b!application/vnd.pluto.tree+objectstate_init%#108 (generic function with 1 method)text/plainstep(::Main.var"workspace#3".var"#tr#115"{Main.var"workspace#3".var"#221#223", Main.var"workspace#3".var"#step#114"{Main.var"workspace#3".var"#220#222", Vector{Int64}, Main.var"workspace#3".var"#boundstate#113"{Int64, Int64}}}) (generic function with 1 method)text/plainistermq(::Main.var"workspace#3".var"#isterm#116"{Main.var"workspace#3".GridworldState}) (generic function with 1 method)text/plaintypestructprefix_shortMDP_TDobjectid4b7046f5bb96df87mime!application/vnd.pluto.tree+objectrootassigneeconst noisy_gridworldlast_run_timestampA /a1persist_js_state·has_pluto_hook_features§cell_id$98bec66e-d8f3-4d4d-b4ec-5838489164e5depends_on_disabled_cells§runtime_published_object_keysdepends_on_skipped_cells§errored$b59eacf8-7f78-4015-bf2c-66f89bf0e24equeued¤logsrunning¦outputbody

Exercise 6.10: Stochastic Wind (programming)

Re-solve the windy gridworld task with King's moves, assuming the effect of the wind, if there is any, is stochastic, sometimes varying by 1 from the mean values given for each column. That is, a third of the time you move exactly according to these values, as in the previous exercise, but also a third of the time you move one cell above that, and another third of the time you move one cell below that. For example, if you are one cell to the right of the goal and you move left, then one-third of the time you move one cell above the goal, one-third of the time you move two cells above the goal, and one-third of the time you move to the goal.

mimetext/htmlrootassigneelast_run_timestampA ޺?persist_js_state·has_pluto_hook_features§cell_id$b59eacf8-7f78-4015-bf2c-66f89bf0e24edepends_on_disabled_cells§runtime۵published_object_keysdepends_on_skipped_cells§errored$1ae30f5d-b25b-4dcb-800f-45c463641ec5queued¤logsrunning¦outputbody ?

Exercise 6.8

Show that an action-value version of (6.6) holds for the action-value form of the TD error $\delta_t=R_{t+1}+\gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)$, again assuming that the values don't change from step to step.

The derivation in (6.6) starts with the definition in (3.9):

$$G_t = R_{t+1} + \gamma G_{t+1}$$

and derives the following:

$$\delta_t \doteq R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$$

$$G_t - V(S_t) = \sum_{k=t}^{T-1} \gamma^{k-t} \delta_k$$

Now we have the action-value form of the TD error:

$$\delta_t \doteq R_{t+1}+\gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)$$

Let us transform (3.9) in a similar manner to derive the rule:

$$\begin{flalign} G_t - Q(S_t, A_t) &= R_{t+1} + \gamma G_{t+1} - Q(S_t, A_t) + \gamma Q(S_{t+1}, A_{t+1}) - \gamma Q(S_{t+1}, A_{t+1}) \\ &= \delta_t + \gamma (G_{t+1} - Q(S_{t+1}, A_{t+1})) \\ &= \delta_t + \gamma \delta_{t+1} + \gamma^2 (G_{t+2} - Q(S_{t+2}, A_{t+2})) \tag{using recursion} \\ &= \delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+1} + \cdots + \gamma^{T-t-1} \delta_{T-1} + \gamma^{T-t}(G_T - Q(S_T, A_T)) \\ &= \delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+1} + \cdots + \gamma^{T-t-1} \delta_{T-1} + \gamma^{T-t}(0-0) \tag{terminal value} \\ &= \sum_{k=t}^{T-1}\gamma^{k-t}\delta_k \end{flalign}$$

mimetext/htmlrootassigneelast_run_timestampA ޺c԰persist_js_state·has_pluto_hook_features§cell_id$1ae30f5d-b25b-4dcb-800f-45c463641ec5depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$7d3be915-9092-4261-8435-dd546a7db144queued¤logsrunning¦outputbody(cum_max (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA tpersist_js_state·has_pluto_hook_features§cell_id$7d3be915-9092-4261-8435-dd546a7db144depends_on_disabled_cells§runtimeZ=published_object_keysdepends_on_skipped_cells§errored$71774d5f-7841-403f-bc6b-1a0cbbb72d6dqueued¤logsrunning¦outputbodyprefix3FiniteMDP{Float32, GridworldState, GridworldAction}elementsstatesprefix$Main.var"workspace#3".GridworldStateelementsprefixGridworldStateelementsx1text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid78e123e4d06443c5!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy2text/plaintypestructprefix_shortGridworldStateobjectide3e6b18864c38362!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid7d75a915b81b9730!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid32586272439d3588!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy5text/plaintypestructprefix_shortGridworldStateobjectid593769200b7ddf14!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy6text/plaintypestructprefix_shortGridworldStateobjectidd7705072ebc67732!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid32fa797472e0a83!application/vnd.pluto.tree+objectprefixGridworldStateelementsx2text/plainy1text/plaintypestructprefix_shortGridworldStateobjectidef30e57ae60bdc38!application/vnd.pluto.tree+object prefixGridworldStateelementsx2text/plainy2text/plaintypestructprefix_shortGridworldStateobjectid74f49756a2864a57!application/vnd.pluto.tree+objectmoreFprefixGridworldStateelementsx10text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid91d5970141de4b2d!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectidee66261cf47f9401!application/vnd.pluto.tree+objectactionsprefix%Main.var"workspace#3".GridworldActionelementsprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+objectprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+objectprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+objectprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid952f6adeb23ade52!application/vnd.pluto.tree+objectrewardsprefixFloat32elements0.0text/plain-1.0text/plaintypeArrayprefix_shortobjectid781de2e275c431b!application/vnd.pluto.tree+objectptf70×2×4×70 Array{Float32, 4}: [:, :, 1, 1] = 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 1] = 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 1] = 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 1] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 2] = 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 2] = 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 2] = 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 2] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 3] = 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 3] = 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ;;;; … [:, :, 1, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 2, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 [:, :, 1, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 [:, :, 2, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 [:, :, 3, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 1, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 [:, :, 2, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 3, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0text/plainaction_scratchprefixFloat32elements334.268text/plain332.964text/plain333.09text/plain333.289text/plaintypeArrayprefix_shortobjectidb3ed763e003373d8!application/vnd.pluto.tree+objectstate_scratchprefixFloat32elements272.455text/plain273.38text/plain0.000616613text/plain4.5677f-41text/plain2.08591f-22text/plain4.5677f-41text/plain1.32f-43text/plain0.0text/plain -3.27565f35text/plainmoreG1.4f-43text/plaintypeArrayprefix_shortobjectid3d1f60899297466d!application/vnd.pluto.tree+objectreward_scratchprefixFloat32elements4.0f-45text/plain0.0text/plaintypeArrayprefix_shortobjectid2d531d9ef2febf74!application/vnd.pluto.tree+objectstate_indexprefix1Dict{Main.var"workspace#3".GridworldState, Int64}elementsprefixGridworldStateelementsx8text/plainy5text/plaintypestructprefix_shortGridworldStateobjectid14e5eae9a48c6749!application/vnd.pluto.tree+object54text/plainprefixGridworldStateelementsx6text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid4e5052ac2b36c8be!application/vnd.pluto.tree+object39text/plainprefixGridworldStateelementsx7text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid6d65389daed97014!application/vnd.pluto.tree+object46text/plainprefixGridworldStateelementsx8text/plainy4text/plaintypestructprefix_shortGridworldStateobjectidb85af438304886c5!application/vnd.pluto.tree+object53text/plainprefixGridworldStateelementsx10text/plainy1text/plaintypestructprefix_shortGridworldStateobjectiddad6dff35c9621ff!application/vnd.pluto.tree+object64text/plainprefixGridworldStateelementsx6text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid4e4b90239eb3be65!application/vnd.pluto.tree+object42text/plainprefixGridworldStateelementsx8text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid6d43cd1ca99a553e!application/vnd.pluto.tree+object50text/plainprefixGridworldStateelementsx2text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid166e372c47e8ffa6!application/vnd.pluto.tree+object10text/plainprefixGridworldStateelementsx5text/plainy3text/plaintypestructprefix_shortGridworldStateobjectidf8402269233868c7!application/vnd.pluto.tree+object31text/plainprefixGridworldStateelementsx8text/plainy7text/plaintypestructprefix_shortGridworldStateobjectidb08053c76dcd8072!application/vnd.pluto.tree+object56text/plainmoretypeDictprefix_shortDictobjectidefe2304abefe6e4c!application/vnd.pluto.tree+objectaction_indexprefix2Dict{Main.var"workspace#3".GridworldAction, Int64}elementsprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+object2text/plainprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+object3text/plainprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+object4text/plainprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+object1text/plaintypeDictprefix_shortDictobjectidea6ca163b4135382!application/vnd.pluto.tree+objecttypestructprefix_shortFiniteMDPobjectidb5f665419f0d8ca4mime!application/vnd.pluto.tree+objectrootassigneeconst windy_gridworld_mdp_dplast_run_timestampA }cpersist_js_state·has_pluto_hook_features§cell_id$71774d5f-7841-403f-bc6b-1a0cbbb72d6ddepends_on_disabled_cells§runtime_ published_object_keysdepends_on_skipped_cells§errored$22c2213e-5b9b-410f-a0ef-8f1e3db3c532queued¤logsrunning¦outputbodyk'

Figure 6.2

Performance of TD(0) and constant-α MC under batch training on the random walk task with 5 states

mimetext/htmlrootassigneelast_run_timestampA bpersist_js_state·has_pluto_hook_features§cell_id$22c2213e-5b9b-410f-a0ef-8f1e3db3c532depends_on_disabled_cells§runtimeLpublished_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/d3030aa42e1dd0c8depends_on_skipped_cells§errored$39470c74-e554-4f6c-919d-97bec1eec0f3queued¤logsrunning¦outputbody

Adding king's move actions, the optimal policy can finish in 7 steps vs 15 for the original actions. What happens after adding a 9th action that causes no movement?

mimetext/htmlrootassigneelast_run_timestampA ޺%persist_js_state·has_pluto_hook_features§cell_id$39470c74-e554-4f6c-919d-97bec1eec0f3depends_on_disabled_cells§runtime\published_object_keysdepends_on_skipped_cells§errored$9da5fd84-800d-4b3e-8627-e90ce8f20297queued¤logsrunning¦outputbody1show_grid_policy (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA ߇'persist_js_state·has_pluto_hook_features§cell_id$9da5fd84-800d-4b3e-8627-e90ce8f20297depends_on_disabled_cells§runtimeg8published_object_keysdepends_on_skipped_cells§errored$415ea466-2038-48fe-9d24-39a90182f1ebqueued¤logsrunning¦outputbody3monte_carlo_pred_V (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA Cpersist_js_state·has_pluto_hook_features§cell_id$415ea466-2038-48fe-9d24-39a90182f1ebdepends_on_disabled_cells§runtimeipublished_object_keysdepends_on_skipped_cells§errored$0e488135-49e5-4e71-83b1-05d8e61f0510queued¤logsrunning¦outputbodyprefix3FiniteMDP{Float32, GridworldState, GridworldAction}elementsstatesprefix$Main.var"workspace#3".GridworldStateelementsprefixGridworldStateelementsx1text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid78e123e4d06443c5!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy2text/plaintypestructprefix_shortGridworldStateobjectide3e6b18864c38362!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid7d75a915b81b9730!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid32586272439d3588!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy5text/plaintypestructprefix_shortGridworldStateobjectid593769200b7ddf14!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy6text/plaintypestructprefix_shortGridworldStateobjectidd7705072ebc67732!application/vnd.pluto.tree+objectprefixGridworldStateelementsx1text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid32fa797472e0a83!application/vnd.pluto.tree+objectprefixGridworldStateelementsx2text/plainy1text/plaintypestructprefix_shortGridworldStateobjectidef30e57ae60bdc38!application/vnd.pluto.tree+object prefixGridworldStateelementsx2text/plainy2text/plaintypestructprefix_shortGridworldStateobjectid74f49756a2864a57!application/vnd.pluto.tree+objectmoreFprefixGridworldStateelementsx10text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid91d5970141de4b2d!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid6f5984a96b457d74!application/vnd.pluto.tree+objectactionsprefix%Main.var"workspace#3".GridworldActionelementsprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+objectprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+objectprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+objectprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+objectprefixUpRightelementstypestructprefix_shortUpRightobjectidffffffff65dca132!application/vnd.pluto.tree+objectprefixUpLeftelementstypestructprefix_shortUpLeftobjectidffffffff68f3503e!application/vnd.pluto.tree+objectprefixDownRightelementstypestructprefix_shortDownRightobjectidffffffff97f641f9!application/vnd.pluto.tree+objectprefixDownLeftelementstypestructprefix_shortDownLeftobjectidffffffffd243dd41!application/vnd.pluto.tree+object prefixStayelementstypestructprefix_shortStayobjectidffffffff40b55070!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid6fc895826c2337b2!application/vnd.pluto.tree+objectrewardsprefixFloat32elements0.0text/plain-1.0text/plaintypeArrayprefix_shortobjectid8ba99ab12ffe5417!application/vnd.pluto.tree+objectptf 70×2×9×70 Array{Float32, 4}: [:, :, 1, 1] = 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 1] = 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 1] = 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 1] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 5, 1] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 6, 1] = 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 7, 1] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 8, 1] = 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 9, 1] = 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 2] = 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 2] = 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 2] = 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 2] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 5, 2] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 6, 2] = 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 7, 2] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 8, 2] = 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 9, 2] = 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 1, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 2, 3] = 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 3] = 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 5, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 6, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 7, 3] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 8, 3] = 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 9, 3] = 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ;;;; … [:, :, 1, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 2, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 3, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 [:, :, 5, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 6, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 7, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 8, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 9, 68] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 [:, :, 1, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 [:, :, 2, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 [:, :, 3, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 5, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 [:, :, 6, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 7, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 [:, :, 8, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 9, 69] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 1, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 [:, :, 2, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 3, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 4, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 [:, :, 5, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 [:, :, 6, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 7, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 [:, :, 8, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 [:, :, 9, 70] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0text/plainaction_scratchprefixFloat32elements1.0text/plain0.5text/plain1.0f-45text/plain0.0text/plain3.0f-45text/plain0.0text/plain4.0f-45text/plain0.0text/plain 0.0text/plaintypeArrayprefix_shortobjectid1a6888bacd7c30e0!application/vnd.pluto.tree+objectstate_scratchprefixFloat32elements1.0f-45text/plain0.0text/plain1.0f-45text/plain0.0text/plain1.0f-45text/plain0.0text/plain1.0f-45text/plain0.0text/plain 1.0f-45text/plainmoreG0.0text/plaintypeArrayprefix_shortobjectid2e9e2ed475e79827!application/vnd.pluto.tree+objectreward_scratchprefixFloat32elements1.0f-45text/plain0.0text/plaintypeArrayprefix_shortobjectid9d5f2048d3697c1e!application/vnd.pluto.tree+objectstate_indexprefix1Dict{Main.var"workspace#3".GridworldState, Int64}elementsprefixGridworldStateelementsx8text/plainy5text/plaintypestructprefix_shortGridworldStateobjectid14e5eae9a48c6749!application/vnd.pluto.tree+object54text/plainprefixGridworldStateelementsx6text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid4e5052ac2b36c8be!application/vnd.pluto.tree+object39text/plainprefixGridworldStateelementsx7text/plainy4text/plaintypestructprefix_shortGridworldStateobjectid6d65389daed97014!application/vnd.pluto.tree+object46text/plainprefixGridworldStateelementsx8text/plainy4text/plaintypestructprefix_shortGridworldStateobjectidb85af438304886c5!application/vnd.pluto.tree+object53text/plainprefixGridworldStateelementsx10text/plainy1text/plaintypestructprefix_shortGridworldStateobjectiddad6dff35c9621ff!application/vnd.pluto.tree+object64text/plainprefixGridworldStateelementsx6text/plainy7text/plaintypestructprefix_shortGridworldStateobjectid4e4b90239eb3be65!application/vnd.pluto.tree+object42text/plainprefixGridworldStateelementsx8text/plainy1text/plaintypestructprefix_shortGridworldStateobjectid6d43cd1ca99a553e!application/vnd.pluto.tree+object50text/plainprefixGridworldStateelementsx2text/plainy3text/plaintypestructprefix_shortGridworldStateobjectid166e372c47e8ffa6!application/vnd.pluto.tree+object10text/plainprefixGridworldStateelementsx5text/plainy3text/plaintypestructprefix_shortGridworldStateobjectidf8402269233868c7!application/vnd.pluto.tree+object31text/plainprefixGridworldStateelementsx8text/plainy7text/plaintypestructprefix_shortGridworldStateobjectidb08053c76dcd8072!application/vnd.pluto.tree+object56text/plainmoretypeDictprefix_shortDictobjectidb59e95c60e92f130!application/vnd.pluto.tree+objectaction_indexprefix2Dict{Main.var"workspace#3".GridworldAction, Int64}elementsprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+object2text/plainprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+object3text/plainprefixUpRightelementstypestructprefix_shortUpRightobjectidffffffff65dca132!application/vnd.pluto.tree+object5text/plainprefixDownRightelementstypestructprefix_shortDownRightobjectidffffffff97f641f9!application/vnd.pluto.tree+object7text/plainprefixDownLeftelementstypestructprefix_shortDownLeftobjectidffffffffd243dd41!application/vnd.pluto.tree+object8text/plainprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+object4text/plainprefixUpLeftelementstypestructprefix_shortUpLeftobjectidffffffff68f3503e!application/vnd.pluto.tree+object6text/plainprefixStayelementstypestructprefix_shortStayobjectidffffffff40b55070!application/vnd.pluto.tree+object9text/plainprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+object1text/plaintypeDictprefix_shortDictobjectid7a41f62b29e0578!application/vnd.pluto.tree+objecttypestructprefix_shortFiniteMDPobjectid9e9f13d9855b28dbmime!application/vnd.pluto.tree+objectrootassigneeconst kingplus_gridworld_mdp_dplast_run_timestampA 37persist_js_state·has_pluto_hook_features§cell_id$0e488135-49e5-4e71-83b1-05d8e61f0510depends_on_disabled_cells§runtime \published_object_keysdepends_on_skipped_cells§errored$1f28280e-ba3b-4ca5-89e4-6ca4a90f5893queued¤logsrunning¦outputbodyj

Afterstate Value Iteration Results for Jack's Car Rental

mimetext/htmlrootassigneelast_run_timestampA !>Spersist_js_state·has_pluto_hook_features§cell_id$1f28280e-ba3b-4ca5-89e4-6ca4a90f5893depends_on_disabled_cells§runtimeδ@published_object_keys59c6be96e-38f7-11f0-2d30-a71f02755abc/d8c715e8e34d7d99depends_on_skipped_cells§errored$6d9ae541-cf8c-4687-9f0a-f008944657e3queued¤logsrunning¦outputbody+figure_6_3 (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$6d9ae541-cf8c-4687-9f0a-f008944657e3depends_on_disabled_cells§runtimenSpublished_object_keysdepends_on_skipped_cells§errored$d4e39164-9833-4deb-84ca-22f49a1c33d8queued¤logsrunning¦outputbodyq

Reference equations:

$$\begin{flalign} V(S_t) &\leftarrow V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)] \tag{6.2} \\ \delta_t &\doteq R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \tag{6.5} \end{flalign}$$

Re-write equation (6.5) using the values known at time t. $V_t$ means the value function estimate at time $t$.

$$\delta_t \doteq R_{t+1} + \gamma V_t(S_{t+1}) - V_t(S_t)$$

Now equation (6.2) becomes

$$V_{t+1}(S_t) = V_t(S_t) + \alpha \delta_t$$

mimetext/htmlrootassigneelast_run_timestampA ޱGIpersist_js_state·has_pluto_hook_features§cell_id$d4e39164-9833-4deb-84ca-22f49a1c33d8depends_on_disabled_cells§runtimeUpublished_object_keysdepends_on_skipped_cells§errored$f2115666-86ce-4c80-9eb7-490cc7a7715cqueued¤logsrunning¦outputbodyپ

With the original value initialization, the error passes through a minimum early on due to the symmetry of the value updates created by the initial value.

mimetext/htmlrootassigneelast_run_timestampA ޷ɰpersist_js_state·has_pluto_hook_features§cell_id$f2115666-86ce-4c80-9eb7-490cc7a7715cdepends_on_disabled_cells§runtimeʵpublished_object_keysdepends_on_skipped_cells§errored$2155adfa-7a93-4960-950e-1b123da9eea4queued¤logsrunning¦outputbodyprefix%Main.var"workspace#3".GridworldActionelementsprefixUpelementstypestructprefix_shortUpobjectidffffffff92511601!application/vnd.pluto.tree+objectprefixDownelementstypestructprefix_shortDownobjectidffffffff0a19a748!application/vnd.pluto.tree+objectprefixLeftelementstypestructprefix_shortLeftobjectidffffffff13951cc2!application/vnd.pluto.tree+objectprefixRightelementstypestructprefix_shortRightobjectidffffffffa8d2f4c6!application/vnd.pluto.tree+objectprefixUpRightelementstypestructprefix_shortUpRightobjectidffffffff65dca132!application/vnd.pluto.tree+objectprefixUpLeftelementstypestructprefix_shortUpLeftobjectidffffffff68f3503e!application/vnd.pluto.tree+objectprefixDownRightelementstypestructprefix_shortDownRightobjectidffffffff97f641f9!application/vnd.pluto.tree+objectprefixDownLeftelementstypestructprefix_shortDownLeftobjectidffffffffd243dd41!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectidd84fdc99910d1e41mime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA 4Opersist_js_state·has_pluto_hook_features§cell_id$2155adfa-7a93-4960-950e-1b123da9eea4depends_on_disabled_cells§runtime"published_object_keysdepends_on_skipped_cells§errored±cell_dependencies$8ddf6b9d-d76d-401f-96ad-2a0b5c114fa4precedence_heuristic cell_id$8ddf6b9d-d76d-401f-96ad-2a0b5c114fa4downstream_cells_mapcreate_noisy_gridworld_mdp$297f1606-4ec2-4075-9f81-926dc517b76fupstream_cells_maplength:FiniteMDP$d7566d1b-8938-4e2c-8c54-124f790e72aeFloat32zerosMDP_TD$3e767962-7339-4f35-a039-b5521a098ed5$5290ae65-6f56-4849-a842-fe347315c6dcprecedence_heuristic cell_id$5290ae65-6f56-4849-a842-fe347315c6dcdownstream_cells_mapupstream_cells_map@md_strgetindex$b3d4117f-7db4-43a6-8427-c08f3542d71fprecedence_heuristic cell_id$b3d4117f-7db4-43a6-8427-c08f3542d71fdownstream_cells_mappoisson$ad03500a-bd42-4216-a9cb-3f923152af79$2455742f-dc18-4d6b-9f58-5666adac6919upstream_cells_mapexp-^/factorial*$3ed12c33-ab0a-49b1-b9e7-c4305ba35767precedence_heuristic cell_id$3ed12c33-ab0a-49b1-b9e7-c4305ba35767downstream_cells_mapinit_step$61bbf9db-49a0-4709-83f4-44f228be09c0upstream_cells_mapsample_action$7a5ff8f7-70d4-46f1-a4a7-bbfcec4f6e3fMatrixMDP_TD$3e767962-7339-4f35-a039-b5521a098ed5FunctionReal$209881b3-3ac8-490e-97bd-fa5ae24a39f5precedence_heuristic cell_id$209881b3-3ac8-490e-97bd-fa5ae24a39f5downstream_cells_mapupdate_value!$3f3ebc9b-b070-4d73-8be9-823b399c664cupstream_cells_map:zeromaxislessFunctionVectorlength-+TD0$620a6426-cb29-4010-997b-aa4f9d5f8fb0lastMDP_TD$3e767962-7339-4f35-a039-b5521a098ed5calc_error$3d8b1ccd-9bb3-42f2-a77a-6afdb72c1ff8*AbstractFloat$6e06bd39-486f-425a-bbca-bf363b58988cprecedence_heuristic cell_id$6e06bd39-486f-425a-bbca-bf363b58988cdownstream_cells_mapupstream_cells_map@md_strgetindex$e039a5be-4b59-4023-be97-2d1de970be27precedence_heuristic cell_id$e039a5be-4b59-4023-be97-2d1de970be27downstream_cells_mapupstream_cells_map@md_strgetindex$2786101e-d365-4d6a-8de7-b9794499efb4precedence_heuristic cell_id$2786101e-d365-4d6a-8de7-b9794499efb4downstream_cells_mapexample_6_2$9db7a268-1e6d-4366-a0ec-ebf54916d3b0upstream_cells_mapstringmake_mrp$4ddcd409-c31c-444c-8fcf-7cc45b68d93bsqrtIterators.takeeachcol@htlmake_random_policy$8e34202a-f841-4464-9017-cd50194f7987tabular_TD0_pred_V$eb735ead-978b-409c-8990-b5fa7a027ebfeachindexscatter/^HypertextLiteral.Resultlast==mean:IteratorsHypertextLiteral.Bypasscollect|>HypertextLiteral.contentmonte_carlo_pred_V$415ea466-2038-48fe-9d24-39a90182f1eb-plotHypertextLiteral$639840dc-976a-4e5c-987f-a92afb2d99d8+Layout$14b456f9-5fd1-4340-a3c7-ab9b91b4e3e0precedence_heuristic cell_id$14b456f9-5fd1-4340-a3c7-ab9b91b4e3e0downstream_cells_mapupstream_cells_mapBaseBase.Docs.HTML@html_str$ec285c96-4a75-4af6-8898-ec3176fa34c6precedence_heuristic cell_id$ec285c96-4a75-4af6-8898-ec3176fa34c6downstream_cells_mapmake_windy_gridworld$ab331778-f892-4690-8bb3-26464e3fc05f$dda222ef-8178-40bb-bf20-d242924c4fab$dee6b500-0ba1-4bbc-b217-cbb9ad47ad06$4ddc7d99-0b79-4689-bd93-8798b105c0a2$64b210e8-223f-41f7-a6b7-8af6183ddf87$95245673-2c29-401e-bb4b-a39dc8172297$07c57f37-22be-4c39-8279-d80addcea0c5upstream_cells_maprook_actions$e19db54c-4b3c-42d1-b016-9620daf89bfbapply_wind$e19db54c-4b3c-42d1-b016-9620daf89bfb:Int64GridworldAction$e19db54c-4b3c-42d1-b016-9620daf89bfbwind_vals$e19db54c-4b3c-42d1-b016-9620daf89bfbclamp==move$e19db54c-4b3c-42d1-b016-9620daf89bfb$031e1106-7408-4c7e-b78e-b713c19123d1$e9359ca3-4d11-4365-bc6e-7babc6fcc7deGridworldState$e19db54c-4b3c-42d1-b016-9620daf89bfbMDP_TD$3e767962-7339-4f35-a039-b5521a098ed5$cafedde8-be94-4697-a511-510a5fea0155precedence_heuristic cell_id$cafedde8-be94-4697-a511-510a5fea0155downstream_cells_mapupstream_cells_mapfig_6_3_load$21fbdc3b-4444-4f56-9934-fb58e184d685figure_6_3$6d9ae541-cf8c-4687-9f0a-f008944657e3cliffworld$6faa3015-3ac4-44af-a78c-10b175822441$d526a3a4-63cc-4f94-8f55-98c9a4a9d134precedence_heuristic cell_id$d526a3a4-63cc-4f94-8f55-98c9a4a9d134downstream_cells_mapdouble_q_learning$69eedbfd-396f-4461-b7a1-c36abc094581$b5e06f59-33b5-414e-9a81-43e8abd07aa3$33d69db9-fa2b-40a3-bbed-21d5fd60f302upstream_cells_mapAbstractFloatzerofirstdouble_expected_sarsa$3756a3f8-18e8-4d62-afa1-cfeb4183820conecreate_greedy_policy$84a71bf8-0d66-42cd-ac7b-589d63a16eda/Matrixmake_greedy_policy!$cb07a6a5-c50a-4900-9e5b-a17dc7ee5710$685a7ba3-0f94-4663-a68a-73fa03bd9445$c4919d14-8cba-43e6-9369-efc52bcb9b23initialize_state_action_value$c5d32889-634b-4b00-8ba7-0d1ecaf94f05MDP_TD$3e767962-7339-4f35-a039-b5521a098ed5make_ϵ_greedy_policy!$6b496582-cc0e-4195-87ef-94792b0fff54create_ϵ_greedy_policy$4d4577b5-3753-450d-a247-ebd8c3e8f799$02f34da1-551f-4ce5-a588-7f3a14afd716precedence_heuristic cell_id$02f34da1-551f-4ce5-a588-7f3a14afd716downstream_cells_mapwind_var$aa0791a5-8cf1-499b-9900-4d0c59be808cupstream_cells_map$f11dca8f-5557-49fc-9720-35034eadba57precedence_heuristic cell_id$f11dca8f-5557-49fc-9720-35034eadba57downstream_cells_mapupstream_cells_map@md_strgetindex$4ddc7d99-0b79-4689-bd93-8798b105c0a2precedence_heuristic cell_id$4ddc7d99-0b79-4689-bd93-8798b105c0a2downstream_cells_mapstochastic_gridworld$ed4e863b-22dd-4d2b-88d0-b3a56d6713b7$1e45a661-c2e1-40c2-b27b-5f80f95efdabupstream_cells_mapking_actions$031e1106-7408-4c7e-b78e-b713c19123d1make_windy_gridworld$ec285c96-4a75-4af6-8898-ec3176fa34c6stochastic_wind$aa0791a5-8cf1-499b-9900-4d0c59be808c$bd1029f9-d6a8-4c68-98cd-8af94297b521precedence_heuristic cell_id$bd1029f9-d6a8-4c68-98cd-8af94297b521downstream_cells_mapplot_path$75bfe913-8757-4789-b708-7d400c225218$d299d800-a64e-4ba2-9603-efa833343405$c34678f6-53bb-4f2a-96f0-a7b16f894ddd$6bffb08c-704a-4b7c-bfce-b3d099cf35c0$84584793-8274-4aa1-854f-b167c7434548upstream_cells_mapmake_random_policy$8e34202a-f841-4464-9017-cd50194f7987$cb07a6a5-c50a-4900-9e5b-a17dc7ee5710precedence_heuristic cell_id$cb07a6a5-c50a-4900-9e5b-a17dc7ee5710downstream_cells_mapmake_greedy_policy!$84a71bf8-0d66-42cd-ac7b-589d63a16eda$d526a3a4-63cc-4f94-8f55-98c9a4a9d134$4019c974-dcaa-46c8-ac90-e6566a376ea1upstream_cells_mapextremazeroexpAbstractVectorsumoneReallength-/*==abs$ddf3bb61-16c9-48c4-95d4-263260309762precedence_heuristic cell_id$ddf3bb61-16c9-48c4-95d4-263260309762downstream_cells_mapexercise_6_5$e8f94345-9ad5-48d4-8709-d796fb55db3f$a72d07bf-e337-4bd4-af5c-44d74d163b6bupstream_cells_mapstring:Iteratorsmake_mrp$4ddcd409-c31c-444c-8fcf-7cc45b68d93bsqrtcollect|>Iterators.takeeachcolmake_random_policy$8e34202a-f841-4464-9017-cd50194f7987tabular_TD0_pred_V$eb735ead-978b-409c-8990-b5fa7a027ebf-scatterplot/^+lastLayoutmean$d7566d1b-8938-4e2c-8c54-124f790e72aeprecedence_heuristic cell_id$d7566d1b-8938-4e2c-8c54-124f790e72aedownstream_cells_mapFiniteMDP$c4919d14-8cba-43e6-9369-efc52bcb9b23$95245673-2c29-401e-bb4b-a39dc8172297$07c57f37-22be-4c39-8279-d80addcea0c5$7ac99619-5232-4db8-8553-d79ea5415d29$8ddf6b9d-d76d-401f-96ad-2a0b5c114fa4$dea61907-d4fb-492d-b2bb-c037c7f785cb$3134e913-1e86-495d-a558-c3ec4828bf7b$2455742f-dc18-4d6b-9f58-5666adac6919CompleteMDP$393cd9d2-dd97-496e-b260-ec6e8b1c13b5$d7566d1b-8938-4e2c-8c54-124f790e72ae$0748902c-ffc0-4634-9a1b-e642b3dfb77b$4019c974-dcaa-46c8-ac90-e6566a376ea1$30e663da-282c-42ff-8171-dbe3c5c467c6$7ed07ddc-1c63-4ce7-bfd3-6da54304d297upstream_cells_mapDictzipVectorRealInt64neweachindexlengthCompleteMDP$d7566d1b-8938-4e2c-8c54-124f790e72ae+undefArray$42799973-9884-4a0e-b29a-039890e92d21precedence_heuristic cell_id$42799973-9884-4a0e-b29a-039890e92d21downstream_cells_mapupstream_cells_map@md_strgetindex$187fc682-2282-46ca-b988-c9de438f36fdprecedence_heuristic cell_id$187fc682-2282-46ca-b988-c9de438f36fddownstream_cells_mapparams_6_2$22c2213e-5b9b-410f-a0ef-8f1e3db3c532upstream_cells_map@md_strCore:PlutoUI$639840dc-976a-4e5c-987f-a92afb2d99d8Base.get@bindSliderBasePlutoRunnerPlutoRunner.create_bondconfirmCore.applicablePlutoUI.combinegetindex$8fe856ec-5f0a-4483-bb7d-3f6fe270b6f3precedence_heuristic cell_id$8fe856ec-5f0a-4483-bb7d-3f6fe270b6f3downstream_cells_mapupstream_cells_map@md_strgetindex$8e15f4b5-0dc7-47a5-9477-9f4d8807b331precedence_heuristic cell_id$8e15f4b5-0dc7-47a5-9477-9f4d8807b331downstream_cells_mapstochastic_gridworld_mdp_dp$d299d800-a64e-4ba2-9603-efa833343405upstream_cells_mapking_actions$031e1106-7408-4c7e-b78e-b713c19123d1wind_vals$e19db54c-4b3c-42d1-b016-9620daf89bfbGridworldState$e19db54c-4b3c-42d1-b016-9620daf89bfbcreate_stochastic_gridworld_mdp$07c57f37-22be-4c39-8279-d80addcea0c5$9d01c0ef-6313-4091-b444-3e9765aba90cprecedence_heuristic cell_id$9d01c0ef-6313-4091-b444-3e9765aba90cdownstream_cells_mapupstream_cells_map@md_strgetindex$62a9a36a-bedb-4f5a-80a4-2d4111a65c12precedence_heuristic cell_id$62a9a36a-bedb-4f5a-80a4-2d4111a65c12downstream_cells_mapupstream_cells_map@md_strBase.getindexBaseHypertextLiteral.BypassHypertextLiteral.ResultHypertextLiteral$639840dc-976a-4e5c-987f-a92afb2d99d8HypertextLiteral.content@htl$2651af2d-56a8-4f7e-a56a-45cabd665c72precedence_heuristic cell_id$2651af2d-56a8-4f7e-a56a-45cabd665c72downstream_cells_mapupstream_cells_mapmax_visual_params2$0163763b-a15f-447e-b3d2-32d4bf9d2605max_bias_visualization_comp$3f4f078a-9fc4-4b02-b499-a805fd5f1071$620a6426-cb29-4010-997b-aa4f9d5f8fb0precedence_heuristic cell_id$620a6426-cb29-4010-997b-aa4f9d5f8fb0downstream_cells_mapBatchMethod$620a6426-cb29-4010-997b-aa4f9d5f8fb0$3f3ebc9b-b070-4d73-8be9-823b399c664cMC$72b4d8d5-464c-4561-8c69-28ef3f59630b$1e3d231a-4065-48ce-a74e-018066fb232aTD0$209881b3-3ac8-490e-97bd-fa5ae24a39f5$3f3ebc9b-b070-4d73-8be9-823b399c664c$1e3d231a-4065-48ce-a74e-018066fb232aupstream_cells_mapBatchMethod$620a6426-cb29-4010-997b-aa4f9d5f8fb0$889611fb-7dac-4769-9251-9a90e3a1422fprecedence_heuristic cell_id$889611fb-7dac-4769-9251-9a90e3a1422fdownstream_cells_mapstatestyle$902738c3-2f7b-49cb-8580-29359c857027upstream_cells_map$5455fc97-55cb-4b0e-a3be-9433ccc96fc0precedence_heuristic cell_id$5455fc97-55cb-4b0e-a3be-9433ccc96fc0downstream_cells_mapnstates$e4c6456c-867d-4ade-a3c8-310c1e065f14$9db7a268-1e6d-4366-a0ec-ebf54916d3b0$4b0d96d0-25d1-4fed-b105-c65fa2883a61delay$53145cc2-784c-468b-8e91-9bb7866db218start_mrp$12c5efe4-d64d-4b82-877c-29b0e537fee6upstream_cells_map@md_strCore:Base.get@bindSliderBasePlutoRunnerPlutoRunner.create_bondCore.applicableButtongetindex$24a441c8-7aaf-4642-b245-5e1201456d67precedence_heuristic cell_id$24a441c8-7aaf-4642-b245-5e1201456d67downstream_cells_mapcheck_policy$eb735ead-978b-409c-8990-b5fa7a027ebf$415ea466-2038-48fe-9d24-39a90182f1eb$3f3ebc9b-b070-4d73-8be9-823b399c664cupstream_cells_mapMain.Base.inferencebarrier@assertsizenothinglengthMainthrowAssertionErrorMatrix==MDP_TD$3e767962-7339-4f35-a039-b5521a098ed5AbstractFloat$1e45a661-c2e1-40c2-b27b-5f80f95efdabprecedence_heuristic cell_id$1e45a661-c2e1-40c2-b27b-5f80f95efdabdownstream_cells_mapupstream_cells_mapking_action_display$cdedd35e-52b8-40a5-938d-2d36f6f93217q_learning$2034fd1e-5171-4eda-85d5-2de62d7a1e8bdisplay_king_policy$9651f823-e1cd-4e6e-9ce0-be9ea1c3f0a4stochastic_gridworld$4ddc7d99-0b79-4689-bd93-8798b105c0a2show_gridworld_policy_value$c34678f6-53bb-4f2a-96f0-a7b16f894ddd$21fbdc3b-4444-4f56-9934-fb58e184d685precedence_heuristic cell_id$21fbdc3b-4444-4f56-9934-fb58e184d685downstream_cells_mapfig_6_3_load$cafedde8-be94-4697-a511-510a5fea0155upstream_cells_mapCore@md_strBasePlutoRunner.create_bondPlutoRunnerCheckBoxCore.applicable@bindBase.getgetindex$30e663da-282c-42ff-8171-dbe3c5c467c6precedence_heuristic cell_id$30e663da-282c-42ff-8171-dbe3c5c467c6downstream_cells_mapmakepolicyvalueplots$bb085f2e-83cb-45b2-adf6-c07da892d6e1$1f28280e-ba3b-4ca5-89e4-6ca4a90f5893upstream_cells_mapLaTeXStrings.latexstring:@L_strrelayoutIntegerVectorRealmakepolicyvaluemaps$7ed07ddc-1c63-4ce7-bfd3-6da54304d297-plotlatexifyheatmapattrCompleteMDP$d7566d1b-8938-4e2c-8c54-124f790e72aeMatrixLaTeXStrings$639840dc-976a-4e5c-987f-a92afb2d99d8Layout$9651f823-e1cd-4e6e-9ce0-be9ea1c3f0a4precedence_heuristic cell_id$9651f823-e1cd-4e6e-9ce0-be9ea1c3f0a4downstream_cells_mapdisplay_king_policy$f0f9d3d5-e76a-4472-bfb1-da29d73a7916$dee6b500-0ba1-4bbc-b217-cbb9ad47ad06$ed4e863b-22dd-4d2b-88d0-b3a56d6713b7$1115f3ec-f4b2-4fba-bd5e-321a63b10a6d$1e45a661-c2e1-40c2-b27b-5f80f95efdabupstream_cells_map HypertextLiteral.attribute_valueHypertextLiteral.BypassHypertextLiteral.ResultHypertextLiteral$639840dc-976a-4e5c-987f-a92afb2d99d8@htlAbstractFloatVector$84a71bf8-0d66-42cd-ac7b-589d63a16edaprecedence_heuristic cell_id$84a71bf8-0d66-42cd-ac7b-589d63a16edadownstream_cells_mapcreate_greedy_policy$6bffb08c-704a-4b7c-bfce-b3d099cf35c0$84584793-8274-4aa1-854f-b167c7434548$3756a3f8-18e8-4d62-afa1-cfeb4183820c$d526a3a4-63cc-4f94-8f55-98c9a4a9d134upstream_cells_mapReal:Matrixzerosmake_greedy_policy!$cb07a6a5-c50a-4900-9e5b-a17dc7ee5710$685a7ba3-0f94-4663-a68a-73fa03bd9445$c4919d14-8cba-43e6-9369-efc52bcb9b23sizecopy$c9f7646a-ec01-4d90-9215-5027b7c1c885precedence_heuristic cell_id$c9f7646a-ec01-4d90-9215-5027b7c1c885downstream_cells_mapα_6_8$b5e06f59-33b5-414e-9a81-43e8abd07aa3upstream_cells_map@md_strCore:Base.get@bindSliderBasePlutoRunnerPlutoRunner.create_bondCore.applicablegetindex$8e34202a-f841-4464-9017-cd50194f7987precedence_heuristic cell_id$8e34202a-f841-4464-9017-cd50194f7987downstream_cells_mapmake_random_policy$7035c082-6e50-4df5-919f-5f09d2011b4a$64fe8336-d1c2-41fe-a522-1b6f63260fc9$2786101e-d365-4d6a-8de7-b9794499efb4$ddf3bb61-16c9-48c4-95d4-263260309762$1e3d231a-4065-48ce-a74e-018066fb232a$bd1029f9-d6a8-4c68-98cd-8af94297b521upstream_cells_maplength/onesMDP_TD$3e767962-7339-4f35-a039-b5521a098ed5AbstractFloat$95245673-2c29-401e-bb4b-a39dc8172297precedence_heuristic cell_id$95245673-2c29-401e-bb4b-a39dc8172297downstream_cells_mapcreate_gridworld_mdp$d299d800-a64e-4ba2-9603-efa833343405$71774d5f-7841-403f-bc6b-1a0cbbb72d6d$2f4e2da2-b1a1-41b1-8904-39b59f426da4$0e488135-49e5-4e71-83b1-05d8e61f0510upstream_cells_maplengthapply_wind$e19db54c-4b3c-42d1-b016-9620daf89bfb:make_windy_gridworld$ec285c96-4a75-4af6-8898-ec3176fa34c6FiniteMDP$d7566d1b-8938-4e2c-8c54-124f790e72aeFloat32wind_vals$e19db54c-4b3c-42d1-b016-9620daf89bfbzeros$c34678f6-53bb-4f2a-96f0-a7b16f894dddprecedence_heuristic cell_id$c34678f6-53bb-4f2a-96f0-a7b16f894ddddownstream_cells_mapshow_gridworld_policy_value$897fde24-9a4a-465e-96f2-dd9e8baab294$1115f3ec-f4b2-4fba-bd5e-321a63b10a6d$1e45a661-c2e1-40c2-b27b-5f80f95efdab$b5e06f59-33b5-414e-9a81-43e8abd07aa3$33d69db9-fa2b-40a3-bbed-21d5fd60f302upstream_cells_map:show_grid_policy$9da5fd84-800d-4b3e-8627-e90ce8f20297HypertextLiteral.BypassHypertextLiteral.contentplot_path$9f28772c-9afe-4253-ab3b-055b0f48be6e$bd1029f9-d6a8-4c68-98cd-8af94297b521randString@htldisplay_rook_policy$4556cf44-4a1c-4ca4-bfb8-4841301a2ce6show_grid_value$8bc54c94-9c92-4904-b3a6-13ff3f0110bb$678cad7a-1abb-4fcc-91ba-b5abcbb914cbrook_action_display$500d8dd4-fc53-4021-b797-114224ca4debHypertextLiteral$639840dc-976a-4e5c-987f-a92afb2d99d8wind_vals$e19db54c-4b3c-42d1-b016-9620daf89bfbHypertextLiteral.Result$e4e80015-40ce-4f8a-aac7-4a9584da4baaprecedence_heuristic cell_id$e4e80015-40ce-4f8a-aac7-4a9584da4baadownstream_cells_mapupstream_cells_mapexample_6_8$33d69db9-fa2b-40a3-bbed-21d5fd60f302ex_6_8_load$d83ff60f-8973-4dc1-9358-5ad109ea5490$64fe8336-d1c2-41fe-a522-1b6f63260fc9precedence_heuristic cell_id$64fe8336-d1c2-41fe-a522-1b6f63260fc9downstream_cells_mapπ_mrp$12c5efe4-d64d-4b82-877c-29b0e537fee6upstream_cells_mapmrp_6_2$4b0d96d0-25d1-4fed-b105-c65fa2883a61make_random_policy$8e34202a-f841-4464-9017-cd50194f7987$dea61907-d4fb-492d-b2bb-c037c7f785cbprecedence_heuristic cell_id$dea61907-d4fb-492d-b2bb-c037c7f785cbdownstream_cells_mapbellman_optimal_value!$8787a5fd-d0ab-46b5-a7df-e7bc103a7378upstream_cells_mapzero@fastmathtypeminBase.FastMath.sub_fastisless@inboundsnothingVectorpoisson$b3d4117f-7db4-43a6-8427-c08f3542d71fislesszipReallengthIntegermakelookup$834e5810-77ea-4dfd-9f37-9d9dbf6585a4BasefindallInt64-enumerate+*$de50f95f-984e-4387-958c-64e0265f5953precedence_heuristic cell_id$de50f95f-984e-4387-958c-64e0265f5953downstream_cells_maprender_walk$e4c6456c-867d-4ade-a3c8-310c1e065f14upstream_cells_map:IteratorserrorHypertextLiteral.Bypass>isless|>Iterators.takeHypertextLiteral.contentceilmapreduce@htlcollectHypertextLiteral.contentmapreduceisfileenumerateHTMLplotfoldxtHypertextLiteral$639840dc-976a-4e5c-987f-a92afb2d99d8+*MapLayoutserialize$3f3ebc9b-b070-4d73-8be9-823b399c664cprecedence_heuristic cell_id$3f3ebc9b-b070-4d73-8be9-823b399c664cdownstream_cells_mapbatch_value_est$1e3d231a-4065-48ce-a74e-018066fb232aupstream_cells_mapzeroTuple>islessVectorlengthpoisson$b3d4117f-7db4-43a6-8427-c08f3542d71fislesszipReallengthIntegerBasefindall-FiniteMDP$d7566d1b-8938-4e2c-8c54-124f790e72aeenumerate+*$f474fcbd-e3c3-49fd-a6b7-6d6a8a7dda09precedence_heuristic cell_id$f474fcbd-e3c3-49fd-a6b7-6d6a8a7dda09downstream_cells_mapupstream_cells_map@md_strgetindex$69eedbfd-396f-4461-b7a1-c36abc094581precedence_heuristic cell_id$69eedbfd-396f-4461-b7a1-c36abc094581downstream_cells_mapexample_6_7_mdp$00d67a93-437c-4cda-899a-9daa1102e1f2upstream_cells_maprandnTerm$4382928c-6325-4ecd-b7cf-282525a270absarsa$61bbf9db-49a0-4709-83f4-44f228be09c0double_expected_sarsa$3756a3f8-18e8-4d62-afa1-cfeb4183820cdeserializeq_learning$2034fd1e-5171-4eda-85d5-2de62d7a1e8bB$4382928c-6325-4ecd-b7cf-282525a270abscatterFloat32/double_q_learning$d526a3a4-63cc-4f94-8f55-98c9a4a9d134lastfill==meanMDP_TD$3e767962-7339-4f35-a039-b5521a098ed5expected_sarsa$292d9018-b550-4278-a8e0-78dd6a6853f1A$4382928c-6325-4ecd-b7cf-282525a270ab:collectzerosisfileInteger-plotmake_ϵ_greedy_policy!$6b496582-cc0e-4195-87ef-94792b0fff54Layoutcreate_ϵ_greedy_policy$4d4577b5-3753-450d-a247-ebd8c3e8f799serialize$7ac99619-5232-4db8-8553-d79ea5415d29precedence_heuristic cell_id$7ac99619-5232-4db8-8553-d79ea5415d29downstream_cells_mapcreate_gridworld_mdp$d299d800-a64e-4ba2-9603-efa833343405$71774d5f-7841-403f-bc6b-1a0cbbb72d6d$2f4e2da2-b1a1-41b1-8904-39b59f426da4$0e488135-49e5-4e71-83b1-05d8e61f0510upstream_cells_maplength:FiniteMDP$d7566d1b-8938-4e2c-8c54-124f790e72aeFloat32zerosMDP_TD$3e767962-7339-4f35-a039-b5521a098ed5$0163763b-a15f-447e-b3d2-32d4bf9d2605precedence_heuristic cell_id$0163763b-a15f-447e-b3d2-32d4bf9d2605downstream_cells_mapmax_visual_params2$2651af2d-56a8-4f7e-a56a-45cabd665c72upstream_cells_map@md_strCore:PlutoUI$639840dc-976a-4e5c-987f-a92afb2d99d8|>Base.get@bindBasePlutoRunnerPlutoRunner.create_bondNumberFieldconfirmCore.applicablePlutoUI.combinegetindex$53145cc2-784c-468b-8e91-9bb7866db218precedence_heuristic cell_id$53145cc2-784c-468b-8e91-9bb7866db218downstream_cells_mapt$54d97122-2d01-46ec-aafe-00bfc9f2d6d1$1dd1ba55-548a-41f6-903e-70742fd60e3dupstream_cells_mapCorePlutoUI$639840dc-976a-4e5c-987f-a92afb2d99d8delay$5455fc97-55cb-4b0e-a3be-9433ccc96fc0Base.get@bindPlutoUI.ClocklengthBasePlutoRunnerPlutoRunner.create_bondmrp_trajectory$12c5efe4-d64d-4b82-877c-29b0e537fee6Core.applicable+$6b496582-cc0e-4195-87ef-94792b0fff54precedence_heuristic cell_id$6b496582-cc0e-4195-87ef-94792b0fff54downstream_cells_mapmake_ϵ_greedy_policy!$4d4577b5-3753-450d-a247-ebd8c3e8f799$61bbf9db-49a0-4709-83f4-44f228be09c0$2034fd1e-5171-4eda-85d5-2de62d7a1e8b$292d9018-b550-4278-a8e0-78dd6a6853f1$3756a3f8-18e8-4d62-afa1-cfeb4183820c$d526a3a4-63cc-4f94-8f55-98c9a4a9d134$69eedbfd-396f-4461-b7a1-c36abc094581upstream_cells_mapsumisapproxAbstractVectoroneReallengtheachindex-/+*==maximum$9db7a268-1e6d-4366-a0ec-ebf54916d3b0precedence_heuristic cell_id$9db7a268-1e6d-4366-a0ec-ebf54916d3b0downstream_cells_mapupstream_cells_mapexample_6_2$2786101e-d365-4d6a-8de7-b9794499efb4nstates$5455fc97-55cb-4b0e-a3be-9433ccc96fc0$c2f56287-9a3e-454a-9ec1-53184b788db9precedence_heuristic cell_id$c2f56287-9a3e-454a-9ec1-53184b788db9downstream_cells_mapjacks_car_mdp$bb085f2e-83cb-45b2-adf6-c07da892d6e1upstream_cells_mapcreate_car_rental_mdp$2455742f-dc18-4d6b-9f58-5666adac6919$18e60b1d-97ec-432c-a388-003e7fae415fprecedence_heuristic cell_id$18e60b1d-97ec-432c-a388-003e7fae415fdownstream_cells_mapbellman_optimal_value!$8787a5fd-d0ab-46b5-a7df-e7bc103a7378upstream_cells_mapzero@fastmathBase.FastMath.sub_fastisless@inboundsnothinglengthBase.get@bindBasePlutoRunnerPlutoRunner.create_bondNumberFieldconfirmCore.applicablePlutoUI.combinegetindex$a5009785-64b4-489b-a967-f7840b4a9463precedence_heuristic cell_id$a5009785-64b4-489b-a967-f7840b4a9463downstream_cells_mapupstream_cells_map@md_strgetindex$eb735ead-978b-409c-8990-b5fa7a027ebfprecedence_heuristic cell_id$eb735ead-978b-409c-8990-b5fa7a027ebfdownstream_cells_maptabular_TD0_pred_V$2786101e-d365-4d6a-8de7-b9794499efb4$ddf3bb61-16c9-48c4-95d4-263260309762upstream_cells_mapzero:!zerosIntegerVectorlengthfindalltakestep$d5abd922-a8c2-4f5c-9a6e-d2490a8ad7dc-enumerateinitialize_state_value$401831c3-3925-465c-a093-28686f0dad2eMatrix+check_policy$24a441c8-7aaf-4642-b245-5e1201456d67*MDP_TD$3e767962-7339-4f35-a039-b5521a098ed5AbstractFloat$2034fd1e-5171-4eda-85d5-2de62d7a1e8bprecedence_heuristic cell_id$2034fd1e-5171-4eda-85d5-2de62d7a1e8bdownstream_cells_mapq_learning$897fde24-9a4a-465e-96f2-dd9e8baab294$1115f3ec-f4b2-4fba-bd5e-321a63b10a6d$1e45a661-c2e1-40c2-b27b-5f80f95efdab$6bffb08c-704a-4b7c-bfce-b3d099cf35c0$84584793-8274-4aa1-854f-b167c7434548$6d9ae541-cf8c-4687-9f0a-f008944657e3$69eedbfd-396f-4461-b7a1-c36abc094581$b5e06f59-33b5-414e-9a81-43e8abd07aa3$33d69db9-fa2b-40a3-bbed-21d5fd60f302upstream_cells_mapzero!oneVectorlengthcopyeachindex/initialize_state_action_value$c5d32889-634b-4b00-8ba7-0d1ecaf94f05==MDP_TD$3e767962-7339-4f35-a039-b5521a098ed5AbstractFloat:firstzerosfindallInt64takestep$d5abd922-a8c2-4f5c-9a6e-d2490a8ad7dc-+undef*make_ϵ_greedy_policy!$6b496582-cc0e-4195-87ef-94792b0fff54maximumcreate_ϵ_greedy_policy$4d4577b5-3753-450d-a247-ebd8c3e8f799$4382928c-6325-4ecd-b7cf-282525a270abprecedence_heuristic cell_id$4382928c-6325-4ecd-b7cf-282525a270abdownstream_cells_mapB$69eedbfd-396f-4461-b7a1-c36abc094581A$69eedbfd-396f-4461-b7a1-c36abc094581Term$69eedbfd-396f-4461-b7a1-c36abc094581MaxBiasStates$4382928c-6325-4ecd-b7cf-282525a270abupstream_cells_mapMaxBiasStates$4382928c-6325-4ecd-b7cf-282525a270ab$8bc54c94-9c92-4904-b3a6-13ff3f0110bbprecedence_heuristic cell_id$8bc54c94-9c92-4904-b3a6-13ff3f0110bbdownstream_cells_mapshow_grid_value$d299d800-a64e-4ba2-9603-efa833343405$c34678f6-53bb-4f2a-96f0-a7b16f894dddupstream_cells_mapking_action_display$cdedd35e-52b8-40a5-938d-2d36f6f93217findfirst:HypertextLiteral.BypassHypertextLiteral.contentmapreduce@htlVectoreachindex-HTML HypertextLiteral.attribute_valueHypertextLiteral.ResultHypertextLiteral$639840dc-976a-4e5c-987f-a92afb2d99d8+HypertextLiteral.StyleTagMatrix*roundmaximum$4b1a4c14-3c2b-40c0-995c-cd0334ed8b3aprecedence_heuristic cell_id$4b1a4c14-3c2b-40c0-995c-cd0334ed8b3adownstream_cells_mapupstream_cells_map@md_strgetindex$f0f9d3d5-e76a-4472-bfb1-da29d73a7916precedence_heuristic cell_id$f0f9d3d5-e76a-4472-bfb1-da29d73a7916downstream_cells_mapupstream_cells_mapking_action_display$cdedd35e-52b8-40a5-938d-2d36f6f93217king_gridworld$dda222ef-8178-40bb-bf20-d242924c4fabdisplay_king_policy$9651f823-e1cd-4e6e-9ce0-be9ea1c3f0a4example_6_5$d299d800-a64e-4ba2-9603-efa833343405$4c1b286c-2ba9-4293-81e1-bf360baa75faprecedence_heuristic cell_id$4c1b286c-2ba9-4293-81e1-bf360baa75fadownstream_cells_mapupstream_cells_map@md_strgetindex$3134e913-1e86-495d-a558-c3ec4828bf7bprecedence_heuristic cell_id$3134e913-1e86-495d-a558-c3ec4828bf7bdownstream_cells_mapbegin_value_iteration_v$d299d800-a64e-4ba2-9603-efa833343405$33d69db9-fa2b-40a3-bbed-21d5fd60f302$bb085f2e-83cb-45b2-adf6-c07da892d6e1$1f28280e-ba3b-4ca5-89e4-6ca4a90f5893upstream_cells_mapzeroFiniteMDP$d7566d1b-8938-4e2c-8c54-124f790e72aeones*sizeReal$db31579e-3e56-4271-8fc3-eb13bc95ac27precedence_heuristic cell_id$db31579e-3e56-4271-8fc3-eb13bc95ac27downstream_cells_mapupstream_cells_map@md_strgetindex$943b6d7e-14a4-4532-90c7-dd5080be0c6eprecedence_heuristic cell_id$943b6d7e-14a4-4532-90c7-dd5080be0c6edownstream_cells_mapnoisy_rewards$64b210e8-223f-41f7-a6b7-8af6183ddf87$297f1606-4ec2-4075-9f81-926dc517b76fupstream_cells_map$84584793-8274-4aa1-854f-b167c7434548precedence_heuristic cell_id$84584793-8274-4aa1-854f-b167c7434548downstream_cells_map,gridworld_Q_vs_sarsa_vs_expected_sarsa_solve$667666b9-3ab6-4836-953d-9878208103c9upstream_cells_mapsarsa$61bbf9db-49a0-4709-83f4-44f228be09c0Tuplezip@htlq_learning$2034fd1e-5171-4eda-85d5-2de62d7a1e8beachindexscatter/HypertextLiteral.Resultfillexpected_sarsa$292d9018-b550-4278-a8e0-78dd6a6853f1:HypertextLiteral.Bypassplot_path$9f28772c-9afe-4253-ab3b-055b0f48be6e$bd1029f9-d6a8-4c68-98cd-8af94297b521HypertextLiteral.contentmapreducecreate_greedy_policy$84a71bf8-0d66-42cd-ac7b-589d63a16edaplotHypertextLiteral$639840dc-976a-4e5c-987f-a92afb2d99d8+attrLayout$9f28772c-9afe-4253-ab3b-055b0f48be6eprecedence_heuristic cell_id$9f28772c-9afe-4253-ab3b-055b0f48be6edownstream_cells_mapplot_path$75bfe913-8757-4789-b708-7d400c225218$d299d800-a64e-4ba2-9603-efa833343405$c34678f6-53bb-4f2a-96f0-a7b16f894ddd$6bffb08c-704a-4b7c-bfce-b3d099cf35c0$84584793-8274-4aa1-854f-b167c7434548upstream_cells_map*findfirst:maxislessendlength-scatterplot+attrlastfillLayoutmaximumrunepisode$bfe71b40-3157-47df-8494-67f8eb8e4e93$7035c082-6e50-4df5-919f-5f09d2011b4a$1dd1ba55-548a-41f6-903e-70742fd60e3dprecedence_heuristic cell_id$1dd1ba55-548a-41f6-903e-70742fd60e3ddownstream_cells_mapupstream_cells_mapmrp_trajectory$12c5efe4-d64d-4b82-877c-29b0e537fee6t$53145cc2-784c-468b-8e91-9bb7866db218show_mrp_state$87fadfc0-2cdb-4be2-81ad-e8fdeffb690c$2a3e4617-efbb-4bbc-9c61-8535628e439cprecedence_heuristic cell_id$2a3e4617-efbb-4bbc-9c61-8535628e439cdownstream_cells_mapupstream_cells_map@md_strgetindex$5f32fed0-c921-4cbb-85fe-ade54d4c6c95precedence_heuristic cell_id$5f32fed0-c921-4cbb-85fe-ade54d4c6c95downstream_cells_mapupstream_cells_map@md_strgetindex$a3d10753-2ec3-4252-9629-834145678b6aprecedence_heuristic cell_id$a3d10753-2ec3-4252-9629-834145678b6adownstream_cells_mapupstream_cells_map@md_strgetindex$12aac612-758b-4655-8ede-daddd4af6d3eprecedence_heuristic cell_id$12aac612-758b-4655-8ede-daddd4af6d3edownstream_cells_mapsarsa_step$61bbf9db-49a0-4709-83f4-44f228be09c0upstream_cells_mapsample_action$7a5ff8f7-70d4-46f1-a4a7-bbfcec4f6e3fMatrixMDP_TD$3e767962-7339-4f35-a039-b5521a098ed5FunctionReal$2c49900b-3c57-4d9a-b3dc-ef9cc20c30c1precedence_heuristic cell_id$2c49900b-3c57-4d9a-b3dc-ef9cc20c30c1downstream_cells_mapupstream_cells_map@md_strgetindex$e26f788e-f602-403e-929e-6c98a6e6bf79precedence_heuristic cell_id$e26f788e-f602-403e-929e-6c98a6e6bf79downstream_cells_mapupstream_cells_map@md_strgetindex$c09530bc-f37e-4d57-a267-14d4027147daprecedence_heuristic cell_id$c09530bc-f37e-4d57-a267-14d4027147dadownstream_cells_mapupstream_cells_map@md_strgetindex$0c0b875e-69f8-46ed-ad06-df9c36088fbeprecedence_heuristic cell_id$0c0b875e-69f8-46ed-ad06-df9c36088fbedownstream_cells_mapgridsize$b5e06f59-33b5-414e-9a81-43e8abd07aa3$98bec66e-d8f3-4d4d-b4ec-5838489164e5$33d69db9-fa2b-40a3-bbed-21d5fd60f302upstream_cells_map$8d05403a-adeb-40ac-a98a-87586d5a5170precedence_heuristic cell_id$8d05403a-adeb-40ac-a98a-87586d5a5170downstream_cells_mapupstream_cells_map@md_strgetindex$44c49006-e210-4f97-916e-fe62f36c593fprecedence_heuristic cell_id$44c49006-e210-4f97-916e-fe62f36c593fdownstream_cells_mapupstream_cells_map@md_strgetindex$0ad739c9-8aca-4b82-bf20-c73584d29535precedence_heuristic cell_id$0ad739c9-8aca-4b82-bf20-c73584d29535downstream_cells_mapupstream_cells_map@md_strgetindex$0748902c-ffc0-4634-9a1b-e642b3dfb77bprecedence_heuristic cell_id$0748902c-ffc0-4634-9a1b-e642b3dfb77bdownstream_cells_mapform_random_policy$4019c974-dcaa-46c8-ac90-e6566a376ea1upstream_cells_maplength/CompleteMDP$d7566d1b-8938-4e2c-8c54-124f790e72aeones$6a1503c6-c77b-4e3a-9f07-74b2af1a5ff7precedence_heuristic cell_id$6a1503c6-c77b-4e3a-9f07-74b2af1a5ff7downstream_cells_mapupstream_cells_map@md_strgetindex$292d9018-b550-4278-a8e0-78dd6a6853f1precedence_heuristic cell_id$292d9018-b550-4278-a8e0-78dd6a6853f1downstream_cells_mapexpected_sarsa$84584793-8274-4aa1-854f-b167c7434548$6d9ae541-cf8c-4687-9f0a-f008944657e3$69eedbfd-396f-4461-b7a1-c36abc094581$33d69db9-fa2b-40a3-bbed-21d5fd60f302upstream_cells_mapzerosum!oneVectorlengthcopyeachindex/initialize_state_action_value$c5d32889-634b-4b00-8ba7-0d1ecaf94f05==MDP_TD$3e767962-7339-4f35-a039-b5521a098ed5AbstractFloat:firstzerosfindallInt64takestep$d5abd922-a8c2-4f5c-9a6e-d2490a8ad7dc-+undef*make_ϵ_greedy_policy!$6b496582-cc0e-4195-87ef-94792b0fff54create_ϵ_greedy_policy$4d4577b5-3753-450d-a247-ebd8c3e8f799$07c57f37-22be-4c39-8279-d80addcea0c5precedence_heuristic cell_id$07c57f37-22be-4c39-8279-d80addcea0c5downstream_cells_mapcreate_stochastic_gridworld_mdp$8e15f4b5-0dc7-47a5-9477-9f4d8807b331upstream_cells_mapapply_wind$e19db54c-4b3c-42d1-b016-9620daf89bfb:make_windy_gridworld$ec285c96-4a75-4af6-8898-ec3176fa34c6maxzerosislesslength-FiniteMDP$d7566d1b-8938-4e2c-8c54-124f790e72aeFloat32/min+wind_vals$e19db54c-4b3c-42d1-b016-9620daf89bfb==GridworldState$e19db54c-4b3c-42d1-b016-9620daf89bfb$b5187232-d808-49b6-9f7e-a4cbeb6c2b3eprecedence_heuristic cell_id$b5187232-d808-49b6-9f7e-a4cbeb6c2b3edownstream_cells_mapupstream_cells_map@md_strgetindex$54d97122-2d01-46ec-aafe-00bfc9f2d6d1precedence_heuristic cell_id$54d97122-2d01-46ec-aafe-00bfc9f2d6d1downstream_cells_mapupstream_cells_map@md_strlengthminmrp_trajectory$12c5efe4-d64d-4b82-877c-29b0e537fee6islessfirstt$53145cc2-784c-468b-8e91-9bb7866db218getindex$926ec37d-b969-4dc9-99b2-a6b29c6d880cprecedence_heuristic cell_id$926ec37d-b969-4dc9-99b2-a6b29c6d880cdownstream_cells_mapupstream_cells_map@md_strgetindex$c360945e-f8b2-4c6f-a70c-6ab4ddcf5b54precedence_heuristic cell_id$c360945e-f8b2-4c6f-a70c-6ab4ddcf5b54downstream_cells_mapupstream_cells_map@md_strgetindex$573a9919-bd7e-4a56-b830-4e40e91288efprecedence_heuristic cell_id$573a9919-bd7e-4a56-b830-4e40e91288efdownstream_cells_mapupstream_cells_map@md_strgetindex$4556cf44-4a1c-4ca4-bfb8-4841301a2ce6precedence_heuristic cell_id$4556cf44-4a1c-4ca4-bfb8-4841301a2ce6downstream_cells_mapdisplay_rook_policy$d299d800-a64e-4ba2-9603-efa833343405$c34678f6-53bb-4f2a-96f0-a7b16f894dddupstream_cells_map HypertextLiteral.attribute_valueHypertextLiteral.BypassHypertextLiteral.ResultHypertextLiteral$639840dc-976a-4e5c-987f-a92afb2d99d8@htlAbstractFloatVector$bb085f2e-83cb-45b2-adf6-c07da892d6e1precedence_heuristic cell_id$bb085f2e-83cb-45b2-adf6-c07da892d6e1downstream_cells_mapv_carcar_resultsπ_carupstream_cells_map@md_strmakepolicyvalueplots$30e663da-282c-42ff-8171-dbe3c5c467c6endlengthbegin_value_iteration_v$e947f86e-8dc3-4ce7-a9d4-0a7b675a9fa9$4019c974-dcaa-46c8-ac90-e6566a376ea1$3134e913-1e86-495d-a558-c3ec4828bf7bjacks_car_mdp$c2f56287-9a3e-454a-9ec1-53184b788db9getindex$e9359ca3-4d11-4365-bc6e-7babc6fcc7deprecedence_heuristic cell_id$e9359ca3-4d11-4365-bc6e-7babc6fcc7dedownstream_cells_mapmove$ec285c96-4a75-4af6-8898-ec3176fa34c6$6556dafb-04fa-434c-868a-8d7bb7b5b196Stay$e9359ca3-4d11-4365-bc6e-7babc6fcc7de$dee6b500-0ba1-4bbc-b217-cbb9ad47ad06$0e488135-49e5-4e71-83b1-05d8e61f0510upstream_cells_mapGridworldAction$e19db54c-4b3c-42d1-b016-9620daf89bfbStay$e9359ca3-4d11-4365-bc6e-7babc6fcc7de$639840dc-976a-4e5c-987f-a92afb2d99d8precedence_heuristiccell_id$639840dc-976a-4e5c-987f-a92afb2d99d8downstream_cells_mapStatisticsStatsBaseTransducersPlutoUI$53145cc2-784c-468b-8e91-9bb7866db218$187fc682-2282-46ca-b988-c9de438f36fd$4862942b-d1e2-4ac8-8e88-65205e91a070$0163763b-a15f-447e-b3d2-32d4bf9d2605ThreadsLatexifySerializationLinearAlgebraHypertextLiteral$de50f95f-984e-4387-958c-64e0265f5953$902738c3-2f7b-49cb-8580-29359c857027$2786101e-d365-4d6a-8de7-b9794499efb4$62a9a36a-bedb-4f5a-80a4-2d4111a65c12$4d7619ee-933f-452a-9202-e95a8f3da20f$75bfe913-8757-4789-b708-7d400c225218$500d8dd4-fc53-4021-b797-114224ca4deb$4556cf44-4a1c-4ca4-bfb8-4841301a2ce6$d299d800-a64e-4ba2-9603-efa833343405$cdedd35e-52b8-40a5-938d-2d36f6f93217$9651f823-e1cd-4e6e-9ce0-be9ea1c3f0a4$d259ecca-0249-4b28-a4d7-6880d4d84495$8bc54c94-9c92-4904-b3a6-13ff3f0110bb$678cad7a-1abb-4fcc-91ba-b5abcbb914cb$9da5fd84-800d-4b3e-8627-e90ce8f20297$c34678f6-53bb-4f2a-96f0-a7b16f894ddd$6bffb08c-704a-4b7c-bfce-b3d099cf35c0$84584793-8274-4aa1-854f-b167c7434548$33d69db9-fa2b-40a3-bbed-21d5fd60f302LaTeXStrings$30e663da-282c-42ff-8171-dbe3c5c467c6PlutoPlotlyupstream_cells_mapTableOfContents$dd167494-99d6-45c6-99e4-c36fde5e2d3fprecedence_heuristic cell_id$dd167494-99d6-45c6-99e4-c36fde5e2d3fdownstream_cells_mapupstream_cells_map@md_strgetindex$ab331778-f892-4690-8bb3-26464e3fc05fprecedence_heuristic cell_id$ab331778-f892-4690-8bb3-26464e3fc05fdownstream_cells_mapwindy_gridworld$75bfe913-8757-4789-b708-7d400c225218$d299d800-a64e-4ba2-9603-efa833343405$897fde24-9a4a-465e-96f2-dd9e8baab294upstream_cells_mapmake_windy_gridworld$ec285c96-4a75-4af6-8898-ec3176fa34c6$0e59e813-3d48-4a24-b5b3-9a9de7c500c2precedence_heuristic cell_id$0e59e813-3d48-4a24-b5b3-9a9de7c500c2downstream_cells_mapupstream_cells_map@md_strgetindex$e4c6456c-867d-4ade-a3c8-310c1e065f14precedence_heuristic cell_id$e4c6456c-867d-4ade-a3c8-310c1e065f14downstream_cells_mapupstream_cells_maprender_walk$de50f95f-984e-4387-958c-64e0265f5953nstates$5455fc97-55cb-4b0e-a3be-9433ccc96fc0$3e767962-7339-4f35-a039-b5521a098ed5precedence_heuristic cell_id$3e767962-7339-4f35-a039-b5521a098ed5downstream_cells_mapMDP_TD$8e34202a-f841-4464-9017-cd50194f7987$401831c3-3925-465c-a093-28686f0dad2e$c5d32889-634b-4b00-8ba7-0d1ecaf94f05$24a441c8-7aaf-4642-b245-5e1201456d67$d5abd922-a8c2-4f5c-9a6e-d2490a8ad7dc$bfe71b40-3157-47df-8494-67f8eb8e4e93$7035c082-6e50-4df5-919f-5f09d2011b4a$eb735ead-978b-409c-8990-b5fa7a027ebf$415ea466-2038-48fe-9d24-39a90182f1eb$4ddcd409-c31c-444c-8fcf-7cc45b68d93b$209881b3-3ac8-490e-97bd-fa5ae24a39f5$72b4d8d5-464c-4561-8c69-28ef3f59630b$3f3ebc9b-b070-4d73-8be9-823b399c664c$12aac612-758b-4655-8ede-daddd4af6d3e$3ed12c33-ab0a-49b1-b9e7-c4305ba35767$61bbf9db-49a0-4709-83f4-44f228be09c0$ec285c96-4a75-4af6-8898-ec3176fa34c6$2034fd1e-5171-4eda-85d5-2de62d7a1e8b$6556dafb-04fa-434c-868a-8d7bb7b5b196$292d9018-b550-4278-a8e0-78dd6a6853f1$3756a3f8-18e8-4d62-afa1-cfeb4183820c$d526a3a4-63cc-4f94-8f55-98c9a4a9d134$69eedbfd-396f-4461-b7a1-c36abc094581$7ac99619-5232-4db8-8553-d79ea5415d29$8ddf6b9d-d76d-401f-96ad-2a0b5c114fa4upstream_cells_mapVectorInt64newDictFunctionmakelookup$834e5810-77ea-4dfd-9f37-9d9dbf6585a4$834e5810-77ea-4dfd-9f37-9d9dbf6585a4precedence_heuristic cell_id$834e5810-77ea-4dfd-9f37-9d9dbf6585a4downstream_cells_mapmakelookup$3e767962-7339-4f35-a039-b5521a098ed5$393cd9d2-dd97-496e-b260-ec6e8b1c13b5$ad03500a-bd42-4216-a9cb-3f923152af79upstream_cells_mapDictenumerate=>Vector$667666b9-3ab6-4836-953d-9878208103c9precedence_heuristic cell_id$667666b9-3ab6-4836-953d-9878208103c9downstream_cells_mapupstream_cells_map,gridworld_Q_vs_sarsa_vs_expected_sarsa_solve$84584793-8274-4aa1-854f-b167c7434548cliffworld$6faa3015-3ac4-44af-a78c-10b175822441$87fadfc0-2cdb-4be2-81ad-e8fdeffb690cprecedence_heuristic cell_id$87fadfc0-2cdb-4be2-81ad-e8fdeffb690cdownstream_cells_mapshow_mrp_state$1dd1ba55-548a-41f6-903e-70742fd60e3dupstream_cells_maplength:minHTML>=collectisless==$4019c974-dcaa-46c8-ac90-e6566a376ea1precedence_heuristic cell_id$4019c974-dcaa-46c8-ac90-e6566a376ea1downstream_cells_mapbegin_value_iteration_v$d299d800-a64e-4ba2-9603-efa833343405$33d69db9-fa2b-40a3-bbed-21d5fd60f302$bb085f2e-83cb-45b2-adf6-c07da892d6e1$1f28280e-ba3b-4ca5-89e4-6ca4a90f5893upstream_cells_mapzerotypemaxcopyepsInt64VectorRealvalue_iteration_v!$8787a5fd-d0ab-46b5-a7df-e7bc103a7378CompleteMDP$d7566d1b-8938-4e2c-8c54-124f790e72aemake_greedy_policy!$cb07a6a5-c50a-4900-9e5b-a17dc7ee5710$685a7ba3-0f94-4663-a68a-73fa03bd9445$c4919d14-8cba-43e6-9369-efc52bcb9b23form_random_policy$0748902c-ffc0-4634-9a1b-e642b3dfb77b$4d4577b5-3753-450d-a247-ebd8c3e8f799precedence_heuristic cell_id$4d4577b5-3753-450d-a247-ebd8c3e8f799downstream_cells_mapcreate_ϵ_greedy_policy$61bbf9db-49a0-4709-83f4-44f228be09c0$2034fd1e-5171-4eda-85d5-2de62d7a1e8b$292d9018-b550-4278-a8e0-78dd6a6853f1$3756a3f8-18e8-4d62-afa1-cfeb4183820c$d526a3a4-63cc-4f94-8f55-98c9a4a9d134$69eedbfd-396f-4461-b7a1-c36abc094581upstream_cells_mapReal:Matrixzerossizemake_ϵ_greedy_policy!$6b496582-cc0e-4195-87ef-94792b0fff54copy$e19db54c-4b3c-42d1-b016-9620daf89bfbprecedence_heuristic cell_id$e19db54c-4b3c-42d1-b016-9620daf89bfbdownstream_cells_mapUp$e19db54c-4b3c-42d1-b016-9620daf89bfbapply_wind$ec285c96-4a75-4af6-8898-ec3176fa34c6$95245673-2c29-401e-bb4b-a39dc8172297$07c57f37-22be-4c39-8279-d80addcea0c5GridworldAction$e19db54c-4b3c-42d1-b016-9620daf89bfb$ec285c96-4a75-4af6-8898-ec3176fa34c6$031e1106-7408-4c7e-b78e-b713c19123d1$e9359ca3-4d11-4365-bc6e-7babc6fcc7de$6556dafb-04fa-434c-868a-8d7bb7b5b196Left$e19db54c-4b3c-42d1-b016-9620daf89bfbRight$e19db54c-4b3c-42d1-b016-9620daf89bfbwind_vals$ec285c96-4a75-4af6-8898-ec3176fa34c6$d299d800-a64e-4ba2-9603-efa833343405$c34678f6-53bb-4f2a-96f0-a7b16f894ddd$95245673-2c29-401e-bb4b-a39dc8172297$07c57f37-22be-4c39-8279-d80addcea0c5$71774d5f-7841-403f-bc6b-1a0cbbb72d6d$2f4e2da2-b1a1-41b1-8904-39b59f426da4$0e488135-49e5-4e71-83b1-05d8e61f0510$8e15f4b5-0dc7-47a5-9477-9f4d8807b331GridworldState$ec285c96-4a75-4af6-8898-ec3176fa34c6$6556dafb-04fa-434c-868a-8d7bb7b5b196$64b210e8-223f-41f7-a6b7-8af6183ddf87$07c57f37-22be-4c39-8279-d80addcea0c5$71774d5f-7841-403f-bc6b-1a0cbbb72d6d$2f4e2da2-b1a1-41b1-8904-39b59f426da4$0e488135-49e5-4e71-83b1-05d8e61f0510$8e15f4b5-0dc7-47a5-9477-9f4d8807b331rook_actions$ec285c96-4a75-4af6-8898-ec3176fa34c6$031e1106-7408-4c7e-b78e-b713c19123d1$6556dafb-04fa-434c-868a-8d7bb7b5b196$64b210e8-223f-41f7-a6b7-8af6183ddf87$71774d5f-7841-403f-bc6b-1a0cbbb72d6dDown$e19db54c-4b3c-42d1-b016-9620daf89bfbmove$ec285c96-4a75-4af6-8898-ec3176fa34c6$6556dafb-04fa-434c-868a-8d7bb7b5b196upstream_cells_mapUp$e19db54c-4b3c-42d1-b016-9620daf89bfbInt64GridworldAction$e19db54c-4b3c-42d1-b016-9620daf89bfbLeft$e19db54c-4b3c-42d1-b016-9620daf89bfb-Right$e19db54c-4b3c-42d1-b016-9620daf89bfb+Down$e19db54c-4b3c-42d1-b016-9620daf89bfb$ed4e863b-22dd-4d2b-88d0-b3a56d6713b7precedence_heuristic cell_id$ed4e863b-22dd-4d2b-88d0-b3a56d6713b7downstream_cells_mapupstream_cells_mapking_action_display$cdedd35e-52b8-40a5-938d-2d36f6f93217display_king_policy$9651f823-e1cd-4e6e-9ce0-be9ea1c3f0a4example_6_5$d299d800-a64e-4ba2-9603-efa833343405stochastic_gridworld$4ddc7d99-0b79-4689-bd93-8798b105c0a2$393cd9d2-dd97-496e-b260-ec6e8b1c13b5precedence_heuristic cell_id$393cd9d2-dd97-496e-b260-ec6e8b1c13b5downstream_cells_mapFiniteAfterstateMDP$18e60b1d-97ec-432c-a388-003e7fae415f$685a7ba3-0f94-4663-a68a-73fa03bd9445$e947f86e-8dc3-4ce7-a9d4-0a7b675a9fa9$ad03500a-bd42-4216-a9cb-3f923152af79upstream_cells_mapDictzerosmakelookup$834e5810-77ea-4dfd-9f37-9d9dbf6585a4VectorInt64RealnewlengthCompleteMDP$d7566d1b-8938-4e2c-8c54-124f790e72aeMatrixArray$401831c3-3925-465c-a093-28686f0dad2eprecedence_heuristic cell_id$401831c3-3925-465c-a093-28686f0dad2edownstream_cells_mapinitialize_state_value$eb735ead-978b-409c-8990-b5fa7a027ebf$415ea466-2038-48fe-9d24-39a90182f1eb$3f3ebc9b-b070-4d73-8be9-823b399c664cupstream_cells_maplengthones*MDP_TD$3e767962-7339-4f35-a039-b5521a098ed5AbstractFloat$2d881aa9-1da3-4d1e-8d05-245956dbaf33precedence_heuristic cell_id$2d881aa9-1da3-4d1e-8d05-245956dbaf33downstream_cells_mapupstream_cells_mapHTML$047a8881-c2ec-4dd1-8778-e3acf9beba2eprecedence_heuristic cell_id$047a8881-c2ec-4dd1-8778-e3acf9beba2edownstream_cells_mapupstream_cells_map@md_strgetindex$29b0a2d5-9629-46cd-b57c-6f3ef797de66precedence_heuristic cell_id$29b0a2d5-9629-46cd-b57c-6f3ef797de66downstream_cells_mapupstream_cells_map@md_strgetindex$c1d6532c-38a4-488f-9789-07d63fe6f125precedence_heuristic cell_id$c1d6532c-38a4-488f-9789-07d63fe6f125downstream_cells_mapload_file$00d67a93-437c-4cda-899a-9daa1102e1f2upstream_cells_mapCore@md_strBasePlutoRunner.create_bondPlutoRunnerCheckBoxCore.applicable@bindBase.getgetindex$e6672866-c0a0-46f2-bb52-25fcc3352645precedence_heuristic cell_id$e6672866-c0a0-46f2-bb52-25fcc3352645downstream_cells_mapupstream_cells_map@md_strgetindex$223055df-7d5c-4d99-bc8d-fbc9702f906fprecedence_heuristic cell_id$223055df-7d5c-4d99-bc8d-fbc9702f906fdownstream_cells_mapupstream_cells_map@md_strgetindex$35dc0d94-145a-4292-b0df-9e84a286c036precedence_heuristic cell_id$35dc0d94-145a-4292-b0df-9e84a286c036downstream_cells_mapupstream_cells_map@md_strgetindex$4d7619ee-933f-452a-9202-e95a8f3da20fprecedence_heuristic cell_id$4d7619ee-933f-452a-9202-e95a8f3da20fdownstream_cells_mapupstream_cells_mapHypertextLiteral.BypassHypertextLiteral.ResultHypertextLiteral$639840dc-976a-4e5c-987f-a92afb2d99d8@htl$00d67a93-437c-4cda-899a-9daa1102e1f2precedence_heuristic cell_id$00d67a93-437c-4cda-899a-9daa1102e1f2downstream_cells_mapupstream_cells_mapexample_6_7_mdp$69eedbfd-396f-4461-b7a1-c36abc094581load_file$c1d6532c-38a4-488f-9789-07d63fe6f125$500d8dd4-fc53-4021-b797-114224ca4debprecedence_heuristic cell_id$500d8dd4-fc53-4021-b797-114224ca4debdownstream_cells_maprook_action_display$75bfe913-8757-4789-b708-7d400c225218$d299d800-a64e-4ba2-9603-efa833343405$c34678f6-53bb-4f2a-96f0-a7b16f894dddupstream_cells_mapHypertextLiteral.BypassHypertextLiteral.ResultHypertextLiteral$639840dc-976a-4e5c-987f-a92afb2d99d8@htl$ff5d051e-5de1-48a9-9578-5dbafd71afd1precedence_heuristic cell_id$ff5d051e-5de1-48a9-9578-5dbafd71afd1downstream_cells_mapupstream_cells_mapmax_bias_visualization$fa04d20f-6e3f-46f8-b3f7-a543d1fa360amax_visual_params$4862942b-d1e2-4ac8-8e88-65205e91a070$e947f86e-8dc3-4ce7-a9d4-0a7b675a9fa9precedence_heuristic cell_id$e947f86e-8dc3-4ce7-a9d4-0a7b675a9fa9downstream_cells_mapbegin_value_iteration_v$d299d800-a64e-4ba2-9603-efa833343405$33d69db9-fa2b-40a3-bbed-21d5fd60f302$bb085f2e-83cb-45b2-adf6-c07da892d6e1$1f28280e-ba3b-4ca5-89e4-6ca4a90f5893upstream_cells_maplengthzeroFiniteAfterstateMDP$393cd9d2-dd97-496e-b260-ec6e8b1c13b5ones*Real$a925534e-f9b8-471a-9d86-c9212129b630precedence_heuristic cell_id$a925534e-f9b8-471a-9d86-c9212129b630downstream_cells_mapupstream_cells_map@md_strgetindex$7a5ff8f7-70d4-46f1-a4a7-bbfcec4f6e3fprecedence_heuristic cell_id$7a5ff8f7-70d4-46f1-a4a7-bbfcec4f6e3fdownstream_cells_mapsample_action$d5abd922-a8c2-4f5c-9a6e-d2490a8ad7dc$12aac612-758b-4655-8ede-daddd4af6d3e$3ed12c33-ab0a-49b1-b9e7-c4305ba35767upstream_cells_map:weightsMatrixsizesampleIntegerAbstractFloat$b5e06f59-33b5-414e-9a81-43e8abd07aa3precedence_heuristic cell_id$b5e06f59-33b5-414e-9a81-43e8abd07aa3downstream_cells_mapupstream_cells_map@md_strq_learning$2034fd1e-5171-4eda-85d5-2de62d7a1e8bgridsize$0c0b875e-69f8-46ed-ad06-df9c36088fbenoisy_gridworld$98bec66e-d8f3-4d4d-b4ec-5838489164e5double_q_learning$d526a3a4-63cc-4f94-8f55-98c9a4a9d134α_6_8$c9f7646a-ec01-4d90-9215-5027b7c1c885fillshow_gridworld_policy_value$c34678f6-53bb-4f2a-96f0-a7b16f894dddgetindex$a0d2333f-e87b-4981-bb52-d436ec6481c1precedence_heuristic cell_id$a0d2333f-e87b-4981-bb52-d436ec6481c1downstream_cells_mapupstream_cells_map@md_strgetindex$f841c4d8-5176-4007-b472-9e01a799d85cprecedence_heuristic cell_id$f841c4d8-5176-4007-b472-9e01a799d85cdownstream_cells_mapaddelements$902738c3-2f7b-49cb-8580-29359c857027upstream_cells_map$685a7ba3-0f94-4663-a68a-73fa03bd9445precedence_heuristic cell_id$685a7ba3-0f94-4663-a68a-73fa03bd9445downstream_cells_mapmake_greedy_policy!$84a71bf8-0d66-42cd-ac7b-589d63a16eda$d526a3a4-63cc-4f94-8f55-98c9a4a9d134$4019c974-dcaa-46c8-ac90-e6566a376ea1upstream_cells_mapzero@fastmathisless@inboundsnothingVectorislessBase.Threads.!=mean@threadsexpected_sarsa$292d9018-b550-4278-a8e0-78dd6a6853f1:Base.Threads.threading_runzerosisfileBase.Threads.divremBaseBase.Threads.firstindexplotattrBase.Threads.lengthBase.Threads.:Base.Threads.<=Base.Threads.==ccallserialize$d4e39164-9833-4deb-84ca-22f49a1c33d8precedence_heuristic cell_id$d4e39164-9833-4deb-84ca-22f49a1c33d8downstream_cells_mapupstream_cells_map@md_strgetindex$f2115666-86ce-4c80-9eb7-490cc7a7715cprecedence_heuristic cell_id$f2115666-86ce-4c80-9eb7-490cc7a7715cdownstream_cells_mapupstream_cells_map@md_strgetindex$2155adfa-7a93-4960-950e-1b123da9eea4precedence_heuristic cell_id$2155adfa-7a93-4960-950e-1b123da9eea4downstream_cells_mapupstream_cells_mapking_actions$031e1106-7408-4c7e-b78e-b713c19123d1cell_execution_order$639840dc-976a-4e5c-987f-a92afb2d99d8$814d89be-cfdf-11ec-3295-49a8f302bbcf$495f5606-0567-47ad-a266-d21320eecfc6$410abe1d-04a6-4434-9abf-0d29dd6498e6$7a5ff8f7-70d4-46f1-a4a7-bbfcec4f6e3f$834e5810-77ea-4dfd-9f37-9d9dbf6585a4$3e767962-7339-4f35-a039-b5521a098ed5$8e34202a-f841-4464-9017-cd50194f7987$401831c3-3925-465c-a093-28686f0dad2e$c5d32889-634b-4b00-8ba7-0d1ecaf94f05$24a441c8-7aaf-4642-b245-5e1201456d67$d5abd922-a8c2-4f5c-9a6e-d2490a8ad7dc$bfe71b40-3157-47df-8494-67f8eb8e4e93$7035c082-6e50-4df5-919f-5f09d2011b4a$eb735ead-978b-409c-8990-b5fa7a027ebf$415ea466-2038-48fe-9d24-39a90182f1eb$a0d2333f-e87b-4981-bb52-d436ec6481c1$3b16cbb7-f859-4871-9a63-8b40eb4191be$d4e39164-9833-4deb-84ca-22f49a1c33d8$c93ed1f2-3c38-4f68-8bf8-2cdf4e7bee34$1e3b3234-3fe1-46c9-82b7-f729c656eb25$c09530bc-f37e-4d57-a267-14d4027147da$b5187232-d808-49b6-9f7e-a4cbeb6c2b3e$5f32fed0-c921-4cbb-85fe-ade54d4c6c95$bc8bad61-a49a-47d6-8fa6-7dcf6c221910$6edb550d-5c9f-4ea6-8746-6632806df11e$0f22e85f-ed31-49df-a7c7-0579298f05fe$9017093c-a9c3-40ea-a9c6-881ee62fc379$5290ae65-6f56-4849-a842-fe347315c6dc$47c2cbdd-f6db-4ce5-bae2-8141f30aacbc$5455fc97-55cb-4b0e-a3be-9433ccc96fc0$a9dda9b5-f568-481c-9e8f-9bb887468775$846720cc-550a-4a3c-a80e-40b99671f4e2$4ddcd409-c31c-444c-8fcf-7cc45b68d93b$4b0d96d0-25d1-4fed-b105-c65fa2883a61$64fe8336-d1c2-41fe-a522-1b6f63260fc9$12c5efe4-d64d-4b82-877c-29b0e537fee6$53145cc2-784c-468b-8e91-9bb7866db218$54d97122-2d01-46ec-aafe-00bfc9f2d6d1$a5009785-64b4-489b-a967-f7840b4a9463$de50f95f-984e-4387-958c-64e0265f5953$e4c6456c-867d-4ade-a3c8-310c1e065f14$f841c4d8-5176-4007-b472-9e01a799d85c$889611fb-7dac-4769-9251-9a90e3a1422f$902738c3-2f7b-49cb-8580-29359c857027$510761f6-66c7-4faf-937b-e1422ec829a6$87fadfc0-2cdb-4be2-81ad-e8fdeffb690c$1dd1ba55-548a-41f6-903e-70742fd60e3d$2786101e-d365-4d6a-8de7-b9794499efb4$9db7a268-1e6d-4366-a0ec-ebf54916d3b0$0b9c6dbd-4eb3-4167-886e-64db9ec7ff04$52aebb7b-c2a9-443f-bc03-24cd25793b32$e6672866-c0a0-46f2-bb52-25fcc3352645$f2115666-86ce-4c80-9eb7-490cc7a7715c$c360945e-f8b2-4c6f-a70c-6ab4ddcf5b54$ddf3bb61-16c9-48c4-95d4-263260309762$e8f94345-9ad5-48d4-8709-d796fb55db3f$a72d07bf-e337-4bd4-af5c-44d74d163b6b$105c5c23-270d-437e-89dd-12297814c6e0$48b557e3-e239-45e9-ab15-105bcca96492$187fc682-2282-46ca-b988-c9de438f36fd$0a4ed8c7-27ca-45cb-af15-70ddd86240fb$620a6426-cb29-4010-997b-aa4f9d5f8fb0$3d8b1ccd-9bb3-42f2-a77a-6afdb72c1ff8$209881b3-3ac8-490e-97bd-fa5ae24a39f5$72b4d8d5-464c-4561-8c69-28ef3f59630b$3f3ebc9b-b070-4d73-8be9-823b399c664c$1e3d231a-4065-48ce-a74e-018066fb232a$22c2213e-5b9b-410f-a0ef-8f1e3db3c532$0e59e813-3d48-4a24-b5b3-9a9de7c500c2$0d6a11af-b146-4bbc-997e-a11b897269a7$a925534e-f9b8-471a-9d86-c9212129b630$62a9a36a-bedb-4f5a-80a4-2d4111a65c12$b35264b0-ac5b-40ce-95e4-9b2bc4cb106f$4d7619ee-933f-452a-9202-e95a8f3da20f$fe2ebf39-4ab3-4aa8-abbd-23389eaf400e$1ae30f5d-b25b-4dcb-800f-45c463641ec5$6a1503c6-c77b-4e3a-9f07-74b2af1a5ff7$6b496582-cc0e-4195-87ef-94792b0fff54$cb07a6a5-c50a-4900-9e5b-a17dc7ee5710$4d4577b5-3753-450d-a247-ebd8c3e8f799$12aac612-758b-4655-8ede-daddd4af6d3e$3ed12c33-ab0a-49b1-b9e7-c4305ba35767$61bbf9db-49a0-4709-83f4-44f228be09c0$8d05403a-adeb-40ac-a98a-87586d5a5170$e19db54c-4b3c-42d1-b016-9620daf89bfb$500d8dd4-fc53-4021-b797-114224ca4deb$136d1d96-b590-4f03-9e42-2337efc560cc$4556cf44-4a1c-4ca4-bfb8-4841301a2ce6$9f28772c-9afe-4253-ab3b-055b0f48be6e$bd1029f9-d6a8-4c68-98cd-8af94297b521$0ad739c9-8aca-4b82-bf20-c73584d29535$031e1106-7408-4c7e-b78e-b713c19123d1$cdedd35e-52b8-40a5-938d-2d36f6f93217$9651f823-e1cd-4e6e-9ce0-be9ea1c3f0a4$2155adfa-7a93-4960-950e-1b123da9eea4$d259ecca-0249-4b28-a4d7-6880d4d84495$39470c74-e554-4f6c-919d-97bec1eec0f3$e9359ca3-4d11-4365-bc6e-7babc6fcc7de$ec285c96-4a75-4af6-8898-ec3176fa34c6$ab331778-f892-4690-8bb3-26464e3fc05f$75bfe913-8757-4789-b708-7d400c225218$dda222ef-8178-40bb-bf20-d242924c4fab$db31579e-3e56-4271-8fc3-eb13bc95ac27$b59eacf8-7f78-4015-bf2c-66f89bf0e24e$02f34da1-551f-4ce5-a588-7f3a14afd716$aa0791a5-8cf1-499b-9900-4d0c59be808c$4ddc7d99-0b79-4689-bd93-8798b105c0a2$2d881aa9-1da3-4d1e-8d05-245956dbaf33$8bc54c94-9c92-4904-b3a6-13ff3f0110bb$678cad7a-1abb-4fcc-91ba-b5abcbb914cb$9da5fd84-800d-4b3e-8627-e90ce8f20297$44c49006-e210-4f97-916e-fe62f36c593f$2034fd1e-5171-4eda-85d5-2de62d7a1e8b$c34678f6-53bb-4f2a-96f0-a7b16f894ddd$9d01c0ef-6313-4091-b444-3e9765aba90c$4b1a4c14-3c2b-40c0-995c-cd0334ed8b3a$897fde24-9a4a-465e-96f2-dd9e8baab294$f2776908-d06a-4073-b2ce-ecbf109c9cc7$1115f3ec-f4b2-4fba-bd5e-321a63b10a6d$c4719c42-87aa-482a-95aa-a1492d42835d$1e45a661-c2e1-40c2-b27b-5f80f95efdab$8224b808-5778-458b-b683-ea2603c82117$6556dafb-04fa-434c-868a-8d7bb7b5b196$6faa3015-3ac4-44af-a78c-10b175822441$05664aaf-575b-4249-974c-d8a2e63f380a$2a3e4617-efbb-4bbc-9c61-8535628e439c$6e06bd39-486f-425a-bbca-bf363b58988c$292d9018-b550-4278-a8e0-78dd6a6853f1$047a8881-c2ec-4dd1-8778-e3acf9beba2e$21fbdc3b-4444-4f56-9934-fb58e184d685$c8500b89-644d-407f-881a-bcbd7da23502$6d9ae541-cf8c-4687-9f0a-f008944657e3$cafedde8-be94-4697-a511-510a5fea0155$29b0a2d5-9629-46cd-b57c-6f3ef797de66$01582b3b-c4d0-4691-9edf-f77e6d8be2c9$4862942b-d1e2-4ac8-8e88-65205e91a070$f474fcbd-e3c3-49fd-a6b7-6d6a8a7dda09$2c49900b-3c57-4d9a-b3dc-ef9cc20c30c1$0163763b-a15f-447e-b3d2-32d4bf9d2605$3e367811-247b-4bd6-b8fe-63f8996fb9e8$4c1b286c-2ba9-4293-81e1-bf360baa75fa$c5718459-2323-4615-b2c4-f92a0fa189d9$03a06e10-f68a-403c-97bf-7a7627f2c5d6$573a9919-bd7e-4a56-b830-4e40e91288ef$bce6e4ab-58ec-4e00-be34-bc4caf51f57d$7d3be915-9092-4261-8435-dd546a7db144$fa04d20f-6e3f-46f8-b3f7-a543d1fa360a$ff5d051e-5de1-48a9-9578-5dbafd71afd1$3f4f078a-9fc4-4b02-b499-a805fd5f1071$2651af2d-56a8-4f7e-a56a-45cabd665c72$e039a5be-4b59-4023-be97-2d1de970be27$223055df-7d5c-4d99-bc8d-fbc9702f906f$926ec37d-b969-4dc9-99b2-a6b29c6d880c$c1d6532c-38a4-488f-9789-07d63fe6f125$84d81413-6334-4965-8632-8a763cd3f28a$4382928c-6325-4ecd-b7cf-282525a270ab$8fe856ec-5f0a-4483-bb7d-3f6fe270b6f3$f11dca8f-5557-49fc-9720-35034eadba57$d83ff60f-8973-4dc1-9358-5ad109ea5490$e26f788e-f602-403e-929e-6c98a6e6bf79$c9f7646a-ec01-4d90-9215-5027b7c1c885$0201ae9f-4a31-497e-86ab-62b454ca85de$943b6d7e-14a4-4532-90c7-dd5080be0c6e$0c0b875e-69f8-46ed-ad06-df9c36088fbe$64b210e8-223f-41f7-a6b7-8af6183ddf87$98bec66e-d8f3-4d4d-b4ec-5838489164e5$42799973-9884-4a0e-b29a-039890e92d21$35dc0d94-145a-4292-b0df-9e84a286c036$6029990b-eb31-45ae-a869-b789fba673a6$b37f2395-1480-4c7c-b6c0-eba391e969d7$c306867b-f137-44f2-97dd-3d10c226ca5c$a3d10753-2ec3-4252-9629-834145678b6a$f95ceb98-f12e-4650-9ad3-0609b7ecd0f3$d5b612d8-82a1-4586-b721-1baaea2101cf$f36822d7-9ea8-4f5c-9925-dc2a466a68ba$14b456f9-5fd1-4340-a3c7-ab9b91b4e3e0$22c4ce8c-bd82-4eb3-8af5-55342018edff$d7566d1b-8938-4e2c-8c54-124f790e72ae$393cd9d2-dd97-496e-b260-ec6e8b1c13b5$18e60b1d-97ec-432c-a388-003e7fae415f$685a7ba3-0f94-4663-a68a-73fa03bd9445$e947f86e-8dc3-4ce7-a9d4-0a7b675a9fa9$0748902c-ffc0-4634-9a1b-e642b3dfb77b$c4919d14-8cba-43e6-9369-efc52bcb9b23$84a71bf8-0d66-42cd-ac7b-589d63a16eda$6bffb08c-704a-4b7c-bfce-b3d099cf35c0$a4c4d5f2-d76d-425e-b8c9-9047fe53c4f0$84584793-8274-4aa1-854f-b167c7434548$667666b9-3ab6-4836-953d-9878208103c9$3756a3f8-18e8-4d62-afa1-cfeb4183820c$d526a3a4-63cc-4f94-8f55-98c9a4a9d134$69eedbfd-396f-4461-b7a1-c36abc094581$00d67a93-437c-4cda-899a-9daa1102e1f2$b5e06f59-33b5-414e-9a81-43e8abd07aa3$95245673-2c29-401e-bb4b-a39dc8172297$07c57f37-22be-4c39-8279-d80addcea0c5$7ac99619-5232-4db8-8553-d79ea5415d29$8ddf6b9d-d76d-401f-96ad-2a0b5c114fa4$297f1606-4ec2-4075-9f81-926dc517b76f$71774d5f-7841-403f-bc6b-1a0cbbb72d6d$2f4e2da2-b1a1-41b1-8904-39b59f426da4$0e488135-49e5-4e71-83b1-05d8e61f0510$8e15f4b5-0dc7-47a5-9477-9f4d8807b331$dea61907-d4fb-492d-b2bb-c037c7f785cb$8787a5fd-d0ab-46b5-a7df-e7bc103a7378$4019c974-dcaa-46c8-ac90-e6566a376ea1$3134e913-1e86-495d-a558-c3ec4828bf7b$d299d800-a64e-4ba2-9603-efa833343405$04a0be81-ee5f-4eeb-963a-ad930392d50b$f0f9d3d5-e76a-4472-bfb1-da29d73a7916$dee6b500-0ba1-4bbc-b217-cbb9ad47ad06$ed4e863b-22dd-4d2b-88d0-b3a56d6713b7$33d69db9-fa2b-40a3-bbed-21d5fd60f302$e4e80015-40ce-4f8a-aac7-4a9584da4baa$dd167494-99d6-45c6-99e4-c36fde5e2d3f$b3d4117f-7db4-43a6-8427-c08f3542d71f$ad03500a-bd42-4216-a9cb-3f923152af79$7de9b6a4-49ce-4dc3-9d5b-cecfcb98bba1$2455742f-dc18-4d6b-9f58-5666adac6919$c2f56287-9a3e-454a-9ec1-53184b788db9$7ed07ddc-1c63-4ce7-bfd3-6da54304d297$30e663da-282c-42ff-8171-dbe3c5c467c6$bb085f2e-83cb-45b2-adf6-c07da892d6e1$1f28280e-ba3b-4ca5-89e4-6ca4a90f5893last_hot_reload_timeshortpath*Chapter_06_Temporal_Difference_Learning.jlprocess_statusreadypathٵ/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-06/Chapter_06_Temporal_Difference_Learning.jlpluto_versionv0.20.8last_save_timeA އcell_order$814d89be-cfdf-11ec-3295-49a8f302bbcf$495f5606-0567-47ad-a266-d21320eecfc6$410abe1d-04a6-4434-9abf-0d29dd6498e6$7a5ff8f7-70d4-46f1-a4a7-bbfcec4f6e3f$834e5810-77ea-4dfd-9f37-9d9dbf6585a4$3e767962-7339-4f35-a039-b5521a098ed5$8e34202a-f841-4464-9017-cd50194f7987$401831c3-3925-465c-a093-28686f0dad2e$c5d32889-634b-4b00-8ba7-0d1ecaf94f05$24a441c8-7aaf-4642-b245-5e1201456d67$d5abd922-a8c2-4f5c-9a6e-d2490a8ad7dc$bfe71b40-3157-47df-8494-67f8eb8e4e93$7035c082-6e50-4df5-919f-5f09d2011b4a$eb735ead-978b-409c-8990-b5fa7a027ebf$415ea466-2038-48fe-9d24-39a90182f1eb$a0d2333f-e87b-4981-bb52-d436ec6481c1$3b16cbb7-f859-4871-9a63-8b40eb4191be$d4e39164-9833-4deb-84ca-22f49a1c33d8$c93ed1f2-3c38-4f68-8bf8-2cdf4e7bee34$1e3b3234-3fe1-46c9-82b7-f729c656eb25$c09530bc-f37e-4d57-a267-14d4027147da$b5187232-d808-49b6-9f7e-a4cbeb6c2b3e$5f32fed0-c921-4cbb-85fe-ade54d4c6c95$bc8bad61-a49a-47d6-8fa6-7dcf6c221910$6edb550d-5c9f-4ea6-8746-6632806df11e$0f22e85f-ed31-49df-a7c7-0579298f05fe$9017093c-a9c3-40ea-a9c6-881ee62fc379$5290ae65-6f56-4849-a842-fe347315c6dc$47c2cbdd-f6db-4ce5-bae2-8141f30aacbc$5455fc97-55cb-4b0e-a3be-9433ccc96fc0$12c5efe4-d64d-4b82-877c-29b0e537fee6$53145cc2-784c-468b-8e91-9bb7866db218$54d97122-2d01-46ec-aafe-00bfc9f2d6d1$e4c6456c-867d-4ade-a3c8-310c1e065f14$9db7a268-1e6d-4366-a0ec-ebf54916d3b0$a9dda9b5-f568-481c-9e8f-9bb887468775$846720cc-550a-4a3c-a80e-40b99671f4e2$4ddcd409-c31c-444c-8fcf-7cc45b68d93b$4b0d96d0-25d1-4fed-b105-c65fa2883a61$64fe8336-d1c2-41fe-a522-1b6f63260fc9$a5009785-64b4-489b-a967-f7840b4a9463$de50f95f-984e-4387-958c-64e0265f5953$f841c4d8-5176-4007-b472-9e01a799d85c$902738c3-2f7b-49cb-8580-29359c857027$889611fb-7dac-4769-9251-9a90e3a1422f$510761f6-66c7-4faf-937b-e1422ec829a6$87fadfc0-2cdb-4be2-81ad-e8fdeffb690c$1dd1ba55-548a-41f6-903e-70742fd60e3d$2786101e-d365-4d6a-8de7-b9794499efb4$0b9c6dbd-4eb3-4167-886e-64db9ec7ff04$52aebb7b-c2a9-443f-bc03-24cd25793b32$e6672866-c0a0-46f2-bb52-25fcc3352645$e8f94345-9ad5-48d4-8709-d796fb55db3f$f2115666-86ce-4c80-9eb7-490cc7a7715c$a72d07bf-e337-4bd4-af5c-44d74d163b6b$c360945e-f8b2-4c6f-a70c-6ab4ddcf5b54$ddf3bb61-16c9-48c4-95d4-263260309762$105c5c23-270d-437e-89dd-12297814c6e0$48b557e3-e239-45e9-ab15-105bcca96492$187fc682-2282-46ca-b988-c9de438f36fd$22c2213e-5b9b-410f-a0ef-8f1e3db3c532$0a4ed8c7-27ca-45cb-af15-70ddd86240fb$620a6426-cb29-4010-997b-aa4f9d5f8fb0$3d8b1ccd-9bb3-42f2-a77a-6afdb72c1ff8$209881b3-3ac8-490e-97bd-fa5ae24a39f5$72b4d8d5-464c-4561-8c69-28ef3f59630b$3f3ebc9b-b070-4d73-8be9-823b399c664c$1e3d231a-4065-48ce-a74e-018066fb232a$0e59e813-3d48-4a24-b5b3-9a9de7c500c2$0d6a11af-b146-4bbc-997e-a11b897269a7$a925534e-f9b8-471a-9d86-c9212129b630$62a9a36a-bedb-4f5a-80a4-2d4111a65c12$b35264b0-ac5b-40ce-95e4-9b2bc4cb106f$4d7619ee-933f-452a-9202-e95a8f3da20f$fe2ebf39-4ab3-4aa8-abbd-23389eaf400e$1ae30f5d-b25b-4dcb-800f-45c463641ec5$6a1503c6-c77b-4e3a-9f07-74b2af1a5ff7$6b496582-cc0e-4195-87ef-94792b0fff54$cb07a6a5-c50a-4900-9e5b-a17dc7ee5710$84a71bf8-0d66-42cd-ac7b-589d63a16eda$4d4577b5-3753-450d-a247-ebd8c3e8f799$12aac612-758b-4655-8ede-daddd4af6d3e$3ed12c33-ab0a-49b1-b9e7-c4305ba35767$61bbf9db-49a0-4709-83f4-44f228be09c0$8d05403a-adeb-40ac-a98a-87586d5a5170$75bfe913-8757-4789-b708-7d400c225218$e19db54c-4b3c-42d1-b016-9620daf89bfb$ec285c96-4a75-4af6-8898-ec3176fa34c6$ab331778-f892-4690-8bb3-26464e3fc05f$500d8dd4-fc53-4021-b797-114224ca4deb$136d1d96-b590-4f03-9e42-2337efc560cc$4556cf44-4a1c-4ca4-bfb8-4841301a2ce6$9f28772c-9afe-4253-ab3b-055b0f48be6e$bd1029f9-d6a8-4c68-98cd-8af94297b521$d299d800-a64e-4ba2-9603-efa833343405$04a0be81-ee5f-4eeb-963a-ad930392d50b$0ad739c9-8aca-4b82-bf20-c73584d29535$031e1106-7408-4c7e-b78e-b713c19123d1$cdedd35e-52b8-40a5-938d-2d36f6f93217$9651f823-e1cd-4e6e-9ce0-be9ea1c3f0a4$2155adfa-7a93-4960-950e-1b123da9eea4$d259ecca-0249-4b28-a4d7-6880d4d84495$dda222ef-8178-40bb-bf20-d242924c4fab$f0f9d3d5-e76a-4472-bfb1-da29d73a7916$39470c74-e554-4f6c-919d-97bec1eec0f3$e9359ca3-4d11-4365-bc6e-7babc6fcc7de$dee6b500-0ba1-4bbc-b217-cbb9ad47ad06$db31579e-3e56-4271-8fc3-eb13bc95ac27$b59eacf8-7f78-4015-bf2c-66f89bf0e24e$02f34da1-551f-4ce5-a588-7f3a14afd716$aa0791a5-8cf1-499b-9900-4d0c59be808c$4ddc7d99-0b79-4689-bd93-8798b105c0a2$ed4e863b-22dd-4d2b-88d0-b3a56d6713b7$2d881aa9-1da3-4d1e-8d05-245956dbaf33$8bc54c94-9c92-4904-b3a6-13ff3f0110bb$678cad7a-1abb-4fcc-91ba-b5abcbb914cb$9da5fd84-800d-4b3e-8627-e90ce8f20297$44c49006-e210-4f97-916e-fe62f36c593f$2034fd1e-5171-4eda-85d5-2de62d7a1e8b$c34678f6-53bb-4f2a-96f0-a7b16f894ddd$9d01c0ef-6313-4091-b444-3e9765aba90c$4b1a4c14-3c2b-40c0-995c-cd0334ed8b3a$897fde24-9a4a-465e-96f2-dd9e8baab294$f2776908-d06a-4073-b2ce-ecbf109c9cc7$1115f3ec-f4b2-4fba-bd5e-321a63b10a6d$c4719c42-87aa-482a-95aa-a1492d42835d$1e45a661-c2e1-40c2-b27b-5f80f95efdab$8224b808-5778-458b-b683-ea2603c82117$6556dafb-04fa-434c-868a-8d7bb7b5b196$6faa3015-3ac4-44af-a78c-10b175822441$6bffb08c-704a-4b7c-bfce-b3d099cf35c0$a4c4d5f2-d76d-425e-b8c9-9047fe53c4f0$05664aaf-575b-4249-974c-d8a2e63f380a$2a3e4617-efbb-4bbc-9c61-8535628e439c$6e06bd39-486f-425a-bbca-bf363b58988c$292d9018-b550-4278-a8e0-78dd6a6853f1$047a8881-c2ec-4dd1-8778-e3acf9beba2e$667666b9-3ab6-4836-953d-9878208103c9$21fbdc3b-4444-4f56-9934-fb58e184d685$cafedde8-be94-4697-a511-510a5fea0155$c8500b89-644d-407f-881a-bcbd7da23502$84584793-8274-4aa1-854f-b167c7434548$6d9ae541-cf8c-4687-9f0a-f008944657e3$29b0a2d5-9629-46cd-b57c-6f3ef797de66$01582b3b-c4d0-4691-9edf-f77e6d8be2c9$4862942b-d1e2-4ac8-8e88-65205e91a070$ff5d051e-5de1-48a9-9578-5dbafd71afd1$f474fcbd-e3c3-49fd-a6b7-6d6a8a7dda09$2c49900b-3c57-4d9a-b3dc-ef9cc20c30c1$0163763b-a15f-447e-b3d2-32d4bf9d2605$2651af2d-56a8-4f7e-a56a-45cabd665c72$3e367811-247b-4bd6-b8fe-63f8996fb9e8$4c1b286c-2ba9-4293-81e1-bf360baa75fa$c5718459-2323-4615-b2c4-f92a0fa189d9$03a06e10-f68a-403c-97bf-7a7627f2c5d6$573a9919-bd7e-4a56-b830-4e40e91288ef$bce6e4ab-58ec-4e00-be34-bc4caf51f57d$7d3be915-9092-4261-8435-dd546a7db144$fa04d20f-6e3f-46f8-b3f7-a543d1fa360a$3f4f078a-9fc4-4b02-b499-a805fd5f1071$e039a5be-4b59-4023-be97-2d1de970be27$3756a3f8-18e8-4d62-afa1-cfeb4183820c$d526a3a4-63cc-4f94-8f55-98c9a4a9d134$223055df-7d5c-4d99-bc8d-fbc9702f906f$926ec37d-b969-4dc9-99b2-a6b29c6d880c$c1d6532c-38a4-488f-9789-07d63fe6f125$00d67a93-437c-4cda-899a-9daa1102e1f2$84d81413-6334-4965-8632-8a763cd3f28a$4382928c-6325-4ecd-b7cf-282525a270ab$69eedbfd-396f-4461-b7a1-c36abc094581$8fe856ec-5f0a-4483-bb7d-3f6fe270b6f3$f11dca8f-5557-49fc-9720-35034eadba57$d83ff60f-8973-4dc1-9358-5ad109ea5490$e4e80015-40ce-4f8a-aac7-4a9584da4baa$e26f788e-f602-403e-929e-6c98a6e6bf79$c9f7646a-ec01-4d90-9215-5027b7c1c885$b5e06f59-33b5-414e-9a81-43e8abd07aa3$0201ae9f-4a31-497e-86ab-62b454ca85de$943b6d7e-14a4-4532-90c7-dd5080be0c6e$0c0b875e-69f8-46ed-ad06-df9c36088fbe$64b210e8-223f-41f7-a6b7-8af6183ddf87$98bec66e-d8f3-4d4d-b4ec-5838489164e5$297f1606-4ec2-4075-9f81-926dc517b76f$33d69db9-fa2b-40a3-bbed-21d5fd60f302$42799973-9884-4a0e-b29a-039890e92d21$35dc0d94-145a-4292-b0df-9e84a286c036$6029990b-eb31-45ae-a869-b789fba673a6$b37f2395-1480-4c7c-b6c0-eba391e969d7$c306867b-f137-44f2-97dd-3d10c226ca5c$a3d10753-2ec3-4252-9629-834145678b6a$393cd9d2-dd97-496e-b260-ec6e8b1c13b5$18e60b1d-97ec-432c-a388-003e7fae415f$685a7ba3-0f94-4663-a68a-73fa03bd9445$e947f86e-8dc3-4ce7-a9d4-0a7b675a9fa9$f95ceb98-f12e-4650-9ad3-0609b7ecd0f3$ad03500a-bd42-4216-a9cb-3f923152af79$c2f56287-9a3e-454a-9ec1-53184b788db9$7de9b6a4-49ce-4dc3-9d5b-cecfcb98bba1$bb085f2e-83cb-45b2-adf6-c07da892d6e1$1f28280e-ba3b-4ca5-89e4-6ca4a90f5893$d5b612d8-82a1-4586-b721-1baaea2101cf$f36822d7-9ea8-4f5c-9925-dc2a466a68ba$639840dc-976a-4e5c-987f-a92afb2d99d8$14b456f9-5fd1-4340-a3c7-ab9b91b4e3e0$22c4ce8c-bd82-4eb3-8af5-55342018edff$d7566d1b-8938-4e2c-8c54-124f790e72ae$0748902c-ffc0-4634-9a1b-e642b3dfb77b$c4919d14-8cba-43e6-9369-efc52bcb9b23$95245673-2c29-401e-bb4b-a39dc8172297$07c57f37-22be-4c39-8279-d80addcea0c5$7ac99619-5232-4db8-8553-d79ea5415d29$8ddf6b9d-d76d-401f-96ad-2a0b5c114fa4$71774d5f-7841-403f-bc6b-1a0cbbb72d6d$2f4e2da2-b1a1-41b1-8904-39b59f426da4$0e488135-49e5-4e71-83b1-05d8e61f0510$8e15f4b5-0dc7-47a5-9477-9f4d8807b331$dea61907-d4fb-492d-b2bb-c037c7f785cb$8787a5fd-d0ab-46b5-a7df-e7bc103a7378$4019c974-dcaa-46c8-ac90-e6566a376ea1$3134e913-1e86-495d-a558-c3ec4828bf7b$dd167494-99d6-45c6-99e4-c36fde5e2d3f$b3d4117f-7db4-43a6-8427-c08f3542d71f$2455742f-dc18-4d6b-9f58-5666adac6919$30e663da-282c-42ff-8171-dbe3c5c467c6$7ed07ddc-1c63-4ce7-bfd3-6da54304d297published_objects659c6be96e-38f7-11f0-2d30-a71f02755abc/72ba1d0790a4c524layoutautosize§paddingxaxisshowlineégridcolorblacktickvals ?@@@range?@ticktext(???@@?linecolorblackshowgridègridwith?zerolineåtitleWind Valuesmirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCHmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals ?@@@range?@mirrorèticktextlinecolorblacktitlefontsizeA`text Optimal policy
path examplex?widthCHconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty?typescattertextSx?showlegend¤modetexttextpositionlefty`@typescattertextGx`@showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx`@`@showlegend¤modelineslinecolorbluey @`@typescatternameOptimal Pathx`@`@59c6be96e-38f7-11f0-2d30-a71f02755abc/d6339d133c128c5blayoutautosize§paddingxaxisshowlineégridcolorblacktickvals(?@@@@@@@AA Arange?0Aticktext(???@@?linecolorblackshowgridègridwith?zerolineåtitleWind Valuesmirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCRmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals?@@@@@@@range?AmirrorèticktextlinecolorblacktitlefontsizeA`text Optimal policy
path examplex?widthCconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty@typescattertextSx?showlegend¤modetexttextpositionlefty@typescattertextGxAshowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx`@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey@@typescatternameOptimal PathxA(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx(AAshowlegend¤modelineslinecolorbluey`@@typescatternameOptimal PathxAA59c6be96e-38f7-11f0-2d30-a71f02755abc/93bf178085e446c5layoutautosize§paddingxaxisshowlineégridcolorblacktickvals ?@@@range?@ticktext(???@@?linecolorblackshowgridègridwith?zerolineåtitleWind Valuesmirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCHmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals ?@@@range?@mirrorèticktextlinecolorblacktitlefontsizeA`text Optimal policy
path examplex?widthCHconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty?typescattertextSx?showlegend¤modetexttextpositionlefty`@typescattertextGx`@showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @`@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx @`@59c6be96e-38f7-11f0-2d30-a71f02755abc/5b7c97cc5c268b2elayoutautosize§paddingxaxisshowlineégridcolorblacktickvals0?@@@@@@@AA A0A@Arange?PAticktextlinecolorblackshowgridègridwith?zerolineåtitlemirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCHmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals?@@@@range?@mirrorèticktextlinecolorblacktitlefontsizeA`textCliff Walking Sarsa Pathx?widthCconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty?typescattertextSx?showlegend¤modetexttextpositionlefty?typescattertextGxHAshowlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @`@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey`@@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx`@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey@@typescatternameOptimal PathxA(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A8Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx8AHAshowlegend¤modelineslinecolorbluey@`@typescatternameOptimal PathxHAHAshowlegend¤modelineslinecolorbluey`@ @typescatternameOptimal PathxHAHAshowlegend¤modelineslinecolorbluey @?typescatternameOptimal PathxHAHA59c6be96e-38f7-11f0-2d30-a71f02755abc/6021fa627daa4cd3layouttemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBpaxistitleαyaxistitletextSum of rewards per episoderangeconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatamodelines+markerslinedashdashyL%˷”Ρ—/–wz­l»a‘qXFQ¯JTEO@L>>>33>>ff>? ??ff&?333?@?L?Y?fff?33s??modelines+markerslinedashdashyLI el;_H,ٞHb8¦I°.¼yPbd&w~Ÿp³¤typescatternameIntermim SarsaxL=>L>>>33>>ff>? ??ff&?333?@?L?Y?fff?33s??modelines+markerslinedashdashyLW€e©@G@_twϣ«q X$T7Lb/BU³P¤typescatternameIntermim Q-learningxL=>L>>>33>>ff>? ??ff&?333?@?L?Y?fff?33s??modelines+markerslinedashdotyLFvZl2~ӥ(8qѥ6JڥڞAӥͼ%typescatternameAsymptotic Expected SarsaxL=>L>>>33>>ff>? ??ff&?333?@?L?Y?fff?33s??modelines+markerslinedashdotyL@אdn`tK vd1 Q؀…±}x̃ĤtypescatternameAsymptotic SarsaxL=>L>>>33>>ff>? ??ff&?333?@?L?Y?fff?33s??modelines+markerslinedashdotyLDLsKKKjK:KKK:FKeK†K K]LK€KRK|KKޚK¤typescatternameAsymptotic Q-learningxL=>L>>>33>>ff>? ??ff&?333?@?L?Y?fff?33s??59c6be96e-38f7-11f0-2d30-a71f02755abc/d3030aa42e1dd0c8layoutxaxistitletextWalks / Episodestemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBpyaxistitletextRMS error, averaged over statestitleBatch TrainingconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatay[q>n>k>6f>`>Y>Q>I>NA>&9>1>c)>X#>M>u>r>V>>:>K==P=^=@4=Q==H==e====]=^==,=B=*=i==Rn====w= ==XN=:=bX=U=]֩==3=====)-===Л===h=Ǘ===)P=睒===U==_==İ=Dԋ==*=ނ=׌=G=a1=0=EV==„=<=p=v=b=j=$===6==_=m=v=4==typescatternameMCx?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8Bo>m>h>Nnc>\>+U>iL>:;C>39>.>#>>8 >>=5==x=)="=x=BM=[u=]gh=8^=X=++T=_R=[R=DT=>U=X=\Z=W\=g^=ez`= \b=5d=e=Je=ce=d=[c=a=Z_=Ν]=ΐ[=Z=:Y=UW=|U=jT=S=Q=O=gN=,L=fLJ=G=E=B=s@=>=˒==O<=;=r:;=*:=RM9=^8=$7=6=|6=6=s5=#5=R3=3=2=0=H70=/=-=,=+=*=)=\'=b%=A|%= %= $=$=m$=#= "=!= =|==typescatternameTDx?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8Bticktextleaving officereach carexiting highway2ndary roadhome streetarrive homeanchory1yaxis1titletext Predicted total
travel timedomain?anchorx1templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthxaxis2tickvalsleavingreach_carexit_highwaysnd_rdhome_starrivetitletextStatedomain ??ticktextleaving officereach carexiting highway2ndary roadhome streetarrive homeanchory2marginlBHbBHrBHtBpyaxis2domain?anchorx2configshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatalinecolorblackxaxisx1yAAAAAAtypescatternameactual outcomeyaxisy1xleavingreach_carexit_highwaysnd_rdhome_starrivemodelineslinecolorblackdashdashxaxisx1yAAAAAAtypescatternameMonte Carlo Predictionyaxisy1xleavingreach_carexit_highwaysnd_rdhome_starriveshowlegend¡xleavingleavinglinecolorredxaxisx1yAAtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xreach_carreach_carlinecolorredxaxisx1yAAtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xexit_highwayexit_highwaylinecolorredxaxisx1yAAtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xsnd_rdsnd_rdlinecolorredxaxisx1yAAtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xhome_sthome_stlinecolorredxaxisx1yAAtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xarrivearrivelinecolorredxaxisx1yAAtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¤linecolorblackxaxisx2yAAAAAAtypescatternameactual outcomeyaxisy2xleavingreach_carexit_highwaysnd_rdhome_starrivemodelineslinecolorblackdashdashshapehvxaxisx2yAAAAAAtypescatternameTD(0) Predictionyaxisy2xleavingreach_carexit_highwaysnd_rdhome_starriveshowlegend¡xleavingleavinglinecolorredxaxisx2yAAtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xreach_carreach_carlinecolorredxaxisx2yAAtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xexit_highwayexit_highwaylinecolorredxaxisx2yAAtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xsnd_rdsnd_rdlinecolorredxaxisx2yAAtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xhome_sthome_stlinecolorredxaxisx2yAAtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xarrivearrivelinecolorredxaxisx2yAAtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefprevious59c6be96e-38f7-11f0-2d30-a71f02755abc/a7c05c6ee7bae052layoutautosize§paddingxaxisshowlineégridcolorblacktickvals(?@@@@@@@AA Arange?0Aticktext(???@@?linecolorblackshowgridègridwith?zerolineåtitleWind Valuesmirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCRmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals?@@@@@@@range?AmirrorèticktextlinecolorblacktitlefontsizeA`text(Value Iteration Policy
Path Examplex?widthCconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty@typescattertextSx?showlegend¤modetexttextpositionlefty@typescattertextGxAshowlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx`@@showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey`@@typescatternameOptimal Pathx@A59c6be96e-38f7-11f0-2d30-a71f02755abc/76a25ffbba40a531layoutxaxistitletextTime stepstemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBpyaxistypelogtitletextSteps Per EpisodeconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatay@{DCCBC8BBCCBBB?ChBBABC%CTB CDBB`A CJCApB`BCBCBB@ACD CBA(BBBA{CBBB|BBABB0A:CAABBB$BA/CpB|B$BAAPBAA}C|B!CAAABBAhB,BBB8BC ABBlBBBC0B{C\B`BBBBwCBBpBAA AApAAA A@B@A`ABAAAA$B@AAABHBA|B@0AABBC@BAbCAClB0ACA~CB4C@B1CPC BTBCB(BAABC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC59c6be96e-38f7-11f0-2d30-a71f02755abc/59425f0a62718546layoutxaxistitletextTime stepstemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBpyaxistitletextEpisodesconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatalinecolorredy@?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCtypescatterx@DD@DD0E EE0E 3EP5E;EP>E@ECEIENEpVE`XEP[E@^E`EbEeEiE`mEpEwEyE`}EXEE0EEhE EEE8EEEؑEEEpEEEhEhExE0EРEȡEE0EФEpEEE`EE8EЩEE0EEEEE(EEEEXEhEȵEطEE@EEعEExEE(EEhEEXEEE EEPEEhEEEPE8EEEPE`EEEhEEEEExEHEEEpEEEHEEPEEE(EE8EExEExEpEEE`EE EE0EEEEEE@EEEE@ExEEXEEEXEEEEEPEEEEEE EhEEEEPEEE(EpEEHEEEEE EEEE(EEE@EEEEEPEEEEEE@ExEEE`EEEPEEHEEEEF0FLFhFFFF$FDF`F|FFFFLFlFFFFtFFFFF@FFFF<F`F|FFF F0FPFFFFFPFlFFF FL F F F F8 Fx F F F F< FX F F F F F8 F\ F F F F< Fl F F F F(FHFF4FFF8FTFtFFFF FDF`FFFFF4FXFFFFFTFpFFF,F|FFFFF`FFF$FFFFF0FFFF8FFFFF F(FFFFFF,FLFhFFFFFFFFF8F\FF FHFFFXFFF FT Fx F F Fh!F!F!F!FH"Fx"F"F"F"F#FL#Ft#F#F#F($FX$F$F$F@%F%F%F%Ft&F&F&F&F'F,'FL'F'F'F(F@(F(F(F59c6be96e-38f7-11f0-2d30-a71f02755abc/4cf46394be540b73layoutautosize§paddingxaxisshowlineégridcolorblacktickvals ?@@@range?@ticktext(???@@?linecolorblackshowgridègridwith?zerolineåtitleWind Valuesmirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCHmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals ?@@@range?@mirrorèticktextlinecolorblacktitlefontsizeA`text Optimal policy
path examplex?widthCHconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty?typescattertextSx?showlegend¤modetexttextpositionlefty`@typescattertextGx`@showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey @`@typescatternameOptimal Pathx`@`@59c6be96e-38f7-11f0-2d30-a71f02755abc/a0944b0f6ba4cc1flayoutautosize§paddingxaxisshowlineégridcolorblacktickvals0?@@@@@@@AA A0A@Arange?PAticktextlinecolorblackshowgridègridwith?zerolineåtitlemirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCHmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals?@@@@range?@mirrorèticktextlinecolorblacktitlefontsizeA`textCliff Walking Sarsa Pathx?widthCconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty?typescattertextSx?showlegend¤modetexttextpositionlefty?typescattertextGxHAshowlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @`@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey`@@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx`@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey@@typescatternameOptimal PathxA(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A8Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx8AHAshowlegend¤modelineslinecolorbluey@`@typescatternameOptimal PathxHAHAshowlegend¤modelineslinecolorbluey`@ @typescatternameOptimal PathxHAHAshowlegend¤modelineslinecolorbluey @?typescatternameOptimal PathxHAHA59c6be96e-38f7-11f0-2d30-a71f02755abc/f97aed3be1675ad6layoutautosize§paddingxaxisshowlineégridcolorblacktickvals ?@@@range?@ticktext(???@@?linecolorblackshowgridègridwith?zerolineåtitleWind Valuesmirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCHmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals ?@@@range?@mirrorèticktextlinecolorblacktitlefontsizeA`text Optimal policy
path examplex?widthCHconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty?typescattertextSx?showlegend¤modetexttextpositionlefty`@typescattertextGx`@showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @`@typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx @`@59c6be96e-38f7-11f0-2d30-a71f02755abc/6eecf72f2f10b69clayoutxaxis1tickvalsleavingreach_carexit_highwaysnd_rdhome_starrivetitletextStatedomainff>ticktextleaving officereach carexiting highway2ndary roadhome streetarrive homeanchory1yaxis1titletext Predicted total
travel timedomain?anchorx1templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthxaxis2tickvalsleavingreach_carexit_highwaysnd_rdhome_starrivetitletextStatedomain ??ticktextleaving officereach carexiting highway2ndary roadhome streetarrive homeanchory2marginlBHbBHrBHtBpyaxis2domain?anchorx2configshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatalinecolorblackxaxisx1yA B B B,B,Btypescatternameactual outcomeyaxisy1xleavingreach_carexit_highwaysnd_rdhome_starrivemodelineslinecolorblackdashdashxaxisx1y,B,B,B,B,B,BtypescatternameMonte Carlo Predictionyaxisy1xleavingreach_carexit_highwaysnd_rdhome_starriveshowlegend¡xleavingleavinglinecolorredxaxisx1yA,BtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xreach_carreach_carlinecolorredxaxisx1y B,BtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xexit_highwayexit_highwaylinecolorredxaxisx1y B,BtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xsnd_rdsnd_rdlinecolorredxaxisx1y B,BtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xhome_sthome_stlinecolorredxaxisx1y,B,BtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xarrivearrivelinecolorredxaxisx1y,B,BtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¤linecolorblackxaxisx2yA B B B,B,Btypescatternameactual outcomeyaxisy2xleavingreach_carexit_highwaysnd_rdhome_starrivemodelineslinecolorblackdashdashshapehvxaxisx2y B B B,B,B,BtypescatternameTD(0) Predictionyaxisy2xleavingreach_carexit_highwaysnd_rdhome_starriveshowlegend¡xleavingleavinglinecolorredxaxisx2yA BtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xreach_carreach_carlinecolorredxaxisx2y B BtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xexit_highwayexit_highwaylinecolorredxaxisx2y B BtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xsnd_rdsnd_rdlinecolorredxaxisx2y B,BtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xhome_sthome_stlinecolorredxaxisx2y,B,BtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xarrivearrivelinecolorredxaxisx2y,B,BtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefprevious59c6be96e-38f7-11f0-2d30-a71f02755abc/a1553d03eb644044layoutxaxis1tickvalsleavingreach_carexit_highwaysnd_rdhome_starrivetitletextStatedomainff>ticktextleaving officereach carexiting highway2ndary roadhome streetarrive homeanchory1yaxis1titletext Predicted total
travel timedomain?anchorx1templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthxaxis2tickvalsleavingreach_carexit_highwaysnd_rdhome_starrivetitletextStatedomain ??ticktextleaving officereach carexiting highway2ndary roadhome streetarrive homeanchory2marginlBHbBHrBHtBpyaxis2domain?anchorx2configshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatalinecolorblackxaxisx1yA B B BBBtypescatternameactual outcomeyaxisy1xleavingreach_carexit_highwaysnd_rdhome_starrivemodelineslinecolorblackdashdashxaxisx1yBBBBBBtypescatternameMonte Carlo Predictionyaxisy1xleavingreach_carexit_highwaysnd_rdhome_starriveshowlegend¡xleavingleavinglinecolorredxaxisx1yABtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xreach_carreach_carlinecolorredxaxisx1y BBtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xexit_highwayexit_highwaylinecolorredxaxisx1y BBtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xsnd_rdsnd_rdlinecolorredxaxisx1y BBtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xhome_sthome_stlinecolorredxaxisx1yBBtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xarrivearrivelinecolorredxaxisx1yBBtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¤linecolorblackxaxisx2yA B B BBBtypescatternameactual outcomeyaxisy2xleavingreach_carexit_highwaysnd_rdhome_starrivemodelineslinecolorblackdashdashshapehvxaxisx2y B B BBBBtypescatternameTD(0) Predictionyaxisy2xleavingreach_carexit_highwaysnd_rdhome_starriveshowlegend¡xleavingleavinglinecolorredxaxisx2yA BtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xreach_carreach_carlinecolorredxaxisx2y B BtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xexit_highwayexit_highwaylinecolorredxaxisx2y B BtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xsnd_rdsnd_rdlinecolorredxaxisx2y BBtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xhome_sthome_stlinecolorredxaxisx2yBBtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xarrivearrivelinecolorredxaxisx2yBBtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefprevious59c6be96e-38f7-11f0-2d30-a71f02755abc/c69864c8f78f9c34layoutxaxis1titletextStatedomainff>anchory1yaxis1domain?anchorx1templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthxaxis2titletextWalks / Episodesdomain ??anchory2marginlBHbBHrBHtBpyaxis2domain?anchorx2annotationsyanchorbottomxanchorcentery?fontsizeAshowarrow¤yrefpapertext-Estimated Value with TD(0)
with α = 0.2xrefpaperx>fffyanchorbottomxanchorcentery?fontsizeAshowarrow¤yrefpapertext)Empirical RMS error, averaged over statesxrefpaperx?FffconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatalinecolorblackxaxisx1y*>>?*?UUU?typescatternameTrue valuesyaxisy1xABCDExaxisx1ytypescattername0 episodesyaxisy1xABCDExaxisx1ytypescattername1 episodesyaxisy1xABCDExaxisx1yŧ:=q>*,?typescattername7 episodesyaxisy1xABCDExaxisx1yݡn>w ?c#3?typescattername15 episodesyaxisy1xABCDExaxisx1yU=J`6>j6>6?kUW?typescattername99 episodesyaxisy1xABCDEshowlegend¥xaxisx2yc ??>c>>>0>5=>>.>>o>%>1>?>}>r>h>^>9X>kN>n E>;>3>T,>'> ,!>:1>>> >>=f=4===M====v===Ap=޽= I=F= r=$o=*=o =e=f==/=<=1= t={==]p==P=F=Ht=} ==;q=ڶ=\=/=ٝ=2=====.=f=c=v==9="=yx==y=R=9=ץ=>=l]=N==]=ê=n== =g%=typescatternameRMS erroryaxisy2x?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8Bc9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCHmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals ?@@@range?@mirrorèticktextlinecolorblacktitlefontsizeA`text Optimal policy
path examplex?widthCHconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatafshowlegend¤modetexttextpositionlefty?typescattertextSx?showlegend¤modetexttextpositionlefty`@typescattertextGx`@showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??59c6be96e-38f7-11f0-2d30-a71f02755abc/bc25cbf31a6c6942layoutxaxistitletextEpisodestemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBpyaxistitleSum of rewards during episoderangepconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatayą+ 4ffø^# f¯šff\Hz¸HsףmR8šh33~rr{,(Z C Ksq= …(33_®G {&33*= L338-= ((&R ff (ףH&R¸#¸¤p{33 6…ff&{4= ףz¤p{{\= +   ffQ"q=33 (33{z{p= (zף H{ ff …%)\)\R${®GpšG\q=zRzR {ffQzff= …33Qz{({(G= " RRffG)\R( š (q=\…)\ffףp\HR{ff33Hף z ®G ff\RffHRp= \QHRH{ffHQq=\GRHp» 33HQ\ף{33 RG\)\q=ff33Qff= q= \33(pGףp{= zR\q=z33p33ףz(QH33HQffR\ ¸)\\{zz š3333( ףpף= zףpG\(Gz¹\RQGQq=)\\)\ffšG= G¸ Gp¹Q(ףG= zG{)\QGµRףRq=G \(q=H(\ףffRQHQzף(= )\= ףHR)\{QG )\\= ½33 \R)\q== = 33ff{q=»H)\Q{ff)\Rp(()\ )\ = R33{= Qp)\z 33¤pp{ףQtypescatternameSarsax?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCyďB@))LHa¸f¤p¸®ǜףR8…kffuQz)ܙ\®ǀr¸}= `LffE33fq=XHFšf K¸J\3|¤p)q=@33Wš:{Oq=Q Iף4QNף@ OYףR\:…f(f(HšNV(< >URY+{+ziC33I®G?q=<¸6(;Lš8)…@\R@…A¤pY~šG®G^ff/z\…xHj….ףW)\ 1:¸W.(…u# S(*= `q=\š!ffyQ2fQ8¤pI= 2 K33-H3¤p6{D®G9a{n= I(Jšf…(¸E(EV\>= >…{¸mP\Y(F¸Dffmš9R¤pu<®G1ףr{"R#= šq¸B)\=q=hQ[zNIVR5Wz/¸ ¸-33GHRdL(:HMF\C:®Gkq=?{?z\)\B 6…9ףLX)\N®GI ff5 <33.RXG\1\Xz(1= R{mq=f;S(m®GZšAš$ףChffr)\®GH JH{)\%)\-v= iף?ףW0…$N33zףCz azhA\(\VB()\Cff))\XG)\IR&H72¤p*33Jff5ffDff?)\gq=j(ףp®GN 9ff$¤p` v(")\2(X tffMi¸JRm\K)\2¤p/Qe336q=ZRPšK33IQbnLq=G)\N= H33V($…S®GXHPRv ~(W;J= Pff)\7\8\-Q#¤pq4{+ףA¤pMURcQmaTffHHGףUף_z5]R1q=> C{ (6)\Bף|š]¸JrQG0Q;Qf{Y¤pK= R lr?š]>6šdQN= 6%R>GHf= =5(V)\pRYQHQ'q=xR ffKq=S{*¤p6š1¸XHdš+= 9ff9= ;\I…5zSN8=HzO{n(!(gš…B= = V hq=>(Wq=W33EbzS W{JQS\5…$MP \a]q=Azzp{r)\8¸[B(*= S®GdffO®G{9H Aa33e¤p)qDRffI= ?ף-q=Zzš.šJJA¸A(BzcJ4H9¸(,…tšU…qQ=(5®G:\(fH6{=z+Y= D…7= ,zGHdq=šš@q=P(\šD¤p[\G(8¸e6F)\czJq=: R¤p{q=L\5H8\¸)ףfףd\Q0…kffM~HQ W\0Q))\4{Zz\  ¤typescatternameQ-learningx?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC59c6be96e-38f7-11f0-2d30-a71f02755abc/b3ded7d596cbc23flayoutxaxistitletextEpisodestemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBpyaxistitle.Average steps per episode
during trainingrangeBconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatayDq}DC)\BpBǕB3BBBBff|BQB€BQaBff_BSB33NBpPB{RB)\?B-B9B+B-BBG Bq=7BGBףBHBq=B(B6B{Bq=ABp(BzAffA(B{BHA(B= AA AAQA)\AAAAGAHA= AAQARAA)\A)\AffAAzARAzA(AHAHAffAAAGAAffAAHAAAA{AAHAAffA\AffAףAAq=AףA A{AGA»A̸Aq=Aq=AAq=AzAGA)\AA= A{AAAzAq=A ׽AARAA\A33AffAQA33A̼AzA{ARAQAGA{AffA ׷A33A̸A= A·ARA= AffAffAQApAA= ARAAQARAAQAffAA¡AAzAAGAµAAq=ApAzA(AA(A33AHAAA ׭A\A©A\AA\AףAµA\AffAffA= A¯AGA= AףARA)\AQAAGAffA­AAAAffAARA§AQAAףAA= AApAAAffAq=AףA{A{ARAAq=A«AA33A(AA{A(Aq=A שAQApApA)\A קAA(ĄAzAA§AA§AAAAQAA(AףARAAGAQAffA̺A±AA̦A(AQA\AAq=A33AApA33Aq=AA= AA­A= AQAAA)\AGA\AAAQA)\A̬A¥ApAzARAAzAףA שAAA ׫AAA{AAQAApA33AffA= A̪A\AHA= A= AQAףAAHAQAףAffAA̮AHAQA\AAGA= AA קAAffAAGA AףAARAARAGAHAHA ׷A(AA³AQA(A ׹A±A̪A33AAAA{AffAGA33AA·Aq=AGA©AAGAq=AARA̤AAQAq=A{AQARA§AAA= A33A33A)\ĄAq=A= ARAHAA ׯAA\AףAA{AffApAARAzA\A«AzA)\A̬A{A\AA(AA(AHAAzAAA(A(AAq=A̬A(A)\A¹AAzA קAAQAzAQAAA33ApA­ApA ׫AffA̪AAAffApAHAAApA ױAŸA)\A33AHARA¥AAAA= AA(A̤AGA= AAA33A±A)\AA\AAHAAHA)\AGAA= AAAA(A\A¯AA̮A33AzĄAffAA«A{AµAApA{AAA(ApA\AAA̤AA33A)\AGA ׫AtypescatternameSarsax?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCylC;CB#BBєB{BuBRB{aBpBpB33dBQdB UB(C?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCyfC CpB.BffBBBHaBq=yB33B)\B{MBXB\VB@BR8Bff3B-Bq="BGB\"B)\BBRBBGBQBRBAGA(AAףAףAAA= AA{AzAffAA{ARAQAzAAAzAQAA= AAffAARA= A̚A̘AApA(Aq=AQAA̔A(AAAQAAffA בAq=AAzAAzAA= AzAףA= AA ׍AAGAAAA= A= AAAq=A= A)\AAGARAA= A{AףAA\AGAAffA(A\AGAAQA\AHAGAGA\AApAQAA(A33AA ׉AzAHApAA{A= AQARA{A\AAHAAףAAAGA{A)\AzAzA33AARAAAAAzAq=AffAq=AAAAffA‰A33A{AףA ׋ApA)\A\ÅÄA(A\ARAffA(AAzAףAAffA= A{A{AAQA= AzAףAףAq=AA)\AffA(A)\AAA{AףA{AA ׉AA(Aq=ǍAGAARAq=A{AףA‹A‹A{A= AAQA(Aq=A{A)\A{A{A)\A\ARA‡A\AApAAHARA)\ARAA33A‰AGAA\ÄÅA)\AAAAףA33A)\AףAQARAA{AGAÄAGA)\A(A)\AA(A ׉AAQA(A(A‰AffA‡A)\AAQAAA33ÄARA‹AzAHAzAq=AARAGAAzA= AzA= AAGA33AzAHA\A{Aq=AAףAA(AAARAA= AHA{AffA)\Aq=AQÄAA‰AGA{A{A‰AAq=AzARAףAzA ׋A= A ׇA)\AHApAףAzAGAGA= AA\AA{A‰A)\AAQA{A33Aq=AA ׇA(A{AAAAĂAHAAGA33A33AHAHApAzǍAAAzAHApA= A\AffA33A= AQApAHAA33AGA\AA\AARAHAAAAAA{A{AAAQAGAA33AףAQA= ApAq=A33AAHAzARAzA)\AQAAAHAףAA{Aq=AAffA‡A= Aq=AAffAq=ÄAAffA= A…AÄAGAAzAAA33A(AQA)\A ׉ÄA(A(A= AAAףAAAGAAGÅAA)\ARA ׉A33A(AAAq=A= ÄAAףAAףApA33A)\A)\AA{AףApAA= AARAq=A)\AAףA‡ARAtypescatternameExpected Sarsax?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC49c6be96e-38f7-11f0-2d30-a71f02755abc/13d8f542ac69f87layoutxaxistitletextTime stepstemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBpyaxistitletextEpisodesconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatalinecolorredy?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8Bc9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCRmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals?@@@@@@@range?AmirrorèticktextlinecolorblacktitlefontsizeA`text Optimal policy
path examplex?widthCconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty@typescattertextSx?showlegend¤modetexttextpositionlefty@typescattertextGxAshowlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey`@ @typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx`@@showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx@Ashowlegend¤modelineslinecolorbluey @ @typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey @`@typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey`@@typescatternameOptimal PathxAA59c6be96e-38f7-11f0-2d30-a71f02755abc/5f08b9d1ec5530fdlayoutxaxistitletextEpisodestemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBpyaxistitleSum of rewards during episoderangepconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatay)ĤpÚ!Ú&33) Q= ף..®Gff{3)\(®GqHarR|zSšQ^QIףR7 ף.= ?sH*š8zP…*ף7H8{+ף { ¸)(ff{*Q (!¤pף ¸)\¸"= * H = (……ףQ= /= ( q=ffffz= {q=)\*Q\š®G®Gff  ¸33®Gף¤p{{pQ ffG  R= …{{ pQq=((q= = q=H q=( q=ffq=Gp)\Q 33 z)\G(p p Q)\ff(zGzffq=\H HGQQ)\(z33̺)\ קGq=¿½{= 33R ׿ ףH)\Gq= ףQp̼ zQq=®Gzz( ׭= 33H{µ\z = H(p\H(zG{q=z)\ף= ff33= q=ff(\̾R QR̪( = zz = {ףq=Gpffp(pHRz)\ ׽3333ff33= R…= p33H)\)\33(RG33{33HRRffR= )®Gף33q=\pHף((33ffHq=ף(Rš\(= )\ p ff33pq=¯33{= pff\ffRGq=33p33Rffp(½HH)\{ףHף\GH{= \33z= HRff)\{= \q=\Rףף ff)\ffzp\\ptypescatternameSarsax?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCy ĮGuC3sf&.== u¤p WR8= 33µ̚¤p{¸H]Ry3®Gzr= gє)\išfZšf\e33\…[ffLzd33B\9¸+E1)(;45(= P¤p<¤p?q=;®G]RK)\N= o TRu®Gtj= (z3= )\{c¸Pz ff6RW(Y¸IL¸&Q3®GDzB)\?š:®GJ\D2ף$wB®G(_=JףףVQKffO Wq=?zT¸N\1ff]33L U\oB¸?Rף ®G!ffcq=^'33TG8®GLQ0z@ 5S]р334 2¤pRWC(4š`\^R-= p 9¸Qri¸¸h)\C¤pR=z?zF\'\:\W…YšTq=B®G \T Aš*ף#¸= -®G833?= D¤p#aRY¤p{$)\JP)\9®G7¤p&AQKףRH(NffP\d¸\zX33?q=YR1ף5znf)\(PFR=33=R8Q1(=q=VףN¸X{)= QRHtSף\33.ffB= X…`33[¸CAl¤pKD¸b{h\…>33$33^š%z] m…!¤p)š%= PR{ff9š;Rb33^®G1®G-R4)\=m(ZšA¤paO\G{:33RJffJ:zSףףW{I¤p,F¤pjRg .ףN= Qd)\w\9ףHff{C~(ZšAšE33;4 \ffB>Eף …r\yף-ף&Q(W®Gbz733i…5:z {b@{z6\$= *šF33d>)\XlBq=M{W¤p0-h{mzO?…+C¤pd(JA®G9= ! iBffT0R$q=J)\R¸+š)\?R-R$N…9®G)Rjff%C\4.{KH3HV{6®G4HD¸$ b®G?W /u¤pFs{R0šT®GhQA\B:Q%RO{ JT:#¸:¸m{XG\uq=7M33O effaq==H>xq=EFzi6IT\GH@\QzP :q=4mFH>>\u{MHZLz,R33J33[?\…5ףA= BU¤pSffV M{%šW33J y®G<®G@ ¤pf\QN S)\933š93®GIH6\P{;\FQ>33*1ff*ffI]\^…%33HH9…4{jף&HE{~33GS8QL= ?H0R? j7¤p:33Pšz\{8q=H\xš¤p G 93®G<\5Ha)\*{P®G#RP{IHX7 I¸-RBףM¸9)\c\h{w(L)\bzMP¤typescatternameQ-learningx?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCy8ף33Hz)ܾL#ףB)ܔ)\Lz]ף]P= VRWף@Q/ffJ5=¸C?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC59c6be96e-38f7-11f0-2d30-a71f02755abc/97d5d32b3ca95403layoutxaxistitletextWalks / Episodestemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBptitle)Empirical RMS error, averaged over statesconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatalinecolor#rgba(0, 0, 255, 0.3333333333333333)y[q>GIj>bkc>\>U>FO>3I>QC>=>8>R2>->'>">>6>k>> >1>>(=Y=\=}Y==H=u=v=f=+=0===d=8+== =m=ɿ==;==|=~=t=r='l=Wd=^=aY=T=PP=VK=g/H=7>E=?=V==P:=7=}X3=0= -=_(="=6 ===(=%==Ӷ===y=J = =f==_O=5=.~=F=$==:=3=L=^= =E=:` = = =M =۫= ==5===typescatternameTD α = 0.05x?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8Bݑc>xV>J>Y>>2>'>Z>ظ>~; >> =\w==L=O7=t=[=s==-Ւ=| =={=t/p=e=-\=+T=wS=Q=uuI=$-K=AJ=E=A=\?=,:=hB=?=3A=N?=PG=`N=,K= G=F=xH=veJ=?G==;E=O=OT=DS=U=)R=R=oP=y?T=wT=k%Z=iW="V= 7W=Z=OZ=X=hV=qV=0X=i\=?d=!Uf=f=[d=Bg=$l=Boi=#>]>I>8>C(>>P >=4=*G=zo==M==|=c=v=)k= j=Kn=6cs=tu=l=ek=l=Hh=se=\h=j=`i=EHp=M{=Nw=@x=u={=v={==|4=~=z=P ==9=Q2=^=h=h=Լ=Bd=.==ۊ==g=W==爋=5=pa==ab== =%j=Z== =&E=1=ݑ=U=Œ==y= ===9O==E=W=?=2J=$="==<=%?=====9!== =x=S?==`C=typescattername α = 0.15x?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BEn>pl>i>f>kd>"b>v_>o^>=[>Y>V>T>HfR>CnP>"N>JL>J>1G> E>B>HA>V?>_>>L<>:>q9>د7>5>44>2>?0>/>p;->b+>*>(>'>H%>8#>n"> >>>R.>\>S}>!>>;>W}>>u> >> >h >*4 >zd >qs >=>C>6>4>^R>O>,=}j=X= ====N==Q;=d===Ia=]=5d=gH=m=l=qE=I=A=T==4=c===e=nq=/=="=z==typescatternameMC α = 0.01x?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8Bl>g>d>j_>Z>q`W>LS> P>3M>I>qE>!B>-?>3<>@8>\4>B1>.>1*>'> #>l`!> >?>1>@>$>'>Fv>PK> > >l >>>>O==o= ==:'==[=Z==V=c=e=====)>=nN=yp=D=X=&==g="Q=<=k==i=)۲=/:=E==C=v==qQ=&=d==U=ᖩ=>=֩==_=m= =S=`l=*.=f= =,= ==A:=0L==M=E=Bԭ===[ɧ= =typescattername α = 0.02x?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8Bk>Ue>MCb>[>M U>`P>L>GVH>C>V?>9>Ԁ2>/>,>(>$>BV!>>Fq>hL>5>G>F >2 >u > >I>=o=>>=$_=#==@=V===v==_5=ҥ==G=H=L=)===9=:===f='=Ds=uo=}=I=<========[B==>D===E=|=r:=>= =l==2 =/=j=f==[W==$=w====w=.Z=X=/=2<==x=`=typescattername α = 0.03x?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BLi>lc>'?]> W>@S>lM>~F>LxC><>9>r4>2>W2>7/>:%+>p/(>}!>>>>F>>5 >o > >} >:O >dq>?>V>L1=*= = ==ݍ== = ====U=O=9==Q=4$=={== ===;=˰=====P===q==W===c== =mJ==l={=======]==@=== ==xD=<===4=3=W=*B==z=(=`=typescattername α = 0.04x?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8Bticktextleaving officereach carexiting highway2ndary roadhome streetarrive homeanchory1yaxis1titletext Predicted total
travel timedomain?anchorx1templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthxaxis2tickvalsleavingreach_carexit_highwaysnd_rdhome_starrivetitletextStatedomain ??ticktextleaving officereach carexiting highway2ndary roadhome streetarrive homeanchory2marginlBHbBHrBHtBpyaxis2domain?anchorx2configshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatalinecolorblackxaxisx1yA B B B B Btypescatternameactual outcomeyaxisy1xleavingreach_carexit_highwaysnd_rdhome_starrivemodelineslinecolorblackdashdashxaxisx1y B B B B B BtypescatternameMonte Carlo Predictionyaxisy1xleavingreach_carexit_highwaysnd_rdhome_starriveshowlegend¡xleavingleavinglinecolorredxaxisx1yA BtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xreach_carreach_carlinecolorredxaxisx1y B BtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xexit_highwayexit_highwaylinecolorredxaxisx1y B BtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xsnd_rdsnd_rdlinecolorredxaxisx1y B BtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xhome_sthome_stlinecolorredxaxisx1y B BtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xarrivearrivelinecolorredxaxisx1y B BtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¤linecolorblackxaxisx2yA B B B B Btypescatternameactual outcomeyaxisy2xleavingreach_carexit_highwaysnd_rdhome_starrivemodelineslinecolorblackdashdashshapehvxaxisx2y B B B B B BtypescatternameTD(0) Predictionyaxisy2xleavingreach_carexit_highwaysnd_rdhome_starriveshowlegend¡xleavingleavinglinecolorredxaxisx2yA BtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xreach_carreach_carlinecolorredxaxisx2y B BtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xexit_highwayexit_highwaylinecolorredxaxisx2y B BtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xsnd_rdsnd_rdlinecolorredxaxisx2y B BtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xhome_sthome_stlinecolorredxaxisx2y B BtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xarrivearrivelinecolorredxaxisx2y B BtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefprevious59c6be96e-38f7-11f0-2d30-a71f02755abc/56740ad756b57fb4layoutxaxis1tickvalsleavingreach_carexit_highwaysnd_rdhome_starrivetitletextStatedomainff>ticktextleaving officereach carexiting highway2ndary roadhome streetarrive homeanchory1yaxis1titletext Predicted total
travel timedomain?anchorx1templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthxaxis2tickvalsleavingreach_carexit_highwaysnd_rdhome_starrivetitletextStatedomain ??ticktextleaving officereach carexiting highway2ndary roadhome streetarrive homeanchory2marginlBHbBHrBHtBpyaxis2domain?anchorx2configshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatalinecolorblackxaxisx1yAAAAABtypescatternameactual outcomeyaxisy1xleavingreach_carexit_highwaysnd_rdhome_starrivemodelineslinecolorblackdashdashxaxisx1yBBBBBBtypescatternameMonte Carlo Predictionyaxisy1xleavingreach_carexit_highwaysnd_rdhome_starriveshowlegend¡xleavingleavinglinecolorredxaxisx1yABtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xreach_carreach_carlinecolorredxaxisx1yABtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xexit_highwayexit_highwaylinecolorredxaxisx1yABtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xsnd_rdsnd_rdlinecolorredxaxisx1yABtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xhome_sthome_stlinecolorredxaxisx1yABtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¡xarrivearrivelinecolorredxaxisx1yBBtypescatternameMone Carlo Erroryaxisy1markersymbolarrow-bar-upanglerefpreviousshowlegend¤linecolorblackxaxisx2yAAAAABtypescatternameactual outcomeyaxisy2xleavingreach_carexit_highwaysnd_rdhome_starrivemodelineslinecolorblackdashdashshapehvxaxisx2yAAAABBtypescatternameTD(0) Predictionyaxisy2xleavingreach_carexit_highwaysnd_rdhome_starriveshowlegend¡xleavingleavinglinecolorredxaxisx2yAAtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xreach_carreach_carlinecolorredxaxisx2yAAtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xexit_highwayexit_highwaylinecolorredxaxisx2yAAtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xsnd_rdsnd_rdlinecolorredxaxisx2yAAtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xhome_sthome_stlinecolorredxaxisx2yABtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefpreviousshowlegend¡xarrivearrivelinecolorredxaxisx2yBBtypescatternameTD(0) Erroryaxisy2markersymbolarrow-bar-upanglerefprevious59c6be96e-38f7-11f0-2d30-a71f02755abc/2933a969c3841bd1layoutautosize§paddingxaxisshowlineégridcolorblacktickvals(?@@@@@@@AA Arange?0Aticktext(???@@?linecolorblackshowgridègridwith?zerolineåtitleWind Valuesmirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCRmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals?@@@@@@@range?AmirrorèticktextlinecolorblacktitlefontsizeA`text(Value Iteration Policy
Path Examplex?widthCconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty@typescattertextSx?showlegend¤modetexttextpositionlefty@typescattertextGxAshowlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey`@@typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx`@@showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey`@@typescatternameOptimal Pathx@A59c6be96e-38f7-11f0-2d30-a71f02755abc/7b6adbf2145966c9layoutautosize§paddingxaxisshowlineégridcolorblacktickvals(?@@@@@@@AA Arange?0Aticktext(???@@?linecolorblackshowgridègridwith?zerolineåtitleWind Valuesmirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCRmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals?@@@@@@@range?AmirrorèticktextlinecolorblacktitlefontsizeA`textSarsa policy
Path Examplex?widthCconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdata showlegend¤modetexttextpositionlefty@typescattertextSx?showlegend¤modetexttextpositionlefty@typescattertextGxAshowlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey`@ @typescatternameOptimal Pathx`@`@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx`@ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx`@@showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey @@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey@@typescatternameOptimal PathxA(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx(AAshowlegend¤modelineslinecolorbluey`@ @typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey @`@typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey`@@typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey@@typescatternameOptimal PathxA(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx(AAshowlegend¤modelineslinecolorbluey`@`@typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey`@`@typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey`@@typescatternameOptimal PathxAA59c6be96e-38f7-11f0-2d30-a71f02755abc/4d752609bc5b03a9layoutxaxistitletextEpisodestemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBpyaxistitle.Average steps per episode
during trainingrangeBconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdataygDn5CB=B´B±BB)ܐBBLBtBq=rB= xB33WBHVB= ^BZB(4BH:B\:B,Bq=2BG7BG%B#Bz#B#Bff BHBff Bz B7BBRA{B)\AAAAA\AAAffAA)\Aq=AffAA)\ARAHAHAq=AzA½AHB33AARA)\A= BffA A)\ApAAAAAA= AQAGA{A\AffAHAHAףApA33AGAq=A ױAzAA\A= A\AA(AA(Aq=AAףA= AA= A)\AAAffA= AAAGBA33Aq=AAzA AA{ApApAA̸AzARA)\A{A̸A)\ApA(AffApA̸AzAHARA{AARAq=A(A©AףA̮Aq=AףA\AAA\A\AAq=A33AA{AAA= A= A= A ׳AGAGAA(AffARAGAARA33AHAAffA{A³A»AAAAHA³AAףA)\A= A\ARAA)\A{ApAHA AzA)\AARApA)\A שA)\AHARA)\AGAHA33ARAA(A\AA\AA\AffAAA= A{A)\AAAffAA)\AGA̶A ׭AףAAGAzAQAA(Aq=B33Aq=AGA קA)\A\A{ApAAףApARAGA33AQAA33ĄAA(A{AA)\AffA¥AAq=AHAAq=A33A̪A33AףAffA= AGAA¥A(AAAA̰A= ARA33AAGAq=AAGA= ApA= AAzAApA33A(Aq=A{AffAA¿AzA ױAAA\AAq=A(A= A ׫A)\AAAAAףA)\AzAA«AףA\AA̢AAA(AAffAzA¡AHAףA)\A ױAHA ׭AAAAGAz B½AAq=AA ׫AQA³AARAGA{ARAffĄAApAA(AAffA{AQAA{A= A§AA= A33AHA= AffAA¡AQA̪A\ARA33A)\AףAAAQAA(AA= AA= A\AzAA AףARAףAq=A= A33A(A(AAA)\AAffAHApA שA{AQAA·Aq=A(AA33AA{AAffAףAA(A{AA̪AAffApAA{AAq=AA= A33AffA= A{A(AףAAffAA)\A33ARA\ARAAzAQAq=AAA(A\AQARA­AHAAA\AGA33A= A­AAAGAQA A(A)\A= AAffA)\A ױAףAAA̴AAq=A33AAA{AQA̢AzAtypescatternameSarsax?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCyC#C WBBBB̜BfBkBRvBQqB{VB[BRFBHB\AB= 2BR/B(*Bq=Bq=B(BB Bp BB)\BA BffA\AzApAAA{A= AףAq=AffAQARA= AffAHAAAApAףA(A{AA)\AHAAAAq=ARA(AffAffApAq=Aq=A{A= A= A‰AHA\ARA)\A‹AAAQAffAAffA ׍A33A{A ׉AAAApAAAq=AQAApA33A(AA)\AA\AffA33AAAQAHAAAAQAffA\AA33ARAAAARA= AyA33AǍAAAA\ApAA ׋A{AÅA(AAGAffAffĀARA{ApAAףĂA33AQA33AA‡A{AHA…AA33AApAffA= AףARAq=ARA\A(AA‡A בAAףAAGAAq=AAQAAGA= AAffA(A\AffAQA\A)\A{AHA\AHAA= AA‰ÀAzAAA…A33A ׉AQAq=AxA)\AA33AffA)\A(ARAHAHA)\A‰Aq=A ׃A ׃ARAA ׉ARAHAq=AA33A{AA(AGAHAA\A)\ApA{Aq=AAAA)\AGAAA {ARA)\A{ÂAA}A33Aq=AAGA33A)\wA ׅA33AA̎A ׉AA= A ׍ARAAq=AAAA= A‹AAARA)\AQARAA\AHAAAGAGAffA{AAzA= AAGAףAQA)\Aq=A ׅAAAףAAAHApAffA(AffApAGA{AAAQA{AףAGAAq=AA)\AGA(A= A ׏AAAA= ApAAq=AzA)\AffApAffAAAA(AAQA\Aq=AGA(AzAGARAA)\AAAq=Aף|A33ApAQAAA)\AzA33AzA ׅA(A ׉A ׇAAA)\A(AAq=AAHAffAA= AAAGApA= A\A ׉AGAHAHAApA ׃AffAzA{zAq=AףA= A\AffAA= AGAAffAA‹AA33A33A\AAARAAAzAq=AAA\ApAA}AH~AARA\ĀAHA(Aq=~AףAq=A(AAffA)\A{A= ApAAףAR~AAq=AA{AףAA= AףAq=Aq=AAAףAG}AƒAGA)\A ׃AA‰ARApA ׇAAAAHA\ÀAAA33ApA(AAQAzAzAAHApAAAףAA)\ARA= A33AAA= AtypescatternameQ-learningx?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC59c6be96e-38f7-11f0-2d30-a71f02755abc/90f5c347caa747c8layoutautosize§paddingxaxisshowlineégridcolorblacktickvals(?@@@@@@@AA Arange?0Aticktext(???@@?linecolorblackshowgridègridwith?zerolineåtitleWind Valuesmirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCRmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals?@@@@@@@range?AmirrorèticktextlinecolorblacktitlefontsizeA`textSarsa policy
Path Examplex?widthCconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty@typescattertextSx?showlegend¤modetexttextpositionlefty@typescattertextGxAshowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx`@`@showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx`@@showlegend¤modelineslinecolorbluey`@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey@@typescatternameOptimal PathxA(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx(AAshowlegend¤modelineslinecolorbluey`@@typescatternameOptimal PathxAA59c6be96e-38f7-11f0-2d30-a71f02755abc/d2eeaee44f48b8a0layoutxaxistitletextTime stepstemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBpyaxistypelogtitletextSteps Per EpisodeconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatayD@DBBCBAC BAB,CAAAA CCBlBBBBB(BBHBBAABBBXBBpBBBBHBBBABABtBAXBXBBB BlBAB4BB8BLBATB@BA(BABBA,BABDBlBAHBABAAAABAA BAAAABAAA B8B BABBBABABABA$BAAAAAAA BAAAAAAAAAAA BpAAAAAAAAAAAAAAAA4BAAAAAABAAAAAAABAAAAAAAAAtypescatterx?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8Bc9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBpyaxistitletextEpisodesconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatalinecolorredy@?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCtypescatterx@{D`DDD` E@EEE3E7E?E0EE QETEZE`\E^EqE{E0EEhEhE؆E EpE`E@EE8ExEEhEEEEPEEhEEPEEEpEHEEE@E8EEHEEEhE8EEEEE(EpE0ETFDF@FF\FFFFLF@ F< FF,FlFFFxFFF|FFFFFFFLF8F F8"FT$F%F(F)F*F0+F+FH,F$0F1F2F3F3F@4Fh4F4F5Fl5F5F5F6F6F6F6F\8F8F8F9F9F,:F\:F:FD;FDFH>Ft>F>F@@FXAFhCF(DFLDFGFHF JF KF8KFDMFdMF\QFQFTF|UF@XF[F \F\F_FaFaF$bFpbFdFdFfFhFhFiFkF mFxmFnFxnFhoFxpFdrFrFrF0tFdtFDuFuFLwFwFFFFjFtFFƋFFF&FFFXFF>FƒF:FFFHF:FFFFFFܟF:FFFF>FFاFF:FFF,FtF|FTFbFįFаFFFFFZFFFFTFFFpFFFFFFc9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCRmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals?@@@@@@@range?AmirrorèticktextlinecolorblacktitlefontsizeA`textRandom policy
path examplex?widthCconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatafshowlegend¤modetexttextpositionlefty@typescattertextSx?showlegend¤modetexttextpositionlefty@typescattertextGxAshowlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx @?showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey`@ @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx @?showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx @?showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx @?showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @`@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey`@@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx @?showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx @?showlegend¤modelineslinecolorbluey`@ @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @`@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey`@ @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx`@ @showlegend¤modelineslinecolorbluey @`@typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx @?showlegend¤modelineslinecolorbluey`@ @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @`@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey`@@typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey`@ @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx`@`@showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx`@ @showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey @`@typescatternameOptimal Pathx`@`@showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx`@ @showlegend¤modelineslinecolorbluey`@ @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx @?showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @`@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey`@@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey`@ @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx @?showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey @`@typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx`@@showlegend¤modelineslinecolorbluey`@@typescatternameOptimal Pathx@`@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx`@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal PathxA@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal PathxAA59c6be96e-38f7-11f0-2d30-a71f02755abc/4e7985c38cb01320layoutxaxistitletextTime stepstemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBpyaxistypelogtitletextSteps Per EpisodeconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatay@@?DCC-CtBKCBB8BBB%CBAC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC59c6be96e-38f7-11f0-2d30-a71f02755abc/21f195b5663a5875layoutxaxistitletextNumber of Samples Per Variabletemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBpyaxistitletextEstimate of Maximum Meantitle2Maximization Bias for IID Variables with Zero MeanconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatamodelineslinecolorblackdashdashytypescatternameTrue Valuex?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BM> >E>m>Z>QN>#C>37>ң.>&>T>>;>>M# >O>k>>~===H== =f,= s===@=3=ۑ=RQ=t=c= =ʻ=㢸==%w=XK=j=.==ج=\|=g=⺧=]===1=)=k=t==X==ٖ=S==h=ӑ=k[==k=:=I==9d= = +===h=[=7=T==L=[T~=t|=N{=y=rx=}w=Xu=s=r=q=jo=1m=k==!j=Nh=bg=(f= e=c=b.c=La=typescattername2 variablesx?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8B9>WJ>>]>>ϒ>T>I>Tz>Jo>g>_>Y>S>M>SG>?jC>>>'9>!5>x 2>ع->*>&>,#> >[>>>Ő>>]>> >> >q >-* >*>t> >w>94>>{==}=I=?====2 =f,=8=4==F=Y==Ik===u=PO=9=X==(=\=E=)=C=~=;=F ==ݜ=p=]==J=@=V=tٻ==v==)#=$=˪=@l=C==e=4A=j===typescattername3 variablesx?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8Bo>gw>->>[>>>EN>V>J?>P>_~>vw>xp>=k>Id>54_>W Z>U>.P>L>UI>eJE>(A>a>>;> 8>6>b2> 0>->+> )>Q'>%>#>"> >;>A>>><>;]>>)>N>7>v>.>8 >[m >Mf >7 > >>>>D>>ğ> >/=g=?==:=$7="=>=A=[,=n=\=@=n=D====m==Ȫ=~==N='=[='==P===*="=typescattername4 variablesx?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8Bc9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCHmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals ?@@@range?@mirrorèticktextlinecolorblacktitlefontsizeA`text Optimal policy
path examplex?widthCHconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty?typescattertextSx?showlegend¤modetexttextpositionlefty`@typescattertextGx`@showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @`@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx @`@59c6be96e-38f7-11f0-2d30-a71f02755abc/7c2857752627f863layoutautosize§paddingxaxisshowlineégridcolorblacktickvals(?@@@@@@@AA Arange?0Aticktext(???@@?linecolorblackshowgridègridwith?zerolineåtitleWind Valuesmirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCRmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals?@@@@@@@range?AmirrorèticktextlinecolorblacktitlefontsizeA`textSarsa policy
Path Examplex?widthCconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty@typescattertextSx?showlegend¤modetexttextpositionlefty@typescattertextGxAshowlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey`@ @typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx`@@showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey @@typescatternameOptimal Pathx@A59c6be96e-38f7-11f0-2d30-a71f02755abc/6aa5ac91f9de9235layoutautosize§paddingxaxisshowlineégridcolorblacktickvals0?@@@@@@@AA A0A@Arange?PAticktextlinecolorblackshowgridègridwith?zerolineåtitlemirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCHmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals?@@@@range?@mirrorèticktextlinecolorblacktitlefontsizeA`textCliff Walking Q Learning Pathx?widthCconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty?typescattertextSx?showlegend¤modetexttextpositionlefty?typescattertextGxHAshowlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx`@@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey @`@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey`@ @typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx@Ashowlegend¤modelineslinecolorbluey @ @typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey @ @typescatternameOptimal PathxA(Ashowlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx(A8Ashowlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx8AHAshowlegend¤modelineslinecolorbluey @?typescatternameOptimal PathxHAHA59c6be96e-38f7-11f0-2d30-a71f02755abc/afbc8d42c8c4fc44layoutautosize§paddingxaxisshowlineégridcolorblacktickvals(?@@@@@@@AA Arange?0Aticktext(???@@?linecolorblackshowgridègridwith?zerolineåtitleWind Valuesmirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCRmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals?@@@@@@@range?AmirrorèticktextlinecolorblacktitlefontsizeA`text Optimal policy
path examplex?widthCconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty@typescattertextSx?showlegend¤modetexttextpositionlefty@typescattertextGxAshowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey`@ @typescatternameOptimal Pathx`@@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey @@typescatternameOptimal Pathx@A59c6be96e-38f7-11f0-2d30-a71f02755abc/ebe8d19277071b89layoutxaxis1yanchorbottomtickvalsAtitlefontsizeA standoff?text# Cars at second locationautomarginædomainff>linewidth@mirroræanchory1linecolorwhiteyaxis1tickvalsAtitlepadlstandoff?text# Cars at first locationautomarginædomain?linewidth@mirroræanchorx1linecolorwhitetemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthxaxis2yanchorbottomtickvalsAtitlefontsizeA standoff?text# Cars at second locationautomarginædomain ??linewidth@mirroræanchory2linecolorwhitemarginlBHbBHrBHtBpyaxis2tickvalsAtitlepadlstandoff?text# Cars at first locationautomarginædomain?linewidth@mirroræanchorx2linecolorwhiteannotationsyanchorbottomxanchorcentery?fontsizeAshowarrow¤yrefpapertext$\pi_{41}$xrefpaperx>fffyanchorbottomxanchorcentery?fontsizeAshowarrow¤yrefpapertext$v_{\pi_{41}}$xrefpaperx?FffconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatacolorbarthickness@xaxisx1yT?@@@@@@@AA A0A@APA`ApAAAAAAtypeheatmapcolorscaleRdBuyaxisy1zT?@@@@@@@@@@@@@@@@T??@@@@@@@@@@@@@@@@T?@@@@@@@@@@@@@@@@@@T??@@@@@@@@@@@@@@@@@@@T????@@@@@@@@@@@@@@T??????@@@@@@T????@@@T?@@T??@T?@T?@T??T@?T@?T@?T@?T@@T@T@T@T@@transposeáxT?@@@@@@@AA A0A@APA`ApAAAAAAcolorbarthickness@xaxisx2yT?@@@@@@@AA A0A@APA`ApAAAAAAtypeheatmapcolorscaleBlueredyaxisy2zTCCMCC:CCvC.wC)CC-CeCD;D̡DgD;DTpDȖDe D DTCCPJC\ClCCvCH)CCCeCBDD(-DDeD D6 DuR D_ D] DTCzC7C[C|CfvC(CpCC2DD=DDD[RDD D4 DX D D DT&XCDHCC*mCuC}(CCOCDBDD-DpDGD Dx< D)[ DOk Dl D_D3FDTCCCCq CCCq2DkDp=DDMD'RD| Da D D6 D DD&DDT[C6KCChCgC|D9D5BDFD,D$DD\ DA< DZ Dk DlD_DFDDnDTCJvCE/CuCCCDDD=DDDQ D4 D D D DDhDDDYbDQDTzCiC!C >De6DkD֛DVDD D D; DZ DjDlD<_DED^D<D9DCDTAiCC)CCD_DDgDD8 D} DV DJ DDDDD6D"bDDD^DT(CC-DWD'D1DR D} D Dj D<D SD ZDJRD)<D(DDDCDDXlDTC5DfD~DjzD1 D| D Dg DDoDDDnDɇDFWDDsD^D;DoDTDbD( D&D D DC Dw DBDDDD DD^D݇D@<DoDlDcDgDT*fDDqDDJ D.= D DDGDyfD-pDgDND&DDЩDSDDBoDVDSDT D*&DH,DD Dz DӦ DD,lDODղD׳DBDD[ND DD^DD3gDD4DTDD Dۂ DY^ DDnDDDDJD-D DfgDDD[DODxSDD DT+D D D DDSFDDDDD DDzDOrDD+DLDD}4D^D DTDJ D] Dk+DDR}DDD[5D5D$DZD2DnDDʪDI1DD< DYD DT D D& DhD9DGD!D.3DGDY@D#DDD\DDDD_wDDDBUDT}] D* DMDD.DD De8DEDo8DDDD:D&DYDD9DbDDDTت DhDDTD5DDWD &D-DDDD|bDDDDDYD;D{DDT D{D.DD0DfD*DDDDDkDD±DV=DڹD)'DDDDC:DtransposeáxT?@@@@@@@AA A0A@APA`ApAAAAAA59c6be96e-38f7-11f0-2d30-a71f02755abc/f51c1fa00f167ddflayoutautosize§paddingxaxisshowlineégridcolorblacktickvals0?@@@@@@@AA A0A@Arange?PAticktextlinecolorblackshowgridègridwith?zerolineåtitlemirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCHmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals?@@@@range?@mirrorèticktextlinecolorblacktitlefontsizeA`textCliff Walking Q Learning Pathx?widthCconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty?typescattertextSx?showlegend¤modetexttextpositionlefty?typescattertextGxHAshowlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx`@@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx@Ashowlegend¤modelineslinecolorbluey @ @typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey @ @typescatternameOptimal PathxA(Ashowlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx(A8Ashowlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx8AHAshowlegend¤modelineslinecolorbluey @?typescatternameOptimal PathxHAHA49c6be96e-38f7-11f0-2d30-a71f02755abc/d3a9386ca62c618layoutxaxistitletextEpisodestemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBptitle"Episode Length for Noisy GridworldyaxistypelogtitletextSteps per EpisodeconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatayP,TpB=BBBnB4^B HBAB;B4Bc'B%B&BB BNBYBoBBBtypescatternameSarsaxP?@@@@@@@AA A0A@APA`ApAAAAAAyPyfB B4`B~BƜjB8PBMTB9B*7B_G.B})BgU$B- B2BB.?B BpB,BzBtypescatternameExpected SarsaxP?@@@@@@@AA A0A@APA`ApAAAAAAyP_B)BIB! BV_BBBgA/AMsAFA?WA#[AA qAAAA}.AAtypescatternameDouble Expected SarsaxP?@@@@@@@AA A0A@APA`ApAAAAAAyPeBҊBɈBXBlB\Bo0JB@BH;B h/B`-B#'BB(BX#BBB'BffB$BVBtypescatternameQ-learningxP?@@@@@@@AA A0A@APA`ApAAAAAAyP\Bk)BABBR'B?B~BBB*BASA6+AOAiAAqA]AASAtypescatternameDouble Q-learningxP?@@@@@@@AA A0A@APA`ApAAAAAA59c6be96e-38f7-11f0-2d30-a71f02755abc/c925358b3c3d408elayoutautosize§paddingxaxisshowlineégridcolorblacktickvals(?@@@@@@@AA Arange?0Aticktext(???@@?linecolorblackshowgridègridwith?zerolineåtitleWind Valuesmirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCRmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals?@@@@@@@range?AmirrorèticktextlinecolorblacktitlefontsizeA`text(Value Iteration Policy
Path Examplex?widthCconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty@typescattertextSx?showlegend¤modetexttextpositionlefty@typescattertextGxAshowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx`@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx@Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey@@typescatternameOptimal PathxA(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx(AAshowlegend¤modelineslinecolorbluey`@@typescatternameOptimal PathxAA59c6be96e-38f7-11f0-2d30-a71f02755abc/5e790add5f7b1844layoutautosize§paddingxaxisshowlineégridcolorblacktickvals(?@@@@@@@AA Arange?0Aticktext(???@@?linecolorblackshowgridègridwith?zerolineåtitleWind Valuesmirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCRmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals?@@@@@@@range?AmirrorèticktextlinecolorblacktitlefontsizeA`text(Value Iteration Policy
Path Examplex?widthCconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty@typescattertextSx?showlegend¤modetexttextpositionlefty@typescattertextGxAshowlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey`@ @typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx`@@showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey?`@typescatternameOptimal Pathx@Ashowlegend¤modelineslinecolorbluey`@@typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey@@typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey@@typescatternameOptimal PathxA(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey@@typescatternameOptimal Pathx(A(Ashowlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx(AAshowlegend¤modelineslinecolorbluey`@ @typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey @ @typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey @`@typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey`@@typescatternameOptimal PathxAA59c6be96e-38f7-11f0-2d30-a71f02755abc/ada388116d66970blayoutautosize§paddingxaxisshowlineégridcolorblacktickvals0?@@@@@@@AA A0A@Arange?PAticktextlinecolorblackshowgridègridwith?zerolineåtitlemirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCHmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals?@@@@range?@mirrorèticktextlinecolorblacktitlefontsizeA`text!Cliff Walking Expected Sarsa Pathx?widthCconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty?typescattertextSx?showlegend¤modetexttextpositionlefty?typescattertextGxHAshowlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey @`@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx`@@showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx@Ashowlegend¤modelineslinecolorbluey`@`@typescatternameOptimal PathxAAshowlegend¤modelineslinecolorbluey`@`@typescatternameOptimal PathxA(Ashowlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx(A8Ashowlegend¤modelineslinecolorbluey`@`@typescatternameOptimal Pathx8AHAshowlegend¤modelineslinecolorbluey`@ @typescatternameOptimal PathxHAHAshowlegend¤modelineslinecolorbluey @?typescatternameOptimal PathxHAHA59c6be96e-38f7-11f0-2d30-a71f02755abc/1cb9d5b796f6ec98layoutxaxis1titletextStatedomainff>anchory1yaxis1domain?anchorx1templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthxaxis2titletextWalks / Episodesdomain ??anchory2marginlBHbBHrBHtBpyaxis2domain?anchorx2annotationsyanchorbottomxanchorcentery?fontsizeAshowarrow¤yrefpapertext-Estimated Value with TD(0)
with α = 0.2xrefpaperx>fffyanchorbottomxanchorcentery?fontsizeAshowarrow¤yrefpapertext)Empirical RMS error, averaged over statesxrefpaperx?FffconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatalinecolorblackxaxisx1y*>>?*?UUU?typescatternameTrue valuesyaxisy1xABCDExaxisx1y?????typescattername0 episodesyaxisy1xABCDExaxisx1y>????typescattername1 episodesyaxisy1xABCDExaxisx1yͤB>>HZ>7}?'?typescattername7 episodesyaxisy1xABCDExaxisx1yR>&>L">?F?typescattername15 episodesyaxisy1xABCDExaxisx1y=͢>>x ?=?typescattername99 episodesyaxisy1xABCDEshowlegend¥xaxisx2y[q>yVW>-?>/*>&>W>p=n==e={=E==D==+#=Ѓ=5=u=9=O=M=י=6=X=#۝=-R=/=l=7=E9=~=ȴ=8=3=ª=ZԨ=O:==t=">=0=r=۰==_="=x;=p===KǪ=^ =*== '=R=)E==$b=Ճ=U=ξ==W=֧=-=3=Ӆ= *=Ȓ==ի==t=!=`~=|=ԫ=+y=]=E=i#=К=[=SB=.==#J==h==y=,=.=5n=F=hv==<=Ԯ=a_=typescatternameRMS erroryaxisy2x?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8Bc9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBpyaxistitletext% left actions from AconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatay:?[?1 ??t#??,?z4?KY6?9?#>>>*> c>r>>K>>?>S>>>Q>>=>>|>;>U0>z>Ԛ>A>@>v>mV>C>J>a2> >O@>n>>ҭ>_>>>]m>M>f>T>p>d;>ٝ>?W>C>>>O>ɔ>>&>)>>َ>z6>q=>r>B>>M>9օ>Ƀ>.>T>J >>#y>Kw>Fs>"lx>t>Bm>hj>yXh>d>fff>Ttd> c>U_>^>?W[>Z>_X>cY>OV>PX>tS>aTR>R>NQ>Q>I>'1H>xK>:K>jM>F>&SE>B>FC>A>E>oD>@>D>H?>:>Zd;>;>X9> 5>Q8> 1>S4>-2>EG2>g3> A1>2>2>O/>..>h",>/>z)>x)>0*)>f$>#>8'>Ӽ#>M">U>|!> >> >>?5> >v>Q>vO>>i>>6>>>>>>0>0>l>>s>s>>>H>>>>+>>*>R>;>>>-><>*:>O >O >4>M >>C >I > >/ > >^ > > >L7 >W >y> >K>a>K>+>>&S>]>>p_>L7 >>o>9>o>N>>>N>>%>=o>=C=b=m=$=!=u=typescatternameSarsax?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCyNb??-?mV?A/?[6>>}>&>{>>>TR>8>T>6<>d]>}>>O@>H>>q=>]> >$>\>d̽>Qk>>ִ>> A>>>P>Kȧ>>>ס>>\>vq>~>>>鷏> c>2>> >Q>>T>>\~>}>/{>gs>"v>iop>jq>:p>zl> i>h>d>u`>v\>QkZ>V>uX>a2U>jM>;O>WJ>I>]K>C>J B>\ A>:>=>Zd;>Ș;>l 9>Y7>_6>Y5>ף0>2>X2>i/>,>C+>h*>O/>(><,> q,>&> $(>$>M">!>R' >@>d> >>d;>c>(>>ё>X>>>O>>X>t>*:>>)>> >)\>/ >*:> >; > >M >/ > >( >M>M >p> >KY>>9E>&S>?>p_> >K>$>>$>>?>J{>,>m=s>J{>>>>>L7 >>=:>]m>y>>=o>C=$=="l=Y=$==e=#J==_=؁=="=Gr=jM=F=k+==ף===.=!=D==D=k= c====ף=e=k=.=io=h==typescatternameQ-learningx?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCy7?>?>>}>>|>->>p>F%>>>b>"lx>ŏq> ql>d>[>V> S>aTR>K>MM>G>?F>I.?>FC>MD>FC>@>:>"5>}6> 5>4> 5>{.>,>h",>*>ff&>^)>x)>@!>?$>9#>9#>f#>.!>o> >>>>>=>= >>&>>tF>>>n> >Ϊ>*:>c > >V>:# >$>k >8>>>>K>o>>@>F%==m=>e=>"==l ===1=F=$==Di=~=== =4==J==\=k== ==D=x=gD=B>====M==u==+==Q= ==|====1=e=J==>=D===b===<=_=Q=a2=i==5=Έ==b===H====`v=H==?= =j=A=|=_)=/=W==Q=_=c===;=7==Zd=;=^K=[==5^=^==A=7=7=\=]=6=6=[===Z=Q=/=4=6=s=6ͻ=j=Q==ŏ=Zd=߾=N=/=Zd=5=-=]=\=N= =}=6=X=6<=v=K=p=[B=[Ӽ=33=&=ܵ=4== =v=6<=5^=V=ŏ=Ș=~==yX==X==ܵ===m=Y=z=V}=鷯=Y=d=ꕲ=Y=Q=K==ꕲ=}== q=2w=}?=O=}?== =5^=W=g==5^==V==Xʲ=p=X=6=y=}г==&=j=Xʲ=5^= =V}=ı=K=}?=~==ִ= =-= =typescatternameDouble Q-learningx?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCy{?6>D>F%>D>>r>(>>>>->s>F%>K>>?>RI>.??? ?+?8???z ?< ?.?V?W??W[?S?Ӽ?=,?O??????+?Y?0*?? ?l ???#?^?,e?c?#??#?#?x?? ?G??8?$??؁?o??Έ?a??A??b? ?j ?_ ?h ? ?Y?7???=?f?8g?o?g?u?w-?h?>>>"l>R>>{> >M>0*>>>]m>F>o>|>>>>">s>aT>>V>;>>>:>B`>>>vO>d;>6ͻ>w>H>Y>R>t>ŏ>)\>P>ƫ>>>/ݤ>+>x>I>o>>5^>)>>>4>>>EG>jޑ>Eؐ>>V>>zlj>9>9>gՇ>ˡ>.>>>7>N>~{>lxz>5x>{>"lx>4v>v>W[q>z6k>33s> ql>l>e>Uh>1f>Sc>.`>ё\>_>(\>t$W>*X>+W>sR>;N>VN>ΪO>rN>WJ>+G>IL>9H>2D>]F>oD>]mE>A>=>:>m;>6<=>Gr9>K7>F6>}6>17>4>g3>2U0>-2>io0>.>S4>(->V}.>j+>yX(>'>d*>)>'>&>&> )>U0*>(>?$>/n#>a!>S#>>S!>">?>}>>/>=>>P>RI>>>>X>>8>>>>>>5 >k >M> >`v> > >C >^K>)\>; >_ >>>p>>2>>ޓ>s>> >8g>ݵ>2>8>>>m=Z>>="=$==$=====D=k=typescatternameExpected Sarsax?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCy?vO>?>{>> A>B`>5^>>h">>>ё>d]>m>s>Ֆ>>F>Eؐ>>5>Eؐ>4>D>h"> >z6>_)>>j>>>f>pΈ>>ۈ>8>(>o>]>>]܆>Z>F>Z>A>&>Z>9օ>\>ݵ>> F>>7>N>b>ۊ}>"lx>ڬz>>~y>z>@>s>jt>|>t>t>mt>Ep>q>u> p>EGr>1w> u>gs>m>{r>zl>n>l>zi>g>k>gg>dj>Uh>1f>dj>.a>`>-]>]>1f>a>]>w-a>o_>w\>>Y>QkZ>V>>Y>>Y>cZ>6Z>RV>Z>V>T>nR>R>xK>P>shQ>L>IL>tS>)M>VN>P>P>:K>NbP>KD>M>IL>L>J>yG>qM>ݵD>%uB>8E>FC>?>?F>?>~;>J B>|?>A>G>@>6<>[<>aC>>>H=>6<=>k7>=>=>ڬ:>#9>H=>#J;>:>:>4>!4>e7>Y7>Z9>X5>5>/>W/>..>{/>0>1,>h",>3>j+>..> q,> &>\->+>,>+>Tt$>. >-!>!>0L&>R' >|!>!>$>e">'>|!>%>#>->Q>-C>/>?5>?$>i>!>>>>Q>D>u>>>d]>>t>s>P>tF>>>t>>t$>>`>+>u>8>->|>>>>>'> >>>!> >_ >) >->> >>k >|> >k >>; > > > >/ >&S>K>p> >: >>s>?>$>>ˡ>>,>>>>8>>>n>=J{>>7>>,>KY>9>a>N>>>\>~=H=>J{>lx="l=F>typescatternameDouble Expected Sarsax?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCClinedashdashyL=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=L=typescatternameoptimalx?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC59c6be96e-38f7-11f0-2d30-a71f02755abc/ff3e7516945b9e18layoutautosize§paddingxaxisshowlineégridcolorblacktickvals ?@@@range?@ticktext(???@@?linecolorblackshowgridègridwith?zerolineåtitleWind Valuesmirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCHmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals ?@@@range?@mirrorèticktextlinecolorblacktitlefontsizeA`text Optimal policy
path examplex?widthCHconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatafshowlegend¤modetexttextpositionlefty?typescattertextSx?showlegend¤modetexttextpositionlefty`@typescattertextGx`@showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx @?showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx @ @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @ @59c6be96e-38f7-11f0-2d30-a71f02755abc/bf44e09ac1fcc101layoutxaxistitletextTime stepstemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBpyaxistitletextEpisodesconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatalinecolorredy@?@@@@@@@AA A0A@APA`ApAAAAAAAAAAAAAAAAABBB BBBBB B$B(B,B0B4B8BC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCtypescatterx@@?DD`DDDDpE0 E E0EEE'E)E,E.E6EP9E@E AEDEHEHEJELE`SEVEXE@_E@aEbEiEmE@rEsE0uEP|E~EEExEXEEEHEEhEEPEȍEhE`EEEEHEȕEXEhEؗEEExEEhE؛EHEE@EEE`EE EE`EEHEPEE(EpEE`EEPE0EEEEPEEدEE`EE0EEE E`EEE8EPEEExEExEظE(EEȺEEE(EEE0ExE0EEE E@EEE(EEEEXEEEEE8EpEEHEEPEEPEEEEEEEEExEEEEpE@EEEE`EEEEPEEEE`E(EEE(ExEEEEEEEEE E0EEEEEEhEEE EXEEEEHEEEE@EEEE(EpE@ExEEE@EEXEEE E`EEEEHEE8EhEEHEE8EE(FPFxFFFF F<FXFtFFFFFF F,FHFdFFF F<F\FFFFF`FF FF(FFF$FPFxFd F FH Fd F F F F F0 FL Fp F F F F F F< FX Ft F F F0 Fd F F F F FF F<FxFFFFFFHFdFFFFFF F(FDFFFFF8FTFpFFFFFDF`FFFFFF$FPFpFFFFFF F<FXF|FFFF4FPFpFFFFF4FPFlFFFFFPFFFF$FLFFFF8FXFxFFFFF0F|FF FDFF`FFFFF0FTFFFFFF,FLFlF59c6be96e-38f7-11f0-2d30-a71f02755abc/a68a31a7f0a83bf4layoutxaxistitletextNumber of Samples Per Variabletemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBpyaxistitletextEstimate of Maximum Meantitle0Maximization Bias for 2 Variables with Zero MeanconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatay:L>ˉ==X<< _<2<@<+;g;/;Ԝ;1;֝y;˘`;K;8; );;;;s:y:JE:&Q::n:vʮ:Ź:R:- :O:LJ:&:z:*n:qe:z\:T:#L: E:0>:8:2:,:3':"!::':S:typescatternameMax of Means Estimatex@@@A A@A`AAAAAAAAABBBB B(B0B8B@BHBPBXB`BhBpBxBBBBBBBBBBBBBBBBBBBBysp!Uֺ|g"(0F|O`7r챸@/YA Ը;+"A*Y>۸9%;+o5`)*QB{FM4oSZ$PB'Z# ,{)5'GܷVýr6X{}Pշiַ=񮷶y÷jŷ۽typescatternameDouble Max Estimatex@@@A A@A`AAAAAAAAABBBB B(B0B8B@BHBPBXB`BhBpBxBBBBBBBBBBBBBBBBBBBB59c6be96e-38f7-11f0-2d30-a71f02755abc/ae6d04b38d0be15flayoutxaxistitletextTime stepstemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBpyaxistypelogtitletextSteps Per EpisodeconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatay@DCC6CC?CAC$B BB4B BLBBBBAC?C@CACBCCCDCECFCGCHCICJCKCLCMCNCOCPCQCRCSCTCUCVCWCXCYCZC[C\C]C^C_C`CaCbCcCdCeCfCgChCiCjCkClCmCnCoCpCqCrCsCtCuCvCwCxCyCzC{C|C}C~CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC59c6be96e-38f7-11f0-2d30-a71f02755abc/d8c715e8e34d7d99layoutxaxis1yanchorbottomtickvalsAtitlefontsizeA standoff?text# Cars at second locationautomarginædomainff>linewidth@mirroræanchory1linecolorwhiteyaxis1tickvalsAtitlepadlstandoff?text# Cars at first locationautomarginædomain?linewidth@mirroræanchorx1linecolorwhitetemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthxaxis2yanchorbottomtickvalsAtitlefontsizeA standoff?text# Cars at second locationautomarginædomain ??linewidth@mirroræanchory2linecolorwhitemarginlBHbBHrBHtBpyaxis2tickvalsAtitlepadlstandoff?text# Cars at first locationautomarginædomain?linewidth@mirroræanchorx2linecolorwhiteannotationsyanchorbottomxanchorcentery?fontsizeAshowarrow¤yrefpapertext$\pi_{30}$xrefpaperx>fffyanchorbottomxanchorcentery?fontsizeAshowarrow¤yrefpapertext$v_{\pi_{30}}$xrefpaperx?FffconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatacolorbarthickness@xaxisx1yT?@@@@@@@AA A0A@APA`ApAAAAAAtypeheatmapcolorscaleRdBuyaxisy1zT?@@@@@@@@@@@@@@@@T??@@@@@@@@@@@@@@@@T?@@@@@@@@@@@@@@@@@@T??@@@@@@@@@@@@@@@@@@@T????@@@@@@@@@@@@@@T??????@@@@@@T????@@@T?@@T??@T?@T?@T??T@?T@?T@?T@?T@@T@T@T@T@@transposeáxT?@@@@@@@AA A0A@APA`ApAAAAAAcolorbarthickness@xaxisx2yT?@@@@@@@AA A0A@APA`ApAAAAAAtypeheatmapcolorscaleBlueredyaxisy2zTCCJC CTCyCCD0CHCs;C CҿCRCC D(D<1D7$DfDD:DT C&CnGC{CCuCYC;+CwBCl5CCC}%D@_DDDsD3D5vD-*DfDTCwC4CCC:`CCC_&CCD0JDDDD D#D D D D DT9UCbEC CbjCmsC%CCCmD6DFDDtDD9DO DT DB D D DiP DT'C*CCC& CTCC%DpDDrD7DjD- DM D ^ D] DF DDDHDTXCpHCCfC>eCrD,D$ADcD: DtDD D_ D- DU5 Dv,DD D7~DPDT1CsC,CCBDDD<DDDP D D3 D DN D~DDDbDKD}DTxCgCpC<D@5DYDښDrD~D D D ; DY DiDkDx^D@D;D%D\DDT%8C&C8CD>DDgDD7 D| D D D5DNDDD|D^D} DDJDTLCpCD.DD*DW D D D D;D]RDHYDQDy;DDXDFD:DӿD%DT,C DeDVDXyD0 D D Dg DҗDD"DջDDDVDDDYD<DB5DTn3DԩDlD%D D DB D DjDDDD) DFDDAD;DDjDD6DT½D3DDDF D8< D DDFDeDqoDfD$ND&D6DLDWSDDnDDQ,DTg6DDD> D D DDRkD{DD!DDhDMD DgD]DhDfDDkDTWDoDjf D D]] D DmD)DDDDDDfDDD3[DDKSDDDTRDjD) DA DƪD{EDDADDCD^ D6DܴDqDXDDILDDY4D?DDT>Dܲ D$ DDDx|DDD4D4DDDD{nDDaD0DD" DYDDTvDz D7 D[HDdDyDYDj2D^FD?D]"DrDDf\DxDTDDwDDDTDTg D DZDSgD-DLD\ D7D ED7DDIDD9DDXD6D]9DGDD~DT D* DfDnD/DdDDc%Dw,DDDbDaD`D\D6DDD;D|DDT< Da DRDWDDDD~DDeDD-DkDPDTD<DD&DDDD:DtransposeáxT?@@@@@@@AA A0A@APA`ApAAAAAA59c6be96e-38f7-11f0-2d30-a71f02755abc/24aa7574d5705350layoutxaxistitletextStatetemplatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthmarginlBHbBHrBHtBptitleEstimated Value with TD(0)configshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatalinecolorblacky*>>?*?UUU?typescatternameTrue valuesxABCDEy?????typescattername0 episodesxABCDEyff>????typescattername1 episodesxABCDEyi]>Hf>?կ?Ȍ*?typescattername10 episodesxABCDEyH>> ?%M?Ei?typescattername100 episodesxABCDE59c6be96e-38f7-11f0-2d30-a71f02755abc/ac757a3486dcd2e1layoutautosize§paddingxaxisshowlineégridcolorblacktickvals(?@@@@@@@AA Arange?0Aticktext(???@@?linecolorblackshowgridègridwith?zerolineåtitleWind Valuesmirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCRmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals?@@@@@@@range?AmirrorèticktextlinecolorblacktitlefontsizeA`textSarsa policy
Path Examplex?widthCconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty@typescattertextSx?showlegend¤modetexttextpositionlefty@typescattertextGxAshowlegend¤modelineslinecolorbluey@`@typescatternameOptimal Pathx??showlegend¤modelineslinecolorbluey`@ @typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey @?typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx`@@showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey @ @typescatternameOptimal Pathx@@showlegend¤modelineslinecolorbluey @@typescatternameOptimal Pathx@A59c6be96e-38f7-11f0-2d30-a71f02755abc/b5c0b7878012e9e3layoutautosize§paddingxaxisshowlineégridcolorblacktickvals ?@@@range?@ticktext(???@@?linecolorblackshowgridègridwith?zerolineåtitleWind Valuesmirrorípaper_bgcolorrgba(0, 0, 0, 0)templatelayoutcoloraxiscolorbarticksoutlinewidthxaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhitehovermodeclosestpaper_bgcolorwhitegeoshowlakesèshowlandélandcolor#E5ECF6bgcolorwhitesubunitcolorwhitelakecolorwhitecolorscalesequential#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921diverging#8e0152=ͧ#c51b7d>Lͧ#de77ae>#f1b6da>ͧ#fde0ef?#f7f7f7?#e6f5d0?333#b8e186?Lͧ#7fbc41?fff#4d9221?#276419sequentialminus#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921yaxisgridcolorwhitezerolinewidth@titlestandoffAptickszerolinecolorwhiteautomarginélinecolorwhiteshapedefaultslinecolor#2a3f5fhoverlabelalignleftmapboxstylelightpolarangularaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6radialaxisgridcolorwhitetickslinecolorwhiteautotypenumbersstrictfontcolor#2a3f5fternaryaaxisgridcolorwhitetickslinecolorwhitebgcolor#E5ECF6caxisgridcolorwhitetickslinecolorwhitebaxisgridcolorwhitetickslinecolorwhiteannotationdefaultsarrowheadarrowwidth?arrowcolor#2a3f5fplot_bgcolor#E5ECF6titlex=Lͥscenexaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitezaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhiteyaxisgridcolorwhitegridwidth@backgroundcolor#E5ECF6ticksshowbackgroundízerolinecolorwhitelinecolorwhitecolorway#636efa#EF553B#00cc96#ab63fa#FFA15A#19d3f3#FF6692#B6E880#FF97FF#FECB52datascatterpolargltypescatterpolarglmarkercolorbarticksoutlinewidthcarpetbaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitetypecarpetaaxisgridcolorwhiteendlinecolor#2a3f5fminorgridcolorwhitestartlinecolor#2a3f5flinecolorwhitescatterpolartypescatterpolarmarkercolorbarticksoutlinewidthparcoordslinecolorbarticksoutlinewidthtypeparcoordsscattertypescattermarkercolorbarticksoutlinewidthhistogram2dcontourcolorbarticksoutlinewidthtypehistogram2dcontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcolorbarticksoutlinewidthtypecontourcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattercarpettypescattercarpetmarkercolorbarticksoutlinewidthmesh3dcolorbarticksoutlinewidthtypemesh3dsurfacecolorbarticksoutlinewidthtypesurfacecolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scattermapboxtypescattermapboxmarkercolorbarticksoutlinewidthscattergeotypescattergeomarkercolorbarticksoutlinewidthhistogramtypehistogrammarkercolorbarticksoutlinewidthpietypepieautomarginêchoroplethcolorbarticksoutlinewidthtypechoroplethheatmapglcolorbarticksoutlinewidthtypeheatmapglcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921bartypebarerror_ycolor#2a3f5ferror_xcolor#2a3f5fmarkerlinecolor#E5ECF6width?heatmapcolorbarticksoutlinewidthtypeheatmapcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921contourcarpetcolorbarticksoutlinewidthtypecontourcarpettabletypetableheaderlinecolorwhitefillcolor#C8D4E3cellslinecolorwhitefillcolor#EBF0F8scatter3dlinecolorbarticksoutlinewidthtypescatter3dmarkercolorbarticksoutlinewidthbarpolartypebarpolarmarkerlinecolor#E5ECF6width?scattergltypescatterglmarkercolorbarticksoutlinewidthhistogram2dcolorbarticksoutlinewidthtypehistogram2dcolorscale#0d0887=9#46039f>c9#7201a8>#9c179e>9#bd3786?8#d8576b?*#ed7953?Gr#fb9f3a?c9#fdca26?#f0f921scatterternarytypescatterternarymarkercolorbarticksoutlinewidthheightCHmarginlBHbBHrBHtBpyaxisshowgridèshowlineégridcolorblackgridwidth?tickvals ?@@@range?@mirrorèticktextlinecolorblacktitlefontsizeA`text Optimal policy
path examplex?widthCHconfigshowLink¨editableªresponsiveêstaticPlotªscrollZoomæframesdatashowlegend¤modetexttextpositionlefty?typescattertextSx?showlegend¤modetexttextpositionlefty`@typescattertextGx`@showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx? @showlegend¤modelineslinecolorbluey??typescatternameOptimal Pathx @`@showlegend¤modelineslinecolorbluey? @typescatternameOptimal Pathx`@`@showlegend¤modelineslinecolorbluey @`@typescatternameOptimal Pathx`@`@nbpkginstall_time_ns4tinstantiatedòinstalled_versionsSerializationstdlibStatisticsstdlibStatsBase0.34.3Transducers0.4.84LinearAlgebrastdlibPlutoUI0.7.60HypertextLiteral0.9.5Latexify0.16.5LaTeXStrings1.3.1PlutoPlotly0.4.6terminal_outputsStatistics@ Resolving... ===  No Changes to `/tmp/jl_wBFXpg/Project.toml`  No Changes to `/tmp/jl_wBFXpg/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_wBFXpg`Transducers@ Resolving... ===  No Changes to `/tmp/jl_wBFXpg/Project.toml`  No Changes to `/tmp/jl_wBFXpg/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_wBFXpg`PlutoUI@ Resolving... ===  No Changes to `/tmp/jl_wBFXpg/Project.toml`  No Changes to `/tmp/jl_wBFXpg/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_wBFXpg`Serialization@ Resolving... ===  No Changes to `/tmp/jl_wBFXpg/Project.toml`  No Changes to `/tmp/jl_wBFXpg/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_wBFXpg`LinearAlgebra@ Resolving... ===  No Changes to `/tmp/jl_wBFXpg/Project.toml`  No Changes to `/tmp/jl_wBFXpg/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_wBFXpg`LaTeXStrings@ Resolving... ===  No Changes to `/tmp/jl_wBFXpg/Project.toml`  No Changes to `/tmp/jl_wBFXpg/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_wBFXpg`StatsBase@ Resolving... ===  No Changes to `/tmp/jl_wBFXpg/Project.toml`  No Changes to `/tmp/jl_wBFXpg/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_wBFXpg`Latexify@ Resolving... ===  No Changes to `/tmp/jl_wBFXpg/Project.toml`  No Changes to `/tmp/jl_wBFXpg/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_wBFXpg`Base@ Resolving... ===  No Changes to `/tmp/jl_wBFXpg/Project.toml`  No Changes to `/tmp/jl_wBFXpg/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_wBFXpg`nbpkg_sync@ Resolving... ===  No Changes to `/tmp/jl_wBFXpg/Project.toml`  No Changes to `/tmp/jl_wBFXpg/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_wBFXpg`HypertextLiteral@ Resolving... ===  No Changes to `/tmp/jl_wBFXpg/Project.toml`  No Changes to `/tmp/jl_wBFXpg/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_wBFXpg`PlutoPlotly@ Resolving... ===  No Changes to `/tmp/jl_wBFXpg/Project.toml`  No Changes to `/tmp/jl_wBFXpg/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_wBFXpg`enabled÷restart_recommended_msgrestart_required_msgbusy_packageswaiting_for_permission,waiting_for_permission_but_probably_disabled«cell_inputs$8ddf6b9d-d76d-401f-96ad-2a0b5c114fa4cell_id$8ddf6b9d-d76d-401f-96ad-2a0b5c114fa4codefunction create_noisy_gridworld_mdp(mdp::MDP_TD, min_reward, max_reward) #this only works when the mdp is deterministic. add a version for the stochastic wind example ptf = zeros(Float32, length(mdp.states), 3, length(mdp.actions), length(mdp.states)) for s in mdp.states i_s = mdp.statelookup[s] if mdp.isterm(s) ptf[i_s, 1, :, i_s] .= 1.0f0 else for a in mdp.actions (r, s′) = mdp.step(s, a) i_a = mdp.actionlookup[a] i_s′ = mdp.statelookup[s′] i_s = mdp.statelookup[s] ptf[i_s′, 2, i_a, i_s] = 0.5f0 ptf[i_s′, 3, i_a, i_s] = 0.5f0 end end end FiniteMDP(mdp.states, mdp.actions, [0.0f0, min_reward, max_reward], ptf) endmetadatashow_logsèdisabled®skip_as_script«code_folded$5290ae65-6f56-4849-a842-fe347315c6dccell_id$5290ae65-6f56-4849-a842-fe347315c6dccodemd""" ## 6.2 Advantages of TD Prediction Methods TD methods can learn before an episode terminates, so this is an advantage in environments that have very long episodes. Also, in continuing problems, Monte Carlo methods may not be suitable at all because there is no termination condition. Furthermore, if we consider off-policy learning, Monte Carlo methods must ignore returns if exploratory actions (ones never taken by the target policy) are taken later in the episode whereas TD methods could learn from individual steps that are not exploratory regardless of what happens later on. For any fixed policy $v_\pi$ TD(0) has been proved to converge to $v_\pi$ in the mean for a constant step-size parameter if it is sufficiently small, and with probability 1 if the step-size parameter decreases according to the usual stochastic approximation conditions (2.7). Since both TD and Monte Carlo methods converge, one natural question is which converges faster, which makes more efficient use of limited data? There is no mathematical proof to this question, nor is it clear how to even pose it formally; however, TD methods have usually been found to converge faster than constant-α MC methods on stochastic tasks, as illustrated in Example 6.2. """metadatashow_logsèdisabled®skip_as_script«code_folded$b3d4117f-7db4-43a6-8427-c08f3542d71fcell_id$b3d4117f-7db4-43a6-8427-c08f3542d71fcode1poisson(n, λ) = exp(-λ) * (λ^n) / factorial(n)metadatashow_logsèdisabled®skip_as_script«code_folded$3ed12c33-ab0a-49b1-b9e7-c4305ba35767cell_id$3ed12c33-ab0a-49b1-b9e7-c4305ba35767codeZ#take a step in the environment from state s using policy π and generate the subsequent action selection as well function init_step(mdp::MDP_TD{S, A, F, G, H}, π::Matrix{T}, s::S) where {S, A, F<:Function, G<:Function, H<:Function, T<:Real} i_s = mdp.statelookup[s] i_a = sample_action(π, i_s) a = mdp.actions[i_a] return (i_s, i_a, a) endmetadatashow_logsèdisabled®skip_as_script«code_folded$209881b3-3ac8-490e-97bd-fa5ae24a39f5cell_id$209881b3-3ac8-490e-97bd-fa5ae24a39f5code#update the value function with the TD0 method using a single episode function update_value!(V::Vector{T}, ::TD0, α::T, γ::T, mdp::MDP_TD{S, A, F, G, H}, states::Vector{S}, actions::Vector{A}, rewards::Vector{T}) where {T<:AbstractFloat, S, A, F<:Function, G<:Function, H<:Function} l = length(states) err = zero(T) for i in 1:l-1 s = states[i] s′ = states[i+1] i_s = mdp.statelookup[s] v_old = V[i_s] i_s′ = mdp.statelookup[s′] v_new = v_old + α*(rewards[i] + γ*V[i_s′] - v_old) err = max(err, calc_error(v_old, v_new)) V[i_s] = v_new end #perform update for terminal state s = last(states) i_s = mdp.statelookup[s] v_old = V[i_s] v_new = v_old + α*(rewards[l] - v_old) err = max(err, calc_error(v_old, v_new)) V[i_s] = v_new return err endmetadatashow_logsèdisabled®skip_as_script«code_folded$6e06bd39-486f-425a-bbca-bf363b58988ccell_id$6e06bd39-486f-425a-bbca-bf363b58988ccodemd""" ## 6.6 Expected Sarsa Consider the learning algorithm that is just like Q-learning except that intsead of the maximization over next state-action pairs it uses the expected value, taking into account how likely each action is under the current policy. That is consider the algorithm with the update rule $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left [ R_{t+1} + \gamma \text{E}_\pi [Q(S_{t+1}, A_{t+1})|S_{t+1}] - Q(S_t, A_t) \right ]$ $= Q(S_t, A_t) + \alpha \left [ R_{t+1} + \gamma \sum_a \pi(a|S_{t+1})Q(S_{t+1}, a) - Q(S_t, A_t) \right ]$ but that otherwise follows the scheme of Q-learning. Given the next state, $S_{t+1}$, this algorithm moves *deterministically* in the same direction as Sarsa moves *in expectation*, and accordingly it is called *Expected Sarsa*. Although more computationally complex than Sarsa, it eliminates the variance due to the random selection of $A_{t+1}$ In general Expected Sarsa might use a policy different from the target policy π to generate behavior in which case it becomes an off-policy algorithm. For example, supppose π is the greedy policy while behavior is more exploratory; then Expected Sarsa is exactly Q-learning. In this sense Expected Sarsa subsumes and generalizes Q-learning while reliably improving over Sarsa. """metadatashow_logsèdisabled®skip_as_script«code_folded$e039a5be-4b59-4023-be97-2d1de970be27cell_id$e039a5be-4b59-4023-be97-2d1de970be27code,md""" ### Double Learning Implementation """metadatashow_logsèdisabled®skip_as_script«code_folded$2786101e-d365-4d6a-8de7-b9794499efb4cell_id$2786101e-d365-4d6a-8de7-b9794499efb4codefunction example_6_2(;l = 5, max_episodes = 100, nruns = 100, vinit = 0.5f0) mrp = make_mrp(l = l) π = make_random_policy(mrp) true_values = collect(1:l) ./ (l+1) get_rw_names(l) = string.(Iterators.take('A':'Z', l) |> collect) (_, td0_est) = tabular_TD0_pred_V(π, mrp, 0.1f0, 1.0f0; num_episodes = 100, vinit = 0.5f0, save_states = collect(1:l)) traces = [scatter(x = get_rw_names(l), y = td0_est[:, n], name = "$(n-1) episodes") for n in [1, 2, 11, 101]] tv_trace = scatter(x = get_rw_names(l), y = true_values, name = "True values", line_color="black") p1 = plot([tv_trace; traces], Layout(title = "Estimated Value with TD(0)", xaxis_title = "State")) calc_rms(v_saves) = [sqrt(mean((v .- true_values) .^2)) for v in eachcol(v_saves)] run_estimate(f, α, n) = f(π, mrp, α, 1.0f0; num_episodes = n, vinit = vinit, save_states = collect(1:l)) td_αs = [0.05f0, 0.1f0, 0.15f0] mc_αs = 0.01f0:0.01f0:0.04f0 |> collect td_est = [mean([calc_rms(last(run_estimate(tabular_TD0_pred_V, α, max_episodes))) for _ in 1:nruns]) for α in td_αs] mc_est = [mean([calc_rms(last(run_estimate(monte_carlo_pred_V, α, max_episodes))) for _ in 1:nruns]) for α in mc_αs] td_traces = [scatter(x = collect(1:max_episodes), y = td_est[i], name = "$(i == 1 ? "TD" : "") α = $(td_αs[i])", line_color = "rgba(0, 0, 255, $(i/3))") for i in eachindex(td_est)] mc_traces = [scatter(x = collect(1:max_episodes), y = mc_est[i], name = "$(i == 1 ? "MC" : "") α = $(mc_αs[i])", line_color = "rgba(255, 0, 0, $(i/5))") for i in eachindex(mc_est)] p2 = plot([td_traces; mc_traces], Layout(xaxis_title = "Walks / Episodes", title = "Empirical RMS error, averaged over states")) @htl("""
$p1 $p2
The right graph shows learning curves for the two methods for various values of α. The performance measure shown is the root mean square (RMS) error between the vlue function learned and the true value function, averaged over the $l states, then averaged over $nruns runs. In all cases the approximate value function was initialized to the intermediate value 0.5. The TD method was consistently better than the MC method on this task.""") end metadatashow_logsèdisabled®skip_as_script«code_folded$14b456f9-5fd1-4340-a3c7-ab9b91b4e3e0cell_id$14b456f9-5fd1-4340-a3c7-ab9b91b4e3e0codehtml""" """metadatashow_logsèdisabled®skip_as_script«code_folded$ec285c96-4a75-4af6-8898-ec3176fa34c6cell_id$ec285c96-4a75-4af6-8898-ec3176fa34c6codefunction make_windy_gridworld(;actions = rook_actions, apply_wind = apply_wind, sterm = GridworldState(8, 4), start = GridworldState(1, 4), xmax = 10, ymax = 7, winds = wind_vals, get_step_reward = () -> -1f0) states = [GridworldState(x, y) for x in 1:xmax for y in 1:ymax] boundstate(x::Int64, y::Int64) = (clamp(x, 1, xmax), clamp(y, 1, ymax)) function step(s::GridworldState, a::GridworldAction) w = winds[s.x] (x1, y1) = move(a, s.x, s.y) (x2, y2) = apply_wind(w, x1, y1) GridworldState(boundstate(x2, y2)...) end tr(s0::GridworldState, a0::GridworldAction) = (get_step_reward(), step(s0, a0)) isterm(s::GridworldState) = s == sterm MDP_TD(states, actions, () -> start, tr, isterm) end metadatashow_logsèdisabled®skip_as_script«code_folded$cafedde8-be94-4697-a511-510a5fea0155cell_id$cafedde8-be94-4697-a511-510a5fea0155code0figure_6_3(cliffworld; load_file = fig_6_3_load)metadatashow_logsèdisabled®skip_as_script«code_folded$d526a3a4-63cc-4f94-8f55-98c9a4a9d134cell_id$d526a3a4-63cc-4f94-8f55-98c9a4a9d134codefunction double_q_learning(mdp::MDP_TD{S, A, F, G, H}, α::T, γ::T; num_episodes = 1000, qinit = zero(T), ϵinit = one(T)/10, Qinit::Matrix{T} = initialize_state_action_value(mdp; qinit=qinit), decay_ϵ = false, target_policy_function! = (v, ϵ, s) -> make_greedy_policy!(v), behavior_policy_function! = (v, ϵ, s) -> make_ϵ_greedy_policy!(v, ϵ), πinit_target::Matrix{T} = create_greedy_policy(Qinit), πinit_behavior::Matrix{T} = create_ϵ_greedy_policy(Qinit, ϵinit), save_state::S = first(mdp.states), save_history = false) where {S, A, F, G, H, T<:AbstractFloat} double_expected_sarsa(mdp, α, γ; num_episodes = num_episodes, qinit = qinit, ϵinit = ϵinit, Qinit = Qinit, decay_ϵ = decay_ϵ, target_policy_function! = target_policy_function!, behavior_policy_function! = behavior_policy_function!, πinit_target = πinit_target, πinit_behavior = πinit_behavior, save_state = save_state, save_history = save_history) endmetadatashow_logsèdisabled®skip_as_script«code_folded$02f34da1-551f-4ce5-a588-7f3a14afd716cell_id$02f34da1-551f-4ce5-a588-7f3a14afd716codeconst wind_var = [-1, 0, 1]metadatashow_logsèdisabled®skip_as_script«code_folded$f11dca8f-5557-49fc-9720-35034eadba57cell_id$f11dca8f-5557-49fc-9720-35034eadba57codemd""" Consider a square gridworld in which the rewards for each step are -1.2 or 1.0 with equal probability. There is no wind and the allowed moves are just up, down, left, and right. The start is the lower left corner and the finish is the upper right corner. It is obvious that the expected reward for a step is -0.1, so the optimal policy is to move to the goal as quickly as possible which will take $(l-1) \times 2$ steps. For a 3x3 grid, this would be 4 steps, so $\mathbb{E} \{ G_0 \} = 4 \times -0.1 = -0.4$. Because the positive reward is so much larger than the expected value, we might expect a large maximization bias to confuse the training method and favor long episodes with expected values that are positive. Below are example solutions after thousands of episodes for each of the previously discussed methods. The first solution shown is the correct optimal policy and value function using value iteration """metadatashow_logsèdisabled®skip_as_script«code_folded$4ddc7d99-0b79-4689-bd93-8798b105c0a2cell_id$4ddc7d99-0b79-4689-bd93-8798b105c0a2codegconst stochastic_gridworld = make_windy_gridworld(actions = king_actions, apply_wind = stochastic_wind)metadatashow_logsèdisabled®skip_as_script«code_folded$bd1029f9-d6a8-4c68-98cd-8af94297b521cell_id$bd1029f9-d6a8-4c68-98cd-8af94297b521codeوplot_path(mdp; title = "Random policy
path example", kwargs...) = plot_path(mdp, make_random_policy(mdp); title = title, kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$cb07a6a5-c50a-4900-9e5b-a17dc7ee5710cell_id$cb07a6a5-c50a-4900-9e5b-a17dc7ee5710codefunction make_greedy_policy!(v::AbstractVector{T}; c = 1000) where T<:Real (vmin, vmax) = extrema(v) if vmin == vmax v .= zero(T) v .= one(T) / length(v) else v .= (v .- vmax) ./ abs(vmin - vmax) v .= exp.(c .* v) v .= v ./ sum(v) end return v endmetadatashow_logsèdisabled®skip_as_script«code_folded$ddf3bb61-16c9-48c4-95d4-263260309762cell_id$ddf3bb61-16c9-48c4-95d4-263260309762codefunction exercise_6_5(;l = 5, max_episodes = 100, nruns = 100, α = 0.3f0, vinit = 0.5f0) mrp = make_mrp(l = l) π = make_random_policy(mrp) true_values = collect(1:l) ./ (l+1) get_rw_names(l) = string.(Iterators.take('A':'Z', l) |> collect) (_, td0_est) = tabular_TD0_pred_V(π, mrp, α, 1.0f0; num_episodes = 100, vinit = vinit, save_states = collect(1:l)) calc_rms(v_saves) = [sqrt(mean((v .- true_values) .^2)) for v in eachcol(v_saves)] run_estimate(f, α, n) = f(π, mrp, α, 1.0f0; num_episodes = n, vinit = vinit, save_states = collect(1:l)) rms = mean([calc_rms(last(run_estimate(tabular_TD0_pred_V, α, max_episodes))) for _ in 1:nruns]) traces = [scatter(x = get_rw_names(l), y = td0_est[:, n], name = "$(n-1) episodes") for n in [1, 2, 8, 16, 100]] tv_trace = scatter(x = get_rw_names(l), y = true_values, name = "True values", line_color="black") p1 = plot([tv_trace; traces], Layout(title = "Estimated Value with TD(0)
with α = $α", xaxis_title = "State")) rmstrace = scatter(x = 1:max_episodes, y = rms, showlegend=false, name = "RMS error") p2 = plot(rmstrace, Layout(xaxis_title = "Walks / Episodes", title = "Empirical RMS error, averaged over states")) [p1 p2] end metadatashow_logsèdisabled®skip_as_script«code_folded$d7566d1b-8938-4e2c-8c54-124f790e72aecell_id$d7566d1b-8938-4e2c-8c54-124f790e72aecode}begin abstract type CompleteMDP{T<:Real} end struct FiniteMDP{T<:Real, S, A} <: CompleteMDP{T} states::Vector{S} actions::Vector{A} rewards::Vector{T} # ptf::Dict{Tuple{S, A}, Matrix{T}} ptf::Array{T, 4} action_scratch::Vector{T} state_scratch::Vector{T} reward_scratch::Vector{T} state_index::Dict{S, Int64} action_index::Dict{A, Int64} function FiniteMDP{T, S, A}(states::Vector{S}, actions::Vector{A}, rewards::Vector{T}, ptf::Array{T, 4}) where {T <: Real, S, A} new(states, actions, rewards, ptf, Vector{T}(undef, length(actions)), Vector{T}(undef, length(states)+1), Vector{T}(undef, length(rewards)), Dict(zip(states, eachindex(states))), Dict(zip(actions, eachindex(actions)))) end end FiniteMDP(states::Vector{S}, actions::Vector{A}, rewards::Vector{T}, ptf::Array{T, 4}) where {T <: Real, S, A} = FiniteMDP{T, S, A}(states, actions, rewards, ptf) endmetadatashow_logsèdisabled®skip_as_script«code_folded$42799973-9884-4a0e-b29a-039890e92d21cell_id$42799973-9884-4a0e-b29a-039890e92d21codemd""" > ### *Exercise 6.13* > What are the update equations for Double Expected Sarsa with an ϵ-greedy target policy? For Q-learning the action-value update equation is: $Q(S_t, A_t) = Q(S_t, A_t) + \alpha[R_{t+1} + \gamma \text{max}_a Q(S_{t+1}, a) - Q(S_t, A_t)]$ For expected Sarsa the action-value update equation is: $Q(S_t, A_t) = Q(S_t, A_t) + \alpha [ R_{t+1} + \gamma \sum_a \pi(a|S_{t+1})Q(S_{t+1}, a) - Q(S_t, A_t)]$ For double Q-learning, the twin action-value update equations are: $Q_1(S_t, A_t) = Q_1(S_t, A_t) + \alpha [ R_{t+1} + \gamma Q_2(S_{t+1}, \text{argmax}_a Q_1(S_{t+1}, a)) - Q_1(S_t, A_t)]$ $Q_2(S_t, A_t) = Q_2(S_t, A_t) + \alpha [ R_{t+1} + \gamma Q_1(S_{t+1}, \text{argmax}_a Q_2(S_{t+1}, a)) - Q_2(S_t, A_t)]$ For double expected sarsa, we have two action-value estimates like in Double Q-learining, but the bootstrap calculation is an expected value calculation using each value function's target policy. In this case that target is the $\epsilon$-greedy policy rather than the greedy policy in Q-learning. The expected value uses the probabilities from the matching value function but the values from the other one: With 50% probability: $Q_1(S_t, A_t) = Q_1(S_t, A_t) + \alpha [ R_{t+1} + \gamma \sum_a \pi_1(a|S_{t+1}) Q_2(S_{t+1}, a) - Q_1(S_t, A_t)]$ and make $\pi_1$ $\epsilon$-greedy with respect to $Q_1$ With 50% probability: $Q_2(S_t, A_t) = Q_2(S_t, A_t) + \alpha [ R_{t+1} + \gamma \sum_a \pi_2(a|S_{t+1}) Q_1(S_{t+1}, a) - Q_2(S_t, A_t)]$ and make $\pi_2$ $\epsilon$-greedy with respect to $Q_2$ """metadatashow_logsèdisabled®skip_as_script«code_folded$187fc682-2282-46ca-b988-c9de438f36fdcell_id$187fc682-2282-46ca-b988-c9de438f36fdcodeu@bind params_6_2 confirm(PlutoUI.combine() do Child md""" Batch Training of Random Walk Task ||| |:-:|:-:| |$\alpha$| $(Child(:α, Slider(0.001:0.001:0.1, default = 0.01, show_value=true)))| |Number of States | $(Child(:l, Slider(3:10, default = 5, show_value=true)))| |Maximum Episodes | $(Child(:ep, Slider(100:1000, default = 100, show_value=true)))| """ end)metadatashow_logsèdisabled®skip_as_script«code_folded$8fe856ec-5f0a-4483-bb7d-3f6fe270b6f3cell_id$8fe856ec-5f0a-4483-bb7d-3f6fe270b6f3code*md""" ### Example 6.8: Noisy Gridworld """metadatashow_logsèdisabled®skip_as_script«code_folded$8e15f4b5-0dc7-47a5-9477-9f4d8807b331cell_id$8e15f4b5-0dc7-47a5-9477-9f4d8807b331codeٗconst stochastic_gridworld_mdp_dp = create_stochastic_gridworld_mdp(10, 7, GridworldState(1, 4), GridworldState(8, 4), wind_vals, king_actions, -1.0f0)metadatashow_logsèdisabled®skip_as_script«code_folded$9d01c0ef-6313-4091-b444-3e9765aba90ccell_id$9d01c0ef-6313-4091-b444-3e9765aba90ccode7md""" ### Windy Gridworld Solutions with Q-Learning """metadatashow_logsèdisabled®skip_as_script«code_folded$62a9a36a-bedb-4f5a-80a4-2d4111a65c12cell_id$62a9a36a-bedb-4f5a-80a4-2d4111a65c12code@htl("""
$(md"""$\cdots \:$""")
$(md"""$S_t$""")
$(md"""$A_t$""")
$(md"""$R_{t+1}$""")
$(md"""$S_{t+1}$""")
$(md"""$A_{t+1}$""")
$(md"""$R_{t+2}$""")
$(md"""$S_{t+2}$""")
$(md"""$A_{t+2}$""")
$(md"""$R_{t+3}$""")
$(md"""$S_{t+3}$""")
$(md"""$\:\cdots$""")
""")metadatashow_logsèdisabled®skip_as_script«code_folded$2651af2d-56a8-4f7e-a56a-45cabd665c72cell_id$2651af2d-56a8-4f7e-a56a-45cabd665c72code4 max_bias_visualization_comp(;max_visual_params2...)metadatashow_logsèdisabled®skip_as_script«code_folded$620a6426-cb29-4010-997b-aa4f9d5f8fb0cell_id$620a6426-cb29-4010-997b-aa4f9d5f8fb0codeebegin abstract type BatchMethod end struct TD0 <: BatchMethod end struct MC <: BatchMethod end endmetadatashow_logsèdisabled®skip_as_script«code_folded$889611fb-7dac-4769-9251-9a90e3a1422fcell_id$889611fb-7dac-4769-9251-9a90e3a1422fcodeSfunction statestyle(s) """ .circlestate.$s::before { content: '$s'; } """ endmetadatashow_logsèdisabled®skip_as_script«code_folded$5455fc97-55cb-4b0e-a3be-9433ccc96fc0cell_id$5455fc97-55cb-4b0e-a3be-9433ccc96fc0codemd""" Number of States: $(@bind nstates Slider(3:10, default = 5, show_value=true)) Animation Interval (s): $(@bind delay Slider(0.1:0.1:1.0, default = 0.5, show_value=true)) $(@bind start_mrp Button("New Random Walk")) """metadatashow_logsèdisabled®skip_as_script«code_folded$24a441c8-7aaf-4642-b245-5e1201456d67cell_id$24a441c8-7aaf-4642-b245-5e1201456d67codefunction check_policy(π::Matrix{T}, mdp::MDP_TD) where {T <: AbstractFloat} #checks to make sure that a policy is defined over the same space as an MDP (n, m) = size(π) num_actions = length(mdp.actions) num_states = length(mdp.states) @assert n == num_actions "The policy distribution length $n does not match the number of actions in the mdp of $(num_actions)" @assert m == num_states "The policy is defined over $m states which does not match the mdp state count of $num_states" return nothing endmetadatashow_logsèdisabled®skip_as_script«code_folded$1e45a661-c2e1-40c2-b27b-5f80f95efdabcell_id$1e45a661-c2e1-40c2-b27b-5f80f95efdabcodeshow_gridworld_policy_value(stochastic_gridworld, q_learning(stochastic_gridworld, 0.1f0, 1.0f0; num_episodes = 2000); action_display = king_action_display, policy_display = display_king_policy)metadatashow_logsèdisabled®skip_as_script«code_folded$21fbdc3b-4444-4f56-9934-fb58e184d685cell_id$21fbdc3b-4444-4f56-9934-fb58e184d685codeNmd""" Load existing figure: $(@bind fig_6_3_load CheckBox(default = true)) """metadatashow_logsèdisabled®skip_as_script«code_folded$30e663da-282c-42ff-8171-dbe3c5c467c6cell_id$30e663da-282c-42ff-8171-dbe3c5c467c6codefunction makepolicyvalueplots(mdp::CompleteMDP, v::Vector{T}, π::Matrix{T}, iter::Integer; policycolorscale = "RdBu", valuecolorscale = "Bluered", kwargs...) where T <: Real (policymap, valuemap) = makepolicyvaluemaps(mdp, v, π) layout = Layout(autosize = false, height = 220, width = 230, paper_bgcolor = "rgba(30, 30, 30, 1)", margin = attr(l = 0, t = 0, r = 0, b = 0, padding = 0), xaxis = attr(title = attr(text = "# Cars at second location", font_size = 10, standoff = 1, automargin = true), tickvals = [0, 20], linecolor = "white", mirror = true, linewidth = 2, yanchor = "bottom"), yaxis = attr(title = attr(text = "# Cars at first location", standoff = 1, automargin = true, pad_l = 0), tickvals = [0, 20], linecolor = "white", mirror = true, linewidth = 2), font_color = "gray", font_size = 9) function makeplot(z, colorscale; kwargs...) tr = heatmap(;x = 0:20, y = 0:20, z = z, colorscale = colorscale, colorbar_thickness = 2) plot(tr, layout) end vtitle = L"v_{\pi_{%$(iter-1)}}" policyplot = relayout(makeplot(policymap, policycolorscale), (title = attr(text = latexify("π_$(iter-1)"), x = 0.5, xanchor = "center", font_size = 20, automargin = true, yref = "paper", yanchor = "bottom", pad_b = 10))) valueplot = relayout(makeplot(valuemap, valuecolorscale), (title = attr(text = vtitle, x = 0.5, xanchor = "center", font_size = 20, automargin = true, yref = "paper", yanchor = "bottom", pad_b = 10))) (π = relayout(policyplot, kwargs), v = relayout(valueplot, kwargs)) endmetadatashow_logsèdisabled®skip_as_script«code_folded$9651f823-e1cd-4e6e-9ce0-be9ea1c3f0a4cell_id$9651f823-e1cd-4e6e-9ce0-be9ea1c3f0a4codefunction display_king_policy(v::Vector{T}; scale = 1.0) where T<:AbstractFloat @htl("""
""") endmetadatashow_logsèdisabled®skip_as_script«code_folded$84a71bf8-0d66-42cd-ac7b-589d63a16edacell_id$84a71bf8-0d66-42cd-ac7b-589d63a16edacodefunction create_greedy_policy(Q::Matrix{T}; c = 1000, π = copy(Q)) where T<:Real vhold = zeros(T, size(Q, 1)) for j in 1:size(Q, 2) vhold .= Q[:, j] make_greedy_policy!(vhold; c = c) π[:, j] .= vhold end return π endmetadatashow_logsèdisabled®skip_as_script«code_folded$c9f7646a-ec01-4d90-9215-5027b7c1c885cell_id$c9f7646a-ec01-4d90-9215-5027b7c1c885code١md""" ### Q-learning Instability at Higher Learning Rate Learning Rate $\alpha$ $(@bind α_6_8 Slider(0.01f0:0.01f0:0.5f0, default = 0.3f0, show_value=true)) """metadatashow_logsèdisabled®skip_as_script«code_folded$8e34202a-f841-4464-9017-cd50194f7987cell_id$8e34202a-f841-4464-9017-cd50194f7987codeٟfunction make_random_policy(mdp::MDP_TD; init::T = 1.0f0) where T <: AbstractFloat ones(T, length(mdp.actions), length(mdp.states)) ./ length(mdp.actions) endmetadatashow_logsèdisabled®skip_as_script«code_folded$95245673-2c29-401e-bb4b-a39dc8172297cell_id$95245673-2c29-401e-bb4b-a39dc8172297codefunction create_gridworld_mdp(width, height, start, goal, wind, actions, step_reward) mdp = make_windy_gridworld(;actions = actions, apply_wind = apply_wind, sterm = goal, start = start, xmax = width, ymax = height, winds = wind_vals, get_step_reward = () -> step_reward) ptf = zeros(Float32, length(mdp.states), 2, length(mdp.actions), length(mdp.states)) for s in mdp.states i_s = mdp.statelookup[s] if mdp.isterm(s) ptf[i_s, 1, :, i_s] .= 1.0f0 else for a in mdp.actions w = wind[s.x] (r, s′) = mdp.step(s, a) i_a = mdp.actionlookup[a] i_s = mdp.statelookup[s] i_s′ = mdp.statelookup[s′] ptf[i_s′, 2, i_a, i_s] = 1.0f0 end end end FiniteMDP(mdp.states, mdp.actions, [0.0f0, step_reward], ptf) endmetadatashow_logsèdisabled®skip_as_script«code_folded$c34678f6-53bb-4f2a-96f0-a7b16f894dddcell_id$c34678f6-53bb-4f2a-96f0-a7b16f894dddcodefunction show_gridworld_policy_value(mdp, results; winds = wind_vals, action_display = rook_action_display, policy_display = display_rook_policy) Q, π = results policy_display = show_grid_policy(mdp, π, winds, policy_display, String(rand('A':'Z', 10)); action_display = action_display, scale = .8) value_display = show_grid_value(mdp, Q, winds, String(rand('A':'Z', 10)); action_display = action_display, scale = .8) path = plot_path(mdp, π) @htl("""
$policy_display
$value_display
$path
""") endmetadatashow_logsèdisabled®skip_as_script«code_folded$e4e80015-40ce-4f8a-aac7-4a9584da4baacell_id$e4e80015-40ce-4f8a-aac7-4a9584da4baacode$example_6_8(;loadfile = ex_6_8_load)metadatashow_logsèdisabled®skip_as_script«code_folded$64fe8336-d1c2-41fe-a522-1b6f63260fc9cell_id$64fe8336-d1c2-41fe-a522-1b6f63260fc9code*const π_mrp = make_random_policy(mrp_6_2)metadatashow_logsèdisabled®skip_as_script«code_folded$dea61907-d4fb-492d-b2bb-c037c7f785cbcell_id$dea61907-d4fb-492d-b2bb-c037c7f785cbcodeefunction bellman_optimal_value!(V::Vector{T}, mdp::FiniteMDP{T, S, A}, γ::T) where {T <: Real, S, A} delt = zero(T) @inbounds @fastmath @simd for i_s in eachindex(mdp.states) maxvalue = typemin(T) @inbounds @fastmath @simd for i_a in eachindex(mdp.actions) x = zero(T) for (i_r, r) in enumerate(mdp.rewards) @inbounds @fastmath @simd for i_s′ in eachindex(V) x += mdp.ptf[i_s′, i_r, i_a, i_s] * (r + γ * V[i_s′]) end end maxvalue = max(maxvalue, x) end delt = max(delt, abs(maxvalue - V[i_s]) / (eps(abs(V[i_s])) + abs(V[i_s]))) V[i_s] = maxvalue end return delt endmetadatashow_logsèdisabled®skip_as_script«code_folded$678cad7a-1abb-4fcc-91ba-b5abcbb914cbcell_id$678cad7a-1abb-4fcc-91ba-b5abcbb914cbcodefunction show_grid_value(mdp, V::Vector, wind::Vector, name; action_display = king_action_display, scale = 1.0) width = maximum(s.x for s in mdp.states) height = maximum(s.y for s in mdp.states) start = mdp.state_init() termind = findfirst(mdp.isterm, mdp.states) sterm = mdp.states[termind] ngrid = width*height @htl("""
$(HTML(mapreduce(i -> """
$(round(V[i], sigdigits = 2))
""", *, eachindex(mdp.states))))
$(HTML(mapreduce(i -> """
$(wind[i])
""", *, 1:width)))
$(action_display)
Wind Values
""") endmetadatashow_logsèdisabled®skip_as_script«code_folded$d299d800-a64e-4ba2-9603-efa833343405cell_id$d299d800-a64e-4ba2-9603-efa833343405code function example_6_5(;mdp = windy_gridworld, num_episodes = 170, action_display = rook_action_display, policy_display = display_rook_policy, use_stochastic_dp=false) (Qstar, πstar, steps, rewards) = sarsa(mdp, 0.5f0, 1.0f0; ϵinit = 0.1f0, num_episodes = num_episodes, decay_ϵ = false) # eg = runepisode(mdp, create_greedy_policy(Qstar)) eg = runepisode(mdp, πstar; max_steps = 100_000) mdp_dp = use_stochastic_dp ? stochastic_gridworld_mdp_dp : create_gridworld_mdp(mdp, -1.0f0) v_dp, π_dp = begin_value_iteration_v(mdp_dp, 1.0f0) path_dp = plot_path(mdp, π_dp; title = "Value Iteration Policy
Path Example") policy_display_dp = show_grid_policy(mdp, π_dp, wind_vals, policy_display, String(rand('A':'Z', 10)); action_display = action_display, scale = 1.0) value_display_dp = show_grid_value(mdp, v_dp[end], wind_vals, String(rand('A':'Z', 10)); action_display = action_display, scale = 1.0) start_trace = scatter(x = [1.5], y = [4.5], mode = "text", text = ["S"], textposition = "left", showlegend=false) finish_trace = scatter(x = [8.5], y = [4.5], mode = "text", text = ["G"], textposition = "left", showlegend=false) path_traces = [scatter(x = [eg[1][i].x + 0.5, eg[1][i+1].x + 0.5], y = [eg[1][i].y + 0.5, eg[1][i+1].y + 0.5], line_color = "blue", mode = "lines", showlegend=false, name = "Optimal Path") for i in 1:length(eg[1])-1] finalpath = scatter(x = [eg[1][end].x + 0.5, 8.5], y = [eg[1][end].y + 0.5, 4.5], line_color = "blue", mode = "lines", showlegend=false, name = "Optimal Path") p1 = plot(scatter(x = cumsum(steps), y = 1:num_episodes, line_color = "red"), Layout(xaxis_title = "Time steps", yaxis_title = "Episodes")) p2 = plot([start_trace; finish_trace; path_traces; finalpath], Layout(xaxis = attr(showgrid = true, showline = true, gridwith = 1, gridcolor = "black", zeroline = true, linecolor = "black", mirror=true, tickvals = 1:10, ticktext = [0, 0, 0, 1, 1, 1, 2, 2, 1, 0], range = [1, 11], title = "Wind Values"), yaxis = attr(linecolor="black", mirror = true, gridcolor = "black", showgrid = true, gridwidth = 1, showline = true, tickvals = 1:7, ticktext = fill("", 7), range = [1, 8]), width = 300, height = 210, autosize = false, padding=0, paper_bgcolor = "rgba(0, 0, 0, 0)", title = attr(text = "Sarsa policy
Path Example", font_size = 14, x = 0.5))) p3 = plot(scatter(x = 1:num_episodes, y = steps), Layout(xaxis_title = "Time steps", yaxis_title = "Steps Per Episode", yaxis_type = "log")) policy_display = show_grid_policy(mdp, πstar, wind_vals, policy_display, String(rand('A':'Z', 10)); action_display = action_display, scale = 1.0) value_display = show_grid_value(mdp, Qstar, wind_vals, String(rand('A':'Z', 10)); action_display = action_display, scale = 1.0) return @htl("""
$p1
$p2
$path_dp
$p3 Sarsa Solution
$policy_display $value_display
Value Iteration Solution
$policy_display_dp $value_display_dp
""") endmetadatashow_logsèdisabled®skip_as_script«code_folded$c5718459-2323-4615-b2c4-f92a0fa189d9cell_id$c5718459-2323-4615-b2c4-f92a0fa189d9code ?md""" Let $\mathcal{M}$ be the set of labels of estimators that maximize the expcted values of $X$: $$\mathcal{M} \doteq \left \{ j \mid \mathbb{E} \{ X_j \} = \max_i \mathbb{E} \{ X_i \} \right \}$$ Let $Max(S)$ be the set of labels of estimators that yield the maximum estimate for some set of samples S: $$Max(S) \doteq \left \{ j \mid \mu_j(S) = \max_i \mu_i(S) \right \}$$ The claim is that for all $j \in \mathcal{M}$ $$\mathbb{E} \{ \max_i \mu_i \} \geq \mathbb{E} \{ \mu_j \} = \mathbb{E} \{ X_j \} \doteq \max_i \mathbb{E} \{ X_i \} \tag{d}$$ *Proof*. Assume $j \in \mathcal{M}$, i.e. $\mu_j$ is any estimator whose expected value is the maximal. Then $$\begin{flalign} \mathbb{E} \{ \max_i \mu_i \} &= P(j \in Max) \mathbb{E} \{ \max_i \mu_i \} + P(j \notin Max) \mathbb{E} \{ \max_i \mu_i \} \\ &= P(j \in Max) \mathbb{E} \{\mu_j \vert j \in Max \} + P(j \notin Max) \mathbb{E} \{ \max_i \mu_i \} \\ &\geq P(j \in Max) \mathbb{E} \{\mu_j \vert j \in Max \} + P(j \notin Max) \mathbb{E} \{ \mu_j \vert j \notin Max \} \\ &=\mathbb{E} \{ \mu_j \} = \mathbb{E} \{X_j\} \doteq \max_i \mathbb{E} \{ X_i \} \end{flalign}$$ The third line in the proof follows from the definition of $Max$ which implies $\mathbb{E} \{ \max_i \mu_i \} \gt \mathbb{E} \{ \mu_j \vert j \notin Max \}$, for any $j$. Therefore the inequality is strict if and only if $P(j \notin Max) \gt 0$, for some $j \in \mathcal{M}$. If we do not know whether this is the case, we do not know if the inequality in $(d)$ is strict and theremore in general we write $\mathbb{E} \{ \max_i \mu_i \} \geq \max_i \mathbb{E} \{ \mu_i \}$ so the claim has been proven. Recall that $j$ is assumed to be in the set $\mathcal{M}$ meaning it has a maximizing expected value while the set $Max(S)$ contains the variables that produce the maximum estimate over some sample $S$. So, intuitively, the proof says that calculating the expected value of the maximum of the estimators will always have a positive bias, unless there is 0 probability that the variables that produces the highest estimates over a given sample are different than the true set of maximizing variables. This means that unless the underlying distribution of the variables have zero overlap (in this case the ranking of estimates will match the ranking of true expected values), there is always an expected positive bias. """metadatashow_logsèdisabled®skip_as_script«code_folded$c306867b-f137-44f2-97dd-3d10c226ca5ccell_id$c306867b-f137-44f2-97dd-3d10c226ca5ccode md""" Consider instead policy improvement with afterstate value estimates $W_\pi(y)$ where we seek to choose a policy that is greedy with respect to the afterstate values: $\pi^\prime(s) = \mathrm{argmax}_a (f_2(s, a) + W_\pi(f_1(s, a))$ where $f_1$ and $f_2$ are the deterministic functions defined above that determine which afterstate is reached from $(s, a)$ and whether any intermediate reward is received. This looks much closer to the policy improvement that occurs with $Q(s, a)$ and that is because $Q_\pi(s, a) = f_2(s, a) + W_\pi(f_1(s, a))$. So, if we use afterstates, we can have the benefits of learning the state action value function while only saving values for the afterstates. The functions $f_1$ and $f_2$ provide all the extra information needed to recover those values. Continuing the comparison to value iteration, recall that we adapted the Bellman optimality equation for the state value function to have a single update rule to estimate $V^*(s)$: $$V^*(s) = \max_a Q^*(s, a) = \max_a \sum_{r, s^\prime} p(r, s^\prime \vert s, a) (r + \gamma V^*(s^\prime))$$ We can only apply this update rule if we have $p(r, s^\prime \vert s, a)$ or if we instead estimate $Q^*$ and sample the transitions from the environment. To estimate $W^*(y)$, we need to represent the Bellman optimality equation for the afterstate value function instead of the state value function: $\begin{flalign} W^*(y) &= \sum_{r, s^\prime} p(r, s^\prime \vert y)(r + \gamma \max_a(f_2(s^\prime, a) + W^*(f_1(s^\prime, a)))) \\ &= \sum_{r, s^\prime} p(r, s^\prime \vert y)r + \gamma \sum_{s^\prime} p(s^\prime \vert y) \max_a(f_2(s^\prime, a) + W^*(f_1(s^\prime, a))) \end{flalign}$ where $p(s^\prime \vert y) = \sum_r p(r, s^\prime \vert y)$ The outer sum is just represents an expected value based on the transition out of $y$, so if we don't have access to $p(r, s^\prime \vert y)$, we could sample the transitions from the environment. The $\max_a$ term can now be calculated explicitely and will involve finding the maximum index of a vector for each transition state and does not depend on the reward. Using state values, the maximization step involves evaluating a double sum every time, so each update with afterstates is less costly. Also, the afterstates themselves might be more informative in the sense that they all have distinct values. If many of the actions from a given state, lead to the same afterstate, this method will immediately treat them all as equal, whereas with usual value iterationthat equivalence would have to be calculated with the probability transition function. The benefits of using an afterstate value function depend entirely on how effectively the environment transitions can be separated into informative deterministic steps and limited stochastic dynamics. """metadatashow_logsèdisabled®skip_as_script«code_folded$a4c4d5f2-d76d-425e-b8c9-9047fe53c4f0cell_id$a4c4d5f2-d76d-425e-b8c9-9047fe53c4f0code&gridworld_Q_vs_sarsa_solve(cliffworld)metadatashow_logsèdisabled®skip_as_script«code_folded$410abe1d-04a6-4434-9abf-0d29dd6498e6cell_id$410abe1d-04a6-4434-9abf-0d29dd6498e6code*md""" ### Tabular TD(0) Implementation """metadatashow_logsèdisabled®skip_as_script«code_folded$aa0791a5-8cf1-499b-9900-4d0c59be808ccell_id$aa0791a5-8cf1-499b-9900-4d0c59be808ccode`function stochastic_wind(w, x, y) w == 0 && return (x, y) v = rand(wind_var) (x, y+w+v) endmetadatashow_logsèdisabled®skip_as_script«code_folded$510761f6-66c7-4faf-937b-e1422ec829a6cell_id$510761f6-66c7-4faf-937b-e1422ec829a6codeHTML(""" """)metadatashow_logsèdisabled®skip_as_script«code_folded$0b9c6dbd-4eb3-4167-886e-64db9ec7ff04cell_id$0b9c6dbd-4eb3-4167-886e-64db9ec7ff04codemd""" > ### *Exercise 6.3* > From the results shown in the left graph of the random walk example it appears that the first episode results in a change only in $V(A)$. What does this tell you about what happened on the first episode? Why was only the estimate for this one state changed? By exactly how much was it changed? The update rule with TD(0) learning is given by $V(S_t) \leftarrow V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]$ All states, A, B, C, D, E are initialized at 0.5 with the terminal state initialized at 0. During the first episode for all transitions before the end, the reward is 0 and the difference between adjacent states would be 0 resulting in no change to the value function. Since the value estimate for state A decreases from the initial value, this means that the first episode terminated to the left. For this final transition we have the following update. $V(A) \leftarrow V(A) + \alpha[0 + \gamma V(\text{Term}) - V(A)]$ We know that prior to the update $V(A) = 0.5$, $V(\text{Term}) = 0$ and $\gamma=1$ so the update is $V(A) \leftarrow 0.5 + \alpha[0 - 0.5]$ For this plot, $\alpha=0.1$, so the updated value for $V(A)$ is $0.5+0.1(-0.5)=0.5-0.05=0.45$ """metadatashow_logsèdisabled®skip_as_script«code_folded$a9dda9b5-f568-481c-9e8f-9bb887468775cell_id$a9dda9b5-f568-481c-9e8f-9bb887468775code$md""" #### Random Walk MDP Setup """metadatashow_logsèdisabled®skip_as_script«code_folded$ad03500a-bd42-4216-a9cb-3f923152af79cell_id$ad03500a-bd42-4216-a9cb-3f923152af79codefunction create_car_rental_afterstate_mdp(;nmax=20, λs::@NamedTuple{request_A::T, request_B::T, return_A::T, return_B::T} = (request_A = 3f0, request_B = 4f0, return_A = 3f0, return_B = 2f0), movecost::T = 2f0, rentcredit::T = 10f0, movemax::Integer=5, maxovernight::Integer = 20, overnightpenalty::T = 4f0, employeeshuttle = false) where T <: Real #enumerate all states and afterstates states = [(n_a, n_b) for n_a in 0:nmax for n_b in 0:nmax] afterstates = [(n_a, n_b) for n_a in 0:nmax for n_b in 0:nmax] actions = collect(-movemax:movemax) afterstate_lookup = makelookup(afterstates) #enumerate all rewards by simply incrementing by 1 dollar from the worst to best case scenario rewards = collect(-movecost*movemax - 2*overnightpenalty:rentcredit*nmax*2) reward_lookup = Dict(zip(rewards, eachindex(rewards))) #mapping from rewards to the proper index #create a lookup for the probability of starting with n cars at the start of the day and ending up with n′ at the end of the day function create_probability_lookup(λ_request, λ_return) #can only rent from 0 to n cars. if requests exceed n, all of those situations are equivalent and the probability is 1 - p(x < n-1) p_rent = Dict(n_request => poisson(n_request, λ_request) for n_request in 0:nmax-1) #car returns can be any number greater than or equal to 0, but all returns of nmax - (n - nrent) or more will result in the same state which is max cars p_return = Dict(n_return => poisson(n_return, λ_return) for n_return in 0:nmax-1) #initialize probabilities for each final value at 0 prob_lookup = Dict((t, nrent) => 0f0 for t in states for nrent in 0:t[1]) for n in 0:nmax for n_rent in 0:n-1 for n_return in 0:(nmax - n + n_rent - 1) n′ = n - n_rent + n_return p = p_rent[n_rent]*p_return[n_return] prob_lookup[((n, n′), n_rent)] += p end prob_lookup[((n, nmax), n_rent)] += p_rent[n_rent]*(1 - sum(p_return[n_return] for n_return in 0:nmax-n+n_rent-1; init = zero(T))) end for n_return in 0:(nmax - 1) n′ = n_return p = (1 - sum(p_rent[n_rent] for n_rent in 0:n-1; init = zero(T)))*p_return[n_return] prob_lookup[((n, n′), n)] += p end prob_lookup[((n, nmax), n)] += (1 - sum(p_rent[n_rent] for n_rent in 0:n-1; init = zero(T)))*(1 - sum(p_return[n_return] for n_return in 0:nmax-1, init = zero(T))) end return prob_lookup end probabilities = (location_A = create_probability_lookup(λs.request_A, λs.return_A), location_B = create_probability_lookup(λs.request_B, λs.return_B)) #calculate probability matrix for all the afterstate transitions given starting in state s and taking action a function get_afterstate_transition(s, a) (n_a, n_b) = s #calculate the number of cars moved with sign indicating direction + being A to B, normally this is simply a but if we try to move more cars than are available, it will be capped carsmoved = if a > 0 min(a, n_a) elseif a < 0 -min(abs(a), n_b) else 0 end #cars above nmax are returned to the company but we still incur the cost of transfering them aftercount_a = min(n_a - carsmoved, nmax) aftercount_b = min(n_b + carsmoved, nmax) cost = (abs(a) - (a > 0)*employeeshuttle)*movecost + (overnightpenalty * ((aftercount_a > maxovernight) + (aftercount_b > maxovernight))) #one free transfer from A to B if employee shuttle is true in modified version, overnight penalty if too many cars are left at a lot afterstate = (aftercount_a, aftercount_b) return (afterstate, -cost) end #create functions that map a state action pair to an afterstate and intermediate reward afterstate_map = zeros(Int64, length(actions), length(states)) reward_interim_map = zeros(Float32, length(actions), length(states)) for (i_s, s) in enumerate(states) for (i_a, a) in enumerate(actions) (afterstate, r_int) = get_afterstate_transition(s, a) afterstate_map[i_a, i_s] = afterstate_lookup[afterstate] reward_interim_map[i_a, i_s] = r_int end end out = zeros(Float32, length(states), length(rewards)) #calculate probability matrix for all the s′, r transitions given starting in afterstate y function fillmatrix!(out, s) #initialize the matrix for s′, r transitions, each column runs over the transition states out .= 0f0 (aftercount_a, aftercount_b) = s for (i_s′, s′) in enumerate(states) (n_a′, n_b′) = s′ for n_rent_a in 0:aftercount_a for n_rent_b in 0:aftercount_b p_a = probabilities.location_A[((aftercount_a, n_a′), n_rent_a)] p_b = probabilities.location_B[((aftercount_b, n_b′), n_rent_b)] p_total = p_a*p_b r = rentcredit*(n_rent_a+n_rent_b) out[i_s′, reward_lookup[r]] += p_total end end end return out end #initialize probability functions with all zeros ptf = zeros(T, length(states), length(rewards), length(afterstates)) for (i_s, s) in enumerate(afterstates) ptf[:, :, i_s] .= fillmatrix!(out, s) end #find indices of the reward vector that never have non zero probability inds = reduce(intersect, [findall(0 .== [sum(ptf[:, i, j]) for i in 1:size(ptf, 2)]) for j in 1:size(ptf, 3)]) goodinds = setdiff(eachindex(rewards), inds) FiniteAfterstateMDP(states, afterstates, actions, rewards[goodinds], ptf[:, goodinds, :], afterstate_map, reward_interim_map) endmetadatashow_logsèdisabled®skip_as_script«code_folded$de50f95f-984e-4387-958c-64e0265f5953cell_id$de50f95f-984e-4387-958c-64e0265f5953codewfunction render_walk(id; l = 5) l > 26 && error("Cannot render more than 26 states") names = Iterators.take('A':'Z', l) |> collect startstate = names[ceil(Int64, l/2)] makestate(s) = """
""" function combinestates(s1, s2) """ $s1
0
$s2 """ end @htl("""
0
$(HTML(mapreduce(makestate, combinestates, names)))
1
""") endmetadatashow_logsèdisabled®skip_as_script«code_folded$c8500b89-644d-407f-881a-bcbd7da23502cell_id$c8500b89-644d-407f-881a-bcbd7da23502codemd""" **Figure 6.3** Interim and aymptotic performance shown for TD control methods on cliff-walking task as a function of α. Dashed lines represent interim performance and solid lines are asymptotic. """metadatashow_logsèdisabled®skip_as_script«code_folded$84d81413-6334-4965-8632-8a763cd3f28acell_id$84d81413-6334-4965-8632-8a763cd3f28acodemd""" Comparison of all learning methods with their double estimator counterparts and the simple MDP described in 6.7. Q-learning initially learns to take the left action much more often than the right atcion, and always takes it significantly more often than the 5% minimum probability encorced by $\epsilon$-greedy action selection with $\epsilon$=0.1. In contrast, Double Q-learning is essentially unaffected by maximization bias as is Double Expected Sarsa. Sarsa and Expected Sarsa also exhibit maximization bias as well. All of the sarsa methods eventually take the left action more than Q-learning even though the behavior policy should be the same for both. Even Double Expected Sarsa without maximization bias shows the same tendancy. The only difference between this method and Double Q-learning is the use of the $\epsilon$-greedy policy in the value calculation. So the action value estimates are for the $\epsilon$-greedy policy rather than for the greedy policy under Double Q-learning. Under this policy, sometimes the right action selection goes left and visa versa. Even under the $\epsilon$-greedy policy, the optimal policy would be to select right, but due to the variance in value estimates introduced by $\epsilon$, it will take longer for the behavior policy based on the Q values to converge to the correct values. That slower convergence is apparent in the graph above. """metadatashow_logsèdisabled®skip_as_script«code_folded$33d69db9-fa2b-40a3-bbed-21d5fd60f302cell_id$33d69db9-fa2b-40a3-bbed-21d5fd60f302codefunction example_6_8(;loadfile = true) methods = [sarsa, expected_sarsa, double_expected_sarsa, q_learning, double_q_learning] names = ["Sarsa", "Expected Sarsa", "Double Expected Sarsa", "Q-learning", "Double Q-learning"] results1 = [f(noisy_gridworld, 0.1f0, 1.0f0, num_episodes = 5_000) for f in methods] displays = [show_gridworld_policy_value(noisy_gridworld, a; winds = fill(0, gridsize)) for a in results1] value_iteration_solution = begin_value_iteration_v(noisy_gridworld_dp, 1.0f0) v_true = last(first(value_iteration_solution)) value_iteration_display = show_gridworld_policy_value(noisy_gridworld, (v_true, last(value_iteration_solution))) if loadfile && isfile("example_6_8.bin") step_plot = deserialize("example_6_8.bin") else max_episodes = 20 num_samples = 10_000 steps = [(1:num_samples |> Map(_ -> f(noisy_gridworld, 0.01f0, 1.0f0, num_episodes = max_episodes)[3]) |> foldxt(+)) / num_samples for f in methods] step_traces = [scatter(x = 1:max_episodes, y = v, name = names[i]) for (i, v) in enumerate(steps)] step_plot = plot(step_traces, Layout(title = "Episode Length for Noisy Gridworld", xaxis_title = "Episodes", yaxis_title = "Steps per Episode", yaxis_type = "log")) serialize("example_6_8.bin", step_plot) end out = @htl("""
Value Iteration Solution $value_iteration_display
$(HTML(mapreduce(*, eachindex(displays)) do i """
$(names[i]) Solution $(displays[i])
""" end))
$(step_plot) """) return out endmetadatashow_logsèdisabled®skip_as_script«code_folded$3f3ebc9b-b070-4d73-8be9-823b399c664ccell_id$3f3ebc9b-b070-4d73-8be9-823b399c664ccode#compute the value function for a policy π on an mdp with a constant step size parameter α and a discount rate of γ. Must provide a tolerance ϵ which is the maximum difference observed when updating the value function that can be tollerated to consider the value function to be converged. function batch_value_est(π::Matrix{T}, mdp::MDP_TD{S, A, F, G, H}, α::T, γ::T, ϵ::T; num_episodes::Integer = 1000, vinit::T = zero(T), save_states::Vector{S} = Vector{S}(), V::Vector{T} = initialize_state_value(mdp; vinit = vinit), estimation_method::BatchMethod = TD0(), maxcount = typemax(T)) where {T<:AbstractFloat, S, A, F, G, H} check_policy(π, mdp) terminds = findall(mdp.isterm(s) for s in mdp.states) V[terminds] .= zero(T) v_saves = zeros(T, length(save_states), num_episodes+1) errors = zeros(T, num_episodes) function update_saves!(v_saves, ep) for (i, s) in enumerate(save_states) i_s = mdp.statelookup[s] v_saves[i, ep] = V[i_s] end end update_saves!(v_saves, 1) #each tuple in this vector matches an output from the runepisode function saved_episodes = Vector{Tuple{Vector{S}, Vector{A}, Vector{T}}}() for n in 1:num_episodes push!(saved_episodes, runepisode(mdp, π)[1:end-1]) err = typemax(T) #wait until the error has converged count = zero(T) while (count < maxcount) && (err > ϵ) worst_error = zero(T) #update values for entire batch of episodes for ep in saved_episodes #update values for each episode in a batch and update the worst error worst_error = max(worst_error, update_value!(V, estimation_method, α, γ, mdp, ep...)) end err = worst_error count += 1 end errors[n] = err #only update saves after the value function has converged for this batch update_saves!(v_saves, n+1) end return V, v_saves, errors endmetadatashow_logsèdisabled®skip_as_script«code_folded$d5b612d8-82a1-4586-b721-1baaea2101cfcell_id$d5b612d8-82a1-4586-b721-1baaea2101cfcodemd""" Value iteration with afterstates converged in 10 fewer steps than state value iteration, but the total runtime is less than 25%. So as expected the afterstate method converges in fewer steps each of which is more efficient to compute than using the state value function. """metadatashow_logsèdisabled®skip_as_script«code_folded$dee6b500-0ba1-4bbc-b217-cbb9ad47ad06cell_id$dee6b500-0ba1-4bbc-b217-cbb9ad47ad06code٦example_6_5(;mdp = make_windy_gridworld(actions = [king_actions; Stay()]), num_episodes = 400, action_display = action3_display, policy_display = display_king_policy)metadatashow_logsèdisabled®skip_as_script«code_folded$897fde24-9a4a-465e-96f2-dd9e8baab294cell_id$897fde24-9a4a-465e-96f2-dd9e8baab294codekshow_gridworld_policy_value(windy_gridworld, q_learning(windy_gridworld, 0.5f0, 1.0f0; num_episodes = 400))metadatashow_logsèdisabled®skip_as_script«code_folded$1e3d231a-4065-48ce-a74e-018066fb232acell_id$1e3d231a-4065-48ce-a74e-018066fb232acodefunction example_6_3(;l = 5, max_episodes = 100, nruns = 100, vinit = 0.5f0, α = 0.05f0, ϵ = α, kwargs...) #note that for this task the error tolerance is set to the step size because the only reward experienced is 1, so the smallest possible maximum value update is α anyway mrp = make_mrp(l = l) π = make_random_policy(mrp) true_values = collect(1:l) ./ (l+1) function get_errors(method) (v, v_saves, errors) = batch_value_est(π, mrp, α, 1.0f0, ϵ; num_episodes = max_episodes, vinit=vinit, save_states = collect(1:l), estimation_method = method, kwargs...) sqrt.(mean((v_saves .- true_values) .^2, dims = 1)) end mc_errors = mean([get_errors(MC()) for _ in 1:nruns])[:] td0_errors = mean([get_errors(TD0()) for _ in 1:nruns])[:] t1 = scatter(x = 0:max_episodes, y = mc_errors, name = "MC") t2 = scatter(x = 0:max_episodes, y = td0_errors, name = "TD") p = plot([t1, t2], Layout(xaxis_title = "Walks / Episodes", yaxis_title = "RMS error, averaged over states", title = "Batch Training")) md""" #### Figure 6.2 $p Performance of TD(0) and constant-α MC under batch training on the random walk task with $l states """ end metadatashow_logsèdisabled®skip_as_script«code_folded$0f22e85f-ed31-49df-a7c7-0579298f05fecell_id$0f22e85f-ed31-49df-a7c7-0579298f05fecodeJmd""" For Monte Carlo learning each state estimate is updated with the error shown by the red arrows only after the episode is finished. For TD(0) learning, as soon as the feedback from the subsequent state is received, the error can be calculated and it is only based on the new information from one state into the future. """metadatashow_logsèdisabled®skip_as_script«code_folded$9017093c-a9c3-40ea-a9c6-881ee62fc379cell_id$9017093c-a9c3-40ea-a9c6-881ee62fc379codemd""" > ### *Exercise 6.2* > This is an exercise to help develop your intuition about why TD methods are often more efficient than Monte Carlo methods. Consider the driving home example and how it is addressed by TD and Monte Carlo methods. Can you imagine a scenario in which a TD update would be better on average than a Monte Carlo update? Give an example scenario - a description of past experience and a current state - in which you would expect the TD update to be better. Here's a hint: Suppose you have lots of experience driving home from work. Then you move to a new building and a new parking lot (but you still enter the highway at the same place). Now you are starting to learn predictions for the new building. Can you see why TD updates are likely to be much better, at least initially, in this case? Might the same sort of thing happen in the original scenario? Originally, from the starting state, the expected total time to reach home is 30 minutes. Now if we change the route so that it now takes on average 5 more minutes to reach the car, but the expected elapsed time for every other leg of the journey is unchanged. Now our total time estimate should be 35 minutes from the starting state on average. Let's say we reach the car and nothing out of the ordinary is happening. The predicted time to go will be 25 minutes and the predicted total time will be 35 minutes. If nothing further out of the ordinary occurs, then only the first state will be corrected. For the Monte Carlo method, the only state with an estimate error will be the first state, but this update will not occur until after we've arrived at our destination. Either way, the next time we drive we will have a new, more accurate estimate reflecting the longer time required to reach the car. $(example_6_1(;elapsed = [0, 10, 20, 25, 32, 35], predicted_ttg = [30, 25, 15, 10, 3, 0])) In the example, during the drive several events occur during the journey that change the predicted and actual time from the average. For simplicity let's assume that when we enter our home street there is a garbage truck blocking our path. Normally it only takes 3 minutes to arrive at home, but with the truck present we estimate it will take 5 minutes (2 minutes longer). Now the total predicted time will be increased from 35 minutes to 37 minutes. In the case of Monte Carlo learning, this additional 2 minutes will propagate backwards to all of the previous states because we experienced a true travel time of 37 minutes rather than the 35 minutes predicted after the 2nd state and the 30 minutes predicted after the first state. For TD(0) learning, however, this delay will only impact the previous state after a single update. Effectively it will increase the predicted time spent on the final leg of the journey only. The prediction from the starting state will only be increased by the 5 minute increase from the walk to the car, not the delay from the garbage truck. Since we are actually starting from a new point, that feedback will be consistent and does reflect a true change in the expected time from the starting state. The garbage truck, however, may be a rare occurence. By the time this change propagates backwards through the states to the starting state, a lot more experience will be accummulated at all the other states and if α is some reasonable value, this delay will not be counted nearly as much as the updates from the first leg of the journey. Since TD(0) only uses feedback from one step into the future immediately, if changes are made to the environment, those changes will only affect the most closely related states immediately. In this example, all of the accurate predictions we still have about the later legs of the journey will be used to keep the predictions more stable. $(example_6_1(;elapsed = [0, 10, 20, 25, 32, 37], predicted_ttg = [30, 25, 15, 10, 5, 0])) The opposite extreme though could create a situation where the Monte Carlo updates were better. Imagine instead that you moved houses in the same neighborhood such that once you enter the home street, it takes 5 minutes to reach your home instead of 3 minutes. In this case, the Monte Carlo updates would move all of the state predictions up towards the 2 minute increase since all of the predictions would be too short. The TD(0) update though would initially only increase the prediction for the final leg of the journey and we would have to wait for this change to propagate backwards to all the other states. So the efficiency of updates for each method depends on where in the episode environmental changes occur. Actual environment change at the end of the route $(example_6_1(;elapsed = [0, 5, 15, 20, 27, 32], predicted_ttg = [30, 25, 15, 10, 3, 0])) Now there is a randomly experienced shorter leg at the start of the journey which won't affect most of the Monte Carlo updates. $(example_6_1(;elapsed = [0, 3, 13, 18, 25, 30], predicted_ttg = [30, 25, 15, 10, 3, 0])) """metadatashow_logsèdisabled®skip_as_script«code_folded$4b0d96d0-25d1-4fed-b105-c65fa2883a61cell_id$4b0d96d0-25d1-4fed-b105-c65fa2883a61code%const mrp_6_2 = make_mrp(l = nstates)metadatashow_logsèdisabled®skip_as_script«code_folded$1115f3ec-f4b2-4fba-bd5e-321a63b10a6dcell_id$1115f3ec-f4b2-4fba-bd5e-321a63b10a6dcodeٶshow_gridworld_policy_value(king_gridworld, q_learning(king_gridworld, 0.1f0, 1.0f0; num_episodes = 2000); action_display = king_action_display, policy_display = display_king_policy)metadatashow_logsèdisabled®skip_as_script«code_folded$1e3b3234-3fe1-46c9-82b7-f729c656eb25cell_id$1e3b3234-3fe1-46c9-82b7-f729c656eb25codemd""" $\begin{flalign} G_t - V_t(S_t) &= \delta_t + \gamma \eta_{t} + \gamma \left [\delta_{t+1} + \gamma \eta_{t+1} + \gamma (G_{t+2} - V_{t+2}(S_{t+2}) ) \right ] \\ &= \delta_t + \gamma \eta_{t} + \gamma \delta_{t+1} + \gamma^2 \eta_{t+1} + \gamma^2 \left [G_{t+2} - V_{t+2}(S_{t+2}) \right ] \\ &= (\delta_t + \gamma \eta_t) + \gamma (\delta_{t+1} + \gamma \eta_{t+1}) + \cdots + \gamma^{T-t-1}(\delta_{T-1} + \gamma \eta_{T-1}) + \gamma^{T-t} \left [G_T - V_T(S_T) \right ]\\ &= (\delta_t + \gamma \eta_t) + \gamma (\delta_{t+1} + \gamma \eta_{t+1}) + \cdots + \gamma^{T-t-1}(\delta_{T-1} + \gamma \eta_{T-1})\\ &=\sum_{k=t}^{T-1} \gamma^{k-t} (\delta_k + \gamma \eta_k)\\ \end{flalign}$ """metadatashow_logsèdisabled®skip_as_script«code_folded$6029990b-eb31-45ae-a869-b789fba673a6cell_id$6029990b-eb31-45ae-a869-b789fba673a6codemd""" To use afterstates with generalized policy iteration, we need to modify our MDP framework by considering the following trajectory: $$(S, A) \longrightarrow (Y, P) \longrightarrow (S^\prime, R) \longrightarrow \cdots \longrightarrow (S_T, R_T)$$ where $(S, A, R)$ are the usual state, action, and reward. We introduce $(Y, P)$ to indicate the afterstate and any intermediate reward that is received from the afterstate transition. The probability transition function for a normal MDP is written as $p(s^\prime, r \vert s, a)$ and represents the probability of transitioning to state $s$ with reward $r$ under the condition that an agent takes action $a$ from state $s$. When using afterstates, transitions can be represented with two functions: $p(y, \rho \vert s, a) \tag{a}$ is the probability of transitioning to afterstate $y$ with intermediate reward $\rho$ given an agent takes action $a$ from state $s$ $p(s^\prime, r \vert y) \tag{b}$ is the probability of transitioning to state $s^\prime$ with reward $r$ given an agent starts in afterstate $y$. Moreover, when an environment is modified to use afterstates, usually there are known deterministic dynamics that follow actions followed by some stochastic behavior after that. A good example is tic-tac-toe where we fully know the dynamics after making a move, but there could be some unknown behavior from the opponent. In this situation, the afterstate probability transition (a) is deterministic, so it could instead be represented by a mapping function that returns an afterstate and an intermediate reward given a state action pair. $$f_1(s, a) = y \tag{b1′}$$ $$f_2(s, a) = \rho \tag{b2′}$$ where $y$ and $\rho$ are the afterstate and reward respectively after taking action $a$ in state $s$. Now all of the stochastic dynamics of the environment are captured in (b) and the function only has 3 arguments instead of the usual 4. We can now apply all of the previous techniques to the afterstate example and even combine dynamic programming and trajectory sampling. """metadatashow_logsèdisabled®skip_as_script«code_folded$61bbf9db-49a0-4709-83f4-44f228be09c0cell_id$61bbf9db-49a0-4709-83f4-44f228be09c0codefunction sarsa(mdp::MDP_TD{S, A, F, G, H}, α::T, γ::T; num_episodes = 1000, qinit = zero(T), ϵinit = one(T)/10, Qinit = initialize_state_action_value(mdp; qinit=qinit), πinit = create_ϵ_greedy_policy(Qinit, ϵinit), history_state::S = first(mdp.states), update_policy! = (v, ϵ, s) -> make_ϵ_greedy_policy!(v, ϵ), save_history = false, decay_ϵ = false) where {S, A, F, G, H, T<:AbstractFloat} terminds = findall(mdp.isterm(s) for s in mdp.states) Q = copy(Qinit) Q[:, terminds] .= zero(T) π = copy(πinit) vhold = zeros(T, length(mdp.actions)) #keep track of rewards and steps per episode as a proxy for training speed rewards = zeros(T, num_episodes) steps = zeros(Int64, num_episodes) if save_history action_history = Vector{A}(undef, num_episodes) end for ep in 1:num_episodes ϵ = decay_ϵ ? ϵinit/ep : ϵinit s = mdp.state_init() (i_s, i_a, a) = init_step(mdp, π, s) rtot = zero(T) l = 0 while !mdp.isterm(s) (s′, i_s′, r, a′, i_a′) = sarsa_step(mdp, π, s, a) if save_history && (s == history_state) action_history[ep] = a end Q[i_a, i_s] += α * (r + γ*Q[i_a′, i_s′] - Q[i_a, i_s]) #update terms for next step vhold .= Q[:, i_s] update_policy!(vhold, ϵ, s) π[:, i_s] .= vhold s = s′ a = a′ i_s = i_s′ i_a = i_a′ l+=1 rtot += r end steps[ep] = l rewards[ep] = rtot end default_return = Q, π, steps, rewards save_history && return (default_return..., action_history) return default_return endmetadatashow_logsèdisabled®skip_as_script«code_folded$814d89be-cfdf-11ec-3295-49a8f302bbcfcell_id$814d89be-cfdf-11ec-3295-49a8f302bbcfcodeOmd""" # Chapter 6 Temporal-Difference Learning TD methods combine the Monte Carlo concept of learning from experience with the self-consistency ideas from dynamic programming. Unlike the pure Monte Carlo methods of Chapter 5, TD methods do not require waiting for the final outcome of an episode to start learning. In other words they bootstrap learning by exploiting what is known about the properties of the value function. Eventually we will see that different degrees of bootstrapping can be used that bridge the gap between the techniques in Chapter 5 and 6. ## 6.1 TD Prediction """metadatashow_logsèdisabled®skip_as_script«code_folded$52aebb7b-c2a9-443f-bc03-24cd25793b32cell_id$52aebb7b-c2a9-443f-bc03-24cd25793b32codemd""" > ### *Exercise 6.4* > The specific results shown in the right graph of the random walk example are dependent on the value of the step-size parameter $\alpha$. Do you think the conclusions about which algorithm is better would be affected if a wider range of values were used? Is there a different, fixed value of $\alpha$ at which either algorithm would have performed significantly better than shown? Why or why not? Both algorithms should theoretically converge to the true values with a sufficiently small $\alpha$ and a large enough number of samples. Over this limited window of 100 episodes, an $\alpha$ that is too small might result in convergence so slow that it does not reach error as low as a larger $\alpha$. For the MC method, $\alpha=0.01$ is the smallest value and it has the slowest convergence over this range. $\alpha=0.04$ is the largest value tested, and it results in approximately the same error after 100 episodes. The intermediate values show better performance over this number of episodes indicating that the best possible performance is already captured in this interval. For the TD method, the best results shown are for $\alpha=0.05$ which is already the smallest value with the slowest convergence rate. An even smaller value might result in a better outcome over 100 episodes, but this performance is already better than anything observed for the MC method. """metadatashow_logsèdisabled®skip_as_script«code_folded$3d8b1ccd-9bb3-42f2-a77a-6afdb72c1ff8cell_id$3d8b1ccd-9bb3-42f2-a77a-6afdb72c1ff8code&#calculate the percentage error for a value update handling cases of zero values function calc_error(v_old::T, v_new::T) where T<:AbstractFloat d = v_new - v_old return abs(d) f(x) = x <= eps(one(T)) f(d) && f(v_old) && return zero(T) f(v_old) && return typemax(T) abs(d) / abs(v_old) endmetadatashow_logsèdisabled®skip_as_script«code_folded$031e1106-7408-4c7e-b78e-b713c19123d1cell_id$031e1106-7408-4c7e-b78e-b713c19123d1codebegin struct UpRight <: GridworldAction end struct DownRight <: GridworldAction end struct UpLeft <: GridworldAction end struct DownLeft <: GridworldAction end const diagonal_actions = [UpRight(), UpLeft(), DownRight(), DownLeft()] const king_actions = [rook_actions; diagonal_actions] move(::UpRight, x, y) = (x+1, y+1) move(::UpLeft, x, y) = (x-1, y+1) move(::DownRight, x, y) = (x+1, y-1) move(::DownLeft, x, y) = (x-1, y-1) endmetadatashow_logsèdisabled®skip_as_script«code_folded$7035c082-6e50-4df5-919f-5f09d2011b4acell_id$7035c082-6e50-4df5-919f-5f09d2011b4acodeXrunepisode(mdp::MDP_TD; kwargs...) = runepisode(mdp, make_random_policy(mdp); kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$bfe71b40-3157-47df-8494-67f8eb8e4e93cell_id$bfe71b40-3157-47df-8494-67f8eb8e4e93codefunction runepisode(mdp::MDP_TD{S, A, F, G, H}, π::Matrix{T}; max_steps = Inf) where {S, A, F, G, H, T<:Real} states = Vector{S}() actions = Vector{A}() rewards = Vector{T}() s = mdp.state_init() step = 1 #note that the terminal state will not be added to the state list while !mdp.isterm(s) && (step <= max_steps) push!(states, s) (i_s, i_s′, r, s′, a, i_a) = takestep(mdp, π, s) push!(actions, a) push!(rewards, r) s = s′ step += 1 end return states, actions, rewards, s endmetadatashow_logsèdisabled®skip_as_script«code_folded$b35264b0-ac5b-40ce-95e4-9b2bc4cb106fcell_id$b35264b0-ac5b-40ce-95e4-9b2bc4cb106fcodemd""" TD(0) update rule for action values: $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})-Q(S_t, A_t)]$ This update is done after every transition from a nonterminal state $S_t$. If $S_{t+1}$ is terminal, then $Q(S_{t+1}, A_{t+1})$ is defined as zero. This rule uses every element of the quintuple of events, $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$, that make up a transition from one state-action pair to the next. This quintuple gives rise to the name *Sarsa* for the algorithm. Each update only uses the immediate reward and the value of the state-action pair in the subsequent state as illustrated in the backup diagram shown below. """metadatashow_logsèdisabled®skip_as_script«code_folded$d259ecca-0249-4b28-a4d7-6880d4d84495cell_id$d259ecca-0249-4b28-a4d7-6880d4d84495codeHconst action3_display = @htl("""
Actions
""")metadatashow_logsèdisabled®skip_as_script«code_folded$22c4ce8c-bd82-4eb3-8af5-55342018edffcell_id$22c4ce8c-bd82-4eb3-8af5-55342018edffcode$md""" # Dynamic Programming Code """metadatashow_logsèdisabled®skip_as_script«code_folded$6faa3015-3ac4-44af-a78c-10b175822441cell_id$6faa3015-3ac4-44af-a78c-10b175822441code$const cliffworld = make_cliffworld()metadatashow_logsèdisabled®skip_as_script«code_folded$fa04d20f-6e3f-46f8-b3f7-a543d1fa360acell_id$fa04d20f-6e3f-46f8-b3f7-a543d1fa360acodefunction max_bias_visualization(;nvars_min = 2, nvars_max = 10, nmax = 10, nruns = 10_000) varlist = collect(nvars_min:nvars_max) estimates = mapreduce(+, 1:nruns) do _ data = randn(nmax, nvars_max) means = reduce(hcat, [cum_mean(c) for c in eachcol(data)]) maxes = reduce(vcat, [cum_max(r)[2:end]' for r in eachrow(means)]) end ./ nruns traces = [scatter(x = 1:nmax, y = c, name = "$(varlist[i]) variables") for (i, c) in enumerate(eachcol(estimates))] true_trace = scatter(x = 1:nmax, y = fill(0.0, nmax), name = "True Value", line_dash = "dash", mode = "lines", line_color = "black") plot([true_trace; traces], Layout(xaxis_title = "Number of Samples Per Variable", yaxis_title = "Estimate of Maximum Mean", title = "Maximization Bias for IID Variables with Zero Mean")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$297f1606-4ec2-4075-9f81-926dc517b76fcell_id$297f1606-4ec2-4075-9f81-926dc517b76fcodeqconst noisy_gridworld_dp = create_noisy_gridworld_mdp(noisy_gridworld, first(noisy_rewards), last(noisy_rewards))metadatashow_logsèdisabled®skip_as_script«code_folded$f2776908-d06a-4073-b2ce-ecbf109c9cc7cell_id$f2776908-d06a-4073-b2ce-ecbf109c9cc7codemd""" #### King Actions """metadatashow_logsèdisabled®skip_as_script«code_folded$d83ff60f-8973-4dc1-9358-5ad109ea5490cell_id$d83ff60f-8973-4dc1-9358-5ad109ea5490codemd""" ### Solutions on Noisy Gridworld Load Existing Results if Present: $(@bind ex_6_8_load CheckBox(default=true)) If file does not load correctly, uncheck this box to produce new results. """metadatashow_logsèdisabled®skip_as_script«code_folded$105c5c23-270d-437e-89dd-12297814c6e0cell_id$105c5c23-270d-437e-89dd-12297814c6e0codemd""" > ### *Exercise 6.6* > In Example 6.2 we stated that the true values for the random walk example are 1/6 , 2/6 , 3/6 , 4/6 , and 5/6 , for states A through E. Describe at least two different ways that these could have been computed. Which would you guess we actually used? Why? ###### Method 1: Set up the following system of equations that represent the relationship between state values $\begin{flalign} V(A) &= \frac{0+V(B)}{2} \implies 2V(A)=V(B) \\ V(B) &= \frac{V(A)+V(C)}{2} \implies 2V(B) = V(A)+V(C)\\ V(C) &= \frac{V(B)+V(D)}{2} \implies 2V(C)=V(B)+V(D)\\ V(D) &= \frac{V(C)+V(E)}{2} \implies 2V(D)=V(C)+V(E)\\ V(E) &= \frac{V(D)+1}{2} \implies 2V(E)=V(D)+1\\ \end{flalign}$ We can work down from the top equation expressing everything in terms of A. For shorter expressions $V(A)$ will be written below as $A$ and likewise for other states: $\begin{flalign} B&=2A \\ 2B&=A+C \implies C = 3A \\ 2C&=B+D \implies D = 6A-2A=4A \\ 2D&=C+E \implies E = 8A-3A = 5A \\ 2E &= D + 1 \implies 10A = 4A + 1 \implies A = \frac{1}{6} \end{flalign}$ Now that we have the value for A, all the others are trivial multiplications of it from 2 to 5. ###### Method 2: Calculate each value from probability of each trajectory With this method to get $V(A)$ we would write down every possible trajectory to a terminal state with the associated probability of each. Since trajectories terminating to the left have a value of 0, we only need to add up the trajectories that terminate to the right. Below are some examples for state A. $V(A) = 0.5^5 + 4 \times 0.5^7 + \cdots$ This equation represents the single trajectory that takes 5 steps to the right each with probability one half and the 4 possible trajectories that turn around once on the way right resulting in 7 steps. This sum will end up being infintely long to account for all of the trajectories that bounce back and forth arbitrarily large amounts of time. This method is significantly harder to calculate for each state compared to the first method and is more in line with how estimates are calculated with MC sampling. The first method is more analogous to TD sampling using the bootstrapped form of the Bellman equation. """metadatashow_logsèdisabled®skip_as_script«code_folded$e8f94345-9ad5-48d4-8709-d796fb55db3fcell_id$e8f94345-9ad5-48d4-8709-d796fb55db3fcodeexercise_6_5(α = 0.2f0)metadatashow_logsèdisabled®skip_as_script«code_folded$64b210e8-223f-41f7-a6b7-8af6183ddf87cell_id$64b210e8-223f-41f7-a6b7-8af6183ddf87codeAfunction make_noisy_gridworld(;actions = rook_actions, l = 3) xmax = l ymax = l make_windy_gridworld(;actions = actions, apply_wind = (w, x, y) -> (x, y), xmax = xmax, ymax = ymax, sterm = GridworldState(xmax, ymax), start = GridworldState(1, 1), winds = fill(0, xmax), get_step_reward = () -> rand(noisy_rewards)) endmetadatashow_logsèdisabled®skip_as_script«code_folded$2f4e2da2-b1a1-41b1-8904-39b59f426da4cell_id$2f4e2da2-b1a1-41b1-8904-39b59f426da4codeنconst king_gridworld_mdp_dp = create_gridworld_mdp(10, 7, GridworldState(1, 4), GridworldState(8, 4), wind_vals, king_actions, -1.0f0)metadatashow_logsèdisabled®skip_as_script«code_folded$bc8bad61-a49a-47d6-8fa6-7dcf6c221910cell_id$bc8bad61-a49a-47d6-8fa6-7dcf6c221910codefunction example_6_1(;elapsed = [0, 5, 20, 30, 40, 43], predicted_ttg = [30, 35, 15, 10, 3, 0]) states = [:leaving, :reach_car, :exit_highway, :snd_rd, :home_st, :arrive] tt = last(elapsed) predicted_tt = predicted_ttg .+ elapsed actual_tt = fill(tt, 6) t1 = scatter(x = states, y = predicted_tt, line_color = "black", name = "actual outcome") t1′ = scatter(x = states, y = predicted_tt, line_color = "black", name = "actual outcome", showlegend=false) t2 = scatter(x = states, y = actual_tt, mode = "lines", line = attr(dash = "dash", color = "black"), name = "Monte Carlo Prediction") errortraces = [scatter(x = [s, s], y = [e, tt], line = attr(color = "red"), marker = attr(symbol = "arrow-bar-up", angleref = "previous"), showlegend = false, name = "Mone Carlo Error") for (s, e) in zip(states, predicted_tt)] p1 = plot([t1; t2; errortraces], Layout(xaxis_title = "State", yaxis_title = "Predicted total
travel time", xaxis_ticktext = ["leaving office", "reach car", "exiting highway", "2ndary road", "home street", "arrive home"], xaxis_tickvals = states, width = 600, legend_orientation = "h", legend_y = 1.1)) td_prediction = [predicted_tt[2:end]; tt] t3 = scatter(x = states, y = td_prediction, name = "TD(0) Prediction", mode = "lines", line = attr(dash = "dash", color = "black", shape = "hv")) tderrors = [scatter(x = [states[i], states[i]], y = [predicted_tt[i], td_prediction[i]], line = attr(color = "red"), marker = attr(symbol = "arrow-bar-up", angleref = "previous"), showlegend = false, name = "TD(0) Error") for i in eachindex(states)] p2 = plot([t1′; t3; tderrors], Layout(xaxis_title = "State", xaxis_ticktext = ["leaving office", "reach car", "exiting highway", "2ndary road", "home street", "arrive home"], xaxis_tickvals = states, width = 600, showlegend = false)) [p1 p2] # plot(predicted_tt, xticks = (1:6, String.(states)), ylabel = "Minutes", lab = "Preicted Outcome", size = (680, 400)) # plot!(fill(43, 6), line = :dot, lab = "actual outcome") endmetadatashow_logsèdisabled®skip_as_script«code_folded$2455742f-dc18-4d6b-9f58-5666adac6919cell_id$2455742f-dc18-4d6b-9f58-5666adac6919codefunction create_car_rental_mdp(;nmax=20, λs::@NamedTuple{request_A::T, request_B::T, return_A::T, return_B::T} = (request_A = 3f0, request_B = 4f0, return_A = 3f0, return_B = 2f0), movecost::T = 2f0, rentcredit::T = 10f0, movemax::Integer=5, maxovernight::Integer = 20, overnightpenalty::T = 4f0, employeeshuttle = false) where T <: Real #enumerate all states states = [(n_a, n_b) for n_a in 0:nmax for n_b in 0:nmax] actions = collect(-movemax:movemax) #enumerate all rewards by simply incrementing by 1 dollar from the worst to best case scenario rewards = collect(-movecost*movemax - 2*overnightpenalty:rentcredit*nmax*2) reward_lookup = Dict(zip(rewards, eachindex(rewards))) #mapping from rewards to the proper index #create a lookup for the probability of starting with n cars at the start of the day and ending up with n′ at the end of the day function create_probability_lookup(λ_request, λ_return) #can only rent from 0 to n cars. if requests exceed n, all of those situations are equivalent and the probability is 1 - p(x < n-1) p_rent = Dict(n_request => poisson(n_request, λ_request) for n_request in 0:nmax-1) #car returns can be any number greater than or equal to 0, but all returns of nmax - (n - nrent) or more will result in the same state which is max cars p_return = Dict(n_return => poisson(n_return, λ_return) for n_return in 0:nmax-1) #initialize probabilities for each final value at 0 prob_lookup = Dict((t, nrent) => 0f0 for t in states for nrent in 0:t[1]) for n in 0:nmax for n_rent in 0:n-1 for n_return in 0:(nmax - n + n_rent - 1) n′ = n - n_rent + n_return p = p_rent[n_rent]*p_return[n_return] prob_lookup[((n, n′), n_rent)] += p end prob_lookup[((n, nmax), n_rent)] += p_rent[n_rent]*(1 - sum(p_return[n_return] for n_return in 0:nmax-n+n_rent-1; init = zero(T))) end for n_return in 0:(nmax - 1) n′ = n_return p = (1 - sum(p_rent[n_rent] for n_rent in 0:n-1; init = zero(T)))*p_return[n_return] prob_lookup[((n, n′), n)] += p end prob_lookup[((n, nmax), n)] += (1 - sum(p_rent[n_rent] for n_rent in 0:n-1; init = zero(T)))*(1 - sum(p_return[n_return] for n_return in 0:nmax-1, init = zero(T))) end return prob_lookup end probabilities = (location_A = create_probability_lookup(λs.request_A, λs.return_A), location_B = create_probability_lookup(λs.request_B, λs.return_B)) #calculate probability matrix for all the s′, r transitions given starting in state s and taking action a function getmatrix(s, a) #initialize the matrix for s′, r transitions, each column runs over the transition states out = zeros(length(states), length(rewards)) (n_a, n_b) = s #calculate the number of cars moved with sign indicating direction + being A to B, normally this is simply a but if we try to move more cars than are available, it will be capped carsmoved = if a > 0 min(a, n_a) elseif a < 0 -min(abs(a), n_b) else 0 end #cars above nmax are returned to the company but we still incur the cost of transfering them aftercount_a = min(n_a - carsmoved, nmax) aftercount_b = min(n_b + carsmoved, nmax) cost = (abs(a) - (a > 0)*employeeshuttle)*movecost + (overnightpenalty * ((aftercount_a > maxovernight) + (aftercount_b > maxovernight))) #one free transfer from A to B if employee shuttle is true in modified version, overnight penalty if too many cars are left at a lot for (i_s′, s′) in enumerate(states) (n_a′, n_b′) = s′ for n_rent_a in 0:aftercount_a for n_rent_b in 0:aftercount_b p_a = probabilities.location_A[((aftercount_a, n_a′), n_rent_a)] p_b = probabilities.location_B[((aftercount_b, n_b′), n_rent_b)] p_total = p_a*p_b r = rentcredit*(n_rent_a+n_rent_b) - cost out[i_s′, reward_lookup[r]] += p_total end end end return out end #initialize probability function with all zeros ptf = zeros(T, length(states), length(rewards), length(actions), length(states)) for (i_s, s) in enumerate(states) for (i_a, a) in enumerate(actions) ptf[:, :, i_a, i_s] .= getmatrix(s, a) end end #find indices of the reward vector that never have non zero probability inds = reduce(intersect, [findall(0 .== [sum(ptf[:, i, j, k]) for i in 1:size(ptf, 2)]) for j in 1:size(ptf, 3) for k in 1:size(ptf, 4)]) goodinds = setdiff(eachindex(rewards), inds) FiniteMDP(states, actions, rewards[goodinds], ptf[:, goodinds, :, :]) endmetadatashow_logsèdisabled®skip_as_script«code_folded$f474fcbd-e3c3-49fd-a6b7-6d6a8a7dda09cell_id$f474fcbd-e3c3-49fd-a6b7-6d6a8a7dda09code%md""" ### Informal Proof for Bias """metadatashow_logsèdisabled®skip_as_script«code_folded$69eedbfd-396f-4461-b7a1-c36abc094581cell_id$69eedbfd-396f-4461-b7a1-c36abc094581code function example_6_7_mdp(;num_actions::Integer = 10, num_episodes = 300, nruns = 10_000, α = 0.1f0, ϵ = 0.1f0, load_file = true, fname = "figure_6_5.bin") load_file && isfile(fname) && begin p = deserialize(fname) return p end states = [A(), B(), Term()] actions = collect(1:num_actions) function step(::A, a) a == 1 && return (0.0f0, B()) a == 2 && return (0.0f0, Term()) return (-100f0, Term()) end step(::B, a) = (randn(Float32) - 0.1f0, Term()) state_init() = A() isterm(::Term) = true isterm(s) = false mdp = MDP_TD(states, actions, state_init, step, isterm) function get_valid_inds(i_s) i_s == 1 && return 1:2 return 1:num_actions end #in state A don't include actions other than left and right as random choices update_behavior!(v, ϵ, ::A) = make_ϵ_greedy_policy!(v, ϵ; valid_inds = 1:2) update_behavior!(v, ϵ, s) = make_ϵ_greedy_policy!(v, ϵ) Qinit = [[[0.0f0, 0.0f0]; fill(-100f0, num_actions-2)] zeros(Float32, num_actions) zeros(Float32, num_actions)] πinit = create_ϵ_greedy_policy(Qinit, ϵ; get_valid_inds = get_valid_inds) sarsa_results = mean(last(sarsa(mdp, 0.1f0, 1.0f0; num_episodes = num_episodes, save_history = true, ϵinit = ϵ, Qinit = Qinit, πinit = πinit, update_policy! = update_behavior!)) .== 1 for _ in 1:nruns) q_learning_results = mean(last(q_learning(mdp, 0.1f0, 1.0f0; num_episodes = num_episodes, save_history = true, ϵinit = ϵ, Qinit = Qinit, πinit = πinit, update_policy! = update_behavior!)) .== 1 for _ in 1:nruns) double_q_learning_results = mean(last(double_q_learning(mdp, 0.1f0, 1.0f0; num_episodes = num_episodes, save_history = true, ϵinit = ϵ, Qinit = Qinit, πinit_behavior = πinit, behavior_policy_function! = update_behavior!)) .== 1 for _ in 1:nruns) expected_sarsa_results = mean(last(expected_sarsa(mdp, 0.1f0, 1.0f0; ϵinit = ϵ, num_episodes = num_episodes, save_history = true, Qinit = Qinit, πinit = πinit, update_policy! = update_behavior!)) .== 1 for _ in 1:nruns) double_expected_sarsa_results = mean(last(double_expected_sarsa(mdp, 0.1f0, 1.0f0; ϵinit = ϵ, num_episodes = num_episodes, save_history = true, Qinit = Qinit, πinit_behavior = πinit, behavior_policy_function! = update_behavior!, target_policy_function! = update_behavior!)) .== 1 for _ in 1:nruns) optimal_trace = scatter(x = 1:num_episodes, y = fill(ϵ / 2, num_episodes), name = "optimal", line_dash = "dash") t0 = scatter(x = 1:num_episodes, y = sarsa_results, name = "Sarsa") t1 = scatter(x = 1:num_episodes, y = q_learning_results, name = "Q-learning") t2 = scatter(x = 1:num_episodes, y = double_q_learning_results, name = "Double Q-learning") t4 = scatter(x = 1:num_episodes, y = double_expected_sarsa_results, name = "Double Expected Sarsa") t3 = scatter(x = 1:num_episodes, y = expected_sarsa_results, name = "Expected Sarsa") # plot([t0, t1, t2, t3]) traces = [t0, t1, t2, t3, t4, optimal_trace] p = plot(traces, Layout(xaxis_title = "Episodes", yaxis_title = "% left actions from A")) serialize(fname, p) return p endmetadatashow_logsèdisabled®skip_as_script«code_folded$7ac99619-5232-4db8-8553-d79ea5415d29cell_id$7ac99619-5232-4db8-8553-d79ea5415d29codekfunction create_gridworld_mdp(mdp::MDP_TD, step_reward) #this only works when the mdp is deterministic. add a version for the stochastic wind example ptf = zeros(Float32, length(mdp.states), 2, length(mdp.actions), length(mdp.states)) for s in mdp.states i_s = mdp.statelookup[s] if mdp.isterm(s) ptf[i_s, 1, :, i_s] .= 1.0f0 else for a in mdp.actions (r, s′) = mdp.step(s, a) i_a = mdp.actionlookup[a] i_s′ = mdp.statelookup[s′] i_s = mdp.statelookup[s] ptf[i_s′, 2, i_a, i_s] = 1.0f0 end end end FiniteMDP(mdp.states, mdp.actions, [0.0f0, step_reward], ptf) endmetadatashow_logsèdisabled®skip_as_script«code_folded$0163763b-a15f-447e-b3d2-32d4bf9d2605cell_id$0163763b-a15f-447e-b3d2-32d4bf9d2605codeٖ@bind max_visual_params2 PlutoUI.combine() do Child md""" Number of Variables: $(Child(:nvars, NumberField(2:100, default = 2))) """ end |> confirmmetadatashow_logsèdisabled®skip_as_script«code_folded$53145cc2-784c-468b-8e91-9bb7866db218cell_id$53145cc2-784c-468b-8e91-9bb7866db218coder@bind t PlutoUI.Clock(interval = delay, max_value = length(mrp_trajectory[1])+5, repeat=true, start_running=false)metadatashow_logsèdisabled®skip_as_script«code_folded$6b496582-cc0e-4195-87ef-94792b0fff54cell_id$6b496582-cc0e-4195-87ef-94792b0fff54code{function make_ϵ_greedy_policy!(v::AbstractVector{T}, ϵ::T; valid_inds = eachindex(v)) where T <: Real vmax = maximum(v[i] for i in valid_inds) v .= T.(isapprox.(v, vmax)) s = sum(v) c = s * ϵ / length(valid_inds) d = one(T)/s - ϵ #value to add to actions that are maximizing for i in valid_inds if v[i] == 1 v[i] = d + c else v[i] = c end end return v endmetadatashow_logsèdisabled®skip_as_script«code_folded$9db7a268-1e6d-4366-a0ec-ebf54916d3b0cell_id$9db7a268-1e6d-4366-a0ec-ebf54916d3b0codeexample_6_2(l = nstates)metadatashow_logsèdisabled®skip_as_script«code_folded$c2f56287-9a3e-454a-9ec1-53184b788db9cell_id$c2f56287-9a3e-454a-9ec1-53184b788db9code-const jacks_car_mdp = create_car_rental_mdp()metadatashow_logsèdisabled®skip_as_script«code_folded$18e60b1d-97ec-432c-a388-003e7fae415fcell_id$18e60b1d-97ec-432c-a388-003e7fae415fcodefunction bellman_optimal_value!(V::Vector{T}, mdp::FiniteAfterstateMDP{T, S1, S2, A}, γ::T) where {T <: Real, S1, S2, A} delt = zero(T) q_vec = zeros(T, length(mdp.actions)) @inbounds @fastmath @simd for i_y in eachindex(mdp.afterstates) q_total = zero(T) r_total = zero(T) @inbounds @fastmath @simd for i_s′ in eachindex(mdp.states) p_total = zero(T) q_vec .= mdp.reward_interim_map[:, i_s′] .+ V[mdp.afterstate_map[:, i_s′]] q_max = maximum(q_vec) @inbounds @fastmath for (i_r, r) in enumerate(mdp.rewards) p = mdp.ptf[i_s′, i_r, i_y] r_total += p*r p_total += p end q_total += q_max*p_total end v_new = r_total + γ*q_total delt = max(delt, abs(v_new - V[i_y]) / (eps(abs(V[i_y])) + abs(V[i_y]))) V[i_y] = v_new end return delt endmetadatashow_logsèdisabled®skip_as_script«code_folded$12c5efe4-d64d-4b82-877c-29b0e537fee6cell_id$12c5efe4-d64d-4b82-877c-29b0e537fee6codeBbegin start_mrp mrp_trajectory = runepisode(mrp_6_2, π_mrp) endmetadatashow_logsèdisabled®skip_as_script«code_folded$a72d07bf-e337-4bd4-af5c-44d74d163b6bcell_id$a72d07bf-e337-4bd4-af5c-44d74d163b6bcode'exercise_6_5(α = 0.2f0, vinit = 0.0f0)metadatashow_logsèdisabled®skip_as_script«code_folded$0201ae9f-4a31-497e-86ab-62b454ca85decell_id$0201ae9f-4a31-497e-86ab-62b454ca85decodemd""" Notice that about about $\alpha = 0.25$, Q-learning sometimes has diverging values and therefore episodes that avoid termination whereas Double Q-learning avoids that problem even at large learning rates. """metadatashow_logsèdisabled®skip_as_script«code_folded$b37f2395-1480-4c7c-b6c0-eba391e969d7cell_id$b37f2395-1480-4c7c-b6c0-eba391e969d7code gmd""" Let's first consider the problem of prediction problem for afterstates and see how to compute the afterstate value function and how it could be used for policy improvement. We will use the terminology $W(y)$ to represent the value of afterstate $y$ while $V(s)$ still means the value of state $s$. From the earlier definitions, we can show the relationship between the state and afterstate value functions. Recall that: $\begin{flalign} G_t &\doteq R_t + \gamma R_{t+1} + \cdots \\ V_\pi(s) &\doteq \mathbb{E}_\pi[G_t \mid S_t = s] \\ & = \mathbb{E}_\pi[R_t + \gamma V_\pi(S_{t+1}) \mid S_t = s] \\ &= \sum_a \pi(a \vert s) \sum_{r, s^\prime} p(r, s^\prime \vert s, a) \left ( r + \gamma V(s^\prime) \right ) \end{flalign}$ Representing the trajectory with afterstates and only considering the reward following an afterstate, we also know that: $\begin{flalign} G_t &\doteq R_t + \gamma(P_{t+1} + R_{t+1} + \gamma(P_{t+2} + R_{t+1} + \cdots))\\ W_\pi(y) &\doteq \mathbb{E}_\pi[G_t \mid Y_t = y] \\ & = \mathbb{E}_\pi[R_t + \gamma \left (P_{t+1} + W_\pi(Y_{t+1}) \right ) \mid Y_t = y] \\ &= \sum_{r, s^\prime} p(r, s^\prime \vert y) \left [r + \gamma \sum_{a^\prime} \left [ \pi(a \vert s^\prime) \left ( f_2(s^\prime, a^\prime) + W_\pi(f_1(s^\prime, a^\prime) \right ) \right ] \right ] \end{flalign}$ Notice that compared to the value function, the policy only matters for this expected value when we consider the action taken from the transition state. The initial transition from the afterstate to $s^\prime$ only depends on our new transition function which only conditioned on the afterstate. Recall that to improve a policy $\pi$ for which we have a value function $V_\pi$, we must select the greedy policy with respect to $V_\pi$ meaning $\pi^{\prime} (s) = \mathrm{argmax}_a \sum_{r, s^\prime} p(r, s^\prime \vert s, a)(r + \gamma V(s^\prime))$. If we do have access to the full probability transition function, we cannot compute this explicitely. Furthermore, we cannot estimate this either from a single trajectory because from each state we would just have a single transition based on the behavior policy at the time. That's why for MDPs that do not provide the full transition function, we prefer to estimate the state action value function $Q(s, a)$ because using that function policy improvement is much more trivial: $\pi^{\prime} (s) = \mathrm{argmax}_a Q(s, a)$. """metadatashow_logsèdisabled®skip_as_script«code_folded$6edb550d-5c9f-4ea6-8746-6632806df11ecell_id$6edb550d-5c9f-4ea6-8746-6632806df11ecodeexample_6_1()metadatashow_logsèdisabled®skip_as_script«code_folded$01582b3b-c4d0-4691-9edf-f77e6d8be2c9cell_id$01582b3b-c4d0-4691-9edf-f77e6d8be2c9codeDmd""" ### Maximization Bias Visualization for a Single Estimator """metadatashow_logsèdisabled®skip_as_script«code_folded$7ed07ddc-1c63-4ce7-bfd3-6da54304d297cell_id$7ed07ddc-1c63-4ce7-bfd3-6da54304d297codefunction makepolicyvaluemaps(mdp::CompleteMDP, v::Vector{T}, π::Matrix{T}) where T <: Real function getaction(dist) #default action will be 0 sum(dist) == 0 && return 0 (p, ind) = findmax(dist) mdp.actions[ind] end policymap = zeros(Int64, 21, 21) valuemap = zeros(T, 21, 21) for i in 1:size(π, 2) action = getaction(view(π, :, i)) (n_a, n_b) = mdp.states[i] policymap[n_a+1, n_b+1] = action valuemap[n_a+1, n_b+1] = v[i] end (policymap, valuemap) endmetadatashow_logsèdisabled®skip_as_script«code_folded$4862942b-d1e2-4ac8-8e88-65205e91a070cell_id$4862942b-d1e2-4ac8-8e88-65205e91a070codec@bind max_visual_params PlutoUI.combine() do Child md""" ||| |---|---| |Maximum Number of Variables:|$(Child(:nvars, NumberField(2:100, default = 4)))| |Maxinum Number of Samples Per Variable:| $(Child(:nmax, NumberField(10:1000, default = 100)))| |Number of Runs:| $(Child(:nruns, NumberField(100:1_000_000, default = 10_000)))| """ end |> confirmmetadatashow_logsèdisabled®skip_as_script«code_folded$a5009785-64b4-489b-a967-f7840b4a9463cell_id$a5009785-64b4-489b-a967-f7840b4a9463code-md""" #### Random Walk Visualization Code """metadatashow_logsèdisabled®skip_as_script«code_folded$eb735ead-978b-409c-8990-b5fa7a027ebfcell_id$eb735ead-978b-409c-8990-b5fa7a027ebfcodefunction tabular_TD0_pred_V(π::Matrix{T}, mdp::MDP_TD{S, A, F, G, H}, α::T, γ::T; num_episodes::Integer = 1000, vinit::T = zero(T), V::Vector{T} = initialize_state_value(mdp; vinit = vinit), save_states::Vector{S} = Vector{S}()) where {T <: AbstractFloat, S, A, F, G, H} check_policy(π, mdp) terminds = findall(mdp.isterm(s) for s in mdp.states) #initialize counts = zeros(Integer, length(mdp.states)) V[terminds] .= zero(T) #terminal state must always have 0 value v_saves = zeros(T, length(save_states), num_episodes+1) function updatesaves!(ep) for (i, s) in enumerate(save_states) i_s = mdp.statelookup[s] v_saves[i, ep] = V[i_s] end end updatesaves!(1) #simulate and episode and update the value function every step function runepisode!(V, j) s = mdp.state_init() while !mdp.isterm(s) (i_s, i_s′, r, s′, a, i_a) = takestep(mdp, π, s) V[i_s] += α * (r + γ*V[i_s′] - V[i_s]) s = s′ end updatesaves!(j+1) return V end for i = 1:num_episodes; runepisode!(V, i); end return V, v_saves endmetadatashow_logsèdisabled®skip_as_script«code_folded$2034fd1e-5171-4eda-85d5-2de62d7a1e8bcell_id$2034fd1e-5171-4eda-85d5-2de62d7a1e8bcodefunction q_learning(mdp::MDP_TD{S, A, F, G, H}, α::T, γ::T; num_episodes = 1000, qinit = zero(T), ϵinit = one(T)/10, Qinit = initialize_state_action_value(mdp; qinit=qinit), πinit = create_ϵ_greedy_policy(Qinit, ϵinit), decay_ϵ = false, history_state::S = first(mdp.states), save_history = false, update_policy! = (v, ϵ, s) -> make_ϵ_greedy_policy!(v, ϵ)) where {S, A, F, G, H, T<:AbstractFloat} terminds = findall(mdp.isterm(s) for s in mdp.states) Q = copy(Qinit) Q[:, terminds] .= zero(T) π = copy(πinit) vhold = zeros(T, length(mdp.actions)) #keep track of rewards and steps per episode as a proxy for training speed rewards = zeros(T, num_episodes) steps = zeros(Int64, num_episodes) if save_history history_actions = Vector{A}(undef, num_episodes) end for ep in 1:num_episodes ϵ = decay_ϵ ? ϵinit/ep : ϵinit s = mdp.state_init() rtot = zero(T) l = 0 while !mdp.isterm(s) (i_s, i_s′, r, s′, a, i_a) = takestep(mdp, π, s) if save_history && (s == history_state) history_actions[ep] = a end qmax = maximum(Q[i, i_s′] for i in eachindex(mdp.actions)) Q[i_a, i_s] += α*(r + γ*qmax - Q[i_a, i_s]) #update terms for next step vhold .= Q[:, i_s] update_policy!(vhold, ϵ, s) π[:, i_s] .= vhold s = s′ l+=1 rtot += r end steps[ep] = l rewards[ep] = rtot end save_history && return Q, π, steps, rewards, history_actions return Q, π, steps, rewards endmetadatashow_logsèdisabled®skip_as_script«code_folded$4382928c-6325-4ecd-b7cf-282525a270abcell_id$4382928c-6325-4ecd-b7cf-282525a270abcodeيbegin abstract type MaxBiasStates end struct A <: MaxBiasStates end struct B <: MaxBiasStates end struct Term <: MaxBiasStates end endmetadatashow_logsèdisabled®skip_as_script«code_folded$8bc54c94-9c92-4904-b3a6-13ff3f0110bbcell_id$8bc54c94-9c92-4904-b3a6-13ff3f0110bbcodefunction show_grid_value(mdp, Q::Matrix, wind::Vector, name; action_display = king_action_display, scale = 1.0) width = maximum(s.x for s in mdp.states) height = maximum(s.y for s in mdp.states) start = mdp.state_init() termind = findfirst(mdp.isterm, mdp.states) sterm = mdp.states[termind] ngrid = width*height @htl("""
$(HTML(mapreduce(i -> """
$(round(maximum(Q[:, i]), sigdigits = 2))
""", *, eachindex(mdp.states))))
$(HTML(mapreduce(i -> """
$(wind[i])
""", *, 1:width)))
$(action_display)
Wind Values
""") endmetadatashow_logsèdisabled®skip_as_script«code_folded$4b1a4c14-3c2b-40c0-995c-cd0334ed8b3acell_id$4b1a4c14-3c2b-40c0-995c-cd0334ed8b3acodemd""" #### Normal Actions """metadatashow_logsèdisabled®skip_as_script«code_folded$f0f9d3d5-e76a-4472-bfb1-da29d73a7916cell_id$f0f9d3d5-e76a-4472-bfb1-da29d73a7916codeقexample_6_5(;mdp = king_gridworld, num_episodes = 400, action_display = king_action_display, policy_display = display_king_policy)metadatashow_logsèdisabled®skip_as_script«code_folded$4c1b286c-2ba9-4293-81e1-bf360baa75facell_id$4c1b286c-2ba9-4293-81e1-bf360baa75facode md""" The following argument is taken from ["Double Q-learning"](https://papers.nips.cc/paper_files/paper/2010/file/091d584fced301b442654dd8c23b3fc9-Paper.pdf) by Hado van Hasselt published in _Advances in Neural Information Processing Systems 23 (NIPS 2010)_: Consider a set of $M$ random variables $X=\{X_1, \dots, X_M\}$. We would like to calculate: $$\max_i \mathbb{E} \{X_i\} \tag{a}$$ Without any knowledge of the underlying distribution of each $X_i$ it is impossible to determine $(\star)$ exactly. Most often we would approximate it by first constructing approximations for $\mathbb{E} \{ X_i \} \: \forall \: i$. Let $S = \bigcup_{i=1}^M S_i$ denote the set of samples where $S_i$ is the subset containing samples for the variable $X_i$. We assume that the samples in $S_i$ are independent and identically distributed (iid). Unbiased estimates for the expected values can be obtained by computing hte sample average for each variable: $\mathbb{E} \{ X_i \} = \mathbb{E} \{ \mu_i \} \approx \mu_i(S) \doteq \frac{1}{\vert S_i \vert } \sum_{s \in S_i} s$ where $\mu_i$ is an estimator for the variable $X_i$. This approximation is unbiased since very sample $s in S_i$ is an unbiased estimat for the value of $\mathbb{E} \{ X_i \}$. The error in approximation thus consists soley of the variance in the estimator and decreases when we obtain more samples. We use the following notations: $f_i$ denotes the probability density function (PDF) of the $i^{th}$ variable $X_i$ and $F_i(x) = \int_{-\infty}^{x} f_i(x)dx$ is the cumulative distribution function (CDF) of this PDF. Similarly, the PDF and CDF of the $i^{th}$ estimator are denoted $f_i^\mu$ and $F_i^\mu$. The maximum expected value cna be expressed in terms of the underlying PDFs as $\max_i \mathbb{E} \{ X_i \} = \max_i \int_{-\infty}^\infty x f_i(x)dx$. An obvious way to approximate the value of $(a)$ is to use the value of the maximal estimator: $$\max_i \mathbb{E} \{ X_i \} = \max_i \mathbb{E} \{ \mu_i \} \approx \max_i \mu_i(S) \tag{b}$$ and this is the estimator employed in ordinary Q-learning. This estimator is distributed according to some PDF $f_{max}^\mu$ that is dependent on the PDFs of the estimators $f_i^\mu$. To determine this PDF, consider the CDF $F_{\max}^\mu(x)$, which gives the probability that the maximum estimate is lower or equal to $x$. This probability is equal to the probability that all the estimates are lower or equal to $x: F_{\max}^\mu(x) \doteq P(\max_i \mu_i \leq x) = \prod_{i=1}^M P(\mu_i\leq x) \doteq \prod_{i=1}^M F_i ^\mu (x)$. The value $\max_i \mu_i(S)$ is an unbiased estimate for $\mathbb{E} \{ \max_j \mu_j \} = \int_{-\infty}^{\infty} x f_{\max}^\mu(x)dx$ which can thus be given by: $$\mathbb{E} \{ \max_j \mu_j \} = \int_{-\infty}^{\infty} x \frac{d}{dx} \prod_{i=1}^M F_i ^ \mu (x) dx = \sum_{j=1}^M \int_{-\infty}^{\infty}x f_j ^ \mu (x) \prod_{i \neq j}^M F_i ^ \mu(x) dx \tag{c}$$ However in $(a)$ the order of the max operator and the expectation operator are the other way around. The following illustrates why $(c)$ has a positive bias. """metadatashow_logsèdisabled®skip_as_script«code_folded$3134e913-1e86-495d-a558-c3ec4828bf7bcell_id$3134e913-1e86-495d-a558-c3ec4828bf7bcodeٺbegin_value_iteration_v(mdp::FiniteMDP{T,S,A}, γ::T; Vinit::T = zero(T), kwargs...) where {T<:Real,S,A} = begin_value_iteration_v(mdp, γ, Vinit .* ones(T, size(mdp.ptf, 1)); kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$db31579e-3e56-4271-8fc3-eb13bc95ac27cell_id$db31579e-3e56-4271-8fc3-eb13bc95ac27code[md""" Adding the no-movement action doesn't seem to change the shortest path of 7 steps """metadatashow_logsèdisabled®skip_as_script«code_folded$943b6d7e-14a4-4532-90c7-dd5080be0c6ecell_id$943b6d7e-14a4-4532-90c7-dd5080be0c6ecode%const noisy_rewards = [-1.2f0, 1.0f0]metadatashow_logsèdisabled®skip_as_script«code_folded$84584793-8274-4aa1-854f-b167c7434548cell_id$84584793-8274-4aa1-854f-b167c7434548code function gridworld_Q_vs_sarsa_vs_expected_sarsa_solve(mdp; α=0.5f0, ϵ=0.1f0, num_episodes = 500, nruns = 100) function addtuple(t1, t2) Tuple(t1[i] .+ t2[i] for i in eachindex(t1)) end sarsa_results = mapreduce(addtuple, 1:nruns) do _ sarsa(mdp, α, 1.0f0; num_episodes = num_episodes, ϵinit = ϵ) end qlearning_results = mapreduce(addtuple, 1:nruns) do _ q_learning(mdp, α, 1.0f0; num_episodes = num_episodes, ϵinit = ϵ) end expected_sarsa_results = mapreduce(addtuple, 1:nruns) do _ expected_sarsa(mdp, α, 1.0f0; num_episodes = num_episodes, ϵinit = ϵ) end # double_expected_sarsa_results = mapreduce(addtuple, 1:nruns) do _ # double_q_learning(mdp, α, 1.0f0; num_episodes = num_episodes, ϵinit = ϵ) # end # qlearning_results = [q_learning(mdp, α, 1.0f0; num_episodes = num_episodes, ϵinit = ϵ) for _ in 1:nruns] p1 = plot_path(mdp, create_greedy_policy(sarsa_results[1] ./ nruns); windtext = fill("", 12), xtitle = "", title = "Cliff Walking Sarsa Path") p2 = plot_path(mdp, qlearning_results[2] ./ nruns; windtext = fill("", 12), xtitle = "", title = "Cliff Walking Q Learning Path") expected_sarsa_path = plot_path(mdp, create_greedy_policy(expected_sarsa_results[1] ./ nruns); windtext = fill("", 12), xtitle = "", title = "Cliff Walking Expected Sarsa Path") # double_expected_sarsa_path = plot_path(mdp, create_greedy_policy(double_expected_sarsa_results[1] ./ nruns); windtext = fill("", 12), xtitle = "", title = "Cliff Walking Double Expected Sarsa Path") traces = [scatter(x = 1:num_episodes, y = results[4] ./ nruns, name = name) for (results, name) in zip([sarsa_results, qlearning_results, expected_sarsa_results], ["Sarsa", "Q-learning", "Expected Sarsa"])] p3 = plot(traces, Layout(xaxis_title = "Episodes", yaxis = attr(title = "Sum of rewards during episode", range = [-100, -15]))) p3 = plot(traces, Layout(xaxis_title = "Episodes", yaxis = attr(title = "Sum of rewards during episode", range = [-100, -15]))) steptraces = [scatter(x = 1:num_episodes, y = results[3] ./ nruns, name = name) for (results, name) in zip([sarsa_results, qlearning_results, expected_sarsa_results], ["Sarsa", "Q-learning", "Expected Sarsa"])] p4 = plot(steptraces, Layout(xaxis_title = "Episodes", yaxis = attr(title = "Average steps per episode
during training", range = [0, 100]))) @htl("""
$p1
$p2
$expected_sarsa_path
$p3 $p4 """ ) endmetadatashow_logsèdisabled®skip_as_script«code_folded$9f28772c-9afe-4253-ab3b-055b0f48be6ecell_id$9f28772c-9afe-4253-ab3b-055b0f48be6ecodefunction plot_path(mdp, π; title = "Optimal policy
path example", windtext = [0, 0, 0, 1, 1, 1, 2, 2, 1, 0], xtitle = "Wind Values") eg = runepisode(mdp, π; max_steps = 100) xmax = maximum([s.x for s in mdp.states]) ymax = maximum([s.y for s in mdp.states]) start = mdp.state_init() goal = mdp.states[findfirst(mdp.isterm(s) for s in mdp.states)] start_trace = scatter(x = [start.x + 0.5], y = [start.y + 0.5], mode = "text", text = ["S"], textposition = "left", showlegend=false) finish_trace = scatter(x = [goal.x + .5], y = [goal.y + .5], mode = "text", text = ["G"], textposition = "left", showlegend=false) path_traces = [scatter(x = [eg[1][i].x + 0.5, eg[1][i+1].x + 0.5], y = [eg[1][i].y + 0.5, eg[1][i+1].y + 0.5], line_color = "blue", mode = "lines", showlegend=false, name = "Optimal Path") for i in 1:length(eg[1])-1] finalpath = scatter(x = [eg[1][end].x + 0.5, last(eg).x + .5], y = [eg[1][end].y + 0.5, last(eg).y + 0.5], line_color = "blue", mode = "lines", showlegend=false, name = "Optimal Path") h1 = 30*ymax plot([start_trace; finish_trace; path_traces; finalpath], Layout(xaxis = attr(showgrid = true, showline = true, gridwith = 1, gridcolor = "black", zeroline = true, linecolor = "black", mirror=true, tickvals = 1:xmax, ticktext = windtext, range = [1, xmax+1], title = xtitle), yaxis = attr(linecolor="black", mirror = true, gridcolor = "black", showgrid = true, gridwidth = 1, showline = true, tickvals = 1:ymax, ticktext = fill("", ymax), range = [1, ymax+1]), width = max(30*xmax, 200), height = max(h1, 200), autosize = false, padding=0, paper_bgcolor = "rgba(0, 0, 0, 0)", title = attr(text = title, font_size = 14, x = 0.5))) endmetadatashow_logsèdisabled®skip_as_script«code_folded$1dd1ba55-548a-41f6-903e-70742fd60e3dcell_id$1dd1ba55-548a-41f6-903e-70742fd60e3dcode>show_mrp_state("eg1", mrp_trajectory[1], mrp_trajectory[3], t)metadatashow_logsèdisabled®skip_as_script«code_folded$2a3e4617-efbb-4bbc-9c61-8535628e439ccell_id$2a3e4617-efbb-4bbc-9c61-8535628e439ccodemd""" > ### *Exercise 6.12* > Supposed action selection is greedy. Is Q-learning then exactly the same algorithm as Sarsa? Will they make exactly the same action selections and weight updates? Consider both updates when the greedy policy is followed during training. Sarsa Update: $Q_\pi(S_t, A_t) = \text{E}_\pi [R_{t+1} + \gamma Q_\pi(S_{t+1}, A_{t+1})]$ with $A_{t+1}$ chosen by the greedy policy accoring to $\text{max}_a Q_\pi(S_{t+1})$ for the estimates prior to this update. Q-Learning Update: $Q_\pi(S_t, A_t) = \text{E}_\pi [R_{t+1} + \gamma \text{max}_a Q_\pi(S_{t+1}, a)]$ The value updates are identical since the Q estimate used in both cases will be based on the maximizing action at state $S_{t+1}$. In the case of Sarsa, $A_{t+1}$ has already been selected prior to this update occurring, so this value update will properly reflect the next step in the trajectory. In Q-learning, the action selection at $S_{t+1}$ will occur after the update step. Notice that we only updated $Q_\pi(S_t, A_t)$ and did not touch $Q_\pi(S_{t+1}, A_{t+1})$, so our next action selection should be unaffected by this update. However, there in one exception for the case where the state is identical through the transition: $S_t = S_{t+1}$. In this case, the update could actually affect the next action selection, for example, let's say a very low reward was received during the update. That would lower the estimate for this action selected on step t and it may no longer be maximizing on step t+1. Then Sarsa would have chosen the same action ahead of the update but Q-learning would chose a different action on the next step even though the state is unchanged. Despite this difference, both methods are still computing the state-action value function for the optimal policy, but neither is guaranteed to converge to this function due to the violation of the assumption that all state-action pairs are visited during training. """metadatashow_logsèdisabled®skip_as_script«code_folded$5f32fed0-c921-4cbb-85fe-ade54d4c6c95cell_id$5f32fed0-c921-4cbb-85fe-ade54d4c6c95codeImd""" At each state or checkpoint you try to predict how much longer it will take to get home using any information that is relevant. Notice that regardless of how inaccurate we were on previous steps, we can still make an accurate prediction for the time to go. |State|Elapsed Time (minutes)|Predicted Time to Go|Predicted Total Time| |---|---|---|---| |leaving office, friday at 6|0|30|30| |reach car, raining|5|35|40| |exiting highway|20|15|35| |2ndary road, behind truck|30|10|40| |entering home street|40|3|43| |arriving home|43|0|43| The rewards in this example are the elapsed times on each leg of the journey and there is no discounting, thus the return for each state is the actual time to go from that state. The value of each state is the *expected* time to go. The second column of numbers gives the current estimated value for the state encountered. A simple way to view the operation of Mone Carlo methods is to plot hte predicted total time (the last column) over the sequence. For each state we would compare that value with the actual elapsed time which was 43 minutes. """metadatashow_logsèdisabled®skip_as_script«code_folded$a3d10753-2ec3-4252-9629-834145678b6acell_id$a3d10753-2ec3-4252-9629-834145678b6acode'md""" ### Afterstate Implementation """metadatashow_logsèdisabled®skip_as_script«code_folded$12aac612-758b-4655-8ede-daddd4af6d3ecell_id$12aac612-758b-4655-8ede-daddd4af6d3ecode#take a step in the environment from state s using policy π and generate the subsequent action selection as well function sarsa_step(mdp::MDP_TD{S, A, F, G, H}, π::Matrix{T}, s::S, a::A) where {S, A, F<:Function, G<:Function, H<:Function, T<:Real} (r, s′) = mdp.step(s, a) i_s′ = mdp.statelookup[s′] i_a′ = sample_action(π, i_s′) a′ = mdp.actions[i_a′] return (s′, i_s′, r, a′, i_a′) endmetadatashow_logsèdisabled®skip_as_script«code_folded$2c49900b-3c57-4d9a-b3dc-ef9cc20c30c1cell_id$2c49900b-3c57-4d9a-b3dc-ef9cc20c30c1code@md""" To understand the origin of the bias, consider a case where we only have a single sample from each variable which follows a standard normal distribution. In this case our estimate of the maximum expected value is just $\max(x, y)$ where $x$ and $y$ are samples from $X$ and $Y$ respectively. The expected value of this estimator can be calculated using the distribution of the maximum of two standard normal random variables: $\mathbb{E}\left [ \text{max}(\mathcal{N}(0, 1), \mathcal{N}(0, 1)) \right ] = \frac{1}{\sqrt{\pi}} \approx 0.564$ Indeed, on the plot for 2 variables after 1 sample collected for each, this average observed value is 0.56 and the value increase the more variables in our list. So apparantly our estimate has a positive bias despite the fact that every underlying variables have exactly the same distribution. If we had more samples for each variable then we would use the distribution of the sample average rather than a single sample and that distribution has a variance proportional to the inverse of the number of samples. So the bias will converge to zero in the limit of infinite samples, and in the graph the bias does in fact converge to zero over more samples. There is a method of eliminating this positive bias using a so-called *double estimator*, and this method was first introduced by Hado van Hasselt in a paper published during NIPS 2010. Below is a more thorough overview of the paper, but first I will provide a conceptual sketch of the proof. First consider a set of $M$ random variables $X = \{X_1, \dots, X_M \}$ and our goal is to estimate: $\max_i \mathbb{E} \{ X_i \}$. In the single estimator case, we will draw samples from each variable and construct some unbiased estimator for each mean: $\mu_i$. After we have collected some set of samples, using this method, we make the assumption that which ever estimator or set of estimators have the maximum value are the true variables with the maximum expected value. If there is zero overlap in the distribution of each random variable, then these estimators will always be ranked in the same order as the true expected values and our estimate will be unbiased. However, if there is any overlap in the underlying distributions (this also includes the case where all distributions are identical), then there is some non-zero probability that the true maximum index is NOT in the set of indices for the maximum estimators. Let's say the apparent maximizing index from the sample is $s^*$ while one of the true maximizing indices is $j \neq s^*$. So our final estimate for the maximum expected value will be $\mu_{s^*}$. We already know that $\mathbb{E} \{ X_j \} = \max_i \mathbb{E} \{X_i \}$ by assumption. We also know that $\mu_{s^*} > \mu_j$ in the sample and $\mathbb{E} \{ \mu_j\} = \max_i \mathbb{E} \{X_i \}$ which is the true value that we want. So we would always expect this estimator to be larger than the true answer or equal to it in the case where the selected index is correct. This is even true if all the variables share the same distribution, because every estimate has the same expected value which is the true answer, yet the one estimate we use to calculate the maximum is guaranteed to be larger than all of those unbiased alternatives. The underlying reason why this will tend to overestimate is because in any finite sample, we are not guaranteed to know the correct maximizing index and any variable that produces samples high enough to exceed the true maximum will always be selected to represent that maximum. In the double estimator case, we split the samples into two sets $\mathcal{A}$ and $\mathcal{B}$ such that $\mathcal{A} \bigcap \mathcal{B} = \emptyset$ and have a set of estimators for each set $\mu_i^\mathcal{A}$ and $\mu_i^\mathcal{B}$. Let $a^*$ be in the set of indices with the maximum estimated values in set $\mathcal{A}$. Again, if the underlying distributions overlap at all, then there is some probability that this index is not in the set of true maximizing indices. However, now if all the distributions are equal, then whichever index we pick is still guaranteed to be correct. To estimate the actual value of the maximum, we take $\mu_{i_{a*}}^\mathcal{B}$ which is the estimate from set $\mathcal{B}$ at the maximizing index from set $\mathcal{A}$. Just like in the single estimator case, if this happens to be a correct index, then we have an unbiased estimate for the true value. However, if the index is wrong, we are estimating the expected value of a non-maximizing index from a new set of samples. By the definition of the maximizing indices, we know that in this case $\mathbb{E} \{ \mu_{a^*}^\mathcal{B} \} \lt \max_i \mathbb{E} \{ X_i \}$ resulting in a negative bias for our estimate. Just like in the single estimator case, this estimate will be unbiased if there is no overlap in the underlying probability distributions for each variable. Unlike the single estimator case, this estimate will also be unbiased if all the underlying distributions are equal. See below for a visualization of the bias removal for the iid case as well as the more formal proof for both methods. """metadatashow_logsèdisabled®skip_as_script«code_folded$e26f788e-f602-403e-929e-6c98a6e6bf79cell_id$e26f788e-f602-403e-929e-6c98a6e6bf79codemd""" The double estimator methods are the only ones that don't show an initial increase in the number of episodes. After enough time though, every methodstarts to converge to the policy that takes a direct path. If $\alpha$ is not low enough, Q-learning fails to converge towards the optimal policy and has diverging value estimates. Both double methods are very stable and correctly estimate every state to have a negative value. """metadatashow_logsèdisabled®skip_as_script«code_folded$c09530bc-f37e-4d57-a267-14d4027147dacell_id$c09530bc-f37e-4d57-a267-14d4027147dacodemd""" Returning to the definition of $\eta_t$, we can simplify further: $\eta_{t} \doteq V_{t+1}(S_{t+1}) - V_t(S_{t+1})$ This quantity is the change in value estimate at a state between two time steps. Note that at time $t+1$ we have only performed an update for the value at state $S_t$ using the equation: $V_{t+1}(S_t) = V_t(S_t) + \alpha \delta_t$ If $S_{t+1} \neq S_t$, then the value estimate at this state will not occur on either time step $t$ or $t+1$, so $V_{t+1}(S_{t+1}) = V_t(S_{t+1}) \implies \eta_{t} = 0$ The only case in which $V_{t+1}(S_{t+1}) \neq V_t(S_{t+1})$ is when $S_t = S_{t+1} = S$. In this case, $V_{t+1}(S) = V_t(S) + \alpha \delta_t \implies V_{t+1}(S) - V_t(S) = \alpha \delta_t$ So we can rewrite $\eta_{t} = \alpha \delta_t \mathbb{1}_{t}$ where $\mathbb{1}_{t} = \begin{cases} 1 & \text{if } S_{t+1} = S_t \\ 0 & \text{otherwise} \end{cases}$ So the original equation can be written as: $\begin{flalign} G_t - V_t(S_t) &= \sum_{k=t}^{T-1} \gamma^{k-t} (\delta_k + \gamma \alpha \delta_k \mathbb{1}_k) \\ &= \sum_{k=t}^{T-1} \gamma^{k-t} \delta_k (1 + \gamma \alpha \mathbb{1}_k) \\ \end{flalign}$ Where the first term is the value from the original derivation and the second term is only non-zero when a state appears twice concecutively in an episode. """metadatashow_logsèdisabled®skip_as_script«code_folded$0c0b875e-69f8-46ed-ad06-df9c36088fbecell_id$0c0b875e-69f8-46ed-ad06-df9c36088fbecodeconst gridsize = 3metadatashow_logsèdisabled®skip_as_script«code_folded$8d05403a-adeb-40ac-a98a-87586d5a5170cell_id$8d05403a-adeb-40ac-a98a-87586d5a5170code*md""" ### Example 6.5: Windy Gridworld """metadatashow_logsèdisabled®skip_as_script«code_folded$44c49006-e210-4f97-916e-fe62f36c593fcell_id$44c49006-e210-4f97-916e-fe62f36c593fcodeCmd""" ## 6.5 Q-learning: Off-policy TD Control One of the early breakthroughs in reinforcement learning was the development of an off-policy TD control algorithm known as *Q-learning* (Watkins, 1989), defined by $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \text{max}_a Q(S_{t+1}, a) - Q(S_t, A_t)]$ """metadatashow_logsèdisabled®skip_as_script«code_folded$0ad739c9-8aca-4b82-bf20-c73584d29535cell_id$0ad739c9-8aca-4b82-bf20-c73584d29535codejmd""" > ### *Exercise 6.9 Windy Gridworld with King's Moves (programming)* > Re-solve the windy gridworld assuming eight possible actions, including the diagonal moves, rather than four. How much better can you do with the extra actions? Can you do even better by including a ninth action that causes no movement at all other than that caused by the wind? """metadatashow_logsèdisabled®skip_as_script«code_folded$0748902c-ffc0-4634-9a1b-e642b3dfb77bcell_id$0748902c-ffc0-4634-9a1b-e642b3dfb77bcodeR#forms a random policy for a generic finite state mdp. The policy is a matrix where the rows represent actions and the columns represent states. Each column is a probability distribution of actions over that state. form_random_policy(mdp::CompleteMDP{T}) where T = ones(T, length(mdp.actions), length(mdp.states)) ./ length(mdp.actions)metadatashow_logsèdisabled®skip_as_script«code_folded$6a1503c6-c77b-4e3a-9f07-74b2af1a5ff7cell_id$6a1503c6-c77b-4e3a-9f07-74b2af1a5ff7code"md""" ### Sarsa Implementation """metadatashow_logsèdisabled®skip_as_script«code_folded$292d9018-b550-4278-a8e0-78dd6a6853f1cell_id$292d9018-b550-4278-a8e0-78dd6a6853f1codefunction expected_sarsa(mdp::MDP_TD{S, A, F, G, H}, α::T, γ::T; num_episodes = 1000, qinit = zero(T), ϵinit = one(T)/10, Qinit = initialize_state_action_value(mdp; qinit=qinit), πinit = create_ϵ_greedy_policy(Qinit, ϵinit), update_policy! = (v, ϵ, s) -> make_ϵ_greedy_policy!(v, ϵ), decay_ϵ = false, save_history = false, save_state = first(mdp.states)) where {S, A, F, G, H, T<:AbstractFloat} terminds = findall(mdp.isterm(s) for s in mdp.states) Q = copy(Qinit) Q[:, terminds] .= zero(T) π = copy(πinit) vhold = zeros(T, length(mdp.actions)) #keep track of rewards and steps per episode as a proxy for training speed rewards = zeros(T, num_episodes) steps = zeros(Int64, num_episodes) if save_history action_history = Vector{A}(undef, num_episodes) end for ep in 1:num_episodes ϵ = decay_ϵ ? ϵinit/ep : ϵinit s = mdp.state_init() rtot = zero(T) l = 0 while !mdp.isterm(s) (i_s, i_s′, r, s′, a, i_a) = takestep(mdp, π, s) if save_history && (s == save_state) action_history[ep] = a end q_expected = sum(π[i, i_s′]*Q[i, i_s′] for i in eachindex(mdp.actions)) Q[i_a, i_s] += α*(r + γ*q_expected - Q[i_a, i_s]) #update terms for next step vhold .= Q[:, i_s] update_policy!(vhold, ϵ, s) π[:, i_s] .= vhold s = s′ l+=1 rtot += r end steps[ep] = l rewards[ep] = rtot end base_return = (Q, π, steps, rewards) save_history && return (base_return..., action_history) return base_return endmetadatashow_logsèdisabled®skip_as_script«code_folded$07c57f37-22be-4c39-8279-d80addcea0c5cell_id$07c57f37-22be-4c39-8279-d80addcea0c5codefunction create_stochastic_gridworld_mdp(width, height, start, goal, wind, actions, step_reward) mdp = make_windy_gridworld(;actions = actions, apply_wind = apply_wind, sterm = goal, start = start, xmax = width, ymax = height, winds = wind_vals, get_step_reward = () -> step_reward) ptf = zeros(Float32, length(mdp.states), 2, length(mdp.actions), length(mdp.states)) for s in mdp.states i_s = mdp.statelookup[s] if mdp.isterm(s) ptf[i_s, 1, :, i_s] .= 1.0f0 else for a in mdp.actions w = wind[s.x] (r, s′) = mdp.step(s, a) i_a = mdp.actionlookup[a] i_s = mdp.statelookup[s] i_s′ = mdp.statelookup[s′] if w == 0 ptf[i_s′, 2, i_a, i_s] = 1.0f0 else #with stochastic wind split the probabilities between the possible outcomes ptf[i_s′, 2, i_a, i_s] += Float32(1/3) s′2 = GridworldState(s′.x, min(height, s′.y + 1)) i_s′2 = mdp.statelookup[s′2] ptf[i_s′2, 2, i_a, i_s] += Float32(1/3) s′3 = GridworldState(s′.x, max(1, s′.y - 1)) i_s′3 = mdp.statelookup[s′3] ptf[i_s′3, 2, i_a, i_s] += Float32(1/3) end end end end FiniteMDP(mdp.states, mdp.actions, [0.0f0, step_reward], ptf) endmetadatashow_logsèdisabled®skip_as_script«code_folded$b5187232-d808-49b6-9f7e-a4cbeb6c2b3ecell_id$b5187232-d808-49b6-9f7e-a4cbeb6c2b3ecode'md""" ### Example 6.1: Driving Home """metadatashow_logsèdisabled®skip_as_script«code_folded$54d97122-2d01-46ec-aafe-00bfc9f2d6d1cell_id$54d97122-2d01-46ec-aafe-00bfc9f2d6d1code[md""" Step: $(min(length(first(mrp_trajectory)), t)) / $(length(first(mrp_trajectory))) """metadatashow_logsèdisabled®skip_as_script«code_folded$926ec37d-b969-4dc9-99b2-a6b29c6d880ccell_id$926ec37d-b969-4dc9-99b2-a6b29c6d880ccodemd""" #### Figure 6.5: """metadatashow_logsèdisabled®skip_as_script«code_folded$c360945e-f8b2-4c6f-a70c-6ab4ddcf5b54cell_id$c360945e-f8b2-4c6f-a70c-6ab4ddcf5b54codeپmd""" By changing the initialization to 0, the RMS error monotonically converges to the minimum since the state values never pass through the correct values on their way to overshooting. """metadatashow_logsèdisabled®skip_as_script«code_folded$573a9919-bd7e-4a56-b830-4e40e91288efcell_id$573a9919-bd7e-4a56-b830-4e40e91288efcode7md""" Let $X = \{ X_1, \dots, X_M \}$ be a set of random variables and let $\mu^A = \{\mu_1^A, \dots, \mu_M^A \}$ and $\mu^B = \{\mu_1^B, \dots, \mu_M^B\}$ be two sets of unbiased estimators such that $\mathbb{E} \{ \mu_i^A \} = \mathbb{E} \{ \mu_i^B \} = \mathbb{E} \{ X_i \}$ for all $i$. Let $$\mathcal{M} \doteq \left \{ j \mid \mathbb{E} \{ X_j \} = \max_i \mathbb{E} \{ X_i \} \right \}$$ be the set of labels of estimators that maximize the expcted values of $X$. Let $a^*$ be an element that maximizes $\mu^A:\mu_{a^*}^A = \max_i \mu_i^A$. The claim is that: $$\mathbb{E} \{ \mu_{a^*}^B \} = \mathbb{E} \{ X_{a^*} \} \leq \max_i \mathbb{E} \{ X_i \}$$. Furthermore, the inequality is strict if and only if $P(a^* \notin \mathcal{M}) \gt 0$. *Proof*. Assume $a^* \in \mathcal{M}$. Then $\mathbb{E} \{ \mu_{a^*}^B\} = \mathbb{E} \{ X_{a^*}\} \doteq \max_i \mathbb{E} \{ X_i \}$. Now assume $a^* \notin \mathcal{M}$ and choose $j \in \mathcal{M}$. Then $\mathbb{E} \{ \mu_{a^*} \} = \mathbb{E} \{ X_{a^*}\} \lt \mathbb{E} \{ X_j \} \doteq \max_i \mathbb{E} \{ X_i \}$. These two possibilities are mutually exclusive, so the combined expression can be written as: $$\begin{flalign} \mathbb{E} \{ \mu_{a^*}^B \} &= P(a^* \in \mathcal{M}) \mathbb{E} \{ \mu_{a^*}^B \vert a^* \in \mathcal{M} \} + P(a^* \notin \mathcal{M}) \mathbb{E} \{ \mu_{a^*}^B \vert a^* \notin \mathcal{M} \} \\ &= P(a^* \in \mathcal{M}) \max_i \mathbb{E} \{X_i \} + P(a^* \notin \mathcal{M}) \mathbb{E} \{ \mu_{a^*}^B \vert a^* \notin \mathcal{M} \} \\ &\leq P(a^* \in \mathcal{M}) \max_i \mathbb{E} \{X_i \} + P(a^* \notin \mathcal{M}) \max_i \mathbb{E} \{ X_i \} \\ &=\max_i \mathbb{E} \{ X_i \} \end{flalign}$$ The inequality is strict only if $P(a^* \notin \mathcal{M}) \gt 0$ where $\mathcal{M}$ is the true set of maximizing variables. This happens when variables have different expected values, but their distributions overlap. In contrast with the simple estimator, the double estimator is unbiased when the variables are iid, since then all expected values are equal and $P(a^* \in \mathcal{M}) = 1$. """metadatashow_logsèdisabled®skip_as_script«code_folded$4556cf44-4a1c-4ca4-bfb8-4841301a2ce6cell_id$4556cf44-4a1c-4ca4-bfb8-4841301a2ce6codeVfunction display_rook_policy(v::Vector{T}; scale = 1.0) where T<:AbstractFloat @htl("""
""") endmetadatashow_logsèdisabled®skip_as_script«code_folded$bb085f2e-83cb-45b2-adf6-c07da892d6e1cell_id$bb085f2e-83cb-45b2-adf6-c07da892d6e1codebegin car_results = begin_value_iteration_v(jacks_car_mdp, 0.9f0; θ = 0.0001f0) π_car, v_car = makepolicyvalueplots(jacks_car_mdp, car_results[1][end], car_results[2], length(car_results[1])) md""" ### Value Iteration Results for Jack's Car Rental $([π_car v_car]) """ endmetadatashow_logsèdisabled®skip_as_script«code_folded$e9359ca3-4d11-4365-bc6e-7babc6fcc7decell_id$e9359ca3-4d11-4365-bc6e-7babc6fcc7decodeJbegin struct Stay <: GridworldAction end move(::Stay, x, y) = (x, y) endmetadatashow_logsèdisabled®skip_as_script«code_folded$639840dc-976a-4e5c-987f-a92afb2d99d8cell_id$639840dc-976a-4e5c-987f-a92afb2d99d8codeٲbegin using StatsBase, Statistics, PlutoUI, HypertextLiteral, LaTeXStrings, PlutoPlotly, Base.Threads, LinearAlgebra, Serialization, Latexify, Transducers TableOfContents() endmetadatashow_logsèdisabled®skip_as_script«code_folded$dd167494-99d6-45c6-99e4-c36fde5e2d3fcell_id$dd167494-99d6-45c6-99e4-c36fde5e2d3fcode#md""" ## Jack's Car Rental Code """metadatashow_logsèdisabled®skip_as_script«code_folded$ab331778-f892-4690-8bb3-26464e3fc05fcell_id$ab331778-f892-4690-8bb3-26464e3fc05fcode.const windy_gridworld = make_windy_gridworld()metadatashow_logsèdisabled®skip_as_script«code_folded$0e59e813-3d48-4a24-b5b3-9a9de7c500c2cell_id$0e59e813-3d48-4a24-b5b3-9a9de7c500c2code }md""" > ### *Exercise 6.7* > Design an off-policy version of the TD(0) update that can be used with arbitrary target policy $\pi$ and convering behavior policy $b$, using each step $t$ the importance sampling ratio $\rho_{t:t}$ (5.3). Recall that equation 5.3 defines: $\rho_{t:T-1} = \prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}$ with the property that: $\mathbb{E}[\rho_{t:T-1}G_t \mid S_t = s] = v_\pi(s)$ when $G_t$ is generated by the behavior policy. The TD(0) update rule is given by: $V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]$ based on the following form of the Bellman equation: $v_\pi (s)=\text{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s]$ In the off-policy case, the reward $R_{t+1}$ and the subsequent state $S_{t+1}$ would be generated from the behavior policy, but the subsequent value would still be based on the target policy value function. Consider instead the quantity: $q_\pi(s, a) = \mathbb{E} [R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s, A_t = a]$ where we have removed the policy from the expectation since nothing in the bracket depends on sampling from the policy. Even if we chose actions a based on a behavior policy that differs from the target policy, these estimates will be correct because we are directly calculating the value for choosing that action, regardless of what the probability is. Consier we are following some behavior policy $b$ and recall that: $\begin{flalign} v_\pi(s) &= \sum_a \pi(a \vert s) q_\pi (s, a) \\ &= \sum_a \pi(a \vert s) \mathbb{E} [R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s, A_t = a]\\ &= \mathbb{E}_\pi [R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s]\\ v_b(s) &= \sum_a b(a \vert s) q_\pi (s, a) \\ &= \sum_a b(a \vert s) \mathbb{E} [R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s, A_t = a] \\ &= \mathbb{E}_b [R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t = s]\\ \end{flalign}$ In the TD(0) update we do not calculate this expected value directly but instead average samples together that are drawn from the target policy. This sampling will produce samples weighted by the target policy probabilities thus mimicking the expected value sum. If instead, our samples are drawn from the behavior policy, then the samples will mimic the behavior policy probability weights instead of the target policy. So in order to correctly calculate the expected value we must multiply each behavior policy sample by $\frac{\pi(a \vert s)}{b(a \vert s)} = \frac{\pi(A_t \vert S_t)}{b(A_t \vert S_t)} = \rho_{t:t}$ resulting in the following update rule: $V(S_t) \leftarrow V(S_t) + \alpha [\rho_{t:t} \left ( R_{t+1} + \gamma V(S_{t+1}) \right ) - V(S_t)]$ """metadatashow_logsèdisabled®skip_as_script«code_folded$e4c6456c-867d-4ade-a3c8-310c1e065f14cell_id$e4c6456c-867d-4ade-a3c8-310c1e065f14coderender_walk("eg1", l = nstates)metadatashow_logsèdisabled®skip_as_script«code_folded$3e767962-7339-4f35-a039-b5521a098ed5cell_id$3e767962-7339-4f35-a039-b5521a098ed5codestruct MDP_TD{S, A, F<:Function, G<:Function, H<:Function} states::Vector{S} statelookup::Dict{S, Int64} actions::Vector{A} actionlookup::Dict{A, Int64} state_init::G #function that produces an initial state for an episode step::F #function that produces reward and updated state given a state action pair isterm::H #function that returns true if the state input is terminal function MDP_TD(states::Vector{S}, actions::Vector{A}, state_init::G, step::F, isterm::H) where {S, A, F<:Function, G<:Function, H<:Function} statelookup = makelookup(states) actionlookup = makelookup(actions) new{S, A, F, G, H}(states, statelookup, actions, actionlookup, state_init, step, isterm) end endmetadatashow_logsèdisabled®skip_as_script«code_folded$834e5810-77ea-4dfd-9f37-9d9dbf6585a4cell_id$834e5810-77ea-4dfd-9f37-9d9dbf6585a4code?makelookup(v::Vector) = Dict(x => i for (i, x) in enumerate(v))metadatashow_logsèdisabled®skip_as_script«code_folded$667666b9-3ab6-4836-953d-9878208103c9cell_id$667666b9-3ab6-4836-953d-9878208103c9code8gridworld_Q_vs_sarsa_vs_expected_sarsa_solve(cliffworld)metadatashow_logsèdisabled®skip_as_script«code_folded$87fadfc0-2cdb-4be2-81ad-e8fdeffb690ccell_id$87fadfc0-2cdb-4be2-81ad-e8fdeffb690ccodefunction show_mrp_state(id, states, rewards, index) reward = rewards[min(index, length(states))] state = states[min(index, length(states))] dir = reward== 0 ? "left" : "right" termcolor = if index >= length(states) """ #$id .term.$dir::before { background-color: red; } """ else """""" end activestate = collect('A':'Z')[state] HTML(""" """ ) endmetadatashow_logsèdisabled®skip_as_script«code_folded$4019c974-dcaa-46c8-ac90-e6566a376ea1cell_id$4019c974-dcaa-46c8-ac90-e6566a376ea1code6function begin_value_iteration_v(mdp::M, γ::T, V::Vector{T}; θ = eps(zero(T)), nmax=typemax(Int64)) where {T<:Real, M <: CompleteMDP{T}} valuelist = [copy(V)] value_iteration_v!(V, θ, mdp, γ, nmax, valuelist) π = form_random_policy(mdp) make_greedy_policy!(π, mdp, V, γ) return (valuelist, π) endmetadatashow_logsèdisabled®skip_as_script«code_folded$4d4577b5-3753-450d-a247-ebd8c3e8f799cell_id$4d4577b5-3753-450d-a247-ebd8c3e8f799code)function create_ϵ_greedy_policy(Q::Matrix{T}, ϵ::T; π = copy(Q), get_valid_inds = j -> 1:size(Q, 1)) where T<:Real vhold = zeros(T, size(Q, 1)) for j in 1:size(Q, 2) vhold .= Q[:, j] make_ϵ_greedy_policy!(vhold, ϵ; valid_inds = get_valid_inds(j)) π[:, j] .= vhold end return π endmetadatashow_logsèdisabled®skip_as_script«code_folded$e19db54c-4b3c-42d1-b016-9620daf89bfbcell_id$e19db54c-4b3c-42d1-b016-9620daf89bfbcodebegin abstract type GridworldAction end struct Up <: GridworldAction end struct Down <: GridworldAction end struct Left <: GridworldAction end struct Right <: GridworldAction end struct GridworldState x::Int64 y::Int64 end rook_actions = [Up(), Down(), Left(), Right()] move(::Up, x, y) = (x, y+1) move(::Down, x, y) = (x, y-1) move(::Left, x, y) = (x-1, y) move(::Right, x, y) = (x+1, y) apply_wind(w, x, y) = (x, y+w) const wind_vals = [0, 0, 0, 1, 1, 1, 2, 2, 1, 0] endmetadatashow_logsèdisabled®skip_as_script«code_folded$ed4e863b-22dd-4d2b-88d0-b3a56d6713b7cell_id$ed4e863b-22dd-4d2b-88d0-b3a56d6713b7code٠example_6_5(;mdp = stochastic_gridworld, num_episodes = 400, action_display = king_action_display, policy_display = display_king_policy, use_stochastic_dp=true)metadatashow_logsèdisabled®skip_as_script«code_folded$393cd9d2-dd97-496e-b260-ec6e8b1c13b5cell_id$393cd9d2-dd97-496e-b260-ec6e8b1c13b5codeqbegin struct FiniteAfterstateMDP{T<:Real, S1, S2, A} <: CompleteMDP{T} states::Vector{S1} afterstates::Vector{S2} actions::Vector{A} rewards::Vector{T} #probability transition function now has probabilities for each state/reward transition from each afterstate ptf::Array{T, 3} #each column contains the index of the afterstate reached from the state represented by the column index while taking the action represented by the row index afterstate_map::Matrix{Int64} #each column contains the reward value received from the state represented by the column index while taking the action represented by the row index reward_interim_map::Matrix{T} state_index::Dict{S1, Int64} afterstate_index::Dict{S2, Int64} action_index::Dict{A, Int64} function FiniteAfterstateMDP{T, S1, S2, A}(states::Vector{S1}, afterstates::Vector{S2}, actions::Vector{A}, rewards::Vector{T}, ptf::Array{T, 3}, afterstate_map::Matrix{Int64}, reward_interim_map::Matrix{T}) where {T <: Real, S1, S2, A} new(states, afterstates, actions, rewards, ptf, afterstate_map, reward_interim_map, makelookup(states), makelookup(afterstates), makelookup(actions)) end end FiniteAfterstateMDP(states::Vector{S1}, afterstates::Vector{S2}, actions::Vector{A}, rewards::Vector{T}, ptf::Array{T, 3}, afterstate_map::Matrix{Int64}, reward_interim_map::Matrix{T}) where {T <: Real, S1, S2, A} = FiniteAfterstateMDP{T, S1, S2, A}(states, afterstates, actions, rewards, ptf, afterstate_map, reward_interim_map) #if a reward map is not provided, assume that there are no intermediate rewards FiniteAfterstateMDP(states::Vector{S1}, afterstates::Vector{S2}, actions::Vector{A}, rewards::Vector{T}, ptf::Array{T, 3}, afterstate_map::Matrix{Int64}) where {T <: Real, S1, S2, A} = FiniteAfterstateMDP{T, S1, S2, A}(states, afterstates, actions, rewards, ptf, afterstate_map, zeros(T, length(actions), length(states))) endmetadatashow_logsèdisabled®skip_as_script«code_folded$401831c3-3925-465c-a093-28686f0dad2ecell_id$401831c3-3925-465c-a093-28686f0dad2ecodesinitialize_state_value(mdp::MDP_TD; vinit::T = 0.0f0) where T<:AbstractFloat = ones(T, length(mdp.states)) .* vinitmetadatashow_logsèdisabled®skip_as_script«code_folded$2d881aa9-1da3-4d1e-8d05-245956dbaf33cell_id$2d881aa9-1da3-4d1e-8d05-245956dbaf33codeHTML(""" """)metadatashow_logsèdisabled®skip_as_script«code_folded$047a8881-c2ec-4dd1-8778-e3acf9beba2ecell_id$047a8881-c2ec-4dd1-8778-e3acf9beba2ecodeYmd""" #### Sarsa vs Q-learning vs Expected Sarsa Performance on Cliff Walking Example """metadatashow_logsèdisabled®skip_as_script«code_folded$29b0a2d5-9629-46cd-b57c-6f3ef797de66cell_id$29b0a2d5-9629-46cd-b57c-6f3ef797de66codemd""" ## 6.7 Maximization Bias and Double Learning All the control algorithms that we have discussed so far involve maximization in the construction of the target policies. For example, in Q-learning the target policy is the greedy policy given the current action values, which is defined with a max, and in Sarsa the policy is often $\epsilon$-greedy, which also involves a maximization operation. In these algorithms, a maximum over estimated values is used implicitely as an estimate of the maximum value, which can lead to significant positive bias. To see why, consider a isngle state $s$ where there are many actions $a$ whose true values $q(s, a)$, are all zero, but whose estimated values, $Q(s, a)$, are uncertain and thus distributed above and some below zero. The maximum of the true values is zero, but the maximum of the estimates is positive, a positive bias. We call this *maximization bias*. To elaborate on the bias, consider just two random variables $X \sim \mathcal{N}(\theta_1, 1)$ and $Y \sim \mathcal{N}(\theta_2, 1)$. We would like to estimate $\text{max} \left ( \mathbb{E}[X], \mathbb{E}[Y] \right ) = \text{max}(\theta_1, \theta_2)$ and using the approach analogous to our learning algorithms we would calculate $\max(\overline{X}, \overline{Y}) = \text{max} \left ( \sum_{i=1}^N \frac{x_i}{N}, \sum_{i=1}^M \frac{y_i}{M} \right )$. The problem with this approach is that for small numbers of samples, the variance each estimator is high and we are using this estimator both to select which random variable has the higher expected value and what that value is. Empirically, this results in a positive bias which gets worse the more variables we are considering as illustrated in the plot below. """metadatashow_logsèdisabled®skip_as_script«code_folded$c1d6532c-38a4-488f-9789-07d63fe6f125cell_id$c1d6532c-38a4-488f-9789-07d63fe6f125codeTmd""" Load Existing File if Present: $(@bind load_file CheckBox(default = true)) """metadatashow_logsèdisabled®skip_as_script«code_folded$e6672866-c0a0-46f2-bb52-25fcc3352645cell_id$e6672866-c0a0-46f2-bb52-25fcc3352645code )md""" > ### *Exercise 6.5* > In the right graph of the random walk example, the RMS error of the TD method seems to go down and then up again, particularly at high $\alpha$’s. What could have caused this? Do you think this always occurs, or might it be a function of how the approximate value function was initialized? Since the value function was initialized at the correct value for the center state, all of the values to the right must be increased and the values to the left must be decreased to reach the true values. Episodes that terminate to the right will receive a reward of 1 and push up the rightmost estimate while episodes that terminate to the left will receive a reward of 0 and decrease the leftmost estimate. The correct value for each of these estimates is $\frac{1}{6}$ and $\frac{5}{6}$ respectively. Since there is an equal probability of exiting the walk on the right or the left, both ends of the value estimates will be updated at roughly the same rate. That means that both ends of the chain will move towards the correct value at about the same time and if those updates stay someone synchronized, all of the states will move through correct values at a similar time. At the time when the values are roughly accurate, what happens if $\alpha=0.15$? In this case, consider an update for state E assuming the estimate is already the correct value. $V(E) \leftarrow \frac{5}{6} + 0.15[1 - \frac{5}{6}] \approx 0.858 \gt \frac{5}{6}$. A similar effect happens with state A pushing it below the correct value. The larger $\alpha$ is, the more over-correction we have on future transitions and the feedback from the other neighboring states won't be enough to bring it back to the correct value. Since we pass through or very close to the correct value on the way, we pass through a minimum error value before over or undershooting the value estimate. If we had instead initialized the state values at 0, then the estimate at A would already be too low and would not get corrected until information from the right side propagated through. State E, however, will receive large updates for each episode that exits to the right, but the values for the states to its left will be too low. Since the state value estimates are not moving symmetrically, we won't have the same synchronized pass through the minimum error, since at the time the E estimate is correct, A will still be high error. In this case, we are more likely to see error continue to fall as more updates occur. Below is a visualization of the state estimates at different stages in the training with the original initialization and a 0 initialization. In the 0 case, you can see the left-size estimates take a long time to reach the correct value, but in the original initialization, all the estimate approach the correct values roughly together. """metadatashow_logsèdisabled®skip_as_script«code_folded$223055df-7d5c-4d99-bc8d-fbc9702f906fcell_id$223055df-7d5c-4d99-bc8d-fbc9702f906fcodemd""" ### Example 6.7: Maximization Bias Example Consider an MDP with two non-terminal states A and B. Episodes always start in state A and there are two actions, left and right. Choosing right will always result in a reward of 0 and the episode terminating. Choosing left will transition into state B from which there are many actions, all of which result in a terminal transition with random rewards. The distribution of rewards for each of these actions is $\mathcal{N}(-0.1, 1)$. The estimated value of (A, right) will always be 0 since that is the only possible sample to be collected. The estimated value of (A, left) however will have higher variance but an expected value of -0.1. The problem with Q-learning is that, due to the maximization bias, (A, left) will have a higher value estimate when few samples have been collected since it is very likely that one of the state-action pairs from B will produce a reward greater than 0. The more of these actions exist, the worse the bias and the more samples needed to be collected to remove it. If we employ Double Q-learning instead, however, we can eliminate the bias completely. """metadatashow_logsèdisabled®skip_as_script«code_folded$35dc0d94-145a-4292-b0df-9e84a286c036cell_id$35dc0d94-145a-4292-b0df-9e84a286c036codeJmd""" ## 6.8 Games, Afterstates, and Other Special Cases In the tic-tac-toe example we considered learning a value function for a state after the player's move but before the opponent's response. This type of state is called an *afterstate*, and it is useful in situations when we know a portion of the dynamics in an environment, but then a portion of it is stochastic or unknown. For example, we typically know the immediate effect of our moves, but not necessarily what happens after that. It can be more efficient to learn based on afterstates because there are fewer values to represent than if we need to learn the full action value function. Any state-action pair that maps to the same afterstate would be represented by a single value. These afterstate value functions can also be learned with generalized policy iteration. """metadatashow_logsèdisabled®skip_as_script«code_folded$4d7619ee-933f-452a-9202-e95a8f3da20fcell_id$4d7619ee-933f-452a-9202-e95a8f3da20fcodej@htl(""" Sarsa backup diagram. Black circles represent actions and white circles represent states.
""")metadatashow_logsèdisabled®skip_as_script«code_folded$00d67a93-437c-4cda-899a-9daa1102e1f2cell_id$00d67a93-437c-4cda-899a-9daa1102e1f2code[example_6_7_mdp(;num_episodes = 300, nruns = 10_000, num_actions = 10, load_file=load_file)metadatashow_logsèdisabled®skip_as_script«code_folded$500d8dd4-fc53-4021-b797-114224ca4debcell_id$500d8dd4-fc53-4021-b797-114224ca4debcodeqconst rook_action_display = @htl("""
Actions
""")metadatashow_logsèdisabled®skip_as_script«code_folded$ff5d051e-5de1-48a9-9578-5dbafd71afd1cell_id$ff5d051e-5de1-48a9-9578-5dbafd71afd1code|max_bias_visualization(;nvars_max = max_visual_params.nvars, nmax = max_visual_params.nmax, nruns = max_visual_params.nruns)metadatashow_logsèdisabled®skip_as_script«code_folded$e947f86e-8dc3-4ce7-a9d4-0a7b675a9fa9cell_id$e947f86e-8dc3-4ce7-a9d4-0a7b675a9fa9codex#the value function in this case represents the value of each afterstate. the afterstates are listed in mdp.afterstates while the states are listed in mdp.states begin_value_iteration_v(mdp::FiniteAfterstateMDP{T,S1, S2, A}, γ::T; Vinit::T = zero(T), kwargs...) where {T<:Real,S1,S2,A} = begin_value_iteration_v(mdp, γ, Vinit .* ones(T, length(mdp.afterstates)); kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$a925534e-f9b8-471a-9d86-c9212129b630cell_id$a925534e-f9b8-471a-9d86-c9212129b630code7md""" The following represents a trajectory taken by a policy in an environment. We week to estimate $q_\pi(s, a)$ for the current behavior policy $\pi$ using the same TD method we introduced above. The update rule now, however, estimates the value of state action pairs rather than the states themselves. """metadatashow_logsèdisabled®skip_as_script«code_folded$7a5ff8f7-70d4-46f1-a4a7-bbfcec4f6e3fcell_id$7a5ff8f7-70d4-46f1-a4a7-bbfcec4f6e3fcodeكfunction sample_action(π::Matrix{T}, i_s::Integer) where T<:AbstractFloat (n, m) = size(π) sample(1:n, weights(π[:, i_s])) endmetadatashow_logsèdisabled®skip_as_script«code_folded$b5e06f59-33b5-414e-9a81-43e8abd07aa3cell_id$b5e06f59-33b5-414e-9a81-43e8abd07aa3codeYmd""" Q-learning Solution $(show_gridworld_policy_value(noisy_gridworld, q_learning(noisy_gridworld, α_6_8, 1.0f0, num_episodes = 5_000); winds = fill(0, gridsize))) Double Q-learning Solution $(show_gridworld_policy_value(noisy_gridworld, double_q_learning(noisy_gridworld, α_6_8, 1.0f0, num_episodes = 1_000); winds = fill(0, gridsize))) """metadatashow_logsèdisabled®skip_as_script«code_folded$a0d2333f-e87b-4981-bb52-d436ec6481c1cell_id$a0d2333f-e87b-4981-bb52-d436ec6481c1code md""" Because TD(0) bases its update in part on an existing estimate, we say that it is a *bootstrapping* method, like DP. We know from Chapter 3 that $\begin{flalign} v_\pi & \doteq \mathbb{E}_\pi[G_t \mid S_t = s] \tag{6.3}\\ &= \mathbb{E}[R_{t+1} + \gamma G_{t+1} \mid S_t = s] \tag{from (3.9)}\\ &=\mathbb{E}[R_{t+1} + \gamma v_\pi (S_{t+1}) \mid S_t = s] \tag{6.4} \end{flalign}$ Roughly speaking, Monte Carlo methods use an estimate of (6.3) as a target whereas DP methods use an estiamte of (6.4) as a target. The Monte Carlo target is an estimate because the exepcted value in (6.3) is not known; a sample return is used in place of the real expected return. The DP target is an estimate not because of the expected values, which are assumed to be completely provided by a model of the environment, but because $v_\pi(S_{t+1})$ is not known and the current estimate, $V(S_{t+1})$, is used isntead. The TD target is an estimate for both reasons; it samples the expected values in (6.4) *and* it uses the current estimate $V$ instead of the true $v_\pi$. Thus, TD methods combine the sampling of Monte Carlo with the bootstrapping of DP. TD and Monte Carlo updates are both refered to as *sample updates* because they involve looking ahead to a sample successsor state (or state-action pair). *Expected updates* used in DP methods use the complete distribution of all possible successor states rather than a single sample. Note that the quantity in the brakets in (6.2) is a sort of error, measuring the difference between the estimated value of $S_t$ and the better estimate $R_{t+1} + \gamma V(S_{t+1})$. This quantity is called the *TD error*: $\delta_t \doteq R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \tag{6.5}$ The TD error depends on the subsequent state so it is not available until one step later. That is to say $\delta_t$ is not known until time $t+1$. Also note that if we do not update $V$ during an episode (as we do not in Monte Carlo methods), then the Monte Carlo error can be written as the sum of TD errors: $\begin{flalign} G_t - V(S_t) &= R_{t+1} + \gamma G_{t+1} - V(S_t) + \gamma V(S_{t+1}) - \gamma V(S_{t+1}) \tag{from (3.9)} \\ &=\delta_t + \gamma(G_{t+1} - V(S_{t+1})) \tag{a}\\ &=\delta_t + \gamma \left ( \delta_{t+1} + \gamma(G_{t+2} - V(S_{t+2})) \right ) \tag{using (a)}\\ &=\delta_t + \gamma \delta_{t+1} + \gamma^2 \left ( G_{t+2} - V(S_{t+2}) \right ) \\ &=\delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+2} + \cdots + \gamma^{T-t-1}\delta_{T-1} + \gamma^{T-t}(G_T - V(S_T)) \tag{applying (a) until terination}\\ &=\delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+2} + \cdots + \gamma^{T-t-1}\delta_{T-1} + \gamma^{T-t}(0-0) \tag{definition of terminal state}\\ &=\sum_{k=t}^{T-1} \gamma^{k-t} \delta_k \tag{6.6} \end{flalign}$ This identity is not exact if $V$ is updated during the episode (as it is in TD(0)), but if the step size is small then it may still hold approximately. """metadatashow_logsèdisabled®skip_as_script«code_folded$f841c4d8-5176-4007-b472-9e01a799d85ccell_id$f841c4d8-5176-4007-b472-9e01a799d85ccode4function addelements(e1, e2) """ $e1 $e2 """ endmetadatashow_logsèdisabled®skip_as_script«code_folded$685a7ba3-0f94-4663-a68a-73fa03bd9445cell_id$685a7ba3-0f94-4663-a68a-73fa03bd9445codefunction make_greedy_policy!(π::Matrix{T}, mdp::FiniteAfterstateMDP{T, S1, S2, A}, V::Vector{T}, γ::T) where {T<:Real,S1,S2,A} for i_s in eachindex(mdp.states) π[:, i_s] .= mdp.reward_interim_map[:, i_s] .+ V[mdp.afterstate_map[:, i_s]] maxv = -T(Inf) @inbounds @fastmath @simd for i_a in eachindex(mdp.actions) maxv = max(maxv, π[i_a, i_s]) end π[:, i_s] .= (π[:, i_s] .≈ maxv) x = zero(T) @fastmath @inbounds @simd for i_a in eachindex(mdp.actions) x += π[i_a, i_s] end π[:, i_s] ./= x end return π endmetadatashow_logsèdisabled®skip_as_script«code_folded$d5abd922-a8c2-4f5c-9a6e-d2490a8ad7dccell_id$d5abd922-a8c2-4f5c-9a6e-d2490a8ad7dccodeq#take a step in the environment from state s using policy π function takestep(mdp::MDP_TD{S, A, F, G, H}, π::Matrix{T}, s::S) where {S, A, F<:Function, G<:Function, H<:Function, T<:Real} i_s = mdp.statelookup[s] i_a = sample_action(π, i_s) a = mdp.actions[i_a] (r, s′) = mdp.step(s, a) i_s′ = mdp.statelookup[s′] return (i_s, i_s′, r, s′, a, i_a) endmetadatashow_logsèdisabled®skip_as_script«code_folded$bce6e4ab-58ec-4e00-be34-bc4caf51f57dcell_id$bce6e4ab-58ec-4e00-be34-bc4caf51f57dcode٥function cum_mean(v::AbstractVector{T}) where T<:Real out = zeros(length(v)) s = zero(T) for (i, x) in enumerate(v) s += x out[i] = s / i end return out endmetadatashow_logsèdisabled®skip_as_script«code_folded$4ddcd409-c31c-444c-8fcf-7cc45b68d93bcell_id$4ddcd409-c31c-444c-8fcf-7cc45b68d93bcodefunction make_mrp(;l = (5)) function step(s, a) x = s + rand(mrp_moves) r = Float32(floor(x / (l+1))) (r, mod(x, l+1)) #if a transition is terminal will return 0 end MDP_TD(collect(0:l), [1], () -> ceil(Int64, l/2), step, s -> s == 0) endmetadatashow_logsèdisabled®skip_as_script«code_folded$c5d32889-634b-4b00-8ba7-0d1ecaf94f05cell_id$c5d32889-634b-4b00-8ba7-0d1ecaf94f05codeُinitialize_state_action_value(mdp::MDP_TD; qinit::T = 0.0f0) where T<:AbstractFloat = ones(T, length(mdp.actions), length(mdp.states)) .* qinitmetadatashow_logsèdisabled®skip_as_script«code_folded$3b16cbb7-f859-4871-9a63-8b40eb4191becell_id$3b16cbb7-f859-4871-9a63-8b40eb4191becodemd""" > ### *Exercise 6.1* > If $V$ changes during the episode, then (6.6) only holds approximately; what would the difference be between the two sides? Let $V_t$ denote the array of state values used at time $t$ in the TD error (6.5) and in the TD update (6.2). Redo the derivation above to determine the additional amount that must be added to the sum of TD errors in order to equal the Monte Carlo error. """metadatashow_logsèdisabled®skip_as_script«code_folded$902738c3-2f7b-49cb-8580-29359c857027cell_id$902738c3-2f7b-49cb-8580-29359c857027codeM@htl(""" """)metadatashow_logsèdisabled®skip_as_script«code_folded$c93ed1f2-3c38-4f68-8bf8-2cdf4e7bee34cell_id$c93ed1f2-3c38-4f68-8bf8-2cdf4e7bee34code{md""" Now we can rewrite the Monte Carlo error using (3.9) again and proceed with the derivation keeping track of the time index of the value estiamtes: $\begin{flalign} G_t - V_t(S_t) &= R_{t+1} + \gamma G_{t+1} - V_t(S_t) + \gamma V_{t}(S_{t+1}) - \gamma V_{t}(S_{t+1}) \tag{from (3.9)}\\ &= \delta_t + \gamma \left [ G_{t+1} - V_t(S_{t+1}) \right ] \\ &= \delta_t + \gamma \left [ G_{t+1} - V_{t+1}(S_{t+1}) + V_{t+1}(S_{t+1}) - V_t(S_{t+1}) \right ] \\ \end{flalign}$ Define the following $\eta_{t} \doteq V_{t+1}(S_{t+1}) - V_t(S_{t+1})$ which let's us re-write the equation $G_t - V_t(S_t) = \delta_t + \gamma \eta_{t} + \gamma \left [ G_{t+1} - V_{t+1}(S_{t+1})\right ]$ Notice that the term in the brakets is equivalent to the left hand side but shifted forward one time step. That implies the equation can be expanded recursively as we did with the original derivation. """metadatashow_logsèdisabled®skip_as_script«code_folded$f36822d7-9ea8-4f5c-9925-dc2a466a68bacell_id$f36822d7-9ea8-4f5c-9925-dc2a466a68bacode%md""" # Dependencies and Settings """metadatashow_logsèdisabled®skip_as_script«code_folded$3e367811-247b-4bd6-b8fe-63f8996fb9e8cell_id$3e367811-247b-4bd6-b8fe-63f8996fb9e8code#md""" ### Formal Proof for Bias """metadatashow_logsèdisabled®skip_as_script«code_folded$7de9b6a4-49ce-4dc3-9d5b-cecfcb98bba1cell_id$7de9b6a4-49ce-4dc3-9d5b-cecfcb98bba1codeCconst jacks_car_afterstate_mdp = create_car_rental_afterstate_mdp()metadatashow_logsèdisabled®skip_as_script«code_folded$c4719c42-87aa-482a-95aa-a1492d42835dcell_id$c4719c42-87aa-482a-95aa-a1492d42835dcode#md""" #### Stochastic Gridworld """metadatashow_logsèdisabled®skip_as_script«code_folded$495f5606-0567-47ad-a266-d21320eecfc6cell_id$495f5606-0567-47ad-a266-d21320eecfc6codemd""" Monte Carlo nonstationary update rule for value function $V(S_t) \leftarrow V(S_t) + \alpha [G_t - V(S_t)] \tag{6.1}$ where $G_t$ is the actual return following time $t$, and $\alpha$ is a constant step-size parameter. Call this method *constant-α MC*. The use of a constant step size α instead of the usual sample average is what makes this estiamtion method suitable for non-stationary problems. Because the value $G_t$ is required, this method requires waiting for the final results from the end of an episode. In contrast, TD methods need only wait for results from the following timestep to perform an update. The following is the simplest TD method update rule: $V(S_t) \leftarrow V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)] \tag{6.2}$ where the update can be made immediately on transition to $S_{t+1}$ after receiving $R_{t+1}$. This TD method is called $TD(0)$, or *one-step TD*. See below for code implementing this. """metadatashow_logsèdisabled®skip_as_script«code_folded$0a4ed8c7-27ca-45cb-af15-70ddd86240fbcell_id$0a4ed8c7-27ca-45cb-af15-70ddd86240fbcode5md""" #### Batch Method Estimation Implementation """metadatashow_logsèdisabled®skip_as_script«code_folded$cdedd35e-52b8-40a5-938d-2d36f6f93217cell_id$cdedd35e-52b8-40a5-938d-2d36f6f93217codeconst king_action_display = @htl("""
Actions
""")metadatashow_logsèdisabled®skip_as_script«code_folded$3756a3f8-18e8-4d62-afa1-cfeb4183820ccell_id$3756a3f8-18e8-4d62-afa1-cfeb4183820ccode function double_expected_sarsa(mdp::MDP_TD{S, A, F, G, H}, α::T, γ::T; num_episodes = 1000, qinit = zero(T), ϵinit = one(T)/10, Qinit::Matrix{T} = initialize_state_action_value(mdp; qinit=qinit), decay_ϵ = false, target_policy_function! = (v, ϵ, s) -> make_ϵ_greedy_policy!(v, ϵ), behavior_policy_function! = (v, ϵ, s) -> make_ϵ_greedy_policy!(v, ϵ), πinit_target::Matrix{T} = create_ϵ_greedy_policy(Qinit, ϵinit), πinit_behavior::Matrix{T} = create_ϵ_greedy_policy(Qinit, ϵinit), save_state::S = first(mdp.states), save_history = false) where {S, A, F, G, H, T<:AbstractFloat} terminds = findall(mdp.isterm(s) for s in mdp.states) Q1 = copy(Qinit) Q2 = copy(Qinit) Q1[:, terminds] .= zero(T) Q2[:, terminds] .= zero(T) π_target1 = copy(πinit_target) π_target2 = copy(πinit_target) π_behavior = copy(πinit_behavior) vhold1 = zeros(T, length(mdp.actions)) vhold2 = zeros(T, length(mdp.actions)) vhold3 = zeros(T, length(mdp.actions)) #keep track of rewards and steps per episode as a proxy for training speed rewards = zeros(T, num_episodes) steps = zeros(Int64, num_episodes) if save_history action_history = Vector{A}(undef, num_episodes) end for ep in 1:num_episodes ϵ = decay_ϵ ? ϵinit/ep : ϵinit s = mdp.state_init() rtot = zero(T) l = 0 while !mdp.isterm(s) (i_s, i_s′, r, s′, a, i_a) = takestep(mdp, π_behavior, s) if save_history && (s == save_state) action_history[ep] = a end # q_expected = sum(π_target[i, i_s′]*(Q1[i, i_s′]*toggle + Q2[i, i_s′]*(1-toggle)) for i in eachindex(mdp.actions)) toggle = rand() < 0.5 q_expected = if toggle sum(π_target2[i, i_s′]*Q1[i, i_s′] for i in eachindex(mdp.actions)) else sum(π_target1[i, i_s′]*Q2[i, i_s′] for i in eachindex(mdp.actions)) end if toggle Q2[i_a, i_s] += α*(r + γ*q_expected - Q2[i_a, i_s]) else Q1[i_a, i_s] += α*(r + γ*q_expected - Q1[i_a, i_s]) end #update terms for next step if toggle vhold2 .= Q2[:, i_s] target_policy_function!(vhold2, ϵ, s) π_target2[:, i_s] .= vhold2 else vhold1 .= Q1[:, i_s] target_policy_function!(vhold1, ϵ, s) π_target1[:, i_s] .= vhold1 end vhold3 .= vhold1 .+ vhold2 behavior_policy_function!(vhold3, ϵ, s) π_behavior[:, i_s] .= vhold3 s = s′ l+=1 rtot += r end steps[ep] = l rewards[ep] = rtot end Q1 .+= Q2 Q1 ./= 2 plain_return = Q1, create_greedy_policy(Q1), steps, rewards save_history && return (plain_return..., action_history) return plain_return endmetadatashow_logsèdisabled®skip_as_script«code_folded$04a0be81-ee5f-4eeb-963a-ad930392d50bcell_id$04a0be81-ee5f-4eeb-963a-ad930392d50bcodeexample_6_5()metadatashow_logsèdisabled®skip_as_script«code_folded$136d1d96-b590-4f03-9e42-2337efc560cccell_id$136d1d96-b590-4f03-9e42-2337efc560cccodeHTML(""" """)metadatashow_logsèdisabled®skip_as_script«code_folded$6bffb08c-704a-4b7c-bfce-b3d099cf35c0cell_id$6bffb08c-704a-4b7c-bfce-b3d099cf35c0codefunction gridworld_Q_vs_sarsa_solve(mdp; α=0.5f0, ϵ=0.1f0, num_episodes = 500, nruns = 100) function addtuple(t1, t2) Tuple(t1[i] .+ t2[i] for i in eachindex(t1)) end sarsa_results = mapreduce(addtuple, 1:nruns) do _ sarsa(mdp, α, 1.0f0; num_episodes = num_episodes, ϵinit = ϵ) end qlearning_results = mapreduce(addtuple, 1:nruns) do _ q_learning(mdp, α, 1.0f0; num_episodes = num_episodes, ϵinit = ϵ) end # qlearning_results = [q_learning(mdp, α, 1.0f0; num_episodes = num_episodes, ϵinit = ϵ) for _ in 1:nruns] p1 = plot_path(mdp, create_greedy_policy(sarsa_results[1] ./ nruns); windtext = fill("", 12), xtitle = "", title = "Cliff Walking Sarsa Path") p2 = plot_path(mdp, qlearning_results[2] ./ nruns; windtext = fill("", 12), xtitle = "", title = "Cliff Walking Q Learning Path") traces = [scatter(x = 1:num_episodes, y = results[4] ./ nruns, name = name) for (results, name) in zip([sarsa_results, qlearning_results], ["Sarsa", "Q-learning"])] p3 = plot(traces, Layout(xaxis_title = "Episodes", yaxis = attr(title = "Sum of rewards during episode", range = [-100, -15]))) p3 = plot(traces, Layout(xaxis_title = "Episodes", yaxis = attr(title = "Sum of rewards during episode", range = [-100, -15]))) steptraces = [scatter(x = 1:num_episodes, y = results[3] ./ nruns, name = name) for (results, name) in zip([sarsa_results, qlearning_results], ["Sarsa", "Q-learning"])] p4 = plot(steptraces, Layout(xaxis_title = "Episodes", yaxis = attr(title = "Average steps per episode
during training", range = [0, 100]))) @htl("""
$p1 $p2
$p3 $p4 """) endmetadatashow_logsèdisabled®skip_as_script«code_folded$f95ceb98-f12e-4650-9ad3-0609b7ecd0f3cell_id$f95ceb98-f12e-4650-9ad3-0609b7ecd0f3codemd""" > ### *Exercise 6.14* > Describe how the task of Jack's Car Rental (Example 4.2) could be reformulated in terms of afterstates. Why, in terms of this specific task, would such a reformulation be likely to speed convergence? In the original problem the state is the number of cars at each location at the end of the day. The actions are the net numbers of cars moved between the two locations overnight. With an afterstate approach, the value function would only consider the number of cars after the movement is performed. This would be equivalent to valuing the state the following morning when customers begin to return and rent new cars. The random processes that occur the following day will have a good/bad outcome based on the cars available at each location at the start of the day. This approach would likely converge faster because we are only modeling the value of the state that is directly related to whether or not cars will be available. Similar to the tic-tac-toe example, many actions will result in the same afterstate, but equivalent afterstates should have the same value. See below for code that creates the car rental MDP and solves it using value iteration with afterstates. """metadatashow_logsèdisabled®skip_as_script«code_folded$8787a5fd-d0ab-46b5-a7df-e7bc103a7378cell_id$8787a5fd-d0ab-46b5-a7df-e7bc103a7378code|function value_iteration_v!(V, θ, mdp, γ, nmax, valuelist) nmax <= 0 && return valuelist #update value function delt = bellman_optimal_value!(V, mdp, γ) #add copy of value function to results list push!(valuelist, copy(V)) #halt when value function is no longer changing delt <= θ && return valuelist value_iteration_v!(V, θ, mdp, γ, nmax - 1, valuelist) endmetadatashow_logsèdisabled®skip_as_script«code_folded$03a06e10-f68a-403c-97bf-7a7627f2c5d6cell_id$03a06e10-f68a-403c-97bf-7a7627f2c5d6code md""" Hasselt, in his paper proposes an alternative **Double Estimator** to correct this bias in approximating $\max_i \mathbb{E} \{ X_i \}$ which uses two sets of estimators: $\mu^A = \{ \mu_1^A, \dots, \mu_M^A \}$ and $\mu^B = \{ \mu_1^B, \dots, \mu_M^B \}$. Both sets of estimators are updated with a subset of samples we draw, such that $S = S^A \cup S^B$ and $S^A \cap S^B = \emptyset$ and $\mu_i^A(S) = \frac{1}{\vert S_i^A \vert } \sum_{s \in S_i^A} s$ and $\mu_i^B(S) = \frac{1}{\vert S_i^B \vert } \sum_{s \in S_i^B} s$. Like the single estimator $\mu_i$, both $\mu_i^A$ and $\mu_i^B$ are unbiased if we assume that samples are split in a proper manner, for instance randomly over the two sets of estimators. Let $Max^A (S) \doteq \{ j \mid \mu_j^A (S) = \max_i \mu_i^A (S) \}$ be the set of maximal estimates in $\mu^A(S)$. Since $\mu^B$ is an independent, unbiased set of estimators, we have $\mathbb{E} \{ \mu_j^B \} = \mathbb{E} \{ X_j \}$ for all $j$, including all $j \in Max^A$. Let $a^*$ be an estimator that maximizes $\mu^A:\mu_{a^*}^A(S) \doteq \max_i \mu_i ^A (S)$. If there are multiple estimators that maximize $\mu^A$, we can for instance pick one at random. Then we can use $\mu_{a^*}^B$ as an estimate for $\max_i \mathbb{E} \{ \mu_i^B \}$ and therefore also for $\max_i \mathbb{E} \{ X_i \}$ and we obtain the approximation $$\max_i \mathbb{E} \{ X_i \} = \max_i \mathbb{E} \{ \mu_i^B \} \approx \mu_{a^*}^B \tag{e}$$ As we gain more samples the variance of the estimators decreases. In the limit, $\mu_i^A(S) = \mu_i^B(S) = \mathbb{E} \{ X_i \}$ for all $i$ and the approximation in $(e)$ converges to the correct result. Assume that hte underlying PDFs are continuous. The probability $P(j = a^*)$ for any $j$ is then equal to the probability that all $i \neq j$ give lower estimates. Thus $\mu_j^A(S) = x$ is maximal for some value $x$ with probability $\prod_{i \neq j}^M P(\mu_i ^A \lt x)$. Integrating out $x$ gives $P(j = a^*) = \int_{-\infty}^\infty P(\mu_j^A = x) \prod_{i \neq j}^M P(\mu_i^A < x)dx \doteq \int_{-\infty}^\infty f_j^A(x) \prod_{i \neq j}^M F_i^A(x) dx$, where $f_i^A$ and $F_i^A$ are the PDF and CDF of $\mu_i^A$. The expected value of the approximation by the double estimator can thus be givne by $$\sum_j^M P(j = a^*) \mathbb{E} \{ \mu_j^B \} = \sum_j^M \mathbb{E} \{ \mu_j ^B \} \int_{-\infty}^\infty f_j^A(x) \prod_{i \neq j} F_i^A(x)dx \tag{f}$$ For discrete PDFs the probability that two or more estimators are equal should be taken into account and the integrals should be replaced with sums. Comparing (f) to (c), we see the difference is that the double estimator uses $\mathbb{E} \{ \mu_j^B \}$ in place of $x$. The single estimator overestimates, because $x$ is within the integral and therefore correlates with the monotonically increasing product $\prod_{i \neq j} F_i^\mu(x)$. The double estimator underestimates because the probabilities $P(j = a^*)$ sum to one and therefore the approximation is a weighted estimate of unbiased expected values, which must be lower or equal to the maximum expected value. In the following lemma, which holds in both discrete and the continuous case, we prove in general that hte estimate $\mathbb{E} \{ \mu_{a^*}^B \}$ is not an unbiased estimate of $\max_i \mathbb{E} \{ X_i \}$. """metadatashow_logsèdisabled®skip_as_script«code_folded$0d6a11af-b146-4bbc-997e-a11b897269a7cell_id$0d6a11af-b146-4bbc-997e-a11b897269a7code,md""" ## 6.4 Sarsa: On-policy TD Control """metadatashow_logsèdisabled®skip_as_script«code_folded$72b4d8d5-464c-4561-8c69-28ef3f59630bcell_id$72b4d8d5-464c-4561-8c69-28ef3f59630bcode#update the value function with the MC method using a single episode function update_value!(V::Vector{T}, ::MC, α::T, γ::T, mdp::MDP_TD{S, A, F, G, H}, states::Vector{S}, actions::Vector{A}, rewards::Vector{T}) where {T<:AbstractFloat, S, A, F<:Function, G<:Function, H<:Function} l = length(states) g = zero(T) err = zero(T) for i in l:-1:1 g = γ*g + rewards[i] s = states[i] i_s = mdp.statelookup[s] v_old = V[i_s] v_new = v_old + α*(g-v_old) err = max(err, calc_error(v_old, v_new)) V[i_s] = v_new end return err endmetadatashow_logsèdisabled®skip_as_script«code_folded$47c2cbdd-f6db-4ce5-bae2-8141f30aacbccell_id$47c2cbdd-f6db-4ce5-bae2-8141f30aacbccodemd""" ### Example 6.2 Random Walk In this example we empirically compare the prediction abilities of TD(0) and constant-α MC when applied to the following Markov reward process: In this MRP the agent's actions are irrelevant as each step the state transition occurs either to the left or the right with equal probability. An episode ends when the transition terminates at the left or right side of the chain. If the agent exits to the right, it receives a reward of 1. Otherwise, all other transitions receive a reward of 0. Below is an animation of the agent randomly moving through an episode. Longer chains will have longer episode times on average growing roughly quadratically with the length of the chain. Underneath the visualizations is the code. """metadatashow_logsèdisabled®skip_as_script«code_folded$8224b808-5778-458b-b683-ea2603c82117cell_id$8224b808-5778-458b-b683-ea2603c82117code(md""" ### Example 6.6: Cliff Walking """metadatashow_logsèdisabled®skip_as_script«code_folded$c4919d14-8cba-43e6-9369-efc52bcb9b23cell_id$c4919d14-8cba-43e6-9369-efc52bcb9b23code#function make_greedy_policy!(π::Matrix{T}, mdp::FiniteMDP{T, S, A}, V::Vector{T}, γ::T) where {T<:Real,S,A} for i_s in eachindex(mdp.states) maxv = -Inf for i_a in eachindex(mdp.actions) x = zero(T) for i_r in eachindex(mdp.rewards) for i_s′ in eachindex(V) x += mdp.ptf[i_s′, i_r, i_a, i_s] * (mdp.rewards[i_r] + γ * V[i_s′]) end end maxv = max(maxv, x) π[i_a, i_s] = x end π[:, i_s] .= (π[:, i_s] .≈ maxv) π[:, i_s] ./= sum(π[i_a, i_s] for i_a in eachindex(mdp.actions)) end return π endmetadatashow_logsèdisabled®skip_as_script«code_folded$05664aaf-575b-4249-974c-d8a2e63f380acell_id$05664aaf-575b-4249-974c-d8a2e63f380acodemd""" > ### *Exercise 6.11* > Why is Q-learning considered an *off-policy* control method? If we compare to the on-policy update rule, the expected value being calculated at each state action pair should be: $Q_\pi(S_t, A_t) = \text{E}_\pi [R_{t+1} + \gamma Q_\pi(S_{t+1}, A_{t+1})]$ which we estimate with sampling. In Q-learning, the expected value being estimated is instead: $Q_\pi(S_t, A_t) = \text{E}_\pi [R_{t+1} + \gamma \text{max}_a Q_\pi(S_{t+1}, a)]$ Since the behavior policy being used to select the subsequent action taken from state $S_{t+1}$ is $\epsilon$-greedy, there is a probability that the next action will not match the maximizing action. So the Q-Learning update is computing the optimal greedy state-action value function rather than the optimal $\epsilon$-greedy value function of the behavior policy. Sarsa, in contrast follows the same policy and computes the value function which matches this policy, thus making it a true on-policy method. """metadatashow_logsèdisabled®skip_as_script«code_folded$dda222ef-8178-40bb-bf20-d242924c4fabcell_id$dda222ef-8178-40bb-bf20-d242924c4fabcodeBconst king_gridworld = make_windy_gridworld(;actions=king_actions)metadatashow_logsèdisabled®skip_as_script«code_folded$48b557e3-e239-45e9-ab15-105bcca96492cell_id$48b557e3-e239-45e9-ab15-105bcca96492code md""" ## 6.3 Optimality of TD(0) Suppose there is available only a finite amount of experience, say 10 episodes or 100 time steps. In this case, a common approach with incremental learning methods is to present the experience repeatedly until the method converges upon an answer. Given an approximate value function $V$, the increments specified by (6.1) or (6.2) are computed for every time step $t$ at which a nonterminal state is visited, but the value function is changed only once, by the sum of all the increments. Then all the available experience is processed again with the new value function to produce a new overall increment, and so on, until the value function converged. We call this *batch updating* because updates are made only after processing each complete *batch* of training data. Under batch updating, TD(0) converges deterministically to a single answer independent of the step-size parameter, $\alpha$, as long as $\alpha$ is chosen to be sufficiently small. The constant $\alpha$ MC method also converges deterministically under the same conditions, but to a difference answer. Understanding these two answers will help us understand the difference between the two methods. Under normal updating the methods do not move all the way to their respective batch answers, but in some sense they take steps in these directions. Before trying to understand the two answers in general, for all possible tasks, we first look at a few examples. ### Example 6.3: Random walk under batch updating Batch-updating versions of TD(0) and constant-$\alpha$ MC were applied as follows to the random walk prediction example (Example 6.2). After each new episode, all episodes seen so far were treated as a batch. They were repeatedly presented to the algorithm, either TD(0) or constant-$\alpha$ MC, with $\alpha$ sufficiently small that the value function converged. The resulting value function was then compared with $v_\pi$, and the average root mean square error across the five states (and accross 100 independent repetitions of the whole experiment) was plotted to obtain the learning curves shown in Figure 6.2. Note that the batch TD method was consistently better than the batch Monte Caro method. Under batch training, constant-$\alpha$ MC converges to the values, $V(s)$, that are sample averages of the actual returns experienced after visiting each state $s$. These are optimal estimates in the sense that they minimize the mean square error from the actual returns in the training set. In this sense it is surprising that the batch TD method was able to perform better according to the root mean square error measure shown in figure 6.2. How is it that batch TD was able to perform better than this optimal method? The answer is that the Monte Carlo method is optimal only in a limited way, and that TD is optimal in a way that is more relevant to predicting returns. Below is code implementing both batch methods in general for arbitrary MDPs. """metadatashow_logsèdisabled®skip_as_script«code_folded$846720cc-550a-4a3c-a80e-40b99671f4e2cell_id$846720cc-550a-4a3c-a80e-40b99671f4e2codeconst mrp_moves = [-1, 1]metadatashow_logsèdisabled®skip_as_script«code_folded$6556dafb-04fa-434c-868a-8d7bb7b5b196cell_id$6556dafb-04fa-434c-868a-8d7bb7b5b196code%function make_cliffworld(;actions = rook_actions, xmax = 12, ymax = 4, cliff_penalty::T = -100f0, step_reward::T = -1f0) where T<:AbstractFloat start = GridworldState(1, 1) sinit() = start isterm(s) = s == GridworldState(xmax, 1) states = [GridworldState(x, y) for x in 1:xmax for y in 1:ymax] boundstate(x::Int64, y::Int64) = (clamp(x, 1, xmax), clamp(y, 1, ymax)) function cliffcheck(s) safereturn = (step_reward, s) unsafereturn = (cliff_penalty, start) s.y > 1 && return safereturn (s.x == 1) && return safereturn (s.x == xmax) && return safereturn unsafereturn end function step(s::GridworldState, a::GridworldAction) (x1, y1) = move(a, s.x, s.y) (x2, y2) = boundstate(x1, y1) cliffcheck(GridworldState(x2, y2)) end MDP_TD(states, actions, sinit, step, isterm) end metadatashow_logsèdisabled®skip_as_script«code_folded$3f4f078a-9fc4-4b02-b499-a805fd5f1071cell_id$3f4f078a-9fc4-4b02-b499-a805fd5f1071codefunction max_bias_visualization_comp(;nvars = 2, nmax = 100, nruns = 10_000) nlist = collect(2:2:nmax) vars = [randn(nmax, nruns) for _ in 1:nvars] max_estimate = [begin mapreduce(j -> begin means1 = [mean(view(x, 1:2:n, j)) for x in vars] means2 = [mean(view(x, 2:2:n, j)) for x in vars] max1 = maximum(means1 .+ means2) / 2 max2 = (means2[argmax(means1)] + means1[argmax(means2)]) / 2 return (max1, max2) end, (a, b) -> (a[1]+b[1], a[2]+b[2]), 1:nruns) end for n in nlist] estimate1 = [a[1] for a in max_estimate] ./ (nruns .* nlist) estimate2 = [a[2] for a in max_estimate] ./ (nruns .* nlist) t1 = scatter(x = 2:2:nmax, y = estimate1, name = "Max of Means Estimate") t2 = scatter(x = 2:2:nmax, y = estimate2, name = "Double Max Estimate") plot([t1, t2], Layout(xaxis_title = "Number of Samples Per Variable", yaxis_title = "Estimate of Maximum Mean", title = "Maximization Bias for $nvars Variables with Zero Mean")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$75bfe913-8757-4789-b708-7d400c225218cell_id$75bfe913-8757-4789-b708-7d400c225218code@htl("""
$(plot_path(windy_gridworld))
$rook_action_display
""")metadatashow_logsèdisabled®skip_as_script«code_folded$fe2ebf39-4ab3-4aa8-abbd-23389eaf400ecell_id$fe2ebf39-4ab3-4aa8-abbd-23389eaf400ecodemd""" Sarsa converges with probability 1 to an optimal policy and action-value function, under the usual conditions on step sizes (2.7), as long as all state-action pairs are visited an infinite number of times and the policy converges in the limit to the greedy policy (which can be arranged, for example, with $\epsilon$-greedy policies by setting $\epsilon = 1/t$). Below is code that implements Sarsa using the $\epsilon$-greedy method for exploration. """metadatashow_logsèdisabled®skip_as_script«code_folded$98bec66e-d8f3-4d4d-b4ec-5838489164e5cell_id$98bec66e-d8f3-4d4d-b4ec-5838489164e5code:const noisy_gridworld = make_noisy_gridworld(l = gridsize)metadatashow_logsèdisabled®skip_as_script«code_folded$b59eacf8-7f78-4015-bf2c-66f89bf0e24ecell_id$b59eacf8-7f78-4015-bf2c-66f89bf0e24ecodemd""" > ### *Exercise 6.10: Stochastic Wind (programming)* > Re-solve the windy gridworld task with King's moves, assuming the effect of the wind, if there is any, is stochastic, sometimes varying by 1 from the mean values given for each column. That is, a third of the time you move exactly according to these values, as in the previous exercise, but also a third of the time you move one cell above that, and another third of the time you move one cell below that. For example, if you are one cell to the right of the goal and you move left, then one-third of the time you move one cell above the goal, one-third of the time you move two cells above the goal, and one-third of the time you move to the goal. """metadatashow_logsèdisabled®skip_as_script«code_folded$1ae30f5d-b25b-4dcb-800f-45c463641ec5cell_id$1ae30f5d-b25b-4dcb-800f-45c463641ec5codemd""" > ### *Exercise 6.8* > Show that an action-value version of (6.6) holds for the action-value form of the TD error $\delta_t=R_{t+1}+\gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)$, again assuming that the values don't change from step to step. The derivation in (6.6) starts with the definition in (3.9): $G_t = R_{t+1} + \gamma G_{t+1}$ and derives the following: $\delta_t \doteq R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ $G_t - V(S_t) = \sum_{k=t}^{T-1} \gamma^{k-t} \delta_k$ Now we have the action-value form of the TD error: $\delta_t \doteq R_{t+1}+\gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)$ Let us transform (3.9) in a similar manner to derive the rule: $\begin{flalign} G_t - Q(S_t, A_t) &= R_{t+1} + \gamma G_{t+1} - Q(S_t, A_t) + \gamma Q(S_{t+1}, A_{t+1}) - \gamma Q(S_{t+1}, A_{t+1}) \\ &= \delta_t + \gamma (G_{t+1} - Q(S_{t+1}, A_{t+1})) \\ &= \delta_t + \gamma \delta_{t+1} + \gamma^2 (G_{t+2} - Q(S_{t+2}, A_{t+2})) \tag{using recursion} \\ &= \delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+1} + \cdots + \gamma^{T-t-1} \delta_{T-1} + \gamma^{T-t}(G_T - Q(S_T, A_T)) \\ &= \delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+1} + \cdots + \gamma^{T-t-1} \delta_{T-1} + \gamma^{T-t}(0-0) \tag{terminal value} \\ &= \sum_{k=t}^{T-1}\gamma^{k-t}\delta_k \end{flalign}$ """metadatashow_logsèdisabled®skip_as_script«code_folded$7d3be915-9092-4261-8435-dd546a7db144cell_id$7d3be915-9092-4261-8435-dd546a7db144code٢function cum_max(v::AbstractVector{T}) where T<:Real out = similar(v) m = first(v) for (i, x) in enumerate(v) m = max(m, x) out[i] = m end return out endmetadatashow_logsèdisabled®skip_as_script«code_folded$71774d5f-7841-403f-bc6b-1a0cbbb72d6dcell_id$71774d5f-7841-403f-bc6b-1a0cbbb72d6dcodeهconst windy_gridworld_mdp_dp = create_gridworld_mdp(10, 7, GridworldState(1, 4), GridworldState(8, 4), wind_vals, rook_actions, -1.0f0)metadatashow_logsèdisabled®skip_as_script«code_folded$22c2213e-5b9b-410f-a0ef-8f1e3db3c532cell_id$22c2213e-5b9b-410f-a0ef-8f1e3db3c532codefexample_6_3(;l = params_6_2.l, max_episodes = params_6_2.ep, α = Float32(params_6_2.α), vinit=0.5f0)metadatashow_logsèdisabled®skip_as_script«code_folded$39470c74-e554-4f6c-919d-97bec1eec0f3cell_id$39470c74-e554-4f6c-919d-97bec1eec0f3codeٯmd""" Adding king's move actions, the optimal policy can finish in 7 steps vs 15 for the original actions. What happens after adding a 9th action that causes no movement? """metadatashow_logsèdisabled®skip_as_script«code_folded$9da5fd84-800d-4b3e-8627-e90ce8f20297cell_id$9da5fd84-800d-4b3e-8627-e90ce8f20297codefunction show_grid_policy(mdp, π, wind::Vector, display_function, name; action_display = king_action_display, scale = 1.0) width = maximum(s.x for s in mdp.states) height = maximum(s.y for s in mdp.states) start = mdp.state_init() termind = findfirst(mdp.isterm, mdp.states) sterm = mdp.states[termind] ngrid = width*height @htl("""
$(HTML(mapreduce(i -> """
$(display_function(π[:, i], scale =0.8))
""", *, eachindex(mdp.states))))
$(HTML(mapreduce(i -> """
$(wind[i])
""", *, 1:width)))
$(action_display)
Wind Values
""") endmetadatashow_logsèdisabled®skip_as_script«code_folded$415ea466-2038-48fe-9d24-39a90182f1ebcell_id$415ea466-2038-48fe-9d24-39a90182f1ebcodefunction monte_carlo_pred_V(π::Matrix{T}, mdp::MDP_TD{S, A, F, G, H}, α::T, γ::T; num_episodes::Integer = 1000, vinit::T = zero(T), V::Vector{T} = initialize_state_value(mdp; vinit=vinit), save_states = Vector{S}()) where {T <: AbstractFloat, S, A, F, G, H} check_policy(π, mdp) terminds = findall(mdp.isterm(s) for s in mdp.states) V[terminds] .= zero(T) #terminal state must always have 0 value v_saves = zeros(T, length(save_states), num_episodes+1) function updatesaves!(ep) for (i, s) in enumerate(save_states) i_s = mdp.statelookup[s] v_saves[i, ep] = V[i_s] end end updatesaves!(1) #there's no check here so this is equivalent to every-visit estimation function updateV!(states, actions, rewards; t = length(states), g = zero(T)) t = length(states) g = zero(T) for t = length(states):-1:1 #accumulate future discounted returns g = γ*g + rewards[t] i_s = mdp.statelookup[states[t]] i_a = mdp.actionlookup[actions[t]] V[i_s] += α*(g - V[i_s]) #update running average of V end end for j in 1:num_episodes (states, actions, rewards) = runepisode(mdp, π) #update value function for each trajectory updateV!(states, actions, rewards) updatesaves!(j+1) end return V, v_saves endmetadatashow_logsèdisabled®skip_as_script«code_folded$0e488135-49e5-4e71-83b1-05d8e61f0510cell_id$0e488135-49e5-4e71-83b1-05d8e61f0510codeٔconst kingplus_gridworld_mdp_dp = create_gridworld_mdp(10, 7, GridworldState(1, 4), GridworldState(8, 4), wind_vals, [king_actions; Stay()], -1.0f0)metadatashow_logsèdisabled®skip_as_script«code_folded$1f28280e-ba3b-4ca5-89e4-6ca4a90f5893cell_id$1f28280e-ba3b-4ca5-89e4-6ca4a90f5893codebegin car_afterstate_results = begin_value_iteration_v(jacks_car_afterstate_mdp, 0.9f0, θ = 0.0001f0) π_car_afterstate, v_car_afterstate = makepolicyvalueplots(jacks_car_afterstate_mdp, car_afterstate_results[1][end], car_afterstate_results[2], length(car_afterstate_results[1])) md""" ### Afterstate Value Iteration Results for Jack's Car Rental $([π_car_afterstate v_car_afterstate]) """ endmetadatashow_logsèdisabled®skip_as_script«code_folded$6d9ae541-cf8c-4687-9f0a-f008944657e3cell_id$6d9ae541-cf8c-4687-9f0a-f008944657e3codefunction figure_6_3(mdp; load_file=true) fname = "figure_6_3.bin" load_file && isfile(fname) && return deserialize(fname) αlist = 0.1f0:0.05f0:1.0f0 function generate_data(estimator, nep, nruns) out = zeros(length(αlist)) @threads for i in eachindex(αlist) rmean = mean(begin α = αlist[i] (Qstar, πstar, steps, rsum) = estimator(mdp, α, 1.0f0; num_episodes = nep, ϵinit = 0.1f0) mean(rsum) end for _ in 1:nruns) out[i] = rmean end return out end interim_data(estimator) = generate_data(estimator, 100, 50_000) asymp_data(estimator) = generate_data(estimator, 100_000, 10) estimators = [expected_sarsa, sarsa, q_learning] names = ["Expected Sarsa", "Sarsa", "Q-learning"] interim_traces = [scatter(x = αlist, y = interim_data(estimator), name = "Intermim $name", mode = "lines+markers", line = attr(dash = "dash")) for (estimator, name) in zip(estimators, names)] asymp_traces = [scatter(x = αlist, y = asymp_data(estimator), name = "Asymptotic $name", mode = "lines+markers", line = attr(dash = "dot")) for (estimator, name) in zip(estimators, names)] p = plot([interim_traces; asymp_traces], Layout(axis_title = "α", yaxis_title = "Sum of rewards per episode", yaxis_range = [-150, 0])) serialize(fname, p) return p endmetadatashow_logsèdisabled®skip_as_script«code_folded$d4e39164-9833-4deb-84ca-22f49a1c33d8cell_id$d4e39164-9833-4deb-84ca-22f49a1c33d8codemd""" Reference equations: $\begin{flalign} V(S_t) &\leftarrow V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)] \tag{6.2} \\ \delta_t &\doteq R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \tag{6.5} \end{flalign}$ Re-write equation (6.5) using the values known at time t. $V_t$ means the value function estimate at time $t$. $\delta_t \doteq R_{t+1} + \gamma V_t(S_{t+1}) - V_t(S_t)$ Now equation (6.2) becomes $V_{t+1}(S_t) = V_t(S_t) + \alpha \delta_t$ """metadatashow_logsèdisabled®skip_as_script«code_folded$f2115666-86ce-4c80-9eb7-490cc7a7715ccell_id$f2115666-86ce-4c80-9eb7-490cc7a7715ccode٤md""" With the original value initialization, the error passes through a minimum early on due to the symmetry of the value updates created by the initial value. """metadatashow_logsèdisabled®skip_as_script«code_folded$2155adfa-7a93-4960-950e-1b123da9eea4cell_id$2155adfa-7a93-4960-950e-1b123da9eea4codeking_actionsmetadatashow_logsèdisabled®skip_as_script«code_folded«notebook_id$9c6be96e-38f7-11f0-2d30-a71f02755abcin_temp_dir¨metadata