bondscell_results$4f96be72-ef3e-4e08-ac4c-be4271dcd14cqueued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestampA Zpersist_js_state·has_pluto_hook_features§cell_id$4f96be72-ef3e-4e08-ac4c-be4271dcd14cdepends_on_disabled_cells§runtime@Upublished_object_keysdepends_on_skipped_cells§errored$19dfabda-7049-4050-8662-0385529c0c5aqueued¤logsrunning¦outputbody"W

x position: 0.0 x velocity: 0.0

mimetext/htmlrootassigneelast_run_timestampA Vpersist_js_state·has_pluto_hook_features§cell_id$19dfabda-7049-4050-8662-0385529c0c5adepends_on_disabled_cells§runtime K޵published_object_keysdepends_on_skipped_cellsçerrored$b71145a4-2614-4f62-bfd2-7d5d1fecec56queued¤logsrunning¦outputbodyGactor_critic_with_eligibility_traces! (generic function with 3 methods)mimetext/plainrootassigneelast_run_timestampA + )opersist_js_state·has_pluto_hook_features§cell_id$b71145a4-2614-4f62-bfd2-7d5d1fecec56depends_on_disabled_cells§runtimeXpublished_object_keysdepends_on_skipped_cells§errored$c0876a48-ea18-494d-8bfc-e2bceb73b417queued¤logsrunning¦outputbodyF%
mimetext/htmlrootassigneelast_run_timestampA @!persist_js_state·has_pluto_hook_features§cell_id$c0876a48-ea18-494d-8bfc-e2bceb73b417depends_on_disabled_cells§runtime|published_object_keysdepends_on_skipped_cellsçerrored$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091queued¤logsrunning¦outputbodyDreinforce_monte_carlo_control_fcann (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA $0persist_js_state·has_pluto_hook_features§cell_id$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091depends_on_disabled_cells§runtime=xpublished_object_keysdepends_on_skipped_cells§errored$f4b6f10b-4cd0-4be6-98ec-4d4ffb696392queued¤logsrunning¦outputbodyS

One-step Actor-Critic Implementation

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$f4b6f10b-4cd0-4be6-98ec-4d4ffb696392depends_on_disabled_cells§runtimemȵpublished_object_keysdepends_on_skipped_cells§errored$9db9ff71-bee9-4bea-a45b-748f8517fed1queued¤logsrunning¦outputbodyelementsaction_probabilitiesprefixFloat32elements0.0033094text/plain0.996691text/plaintypeArrayprefix_shortobjectidb337585f8106efc9!application/vnd.pluto.tree+objectstate_value_estimate-243.183text/plaintypeNamedTupleobjectid1ff9b53824d3895mime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA *6persist_js_state·has_pluto_hook_features§cell_id$9db9ff71-bee9-4bea-a45b-748f8517fed1depends_on_disabled_cells§runtimeǵpublished_object_keysdepends_on_skipped_cellsçerrored$4634267b-5dea-4164-8bb2-1eb2fd4d7954queued¤logsrunning¦outputbodyBupdate_linear_eligibility_vector! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA ذpersist_js_state·has_pluto_hook_features§cell_id$4634267b-5dea-4164-8bb2-1eb2fd4d7954depends_on_disabled_cells§runtime"published_object_keysdepends_on_skipped_cells§errored$6c5f51bb-a6be-447e-b73d-4f9c2885e809queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestamppersist_js_state·has_pluto_hook_features§cell_id$6c5f51bb-a6be-447e-b73d-4f9c2885e809depends_on_disabled_cellsçruntimepublished_object_keysdepends_on_skipped_cells§errored$cc45091e-b889-4d5a-9eef-84d80f792046queued¤logsrunning¦outputbody

13.4 REINFORCE with Baseline

The policy gradient theorem (13.5) can be generalized to include a comparison of the action value to an arbitrary baseline b(s):

$$\nabla J(\boldsymbol{\theta}) \propto \sum_s \mu(s)\sum_a\left( q_\pi(s,a)-b(s) \right ) \nabla\pi(a|s,\boldsymbol{\theta}) \tag{13.10}$$

The baseline can be any function, even a random variable, as long as it does not vary with $a$; the euation remains valid because the subtracted quantity is zero:

$$\sum_ab(s)\nabla\pi(a|s,\boldsymbol{\theta})=b(s)\nabla\sum_a\pi(a|s,\boldsymbol{\theta})=b(s)\nabla1=0$$

The policy gradient theorem with baseline (13.10) can be used to derive an update rule using similar steps as in the previous section. The update rule that we end up with is a new version of REINFORCE that includes a general baseline:

$$\boldsymbol{\theta}_{t+1} \doteq \boldsymbol{\theta}_t+\alpha(G_t-b(S_t))\frac{\nabla\pi(A_t|S_t,\boldsymbol{\theta}_t)}{\pi(A_t|S_t,\boldsymbol{\theta}_t)} \tag{13.11}$$

Since the baseline could be uniformly zero, this is a strict generalization of REINFORCE. To have an effective baseline that depends on state we can use a state value estimate that is also updated with gradient steps: $\hat v(S_t, \mathbf{w})$. Using such an estimate we can revise the previous REINFORCE algorithm.

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$cc45091e-b889-4d5a-9eef-84d80f792046depends_on_disabled_cells§runtime]Apublished_object_keysdepends_on_skipped_cells§errored$5b15d91e-7119-4f85-a54a-7d4f1fdaf097queued¤logsrunning¦outputbodyIcreate_actor_critic_continuing_params_UI (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA rpersist_js_state·has_pluto_hook_features§cell_id$5b15d91e-7119-4f85-a54a-7d4f1fdaf097depends_on_disabled_cells§runtime>published_object_keysdepends_on_skipped_cellsçerrored$ba41f521-4ee2-42a6-bf18-078bfa4b875equeued¤logsrunning¦outputbodyAmake_n_param_dist_policy_params (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA ' persist_js_state·has_pluto_hook_features§cell_id$ba41f521-4ee2-42a6-bf18-078bfa4b875edepends_on_disabled_cells§runtimexpublished_object_keysdepends_on_skipped_cells§errored$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03queued¤logsrunning¦outputbodyNcartpole_tilecoding_reinforce_parameter_study (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 1<persist_js_state·has_pluto_hook_features§cell_id$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03depends_on_disabled_cells§runtimeY?published_object_keysdepends_on_skipped_cellsçerrored$3c695d54-c30f-4f04-bd40-f5da53be2a95queued¤logsrunning¦outputbodyN

Cart Pole Continuous Action MDP

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$3c695d54-c30f-4f04-bd40-f5da53be2a95depends_on_disabled_cells§runtime^Hpublished_object_keysdepends_on_skipped_cells§errored$0d45ae72-572f-4d17-83cf-9814f2854131queued¤logsrunning¦outputbodycY

$\lambda_\theta$: 0.05

$\lambda_\mathbf{w}$: 0.8

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

mimetext/htmlrootassigneelast_run_timestampA @/persist_js_state·has_pluto_hook_features§cell_id$0d45ae72-572f-4d17-83cf-9814f2854131depends_on_disabled_cells§runtime&Lpublished_object_keysdepends_on_skipped_cellsçerrored$cd9c9eeb-c90d-4499-9503-7773d5250f47queued¤logsrunning¦outputbodyڋTotal Reward: -68.0
mimetext/htmlrootassigneelast_run_timestampA AtKpersist_js_state·has_pluto_hook_features§cell_id$cd9c9eeb-c90d-4499-9503-7773d5250f47depends_on_disabled_cells§runtime-published_object_keysdepends_on_skipped_cellsçerrored$fd58402f-da65-44cf-b81a-e21192fd0e63queued¤logsrunning¦outputbodyy
mimetext/htmlrootassigneelast_run_timestampA @`persist_js_state·has_pluto_hook_features§cell_id$fd58402f-da65-44cf-b81a-e21192fd0e63depends_on_disabled_cells§runtime\9published_object_keysdepends_on_skipped_cellsçerrored$8e39bd15-862e-4941-88f9-2794b861a523queued¤logsrunning¦outputbodyNreinforce_monte_carlo_control_linear_features (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA $ 9?persist_js_state·has_pluto_hook_features§cell_id$8e39bd15-862e-4941-88f9-2794b861a523depends_on_disabled_cells§runtime;hpublished_object_keysdepends_on_skipped_cells§errored$64900586-ef92-48e4-839e-ff952a46671bqueued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestamppersist_js_state·has_pluto_hook_features§cell_id$64900586-ef92-48e4-839e-ff952a46671bdepends_on_disabled_cellsçruntimepublished_object_keysdepends_on_skipped_cells§errored$fddef10c-7695-4596-9e16-987fd45a57e6queued¤logsrunning¦outputbodyBsetup_cartpole_continuous_problem (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 0upersist_js_state·has_pluto_hook_features§cell_id$fddef10c-7695-4596-9e16-987fd45a57e6depends_on_disabled_cells§runtimeNvpublished_object_keysdepends_on_skipped_cells§errored$e2b09af1-0f22-4f7f-b806-54fa522adb20queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$e2b09af1-0f22-4f7f-b806-54fa522adb20depends_on_disabled_cells§runtime'Kpublished_object_keysdepends_on_skipped_cells§errored$2be8a812-4f21-4fe8-a2de-50497db0345aqueued¤logsrunning¦outputbodyg

Actor-Critic Implementation for Continuous Action Spaces

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$2be8a812-4f21-4fe8-a2de-50497db0345adepends_on_disabled_cells§runtimeupublished_object_keysdepends_on_skipped_cells§errored$68806899-9972-460a-9f11-daa708a9d610queued¤logsrunning¦outputbodyUactor_critic_with_eligibility_traces_linear_features (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA +NYpersist_js_state·has_pluto_hook_features§cell_id$68806899-9972-460a-9f11-daa708a9d610depends_on_disabled_cells§runtimeKMpublished_object_keysdepends_on_skipped_cells§errored$189798b3-ec6b-48b9-918c-ee0f65935ab3queued¤logsrunning¦outputbody

Exercise 13.3

In Section 13.1 we considered policy parameterizations using the soft-max in action preferences (13.2) with linear action preferences (13.3). For this parameterization, prove that the eligibility vector is $\begin{flalign} \nabla \ln \pi(a|s, \boldsymbol{\theta}) = \mathbf{x}(s, a) - \sum_b \pi(b|s, \boldsymbol{\theta}) \mathbf{x}(s, b) \tag{13.9} \end{flalign}$ using the definitions and elementary calculus.

mimetext/htmlrootassigneelast_run_timestampA Epersist_js_state·has_pluto_hook_features§cell_id$189798b3-ec6b-48b9-918c-ee0f65935ab3depends_on_disabled_cells§runtime-published_object_keysdepends_on_skipped_cells§errored$00152954-dc98-4120-b94b-2ea4d987832bqueued¤logsrunning¦outputbodyBcreate_mountaincar_continuing_mdp (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA !'rpersist_js_state·has_pluto_hook_features§cell_id$00152954-dc98-4120-b94b-2ea4d987832bdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$42d4600a-bf3c-45ac-b7f5-d23917713ff5queued¤logsrunning¦outputbody

Layer Size: Num Layers:

mimetext/htmlrootassigneelast_run_timestampA !@persist_js_state·has_pluto_hook_features§cell_id$42d4600a-bf3c-45ac-b7f5-d23917713ff5depends_on_disabled_cells§runtimé̵published_object_keysdepends_on_skipped_cellsçerrored$4e29c621-223e-4859-8e96-db04b967815aqueued¤logsrunning¦outputbodyPsetup_binary_squashed_gaussian_policy_arguments (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA '4Qpersist_js_state·has_pluto_hook_features§cell_id$4e29c621-223e-4859-8e96-db04b967815adepends_on_disabled_cells§runtimeӵpublished_object_keysdepends_on_skipped_cells§errored$5981f52b-d829-4c7d-b47b-33310f7d64a2queued¤logsrunning¦outputbodyprefixFloat32elements0.5text/plain0.5text/plaintypeArrayprefix_shortobjectid540650b0f89a7c43mime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA >persist_js_state·has_pluto_hook_features§cell_id$5981f52b-d829-4c7d-b47b-33310f7d64a2depends_on_disabled_cells§runtimeipublished_object_keysdepends_on_skipped_cellsçerrored$0e9de19e-bcd4-40ac-9831-afb6cad38422queued¤logsrunning¦outputbody=setup_fcann_policy_arguments (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA ݅npersist_js_state·has_pluto_hook_features§cell_id$0e9de19e-bcd4-40ac-9831-afb6cad38422depends_on_disabled_cells§runtimeKpublished_object_keysdepends_on_skipped_cells§errored$ff3009eb-23f9-44fe-8e56-85dbc7b463d0queued¤logsrunning¦outputbody5show_squashed_policy (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA >^persist_js_state·has_pluto_hook_features§cell_id$ff3009eb-23f9-44fe-8e56-85dbc7b463d0depends_on_disabled_cells§runtime>published_object_keysdepends_on_skipped_cellsçerrored$4fb83451-b6f8-4e6e-a131-1accc8e10b08queued¤logsrunning¦outputbodyMreinforce_with_baseline_monte_carlo_control! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA #]R;persist_js_state·has_pluto_hook_features§cell_id$4fb83451-b6f8-4e6e-a131-1accc8e10b08depends_on_disabled_cells§runtimea:published_object_keysdepends_on_skipped_cells§errored$406638af-1e08-44d2-9ee4-97aa9294a94bqueued¤logsrunning¦outputbodyF

13.2 The Policy Gradient Theorem

mimetext/htmlrootassigneelast_run_timestampA vupersist_js_state·has_pluto_hook_features§cell_id$406638af-1e08-44d2-9ee4-97aa9294a94bdepends_on_disabled_cells§runtimeõpublished_object_keysdepends_on_skipped_cells§errored$57e5e12a-b722-4ea3-ab3b-e5711029e640queued¤logsrunning¦outputbodyFone_step_actor_critic_linear_features (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA *persist_js_state·has_pluto_hook_features§cell_id$57e5e12a-b722-4ea3-ab3b-e5711029e640depends_on_disabled_cells§runtime:蘆published_object_keysdepends_on_skipped_cells§errored$374af774-3a97-49b5-a3bb-bc3f7f63a3faqueued¤logsrunning¦outputbodyԾ
mimetext/htmlrootassigneelast_run_timestampA @"persist_js_state·has_pluto_hook_features§cell_id$374af774-3a97-49b5-a3bb-bc3f7f63a3fadepends_on_disabled_cells§runtimeַpublished_object_keysdepends_on_skipped_cellsçerrored$7bf209c8-ef0a-46d1-937e-b1a6e45dc62equeued¤logsrunning¦outputbodyV(

α: 0.01

β: 0.01

mimetext/htmlrootassigneelast_run_timestampA !hHwpersist_js_state·has_pluto_hook_features§cell_id$7bf209c8-ef0a-46d1-937e-b1a6e45dc62edepends_on_disabled_cells§runtime2published_object_keysdepends_on_skipped_cellsçerrored$dd8e8cd2-7b41-46c4-8530-adefb7aea684queued¤logsrunning¦outputbodyRactor_critic_binary_episodic_beta_parameter_study (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA /吼persist_js_state·has_pluto_hook_features§cell_id$dd8e8cd2-7b41-46c4-8530-adefb7aea684depends_on_disabled_cells§runtime;published_object_keysdepends_on_skipped_cellsçerrored$4fea7232-f286-4a8b-93f8-a0702818ab31queued¤logsrunning¦outputbodyO

Test Actor-Critic with Eligibility Traces

mimetext/htmlrootassigneelast_run_timestampA Ҫpersist_js_state·has_pluto_hook_features§cell_id$4fea7232-f286-4a8b-93f8-a0702818ab31depends_on_disabled_cells§runtime߮published_object_keysdepends_on_skipped_cells§errored$26880577-d267-4950-8725-7afe0d0402b6queued¤logsrunning¦outputbodyelementsmdpselementsepisodicelementsdiscreteprefixStateMDP{Float32, CartPoleState{Float32}, Float32, StateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1515#1531"{var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}, Vector{Float32}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1521#1537", var"#init_θ#1551", var"#1523#1539", var"#1524#1540"}}, var"#failure#1527"{Float32, Float32, Float32, Float32}, var"#164#169"}elementsactionsprefixFloat32elementsmoretypeArrayprefix_shortobjectid94c2589b374bb4e8!application/vnd.pluto.tree+objectptfprefixStateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1515#1531"{var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}, Vector{Float32}}}elementsstep#1515text/plaintypestructprefix_shortStateMDPTransitionSamplerobjectidfe79d9d3300a131a!application/vnd.pluto.tree+objectinitialize_stateinitialize_statetext/plainistermfailuretext/plainis_valid_action#164text/plainaction_indexprefixDict{Float32, Int64}elementsmoretypeDictprefix_shortDictobjectid604cc680086d6fbc!application/vnd.pluto.tree+objecttypestructprefix_shortStateMDPobjectid93f5ea03ec017b99!application/vnd.pluto.tree+objectcontinuousprefixContinuousMDP{Float32, CartPoleState{Float32}, Float32, ContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1521#1537", var"#init_θ#1551", var"#1523#1539", var"#1524#1540"}}, var"#failure#1527"{Float32, Float32, Float32, Float32}, Returns{Bool}}elementsptfprefixٺContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}}elementsstepepisodic_steptext/plaintypestructprefix_shortContinuousMDPTransitionSamplerobjectid7c8829407e8c2ed7!application/vnd.pluto.tree+objectinitialize_stateinitialize_statetext/plainistermfailuretext/plainis_valid_actionReturns{Bool}(true)text/plaintypestructprefix_shortContinuousMDPobjectid677d0c961fd63cca!application/vnd.pluto.tree+objecttypeNamedTupleobjectid6ef1486dd0419d46!application/vnd.pluto.tree+objectcontinuingelementsdiscreteprefixStateMDP{Float32, CartPoleState{Float32}, Float32, StateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1516#1532"{var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}, Vector{Float32}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1521#1537", var"#init_θ#1551", var"#1523#1539", var"#1524#1540"}}, Returns{Bool}, var"#164#169"}elementsactionsprefixFloat32elementsmoretypeArrayprefix_shortobjectid98ad56d5f22ee7f4!application/vnd.pluto.tree+objectptfprefixStateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1516#1532"{var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}, Vector{Float32}}}elementsstep#1516text/plaintypestructprefix_shortStateMDPTransitionSamplerobjectid1a36bcae07eee980!application/vnd.pluto.tree+objectinitialize_stateinitialize_statetext/plainistermReturns{Bool}(false)text/plainis_valid_action#164text/plainaction_indexprefixDict{Float32, Int64}elementsmoretypeDictprefix_shortDictobjectidc3e528862580ef49!application/vnd.pluto.tree+objecttypestructprefix_shortStateMDPobjectid45fb03e2144b629b!application/vnd.pluto.tree+objectcontinuousprefixContinuousMDP{Float32, CartPoleState{Float32}, Float32, ContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1521#1537", var"#init_θ#1551", var"#1523#1539", var"#1524#1540"}}, Returns{Bool}, Returns{Bool}}elementsptfprefixContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}}elementsstepcontinuing_steptext/plaintypestructprefix_shortContinuousMDPTransitionSamplerobjectid5a3a5ccaebd85094!application/vnd.pluto.tree+objectinitialize_stateinitialize_statetext/plainistermReturns{Bool}(false)text/plainis_valid_actionReturns{Bool}(true)text/plaintypestructprefix_shortContinuousMDPobjectid8a16015925357c5a!application/vnd.pluto.tree+objecttypeNamedTupleobjectiddb92a2c629c9c65a!application/vnd.pluto.tree+objecttypeNamedTupleobjectid2f01e5bd9066e7b!application/vnd.pluto.tree+objectget_active_features#1549text/plainnum_features52488text/plainmin_valselements-50.0text/plain-1.22173text/plain-50.0text/plain-10.0text/plaintypeTupleobjectide43d54ac3f2f06ed!application/vnd.pluto.tree+objectmax_valselements50.0text/plain1.22173text/plain50.0text/plain10.0text/plaintypeTupleobjectid670383347ca2d0ca!application/vnd.pluto.tree+objecttypeNamedTupleobjectida4287526702b37b9mime!application/vnd.pluto.tree+objectrootassigneeconst cartpole_setuplast_run_timestampA 0˵=persist_js_state·has_pluto_hook_features§cell_id$26880577-d267-4950-8725-7afe0d0402b6depends_on_disabled_cells§runtime[npublished_object_keysdepends_on_skipped_cells§errored$a7891c63-18d6-4c1f-ba67-adf7c547d334queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestamppersist_js_state·has_pluto_hook_features§cell_id$a7891c63-18d6-4c1f-ba67-adf7c547d334depends_on_disabled_cellsçruntimepublished_object_keysdepends_on_skipped_cells§errored$44f14d4f-7414-4c6f-883a-042ca261a403queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestamppersist_js_state·has_pluto_hook_features§cell_id$44f14d4f-7414-4c6f-883a-042ca261a403depends_on_disabled_cellsçruntimepublished_object_keysdepends_on_skipped_cells§errored$94354552-9920-4b90-98d9-f75286d1f53equeued¤logsrunning¦outputbody.R mimetext/htmlrootassigneelast_run_timestampA )ppersist_js_state·has_pluto_hook_features§cell_id$94354552-9920-4b90-98d9-f75286d1f53edepends_on_disabled_cells§runtimeeup=published_object_keysdepends_on_skipped_cellsçerrored$e5faaa1b-88cb-43e2-8d04-8972b58b4bdaqueued¤logsrunning¦outputbodyk9 mimetext/htmlrootassigneelast_run_timestampA Ypersist_js_state·has_pluto_hook_features§cell_id$e5faaa1b-88cb-43e2-8d04-8972b58b4bdadepends_on_disabled_cells§runtime$.Cpublished_object_keysdepends_on_skipped_cellsçerrored$70096b14-beab-4f71-9886-6355c749bb8aqueued¤logsrunning¦outputbody

We previously derived an expression for the gradient of the policy itself in the case of linear action preferences:

$$\begin{flalign} h_a &= \boldsymbol{\theta}^\top \mathbf{x}(s, a) \\ \pi_a &= \frac{e^{h_a}}{\sum_k e^{h_k}} \\ \nabla(\pi_a)_i &= \pi_a \left ( \mathbf{x}(s, a)_i - \sum_k \pi_k \mathbf{x}(s, k)_i \right) \end{flalign}$$

Applying the chain rule to the natural logarithm produces:

$$\nabla \left ( \ln f(\theta) \right) = \frac{\nabla f(\theta)}{f(\theta)} \implies \nabla \left ( \ln f(\theta) \right )_i = \frac{\nabla \left ( f(\theta) \right )_i}{f(\theta)}$$

Applying this to the above expression yields:

$$\begin{flalign} \nabla \left ( \ln \pi_a \right )_i &= \frac{\nabla \left ( \pi_a \right )_i}{\pi_a} \\ &= \frac{\pi_a \left ( \mathbf{x}(s, a)_i - \sum_k \pi_k \mathbf{x}(s, k)_i \right)}{\pi_a} \\ &= \mathbf{x}(s, a)_i - \sum_k \pi_k \mathbf{x}(s, k)_i \end{flalign}$$

which is the per component version of the desired vector expression.

mimetext/htmlrootassigneelast_run_timestampA ipersist_js_state·has_pluto_hook_features§cell_id$70096b14-beab-4f71-9886-6355c749bb8adepends_on_disabled_cells§runtimexյpublished_object_keysdepends_on_skipped_cells§errored$90d3b96b-ad2b-405c-951b-f48ec7ccf24aqueued¤logsrunning¦outputbody

The final expected value expression (13.5) can be sampled on a step by step basis during an episode since we would have access to both the step count and some unbiased sample of the state-action value.

mimetext/htmlrootassigneelast_run_timestampA &xpersist_js_state·has_pluto_hook_features§cell_id$90d3b96b-ad2b-405c-951b-f48ec7ccf24adepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$700dcbc4-c94c-4287-8cf0-0b2c7a320a3aqueued¤logsrunning¦outputbodyelementsaction_probabilitiesprefixFloat32elements0.6152text/plain0.107868text/plain0.276932text/plaintypeArrayprefix_shortobjectid77c56ee0c0a77e30!application/vnd.pluto.tree+objectstate_value_estimate0.920919text/plaintypeNamedTupleobjectid8b86b57a11e05d5bmime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA :Jpersist_js_state·has_pluto_hook_features§cell_id$700dcbc4-c94c-4287-8cf0-0b2c7a320a3adepends_on_disabled_cells§runtimelpublished_object_keysdepends_on_skipped_cells§errored$f59a5dcd-9f4a-4336-a391-e64af35ef799queued¤logsrunning¦outputbodyٴ mimetext/htmlrootassigneelast_run_timestampA tpersist_js_state·has_pluto_hook_features§cell_id$f59a5dcd-9f4a-4336-a391-e64af35ef799depends_on_disabled_cells§runtime

Normal Distribution: $f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{(x - \mu)^2}{2 \sigma^2}}$

Consider a new random variable $Y \sim \tanh(X)$ where $X \sim N(0, 1)$. Using the change of variables theorem from probability theory we can compute the density function of $Y$:

$$f_Y(y) = f_X (g^{-1}(y)) \cdot \left \vert \frac{d}{dy} g^{-1}(y) \right \vert$$

where $g(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ so $f_Y(y) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{\left (\tanh^{-1}(y) - \mu \right )^2}{2 \sigma^2}} \left \vert \frac{1}{1 - y^2} \right \vert$

mimetext/htmlrootassigneelast_run_timestampA mΰpersist_js_state·has_pluto_hook_features§cell_id$5864a5a3-a5a5-43c2-9cb4-7d13b2d20beddepends_on_disabled_cells§runtime Iµpublished_object_keysdepends_on_skipped_cells§errored$e3a2fb12-37ce-4c23-ad93-5fc89991aabbqueued¤logsrunning¦outputbodyf

Eligibility Vector for General Soft-Max and State Feature Vector

mimetext/htmlrootassigneelast_run_timestampA 􈦤persist_js_state·has_pluto_hook_features§cell_id$e3a2fb12-37ce-4c23-ad93-5fc89991aabbdepends_on_disabled_cells§runtimeqpublished_object_keysdepends_on_skipped_cells§errored$e5c1aca8-7575-4835-8273-e69ca0a55fe8queued¤logsrunning¦outputbody;corridor_parameter_studies (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA &persist_js_state·has_pluto_hook_features§cell_id$e5c1aca8-7575-4835-8273-e69ca0a55fe8depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cellsçerrored$44b32cc0-36a8-41fd-89bc-ce894536926cqueued¤logsrunning¦outputbodyelementsaction_probabilitiesprefixFloat32elements0.401392text/plain0.598608text/plaintypeArrayprefix_shortobjectidbf6dd01b0c41893f!application/vnd.pluto.tree+objectstate_value_estimate-9.63535text/plaintypeNamedTupleobjectidd3a6d0a62b65b7a5mime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA &Ypersist_js_state·has_pluto_hook_features§cell_id$44b32cc0-36a8-41fd-89bc-ce894536926cdepends_on_disabled_cells§runtimeFֵpublished_object_keysdepends_on_skipped_cellsçerrored$646bc853-b7fc-49fa-a201-ff98e8f952d4queued¤logsrunning¦outputbodyupdate_traces_with_gradient! (generic function with 6 methods)mimetext/plainrootassigneelast_run_timestampA B)persist_js_state·has_pluto_hook_features§cell_id$25be5dcf-be63-46c4-b6de-6cf79fa28fd0depends_on_disabled_cells§runtime=&published_object_keysdepends_on_skipped_cells§errored$38acd032-1d18-4760-9111-67c9cdd2e892queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestampA Vpersist_js_state·has_pluto_hook_features§cell_id$38acd032-1d18-4760-9111-67c9cdd2e892depends_on_disabled_cells§runtime&dpublished_object_keysdepends_on_skipped_cells§errored$cecc2a35-3850-4f66-84e8-e29da4f3d4b0queued¤logsrunning¦outputbody // We start by putting all the variable interpolation here at the beginning // Publish the plot object to JS let plot_obj = {"layout": {"xaxis": {"title": {"text": "Time(s)"}}, "template": {"layout": {"coloraxis": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "xaxis": {"gridcolor": "white", "zerolinewidth": 2, "title": {"standoff": 15}, "ticks": "", "zerolinecolor": "white", "automargin": true, "linecolor": "white"}, "hovermode": "closest", "paper_bgcolor": "white", "geo": {"showlakes": true, "showland": true, "landcolor": "#E5ECF6", "bgcolor": "white", "subunitcolor": "white", "lakecolor": "white"}, "colorscale": {"sequential": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]], "diverging": [[0, "#8e0152"], [0.1, "#c51b7d"], [0.2, "#de77ae"], [0.3, "#f1b6da"], [0.4, "#fde0ef"], [0.5, "#f7f7f7"], [0.6, "#e6f5d0"], [0.7, "#b8e186"], [0.8, "#7fbc41"], [0.9, "#4d9221"], [1, "#276419"]], "sequentialminus": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}, "yaxis": {"gridcolor": "white", "zerolinewidth": 2, "title": {"standoff": 15}, "ticks": "", "zerolinecolor": "white", "automargin": true, "linecolor": "white"}, "shapedefaults": {"line": {"color": "#2a3f5f"}}, "hoverlabel": {"align": "left"}, "mapbox": {"style": "light"}, "polar": {"angularaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "bgcolor": "#E5ECF6", "radialaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}}, "autotypenumbers": "strict", "font": {"color": "#2a3f5f"}, "ternary": {"baxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "bgcolor": "#E5ECF6", "caxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "aaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}}, "annotationdefaults": {"arrowhead": 0, "arrowwidth": 1, "arrowcolor": "#2a3f5f"}, "plot_bgcolor": "#E5ECF6", "title": {"x": 0.05}, "scene": {"xaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}, "zaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}, "yaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}}, "colorway": ["#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52"]}, "data": {"barpolar": [{"type": "barpolar", "marker": {"line": {"color": "#E5ECF6", "width": 0.5}}}], "carpet": [{"aaxis": {"gridcolor": "white", "endlinecolor": "#2a3f5f", "minorgridcolor": "white", "startlinecolor": "#2a3f5f", "linecolor": "white"}, "type": "carpet", "baxis": {"gridcolor": "white", "endlinecolor": "#2a3f5f", "minorgridcolor": "white", "startlinecolor": "#2a3f5f", "linecolor": "white"}}], "scatterpolar": [{"type": "scatterpolar", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "parcoords": [{"line": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "type": "parcoords"}], "scatter": [{"type": "scatter", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram2dcontour": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "histogram2dcontour", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "contour": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "contour", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scattercarpet": [{"type": "scattercarpet", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "mesh3d": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "mesh3d"}], "surface": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "surface", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scattermapbox": [{"type": "scattermapbox", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scattergeo": [{"type": "scattergeo", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram": [{"type": "histogram", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "pie": [{"type": "pie", "automargin": true}], "choropleth": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "choropleth"}], "heatmapgl": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "heatmapgl", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "bar": [{"type": "bar", "error_y": {"color": "#2a3f5f"}, "error_x": {"color": "#2a3f5f"}, "marker": {"line": {"color": "#E5ECF6", "width": 0.5}}}], "heatmap": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "heatmap", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "contourcarpet": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "contourcarpet"}], "table": [{"type": "table", "header": {"line": {"color": "white"}, "fill": {"color": "#C8D4E3"}}, "cells": {"line": {"color": "white"}, "fill": {"color": "#EBF0F8"}}}], "scatter3d": [{"line": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "type": "scatter3d", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scattergl": [{"type": "scattergl", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram2d": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "histogram2d", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scatterternary": [{"type": "scatterternary", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scatterpolargl": [{"type": "scatterpolargl", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}]}}, "legend": {"orientation": "h"}, "margin": {"l": 50, "b": 50, "r": 50, "t": 60}, "yaxis": {"title": {"text": "Horizontal Position"}}, "yaxis2": {"overlaying": "y", "title": "Pole Angle (Radians)", "side": "right"}}, "config": {"showLink": false, "editable": false, "responsive": true, "staticPlot": false, "scrollZoom": true}, "frames": [], "data": [{"y": [0.0, -1.8645987e-5, -0.022929357, -0.045886353, -0.023194335, -0.0005510263, 0.022051038, 0.044617552, 0.06715512, 0.13538401, 0.20361239, 0.22614819, 0.20299889, 0.13415128, 0.06529444, -0.0035927966, -0.09537905, -0.16440776, -0.23353508, -0.3255812, -0.39498538, -0.41907012, -0.39791384, -0.35427064, -0.2881403, -0.1767683, -0.06560569, 0.045367204, 0.15615189, 0.26676503, 0.37720996, 0.44188887, 0.5064011, 0.61633414, 0.7489039, 0.90414107, 1.1049118, 1.3056073, 1.4605527, 1.6154549, 1.7703285, 1.9023333, 2.011477, 2.1206179, 2.275472, 2.4303405, 2.539529, 2.6258953, 2.735144, 2.867276, 2.97659, 3.0402324, 3.1038985, 3.2132893, 3.3226922, 3.4321096, 3.5415447, 3.6052852, 3.6690316, 3.755628, 3.8422112, 3.9516323, 4.106752, 4.2618747, 4.371306, 4.435049, 4.4988036, 4.562556, 4.6262918, 4.73571, 4.8679576, 5.0230465, 5.223851, 5.424685, 5.579887, 5.6894994, 5.799211, 5.9090276, 5.99612, 6.060486, 6.1021104, 6.1666603, 6.2312856, 6.2731314, 6.292176, 6.31125, 6.376049, 6.486568, 6.597103, 6.7076654, 6.8411207, 6.951783, 7.0625157, 7.219008, 7.375595, 7.48668, 7.5523143, 7.6181126, 7.684078, 7.7502127, 7.816523, 7.8374333, 7.8129177, 7.788544, 7.764289, 7.7401423, 7.716083, 7.6463995, 7.5767508, 7.5528245, 7.574607, 7.642098, 7.755311, 7.8685584, 7.9590197, 8.026736, 8.094562, 8.162517, 8.185008, 8.184859, 8.162067, 8.093802, 8.025659, 8.003284, 8.003853, 7.9817004, 7.936823, 7.8692017, 7.8016596, 7.779879, 7.80386, 7.8279204, 7.806403, 7.78499, 7.7636843, 7.742489, 7.7214103, 7.654795, 7.588292, 7.5447288, 7.5241046, 7.503594, 7.4375525, 7.3716316, 7.305827, 7.240137, 7.174559, 7.1090927, 7.089399, 7.069826, 7.004761, 6.9398437, 6.875079, 6.7648544, 6.60914, 6.453538, 6.343704, 6.2339664, 6.1243205, 6.0147614, 5.9052863, 5.818736, 5.709429, 5.6002064, 5.491068, 5.382013, 5.3187256, 5.2555337, 5.1467957, 4.9925194, 4.838351, 4.6842756, 4.530287, 4.3763723, 4.176821, 3.9773064, 3.8235195, 3.669739, 3.5159564, 3.3850207, 3.2312167, 3.0773928, 2.9235408, 2.769651, 2.6614213, 2.598863, 2.5362892, 2.4508643, 2.3426044, 2.2343729, 2.1261773, 1.9723158, 1.7727747, 1.5275265, 1.2822561, 1.0826205, 0.928563, 0.8200561, 0.71143824, 0.6027142, 0.49388582, 0.38495576, 0.29876402, 0.18964016, 0.057584107, -0.051734634, -0.16115624, -0.270684, -0.33464965, -0.39871112, -0.46285516, -0.527074, -0.61420643, -0.67855245, -0.74295354, -0.8531072, -0.98616576, -1.0964469, -1.2068076, -1.3172601, -1.4278116, -1.5384762, -1.6036144, -1.6688783, -1.757092, -1.8226123, -1.8882642, -1.9540497, -2.0199702, -2.0860305, -2.1065993, -2.1273015, -2.1709533, -2.191903, -2.212972, -2.256985, -2.278282, -2.2540185, -2.2298415, -2.2514281, -2.318779, -2.4090514, -2.4765909, -2.5442452, -2.6120355, -2.634371, -2.611266, -2.5426855, -2.4513898, -2.383013, -2.3603988, -2.337854, -2.3153737, -2.3386526, -2.3620014, -2.385429, -2.4089506, -2.386911, -2.3649797, -2.3431559, -2.3214393, -2.299831, -2.2326608, -2.1655843, -2.0985863, -2.0316575, -2.0104876, -2.0350761, -2.0825791, -2.107349, -2.0866106, -2.066016, -2.0911744, -2.1392574, -2.1647384, -2.1449573, -2.0799887, -1.969851, -1.8145049, -1.613857, -1.4133584, -1.2585917, -1.1495901, -1.0407465, -0.90925497, -0.7551049, -0.5782754, -0.42439175, -0.27062166, -0.07128258, 0.12798035, 0.30433846, 0.45779777, 0.61121714, 0.7646016, 0.8722452, 0.97984815, 1.1102513, 1.2177473, 1.3251743, 1.4782076, 1.6768526, 1.8754377, 2.0739822, 2.318216, 2.6081653, 2.898141, 3.1425116, 3.3413675, 3.4947717, 3.6027434, 3.710875, 3.819157, 3.881948, 3.8991861, 3.9164896, 3.9795377, 4.0654726, 4.1514297, 4.21454, 4.2776437, 4.3407235, 4.4037595, 4.5124316, 4.666733, 4.8209786, 4.975182, 5.175068, 5.3749514, 5.574855, 5.7748046, 5.9291406, 6.0378923, 6.1467385, 6.301356, 6.4560766, 6.5652604, 6.62891, 6.692666, 6.7565126, 6.774744, 6.793021, 6.8113174, 6.8296103, 6.8935847, 7.0032225, 7.158531, 7.313822, 7.4691133, 7.6244264, 7.7340727, 7.843767, 7.953515, 8.063321, 8.173192, 8.283133, 8.415995, 8.526107, 8.6134815, 8.723794, 8.834218, 8.921934, 8.98695, 9.05209, 9.11735, 9.182729, 9.248226, 9.313843, 9.425231, 9.536748, 9.625601, 9.691819, 9.712618, 9.710777, 9.686263, 9.66187, 9.683248, 9.704732, 9.680645, 9.656641, 9.6327, 9.608809, 9.630663, 9.652552, 9.674477, 9.719294, 9.741303, 9.717653, 9.694035, 9.716148, 9.738283, 9.760441, 9.805481, 9.827703, 9.849967, 9.872275, 9.894631, 9.939892, 10.008061, 10.076311, 10.098997, 10.1217985, 10.144724, 10.167779, 10.190976, 10.168705, 10.1465845, 10.124613, 10.102789, 10.126733, 10.173624, 10.197894, 10.17682, 10.110425, 9.998679, 9.841503, 9.684438, 9.52745, 9.370525, 9.259346, 9.193909, 9.174213, 9.154566, 9.089285, 9.024073, 9.004623, 9.030927, 9.057321, 9.08381, 9.11043, 9.091724, 9.027802, 8.918723, 8.764486, 8.565034, 8.365782, 8.189479, 8.013363, 7.860214, 7.7072477, 7.5088744, 7.265038, 7.021312, 6.777666, 6.5340853, 6.336252, 6.1841655, 6.077826, 5.971545, 5.8196597, 5.6221848, 5.4247947, 5.2274795, 5.0073824, 4.8101892, 4.613042, 4.3702226, 4.127419, 3.9074671, 3.710352, 3.5132103, 3.2703269, 3.0273921, 2.8072305, 2.6098146, 2.457968, 2.351708, 2.2453794, 2.1390057, 2.0326037, 1.9261947, 1.8197982, 1.6905794, 1.5842582, 1.4780027, 1.3718272, 1.2657576, 1.1141831, 0.96275413, 0.8114837, 0.6147864, 0.37265286, 0.08502662, -0.20250261, -0.48996568, -0.82308984, -1.1790636, -1.512204, -1.7997215, -2.0416884, -2.2837818, -2.5260215, -2.7228296, -2.8742185, -2.980151, -3.086207, -3.23803, -3.4356277, -3.6333435, -3.8311834, -4.029162, -4.1816707, -4.334332, -4.5327606, -4.731348, -4.930098, -5.1517916, -5.350927, -5.52755, -5.6817102, -5.7907405, -5.854638, -5.8733435, -5.8695016, -5.888594, -5.953426, -6.018442, -6.038055, -6.0578256, -6.077734, -6.0521173, -6.0265875, -6.046815, -6.1128035, -6.2245526, -6.3363833, -6.4483085, -6.560353, -6.672525, -6.8303995, -6.988398, -7.10112, -7.1687007, -7.2365303, -7.304627, -7.3277855, -7.3060665, -7.239475, -7.173194, -7.107218, -6.9962935, -6.840341, -6.6846237, -6.574542, -6.46471, -6.3551297, -6.2457957, -6.0912538, -5.891438, -5.691799, -5.5378947, -5.3841634, -5.1849947, -4.985962, -4.7870407, -4.5882206, -4.4351687, -4.327886, -4.2207017, -4.090797, -3.961014, -3.8085406, -3.6562016, -3.5040014, -3.3063056, -3.1087403, -2.934123, -2.782454, -2.6537333, -2.5023386, -2.3054826, -2.1087809, -1.9122297, -1.6701969, -1.4282925, -1.1864984, -0.944806, -0.72604287, -0.4845205, -0.24307097, -0.0473824, 0.12539136, 0.3209296, 0.51638067, 0.71173584, 0.95266193, 1.2163333, 1.5027755, 1.789164, 2.0298085, 2.270422, 2.5567236, 2.8430214, 3.0836217, 3.3242445, 3.564895, 3.8055782, 4.0463004, 4.2870665, 4.5278835, 4.7687573, 5.0096955, 5.2050095, 5.400387, 5.641522, 5.882722, 6.123994, 6.3653502, 6.606798, 6.8940086, 7.1813316, 7.4231896, 7.619634, 7.770668, 7.9218636, 8.073206, 8.22469, 8.421948, 8.619347, 8.771265, 8.900506, 9.007041, 9.0908375, 9.197549, 9.350019, 9.50255, 9.65514, 9.807787, 9.960491, 10.158953, 10.403167, 10.647463, 10.846238, 10.999553, 11.107423, 11.215446, 11.369241, 11.52319, 11.677296, 11.8315735, 11.94046, 12.003951, 12.067605, 12.177026, 12.332209, 12.487555, 12.6203, 12.753241, 12.863646, 12.951537, 13.062389, 13.218885, 13.375585, 13.487214, 13.5990925, 13.711238, 13.77856, 13.8011465, 13.7790365, 13.712215, 13.645724, 13.579557, 13.468581, 13.312708, 13.157091, 13.024375, 12.869243, 12.714358, 12.605137, 12.49617, 12.34203, 12.188131, 12.034465, 11.835533, 11.591249, 11.347103, 11.103061, 10.836275, 10.59239, 10.394253, 10.24186, 10.135214, 10.028622, 9.899256, 9.747138, 9.549434, 9.351813, 9.19996, 9.048188, 8.850821, 8.607839, 8.364908, 8.167719, 7.970556, 7.727698, 7.46198, 7.1962347, 6.9532814, 6.733088, 6.5356255, 6.383711, 6.2773614, 6.1709313, 6.0644436, 5.9579144, 5.8513637, 5.790524, 5.7297096, 5.623258, 5.471214, 5.3192677, 5.167427, 4.9700317, 4.72706, 4.4613104, 4.218454, 3.998476, 3.8013659, 3.6271176, 3.4300184, 3.1872046, 2.9443703, 2.7472007, 2.5728319, 2.4212656, 2.269658, 2.0723076, 1.8292117, 1.5860579, 1.3656662, 1.1223476, 0.87891394, 0.6809477, 0.52840495, 0.42129087, 0.3596588, 0.34359628, 0.3274892, 0.2656523, 0.18095392, 0.0734135, -0.07981146, -0.27872625, -0.4776377, -0.6765644, -0.8755267, -1.0745409, -1.2736295, -1.427141, -1.5807598, -1.7344948, -1.8426976, -1.9510179, -2.0822816, -2.2364883, -2.4364555, -2.6365485, -2.7912173, -2.9005198, -3.010019, -3.1197193, -3.1840816, -3.24864, -3.3133855, -3.3327315, -3.352233, -3.4175007, -3.4829164, -3.5484798, -3.614192, -3.680054, -3.7916791, -3.9034624, -3.969863, -3.99091, -3.96658, -3.9424124, -3.918385, -3.8944912, -3.870711, -3.847036, -3.869132, -3.936998, -4.004971, -4.0502467, -4.1184688, -4.1868424, -4.2098255, -4.187454, -4.1652784, -4.1432943, -4.121502, -4.099901, -4.0329285, -3.9661343, -3.9223049, -3.9014416, -3.880755, -3.8146782, -3.748783, -3.6830654, -3.5719285, -3.4153166, -3.2588098, -3.1480637, -3.0374002, -2.9268103, -2.8391314, -2.774362, -2.7553518, -2.7364206, -2.6719248, -2.5618849, -2.4291193, -2.2964444, -2.1638472, -2.0541654, -1.9445462, -1.8349862, -1.7254812, -1.6160281, -1.5523285, -1.4886901, -1.3794353, -1.2245612, -1.0697435, -0.9606747, -0.85164815, -0.69695425, -0.49657103, -0.29618263, -0.14146921, -0.032405302, 0.07671446, 0.23159096, 0.38653728, 0.49588817, 0.5596485, 0.57778823, 0.59596956, 0.65988135, 0.76951563, 0.92487574, 1.0802667, 1.190006, 1.2769548, 1.3411089, 1.40531, 1.4695469, 1.488096, 1.5066452, 1.5708824, 1.635084, 1.699239, 1.8090404, 1.9416435, 2.0513544, 2.1610327, 2.2706828, 2.3803096, 2.4899173, 2.5537958, 2.5947924, 2.6585956, 2.745188, 2.8545609, 3.0095692, 3.2102442, 3.4109101, 3.565883, 3.7208934, 3.8759608, 3.985413, 4.0492563, 4.113168, 4.2228374, 4.3325686, 4.419517, 4.4836783, 4.5478926, 4.6350036, 4.7450104, 4.9007683, 5.056592, 5.1668296, 5.2315035, 5.2505984, 5.2697673, 5.3346925, 5.4453745, 5.556123, 5.621259, 5.6864743, 5.7517657, 5.7714353, 5.7911596, 5.810921, 5.830705, 5.896211, 5.9617243, 5.981529, 6.0013256, 6.06681, 6.1779838, 6.2891493, 6.354608, 6.420076, 6.5312676, 6.6424847, 6.7080355, 6.773631, 6.8621254, 6.927823, 6.9935784, 7.0593944, 7.0795712, 7.0540867, 7.0286183, 7.0488553, 7.091926, 7.157825, 7.2236996, 7.289553, 7.355389, 7.4212103, 7.5098777, 7.6214027, 7.7329464, 7.821673, 7.887598, 7.953579, 8.01962, 8.085726, 8.151902, 8.218153, 8.284487, 8.305224, 8.326042, 8.392626, 8.459291, 8.480364, 8.478677, 8.454208, 8.406931, 8.382527, 8.403838, 8.47086, 8.583603, 8.696378, 8.763515, 8.785043, 8.760956, 8.736926, 8.758646, 8.826115, 8.939333, 9.052623, 9.120362, 9.142606, 9.164995, 9.187535, 9.164613, 9.1190195, 9.050722, 8.982526, 8.937251, 8.914897, 8.915462, 8.938946, 8.962517, 8.940514, 8.895774, 8.873967, 8.852254, 8.78495, 8.717723, 8.650556, 8.583438, 8.562066, 8.586441, 8.610873, 8.635377, 8.659976, 8.63904, 8.57259, 8.506265, 8.485714, 8.46529, 8.399364, 8.333575, 8.313559, 8.293684, 8.251159, 8.186004, 8.075423, 7.9193773, 7.7634487, 7.6076093, 7.451846, 7.3418393], "type": "scatter", "name": "x", "yaxis": "y", "x": [0.0, 0.04, 0.08, 0.12, 0.16, 0.19999999, 0.23999998, 0.27999997, 0.31999996, 0.35999995, 0.39999995, 0.43999994, 0.47999993, 0.5199999, 0.55999994, 0.59999996, 0.64, 0.68, 0.72, 0.76000005, 0.8000001, 0.8400001, 0.8800001, 0.92000014, 0.96000016, 1.0000001, 1.0400001, 1.08, 1.12, 1.16, 1.1999999, 1.2399999, 1.2799999, 1.3199998, 1.3599998, 1.3999997, 1.4399997, 1.4799997, 1.5199996, 1.5599996, 1.5999995, 1.6399995, 1.6799995, 1.7199994, 1.7599994, 1.7999994, 1.8399993, 1.8799993, 1.9199992, 1.9599992, 1.9999992, 2.0399992, 2.0799992, 2.1199992, 2.1599991, 2.199999, 2.239999, 2.279999, 2.319999, 2.359999, 2.399999, 2.4399989, 2.4799988, 2.5199988, 2.5599988, 2.5999987, 2.6399987, 2.6799986, 2.7199986, 2.7599986, 2.7999985, 2.8399985, 2.8799984, 2.9199984, 2.9599984, 2.9999983, 3.0399983, 3.0799983, 3.1199982, 3.1599982, 3.1999981, 3.239998, 3.279998, 3.319998, 3.359998, 3.399998, 3.439998, 3.4799979, 3.5199978, 3.5599978, 3.5999978, 3.6399977, 3.6799977, 3.7199976, 3.7599976, 3.7999976, 3.8399975, 3.8799975, 3.9199975, 3.9599974, 3.9999974, 4.0399976, 4.0799975, 4.1199975, 4.1599975, 4.1999974, 4.2399974, 4.2799973, 4.3199973, 4.3599973, 4.399997, 4.439997, 4.479997, 4.519997, 4.559997, 4.599997, 4.639997, 4.679997, 4.719997, 4.759997, 4.799997, 4.839997, 4.879997, 4.9199967, 4.9599967, 4.9999967, 5.0399966, 5.0799966, 5.1199965, 5.1599965, 5.1999965, 5.2399964, 5.2799964, 5.3199964, 5.3599963, 5.3999963, 5.4399962, 5.479996, 5.519996, 5.559996, 5.599996, 5.639996, 5.679996, 5.719996, 5.759996, 5.799996, 5.839996, 5.879996, 5.919996, 5.9599957, 5.9999957, 6.0399957, 6.0799956, 6.1199956, 6.1599956, 6.1999955, 6.2399955, 6.2799954, 6.3199954, 6.3599954, 6.3999953, 6.4399953, 6.4799953, 6.519995, 6.559995, 6.599995, 6.639995, 6.679995, 6.719995, 6.759995, 6.799995, 6.839995, 6.879995, 6.919995, 6.959995, 6.9999948, 7.0399947, 7.0799947, 7.1199946, 7.1599946, 7.1999946, 7.2399945, 7.2799945, 7.3199944, 7.3599944, 7.3999944, 7.4399943, 7.4799943, 7.5199943, 7.559994, 7.599994, 7.639994, 7.679994, 7.719994, 7.759994, 7.799994, 7.839994, 7.879994, 7.919994, 7.959994, 7.999994, 8.039994, 8.079994, 8.119994, 8.159994, 8.199994, 8.239994, 8.279994, 8.319994, 8.359994, 8.399994, 8.439994, 8.479994, 8.519994, 8.559994, 8.599994, 8.639994, 8.679994, 8.719994, 8.759994, 8.7999935, 8.8399935, 8.879993, 8.919993, 8.959993, 8.999993, 9.039993, 9.079993, 9.119993, 9.159993, 9.199993, 9.239993, 9.279993, 9.319993, 9.359993, 9.399993, 9.439993, 9.479993, 9.519993, 9.559993, 9.599993, 9.639993, 9.679993, 9.719993, 9.759993, 9.799993, 9.839993, 9.8799925, 9.919992, 9.959992, 9.999992, 10.039992, 10.079992, 10.119992, 10.159992, 10.199992, 10.239992, 10.279992, 10.319992, 10.359992, 10.399992, 10.439992, 10.479992, 10.519992, 10.559992, 10.599992, 10.639992, 10.679992, 10.719992, 10.759992, 10.799992, 10.839992, 10.879992, 10.9199915, 10.959991, 10.999991, 11.039991, 11.079991, 11.119991, 11.159991, 11.199991, 11.239991, 11.279991, 11.319991, 11.359991, 11.399991, 11.439991, 11.479991, 11.519991, 11.559991, 11.599991, 11.639991, 11.679991, 11.719991, 11.759991, 11.799991, 11.839991, 11.879991, 11.919991, 11.9599905, 11.99999, 12.03999, 12.07999, 12.11999, 12.15999, 12.19999, 12.23999, 12.27999, 12.31999, 12.35999, 12.39999, 12.43999, 12.47999, 12.51999, 12.55999, 12.59999, 12.63999, 12.67999, 12.71999, 12.75999, 12.79999, 12.83999, 12.87999, 12.91999, 12.95999, 12.9999895, 13.039989, 13.079989, 13.119989, 13.159989, 13.199989, 13.239989, 13.279989, 13.319989, 13.359989, 13.399989, 13.439989, 13.479989, 13.519989, 13.559989, 13.599989, 13.639989, 13.679989, 13.719989, 13.759989, 13.799989, 13.839989, 13.879989, 13.919989, 13.959989, 13.999989, 14.0399885, 14.0799885, 14.119988, 14.159988, 14.199988, 14.239988, 14.279988, 14.319988, 14.359988, 14.399988, 14.439988, 14.479988, 14.519988, 14.559988, 14.599988, 14.639988, 14.679988, 14.719988, 14.759988, 14.799988, 14.839988, 14.879988, 14.919988, 14.959988, 14.999988, 15.039988, 15.079988, 15.1199875, 15.159987, 15.199987, 15.239987, 15.279987, 15.319987, 15.359987, 15.399987, 15.439987, 15.479987, 15.519987, 15.559987, 15.599987, 15.639987, 15.679987, 15.719987, 15.759987, 15.799987, 15.839987, 15.879987, 15.919987, 15.959987, 15.999987, 16.039988, 16.079988, 16.11999, 16.15999, 16.199991, 16.239992, 16.279993, 16.319994, 16.359995, 16.399996, 16.439997, 16.479998, 16.519999, 16.56, 16.6, 16.640001, 16.680002, 16.720003, 16.760004, 16.800005, 16.840006, 16.880007, 16.920008, 16.960009, 17.00001, 17.04001, 17.080011, 17.120012, 17.160013, 17.200014, 17.240015, 17.280016, 17.320017, 17.360018, 17.400019, 17.44002, 17.48002, 17.520021, 17.560022, 17.600023, 17.640024, 17.680025, 17.720026, 17.760027, 17.800028, 17.840029, 17.88003, 17.92003, 17.960032, 18.000032, 18.040033, 18.080034, 18.120035, 18.160036, 18.200037, 18.240038, 18.280039, 18.32004, 18.36004, 18.400042, 18.440042, 18.480043, 18.520044, 18.560045, 18.600046, 18.640047, 18.680048, 18.720049, 18.76005, 18.80005, 18.840052, 18.880053, 18.920053, 18.960054, 19.000055, 19.040056, 19.080057, 19.120058, 19.160059, 19.20006, 19.24006, 19.280062, 19.320063, 19.360064, 19.400064, 19.440065, 19.480066, 19.520067, 19.560068, 19.600069, 19.64007, 19.68007, 19.720072, 19.760073, 19.800074, 19.840075, 19.880075, 19.920076, 19.960077, 20.000078, 20.04008, 20.08008, 20.12008, 20.160082, 20.200083, 20.240084, 20.280085, 20.320086, 20.360086, 20.400087, 20.440088, 20.48009, 20.52009, 20.560091, 20.600092, 20.640093, 20.680094, 20.720095, 20.760096, 20.800097, 20.840097, 20.880098, 20.9201, 20.9601, 21.000101, 21.040102, 21.080103, 21.120104, 21.160105, 21.200106, 21.240107, 21.280107, 21.320108, 21.36011, 21.40011, 21.440111, 21.480112, 21.520113, 21.560114, 21.600115, 21.640116, 21.680117, 21.720118, 21.760118, 21.80012, 21.84012, 21.880121, 21.920122, 21.960123, 22.000124, 22.040125, 22.080126, 22.120127, 22.160128, 22.200129, 22.24013, 22.28013, 22.320131, 22.360132, 22.400133, 22.440134, 22.480135, 22.520136, 22.560137, 22.600138, 22.640139, 22.68014, 22.72014, 22.760141, 22.800142, 22.840143, 22.880144, 22.920145, 22.960146, 23.000147, 23.040148, 23.080149, 23.12015, 23.16015, 23.200151, 23.240152, 23.280153, 23.320154, 23.360155, 23.400156, 23.440157, 23.480158, 23.520159, 23.56016, 23.60016, 23.640162, 23.680162, 23.720163, 23.760164, 23.800165, 23.840166, 23.880167, 23.920168, 23.960169, 24.00017, 24.04017, 24.080172, 24.120173, 24.160173, 24.200174, 24.240175, 24.280176, 24.320177, 24.360178, 24.400179, 24.44018, 24.48018, 24.520182, 24.560183, 24.600183, 24.640184, 24.680185, 24.720186, 24.760187, 24.800188, 24.840189, 24.88019, 24.92019, 24.960192, 25.000193, 25.040194, 25.080194, 25.120195, 25.160196, 25.200197, 25.240198, 25.2802, 25.3202, 25.3602, 25.400202, 25.440203, 25.480204, 25.520205, 25.560205, 25.600206, 25.640207, 25.680208, 25.72021, 25.76021, 25.80021, 25.840212, 25.880213, 25.920214, 25.960215, 26.000216, 26.040216, 26.080217, 26.120218, 26.16022, 26.20022, 26.240221, 26.280222, 26.320223, 26.360224, 26.400225, 26.440226, 26.480227, 26.520227, 26.560228, 26.60023, 26.64023, 26.680231, 26.720232, 26.760233, 26.800234, 26.840235, 26.880236, 26.920237, 26.960238, 27.000238, 27.04024, 27.08024, 27.120241, 27.160242, 27.200243, 27.240244, 27.280245, 27.320246, 27.360247, 27.400248, 27.440248, 27.48025, 27.52025, 27.560251, 27.600252, 27.640253, 27.680254, 27.720255, 27.760256, 27.800257, 27.840258, 27.880259, 27.92026, 27.96026, 28.000261, 28.040262, 28.080263, 28.120264, 28.160265, 28.200266, 28.240267, 28.280268, 28.320269, 28.36027, 28.40027, 28.440271, 28.480272, 28.520273, 28.560274, 28.600275, 28.640276, 28.680277, 28.720278, 28.760279, 28.80028, 28.84028, 28.880281, 28.920282, 28.960283, 29.000284, 29.040285, 29.080286, 29.120287, 29.160288, 29.200289, 29.24029, 29.28029, 29.320292, 29.360292, 29.400293, 29.440294, 29.480295, 29.520296, 29.560297, 29.600298, 29.640299, 29.6803, 29.7203, 29.760302, 29.800303, 29.840303, 29.880304, 29.920305, 29.960306, 30.000307, 30.040308, 30.080309, 30.12031, 30.16031, 30.200312, 30.240313, 30.280313, 30.320314, 30.360315, 30.400316, 30.440317, 30.480318, 30.520319, 30.56032, 30.60032, 30.640322, 30.680323, 30.720324, 30.760324, 30.800325, 30.840326, 30.880327, 30.920328, 30.96033, 31.00033, 31.04033, 31.080332, 31.120333, 31.160334, 31.200335, 31.240335, 31.280336, 31.320337, 31.360338, 31.40034, 31.44034, 31.480341, 31.520342, 31.560343, 31.600344, 31.640345, 31.680346, 31.720346, 31.760347, 31.800348, 31.84035, 31.88035, 31.920351, 31.960352, 32.00035, 32.04035, 32.080353, 32.120354, 32.160355, 32.200356, 32.240356, 32.280357, 32.32036, 32.36036, 32.40036, 32.44036, 32.480362, 32.520363, 32.560364, 32.600365, 32.640366, 32.680367, 32.720367, 32.76037, 32.80037, 32.84037, 32.88037, 32.920372, 32.960373, 33.000374, 33.040375, 33.080376, 33.120377, 33.160378, 33.20038, 33.24038, 33.28038, 33.32038, 33.360382, 33.400383, 33.440384, 33.480385, 33.520386, 33.560387, 33.600388, 33.64039, 33.68039, 33.72039, 33.76039, 33.800392, 33.840393, 33.880394, 33.920395, 33.960396, 34.000397, 34.040398, 34.0804, 34.1204, 34.1604, 34.2004, 34.240402, 34.280403, 34.320404, 34.360405, 34.400406, 34.440407, 34.480408, 34.52041, 34.56041, 34.60041, 34.64041, 34.680412, 34.720413, 34.760414, 34.800415, 34.840416, 34.880417, 34.920418, 34.96042, 35.00042, 35.04042, 35.08042, 35.120422, 35.160423, 35.200424, 35.240425, 35.280426, 35.320427, 35.360428, 35.40043, 35.44043, 35.48043, 35.52043, 35.560432, 35.600433, 35.640434, 35.680435, 35.720436, 35.760437, 35.800438, 35.84044, 35.88044, 35.92044, 35.96044, 36.000443, 36.040443, 36.080444, 36.120445, 36.160446, 36.200447, 36.240448, 36.28045, 36.32045, 36.36045, 36.40045, 36.440453, 36.480453, 36.520454, 36.560455, 36.600456, 36.640457, 36.680458, 36.72046, 36.76046, 36.80046, 36.84046, 36.880463, 36.920464, 36.960464, 37.000465, 37.040466, 37.080467, 37.12047, 37.16047, 37.20047, 37.24047, 37.28047, 37.320473, 37.360474, 37.400475, 37.440475, 37.480476, 37.520477, 37.56048, 37.60048, 37.64048, 37.68048, 37.72048, 37.760483, 37.800484, 37.840485, 37.880486, 37.920486, 37.960487, 38.00049, 38.04049, 38.08049, 38.12049, 38.160492, 38.200493, 38.240494, 38.280495, 38.320496, 38.360497, 38.400497, 38.4405, 38.4805, 38.5205, 38.5605, 38.600502, 38.640503, 38.680504, 38.720505, 38.760506, 38.800507, 38.840508, 38.88051, 38.92051, 38.96051, 39.00051, 39.040512, 39.080513, 39.120514, 39.160515, 39.200516, 39.240517, 39.280518, 39.32052, 39.36052, 39.40052, 39.44052, 39.480522, 39.520523, 39.560524, 39.600525, 39.640526, 39.680527, 39.720528, 39.76053, 39.80053, 39.84053, 39.88053, 39.920532, 39.960533]}, {"y": [0.05, 0.050205365, 0.06224224, 0.0747987, 0.065162905, 0.056069747, 0.04743547, 0.03919608, 0.031277858, 0.00075507164, -0.029761303, -0.037659604, -0.023010574, 0.01431556, 0.051763266, 0.08961947, 0.13960168, 0.17942575, 0.22065093, 0.27481857, 0.32026565, 0.34665233, 0.35440493, 0.35432476, 0.34641054, 0.3198583, 0.29594505, 0.27437213, 0.25506857, 0.23780292, 0.22250609, 0.23126327, 0.24192622, 0.23239662, 0.21366568, 0.18549709, 0.1363802, 0.08842297, 0.06395237, 0.039995678, 0.016375918, 0.0043235607, 0.003743016, 0.003193365, -0.02020235, -0.043768235, -0.044842925, -0.034863014, -0.036598757, -0.05006402, -0.05252134, -0.03256936, -0.012891032, -0.016188297, -0.019619746, -0.023212792, -0.026998306, -0.008140886, 0.010648154, 0.018088633, 0.025677722, 0.022045769, -0.004272338, -0.030626481, -0.034370102, -0.0155366175, 0.0031663869, 0.021895057, 0.040807545, 0.037201513, 0.022474758, -0.0035022777, -0.05237659, -0.10169783, -0.12910135, -0.13489065, -0.14178333, -0.14985067, -0.14784579, -0.1357461, -0.11344425, -0.10343327, -0.094266765, -0.07449753, -0.043929234, -0.013732521, -0.0065170866, -0.022226907, -0.03812291, -0.054328553, -0.08240104, -0.09976192, -0.11792992, -0.15974927, -0.2029215, -0.22537017, -0.2273989, -0.23128062, -0.23706017, -0.24476211, -0.25447655, -0.24417375, -0.21370578, -0.18502752, -0.15781939, -0.13193637, -0.107108995, -0.060422048, -0.014250103, 0.008939279, 0.009330352, -0.013073645, -0.05845124, -0.104326844, -0.13966656, -0.16484667, -0.19134222, -0.21944101, -0.22703901, -0.22536422, -0.21439536, -0.18285766, -0.1528565, -0.14667472, -0.15299445, -0.14927594, -0.13548607, -0.11147166, -0.08839392, -0.088810675, -0.11272425, -0.13758545, -0.1409294, -0.14542858, -0.15112981, -0.15806605, -0.16631058, -0.15337867, -0.14172173, -0.14253174, -0.15581273, -0.17039035, -0.16384132, -0.15864584, -0.15474844, -0.15212725, -0.15075456, -0.15062255, -0.17430225, -0.19944166, -0.20381862, -0.2098563, -0.21762273, -0.20486659, -0.17141603, -0.13940877, -0.13116103, -0.12398481, -0.11783616, -0.11265293, -0.108402275, -0.11640365, -0.11400856, -0.11255479, -0.11202775, -0.11242413, -0.13644929, -0.16162021, -0.1655568, -0.14830759, -0.13229582, -0.117355466, -0.103394404, -0.090273805, -0.055117715, -0.02042894, -0.008772474, 0.0028123865, 0.014420708, 0.01471217, 0.026559092, 0.038627803, 0.051011633, 0.06382151, 0.054330673, 0.022447668, -0.009245232, -0.029577887, -0.038724385, -0.048187807, -0.05805275, -0.04556115, -0.010595109, 0.04715126, 0.10530118, 0.14156973, 0.15636508, 0.14986406, 0.14460231, 0.14052469, 0.13760729, 0.13581985, 0.12382671, 0.124190696, 0.13691449, 0.13944367, 0.14311694, 0.14797243, 0.1314321, 0.11598901, 0.10148663, 0.08783152, 0.08628196, 0.074048094, 0.06243201, 0.0741491, 0.09788033, 0.11104393, 0.12510982, 0.14021935, 0.15646371, 0.17401308, 0.17048015, 0.16835105, 0.1788588, 0.17959349, 0.18179998, 0.18550234, 0.19071966, 0.19750945, 0.18351334, 0.17104049, 0.17121336, 0.16152781, 0.15318057, 0.15737404, 0.15157188, 0.12441519, 0.098305345, 0.0957467, 0.1167256, 0.15001938, 0.17325647, 0.19787866, 0.22415598, 0.22999613, 0.21548, 0.18040812, 0.13559216, 0.10321649, 0.09441781, 0.08639128, 0.079082616, 0.095213026, 0.11214, 0.12997414, 0.14889532, 0.14643139, 0.14517416, 0.14511046, 0.14623995, 0.14857438, 0.12951991, 0.11154886, 0.09448002, 0.078201786, 0.085357174, 0.11598749, 0.1589279, 0.19190356, 0.20401822, 0.21778141, 0.25557867, 0.3064961, 0.3490549, 0.37300712, 0.37875316, 0.36639085, 0.33571056, 0.28620595, 0.23910227, 0.21607253, 0.21707936, 0.21986513, 0.21330555, 0.19731848, 0.17175871, 0.15886302, 0.14725651, 0.11424364, 0.082195975, 0.062204696, 0.054138854, 0.046516594, 0.039281238, 0.055219755, 0.07162072, 0.077202015, 0.09481165, 0.11321539, 0.109829105, 0.08461552, 0.060114373, 0.03609631, -0.010482322, -0.08000752, -0.15021858, -0.19904372, -0.22709252, -0.23474321, -0.22210242, -0.2112967, -0.20220281, -0.17238104, -0.121447764, -0.07154875, -0.04503311, -0.030305816, -0.015827239, 0.009957148, 0.03582553, 0.061981335, 0.08865987, 0.09328683, 0.075910695, 0.059170403, 0.042910177, 0.004151102, -0.03457267, -0.07356888, -0.113189876, -0.13101831, -0.12725508, -0.124543086, -0.14552255, -0.1677208, -0.16875906, -0.14865027, -0.12978439, -0.11196618, -0.0723421, -0.03333244, 0.005414594, 0.044207633, 0.060513772, 0.054487083, 0.02606972, -0.0021273363, -0.030341433, -0.05881188, -0.064932995, -0.071586855, -0.07883545, -0.086729765, -0.09534507, -0.10473963, -0.12636536, -0.13769653, -0.13883337, -0.15241757, -0.16727075, -0.1722216, -0.16732772, -0.16381375, -0.16164199, -0.16080047, -0.16128018, -0.16308384, -0.1887501, -0.2159974, -0.23383276, -0.24248177, -0.23094587, -0.21019414, -0.17996578, -0.15124881, -0.14634484, -0.14263956, -0.117475405, -0.0932988, -0.069871575, -0.047032267, -0.047421142, -0.048201937, -0.049380545, -0.06238696, -0.064495444, -0.044309735, -0.024496289, -0.027747707, -0.031229114, -0.03496767, -0.050423212, -0.05487436, -0.059776645, -0.06517492, -0.07110833, -0.08903274, -0.1190625, -0.15009671, -0.15975831, -0.1707191, -0.18309651, -0.1969557, -0.21244971, -0.20735243, -0.20395966, -0.20223314, -0.20216462, -0.22609405, -0.26299524, -0.29104096, -0.2995954, -0.2887901, -0.2584592, -0.20815179, -0.15960562, -0.112293154, -0.06593612, -0.04292828, -0.04311979, -0.06651196, -0.09046399, -0.09238327, -0.095062956, -0.12127538, -0.17115153, -0.22248077, -0.27550936, -0.33085704, -0.36729822, -0.38543227, -0.38556793, -0.36770755, -0.33154693, -0.29813325, -0.27796754, -0.26004437, -0.25525343, -0.25253472, -0.22976989, -0.18663676, -0.14508101, -0.10466, -0.065127075, -0.048940565, -0.05599601, -0.086339906, -0.11741333, -0.12674172, -0.1144303, -0.103071906, -0.09255381, -0.07141727, -0.062274523, -0.05364135, -0.022609469, 0.0082310345, 0.027702972, 0.0359725, 0.044536725, 0.07631034, 0.10872948, 0.13066289, 0.14234391, 0.13256352, 0.10120758, 0.070706435, 0.04076886, 0.011176037, -0.01832755, -0.047986105, -0.06661167, -0.09718338, -0.12857601, -0.16098864, -0.194758, -0.20770252, -0.22232318, -0.23878622, -0.235013, -0.21095037, -0.16626802, -0.12299409, -0.08068067, -0.016230367, 0.05951432, 0.12434223, 0.16749325, 0.18947642, 0.2129747, 0.23825075, 0.2432815, 0.22813822, 0.1926083, 0.15869918, 0.14863212, 0.16236949, 0.17745796, 0.19397877, 0.2121134, 0.20965204, 0.20890951, 0.23218437, 0.25739357, 0.28463855, 0.3251377, 0.3574937, 0.38200796, 0.39904872, 0.39832017, 0.3798096, 0.34319732, 0.2986557, 0.2674288, 0.26033255, 0.25534362, 0.23034754, 0.20727067, 0.18584965, 0.1434878, 0.10234202, 0.084765136, 0.09066177, 0.120064974, 0.15048066, 0.18209001, 0.21523091, 0.2500635, 0.30897608, 0.37048364, 0.4137079, 0.43939537, 0.46846092, 0.5013099, 0.51819956, 0.5194567, 0.5051058, 0.4947499, 0.4882403, 0.46553683, 0.42621234, 0.39037102, 0.37861013, 0.3698511, 0.36407942, 0.3612191, 0.33996105, 0.29995424, 0.26244846, 0.24906288, 0.23768215, 0.20604627, 0.1761384, 0.1476286, 0.120360255, 0.11676107, 0.13681401, 0.15801261, 0.16921076, 0.18179, 0.18462768, 0.18897463, 0.19487867, 0.17996296, 0.16654257, 0.16574045, 0.17755301, 0.20206372, 0.21705125, 0.2115085, 0.20770442, 0.20559596, 0.18280533, 0.16154218, 0.14157687, 0.12279576, 0.11635507, 0.09951022, 0.08349773, 0.09095123, 0.1105391, 0.11967862, 0.12979448, 0.14098935, 0.13070709, 0.11017242, 0.07916644, 0.048829578, 0.04173228, 0.034977335, 0.0056509897, -0.023627548, -0.030233867, -0.037088342, -0.0442512, -0.05177694, -0.059733357, -0.06817899, -0.07719198, -0.08683607, -0.097203106, -0.085608274, -0.074726984, -0.08725783, -0.100516856, -0.114592984, -0.12962551, -0.14570795, -0.18557462, -0.22700852, -0.24803777, -0.2489583, -0.22978333, -0.2125151, -0.19695586, -0.18303159, -0.19304578, -0.2046579, -0.19557674, -0.17690118, -0.14840822, -0.10984366, -0.08354738, -0.08071613, -0.07854937, -0.07703207, -0.076149814, -0.07589647, -0.09906178, -0.14577515, -0.19372675, -0.22083451, -0.22746035, -0.2136956, -0.20169951, -0.21370171, -0.22747259, -0.24307387, -0.26068693, -0.2583702, -0.23608983, -0.21577166, -0.21948993, -0.24725181, -0.27707642, -0.29814583, -0.3216128, -0.33688688, -0.34411067, -0.36477613, -0.4095548, -0.45770496, -0.48908275, -0.52415586, -0.56342846, -0.58791894, -0.59812385, -0.5942526, -0.5762256, -0.56269485, -0.5534592, -0.5292621, -0.48962438, -0.45393956, -0.43194133, -0.40300402, -0.37735358, -0.37581414, -0.37728703, -0.36062664, -0.34690246, -0.33593604, -0.30616412, -0.25711364, -0.21022697, -0.16496378, -0.10979815, -0.06689888, -0.04735631, -0.051044486, -0.07798806, -0.105589405, -0.122678585, -0.12943567, -0.11458328, -0.10068713, -0.110357046, -0.1209449, -0.10982812, -0.07688324, -0.044588957, -0.035504334, -0.026710369, 0.0047286414, 0.04764131, 0.09094106, 0.12361148, 0.14593668, 0.1581608, 0.14911012, 0.118675336, 0.08924145, 0.060519524, 0.032308724, 0.0043563936, -0.046429135, -0.09761117, -0.12683208, -0.13441299, -0.14309132, -0.15295687, -0.1414856, -0.10854083, -0.0651339, -0.033674937, -0.013916526, -0.005707996, -0.008982436, -0.0008947514, 0.030057281, 0.06126368, 0.07014433, 0.06819432, 0.05539483, 0.043057982, 0.053921998, 0.08805965, 0.12294291, 0.1474729, 0.18448356, 0.22305016, 0.24116033, 0.23906678, 0.21673986, 0.17387041, 0.10990057, 0.046870723, 0.0070558414, -0.021267852, -0.03833483, -0.032860655, -0.0047959927, 0.023228124, 0.051438116, 0.08008258, 0.10936628, 0.13957265, 0.14828391, 0.15820329, 0.16943541, 0.15952925, 0.15094529, 0.15488805, 0.17137618, 0.2117523, 0.25391328, 0.27603933, 0.2784556, 0.2831269, 0.29011014, 0.2775919, 0.26735282, 0.25926903, 0.23123267, 0.20512655, 0.20302631, 0.20258693, 0.20380636, 0.20669824, 0.21127638, 0.23987654, 0.27048042, 0.28129113, 0.27247033, 0.24388589, 0.21733531, 0.19251184, 0.16929787, 0.14744008, 0.1268167, 0.12989609, 0.15669195, 0.18480429, 0.20318009, 0.23436315, 0.26750797, 0.28081575, 0.2744849, 0.2703966, 0.26850134, 0.26879236, 0.2712731, 0.2539759, 0.23877773, 0.23659036, 0.2474077, 0.26026407, 0.2531851, 0.24818122, 0.2451932, 0.22205637, 0.17844734, 0.13634777, 0.117989056, 0.100583926, 0.08402044, 0.079534315, 0.08709705, 0.11814495, 0.1501902, 0.16086374, 0.15029494, 0.12966806, 0.110100545, 0.09143365, 0.084904365, 0.07907132, 0.07389425, 0.06932417, 0.06532876, 0.08468515, 0.10475198, 0.10293707, 0.079220794, 0.056171007, 0.056412008, 0.057119094, 0.035460655, -0.008764204, -0.05306372, -0.074959025, -0.07466355, -0.0749844, -0.09871813, -0.12328213, -0.1261669, -0.10740516, -0.0667881, -0.02673867, -0.009768106, -0.015748646, -0.04472629, -0.074081935, -0.08123779, -0.07766474, -0.06332756, -0.049519803, -0.03611526, -0.00014982373, 0.03581432, 0.049214847, 0.06301615, 0.077344224, 0.069505595, 0.050834805, 0.044003617, 0.0375335, 0.03137532, 0.025474804, 0.019785952, 0.037126265, 0.066201024, 0.084415555, 0.09192626, 0.0888084, 0.06364058, 0.01616954, -0.031163506, -0.055891916, -0.081069164, -0.10692933, -0.11093081, -0.093117386, -0.07608364, -0.082470335, -0.089541815, -0.08596114, -0.071692415, -0.058022846, -0.05624292, -0.06634036, -0.09979378, -0.13409147, -0.1468305, -0.13816045, -0.10797903, -0.07870866, -0.07287324, -0.090438105, -0.108760536, -0.1052445, -0.10259872, -0.10079707, -0.07707125, -0.05399389, -0.03135126, -0.008972554, -0.009538375, -0.010183048, 0.011960189, 0.034204077, 0.03386942, 0.010953216, -0.011870727, -0.011921028, -0.01206986, -0.035187628, -0.0586017, -0.05966446, -0.06121934, -0.07469129, -0.077377155, -0.08069982, -0.08469075, -0.06658958, -0.026214156, 0.013938282, 0.031335916, 0.03755931, 0.032663845, 0.028039493, 0.02364578, 0.019448273, 0.015410811, 6.57849e-5, -0.026714515, -0.053720325, -0.06974385, -0.07493675, -0.08074484, -0.087223075, -0.09441606, -0.10239325, -0.11120758, -0.12094623, -0.10898025, -0.097921915, -0.1104141, -0.12382683, -0.11556704, -0.096906945, -0.067652866, -0.02755348, 0.00088725425, 0.006463351, -0.010779781, -0.050979577, -0.0916128, -0.11022386, -0.10701231, -0.08194301, -0.057563663, -0.056484282, -0.0786979, -0.124341056, -0.17104116, -0.19661525, -0.20139043, -0.20780624, -0.21593478, -0.20351833, -0.18159287, -0.14989418, -0.11945823, -0.10133224, -0.095413044, -0.10165569, -0.1200975, -0.13954471, -0.13749777, -0.12526177, -0.12539245, -0.12655653, -0.10607838, -0.086489625, -0.0675997, -0.0492768, -0.05419806, -0.08239602, -0.11128938, -0.14106949, -0.17203814, -0.18189804, -0.17077395, -0.1610655, -0.17521349, -0.19081862, -0.18555215, -0.18181418, -0.20200977, -0.22388937, -0.23644717, -0.23984325, -0.22301759, -0.18573564, -0.15001905, -0.115485616, -0.08192864, -0.071825184], "type": "scatter", "name": "θ", "yaxis": "y2", "x": [0.0, 0.04, 0.08, 0.12, 0.16, 0.19999999, 0.23999998, 0.27999997, 0.31999996, 0.35999995, 0.39999995, 0.43999994, 0.47999993, 0.5199999, 0.55999994, 0.59999996, 0.64, 0.68, 0.72, 0.76000005, 0.8000001, 0.8400001, 0.8800001, 0.92000014, 0.96000016, 1.0000001, 1.0400001, 1.08, 1.12, 1.16, 1.1999999, 1.2399999, 1.2799999, 1.3199998, 1.3599998, 1.3999997, 1.4399997, 1.4799997, 1.5199996, 1.5599996, 1.5999995, 1.6399995, 1.6799995, 1.7199994, 1.7599994, 1.7999994, 1.8399993, 1.8799993, 1.9199992, 1.9599992, 1.9999992, 2.0399992, 2.0799992, 2.1199992, 2.1599991, 2.199999, 2.239999, 2.279999, 2.319999, 2.359999, 2.399999, 2.4399989, 2.4799988, 2.5199988, 2.5599988, 2.5999987, 2.6399987, 2.6799986, 2.7199986, 2.7599986, 2.7999985, 2.8399985, 2.8799984, 2.9199984, 2.9599984, 2.9999983, 3.0399983, 3.0799983, 3.1199982, 3.1599982, 3.1999981, 3.239998, 3.279998, 3.319998, 3.359998, 3.399998, 3.439998, 3.4799979, 3.5199978, 3.5599978, 3.5999978, 3.6399977, 3.6799977, 3.7199976, 3.7599976, 3.7999976, 3.8399975, 3.8799975, 3.9199975, 3.9599974, 3.9999974, 4.0399976, 4.0799975, 4.1199975, 4.1599975, 4.1999974, 4.2399974, 4.2799973, 4.3199973, 4.3599973, 4.399997, 4.439997, 4.479997, 4.519997, 4.559997, 4.599997, 4.639997, 4.679997, 4.719997, 4.759997, 4.799997, 4.839997, 4.879997, 4.9199967, 4.9599967, 4.9999967, 5.0399966, 5.0799966, 5.1199965, 5.1599965, 5.1999965, 5.2399964, 5.2799964, 5.3199964, 5.3599963, 5.3999963, 5.4399962, 5.479996, 5.519996, 5.559996, 5.599996, 5.639996, 5.679996, 5.719996, 5.759996, 5.799996, 5.839996, 5.879996, 5.919996, 5.9599957, 5.9999957, 6.0399957, 6.0799956, 6.1199956, 6.1599956, 6.1999955, 6.2399955, 6.2799954, 6.3199954, 6.3599954, 6.3999953, 6.4399953, 6.4799953, 6.519995, 6.559995, 6.599995, 6.639995, 6.679995, 6.719995, 6.759995, 6.799995, 6.839995, 6.879995, 6.919995, 6.959995, 6.9999948, 7.0399947, 7.0799947, 7.1199946, 7.1599946, 7.1999946, 7.2399945, 7.2799945, 7.3199944, 7.3599944, 7.3999944, 7.4399943, 7.4799943, 7.5199943, 7.559994, 7.599994, 7.639994, 7.679994, 7.719994, 7.759994, 7.799994, 7.839994, 7.879994, 7.919994, 7.959994, 7.999994, 8.039994, 8.079994, 8.119994, 8.159994, 8.199994, 8.239994, 8.279994, 8.319994, 8.359994, 8.399994, 8.439994, 8.479994, 8.519994, 8.559994, 8.599994, 8.639994, 8.679994, 8.719994, 8.759994, 8.7999935, 8.8399935, 8.879993, 8.919993, 8.959993, 8.999993, 9.039993, 9.079993, 9.119993, 9.159993, 9.199993, 9.239993, 9.279993, 9.319993, 9.359993, 9.399993, 9.439993, 9.479993, 9.519993, 9.559993, 9.599993, 9.639993, 9.679993, 9.719993, 9.759993, 9.799993, 9.839993, 9.8799925, 9.919992, 9.959992, 9.999992, 10.039992, 10.079992, 10.119992, 10.159992, 10.199992, 10.239992, 10.279992, 10.319992, 10.359992, 10.399992, 10.439992, 10.479992, 10.519992, 10.559992, 10.599992, 10.639992, 10.679992, 10.719992, 10.759992, 10.799992, 10.839992, 10.879992, 10.9199915, 10.959991, 10.999991, 11.039991, 11.079991, 11.119991, 11.159991, 11.199991, 11.239991, 11.279991, 11.319991, 11.359991, 11.399991, 11.439991, 11.479991, 11.519991, 11.559991, 11.599991, 11.639991, 11.679991, 11.719991, 11.759991, 11.799991, 11.839991, 11.879991, 11.919991, 11.9599905, 11.99999, 12.03999, 12.07999, 12.11999, 12.15999, 12.19999, 12.23999, 12.27999, 12.31999, 12.35999, 12.39999, 12.43999, 12.47999, 12.51999, 12.55999, 12.59999, 12.63999, 12.67999, 12.71999, 12.75999, 12.79999, 12.83999, 12.87999, 12.91999, 12.95999, 12.9999895, 13.039989, 13.079989, 13.119989, 13.159989, 13.199989, 13.239989, 13.279989, 13.319989, 13.359989, 13.399989, 13.439989, 13.479989, 13.519989, 13.559989, 13.599989, 13.639989, 13.679989, 13.719989, 13.759989, 13.799989, 13.839989, 13.879989, 13.919989, 13.959989, 13.999989, 14.0399885, 14.0799885, 14.119988, 14.159988, 14.199988, 14.239988, 14.279988, 14.319988, 14.359988, 14.399988, 14.439988, 14.479988, 14.519988, 14.559988, 14.599988, 14.639988, 14.679988, 14.719988, 14.759988, 14.799988, 14.839988, 14.879988, 14.919988, 14.959988, 14.999988, 15.039988, 15.079988, 15.1199875, 15.159987, 15.199987, 15.239987, 15.279987, 15.319987, 15.359987, 15.399987, 15.439987, 15.479987, 15.519987, 15.559987, 15.599987, 15.639987, 15.679987, 15.719987, 15.759987, 15.799987, 15.839987, 15.879987, 15.919987, 15.959987, 15.999987, 16.039988, 16.079988, 16.11999, 16.15999, 16.199991, 16.239992, 16.279993, 16.319994, 16.359995, 16.399996, 16.439997, 16.479998, 16.519999, 16.56, 16.6, 16.640001, 16.680002, 16.720003, 16.760004, 16.800005, 16.840006, 16.880007, 16.920008, 16.960009, 17.00001, 17.04001, 17.080011, 17.120012, 17.160013, 17.200014, 17.240015, 17.280016, 17.320017, 17.360018, 17.400019, 17.44002, 17.48002, 17.520021, 17.560022, 17.600023, 17.640024, 17.680025, 17.720026, 17.760027, 17.800028, 17.840029, 17.88003, 17.92003, 17.960032, 18.000032, 18.040033, 18.080034, 18.120035, 18.160036, 18.200037, 18.240038, 18.280039, 18.32004, 18.36004, 18.400042, 18.440042, 18.480043, 18.520044, 18.560045, 18.600046, 18.640047, 18.680048, 18.720049, 18.76005, 18.80005, 18.840052, 18.880053, 18.920053, 18.960054, 19.000055, 19.040056, 19.080057, 19.120058, 19.160059, 19.20006, 19.24006, 19.280062, 19.320063, 19.360064, 19.400064, 19.440065, 19.480066, 19.520067, 19.560068, 19.600069, 19.64007, 19.68007, 19.720072, 19.760073, 19.800074, 19.840075, 19.880075, 19.920076, 19.960077, 20.000078, 20.04008, 20.08008, 20.12008, 20.160082, 20.200083, 20.240084, 20.280085, 20.320086, 20.360086, 20.400087, 20.440088, 20.48009, 20.52009, 20.560091, 20.600092, 20.640093, 20.680094, 20.720095, 20.760096, 20.800097, 20.840097, 20.880098, 20.9201, 20.9601, 21.000101, 21.040102, 21.080103, 21.120104, 21.160105, 21.200106, 21.240107, 21.280107, 21.320108, 21.36011, 21.40011, 21.440111, 21.480112, 21.520113, 21.560114, 21.600115, 21.640116, 21.680117, 21.720118, 21.760118, 21.80012, 21.84012, 21.880121, 21.920122, 21.960123, 22.000124, 22.040125, 22.080126, 22.120127, 22.160128, 22.200129, 22.24013, 22.28013, 22.320131, 22.360132, 22.400133, 22.440134, 22.480135, 22.520136, 22.560137, 22.600138, 22.640139, 22.68014, 22.72014, 22.760141, 22.800142, 22.840143, 22.880144, 22.920145, 22.960146, 23.000147, 23.040148, 23.080149, 23.12015, 23.16015, 23.200151, 23.240152, 23.280153, 23.320154, 23.360155, 23.400156, 23.440157, 23.480158, 23.520159, 23.56016, 23.60016, 23.640162, 23.680162, 23.720163, 23.760164, 23.800165, 23.840166, 23.880167, 23.920168, 23.960169, 24.00017, 24.04017, 24.080172, 24.120173, 24.160173, 24.200174, 24.240175, 24.280176, 24.320177, 24.360178, 24.400179, 24.44018, 24.48018, 24.520182, 24.560183, 24.600183, 24.640184, 24.680185, 24.720186, 24.760187, 24.800188, 24.840189, 24.88019, 24.92019, 24.960192, 25.000193, 25.040194, 25.080194, 25.120195, 25.160196, 25.200197, 25.240198, 25.2802, 25.3202, 25.3602, 25.400202, 25.440203, 25.480204, 25.520205, 25.560205, 25.600206, 25.640207, 25.680208, 25.72021, 25.76021, 25.80021, 25.840212, 25.880213, 25.920214, 25.960215, 26.000216, 26.040216, 26.080217, 26.120218, 26.16022, 26.20022, 26.240221, 26.280222, 26.320223, 26.360224, 26.400225, 26.440226, 26.480227, 26.520227, 26.560228, 26.60023, 26.64023, 26.680231, 26.720232, 26.760233, 26.800234, 26.840235, 26.880236, 26.920237, 26.960238, 27.000238, 27.04024, 27.08024, 27.120241, 27.160242, 27.200243, 27.240244, 27.280245, 27.320246, 27.360247, 27.400248, 27.440248, 27.48025, 27.52025, 27.560251, 27.600252, 27.640253, 27.680254, 27.720255, 27.760256, 27.800257, 27.840258, 27.880259, 27.92026, 27.96026, 28.000261, 28.040262, 28.080263, 28.120264, 28.160265, 28.200266, 28.240267, 28.280268, 28.320269, 28.36027, 28.40027, 28.440271, 28.480272, 28.520273, 28.560274, 28.600275, 28.640276, 28.680277, 28.720278, 28.760279, 28.80028, 28.84028, 28.880281, 28.920282, 28.960283, 29.000284, 29.040285, 29.080286, 29.120287, 29.160288, 29.200289, 29.24029, 29.28029, 29.320292, 29.360292, 29.400293, 29.440294, 29.480295, 29.520296, 29.560297, 29.600298, 29.640299, 29.6803, 29.7203, 29.760302, 29.800303, 29.840303, 29.880304, 29.920305, 29.960306, 30.000307, 30.040308, 30.080309, 30.12031, 30.16031, 30.200312, 30.240313, 30.280313, 30.320314, 30.360315, 30.400316, 30.440317, 30.480318, 30.520319, 30.56032, 30.60032, 30.640322, 30.680323, 30.720324, 30.760324, 30.800325, 30.840326, 30.880327, 30.920328, 30.96033, 31.00033, 31.04033, 31.080332, 31.120333, 31.160334, 31.200335, 31.240335, 31.280336, 31.320337, 31.360338, 31.40034, 31.44034, 31.480341, 31.520342, 31.560343, 31.600344, 31.640345, 31.680346, 31.720346, 31.760347, 31.800348, 31.84035, 31.88035, 31.920351, 31.960352, 32.00035, 32.04035, 32.080353, 32.120354, 32.160355, 32.200356, 32.240356, 32.280357, 32.32036, 32.36036, 32.40036, 32.44036, 32.480362, 32.520363, 32.560364, 32.600365, 32.640366, 32.680367, 32.720367, 32.76037, 32.80037, 32.84037, 32.88037, 32.920372, 32.960373, 33.000374, 33.040375, 33.080376, 33.120377, 33.160378, 33.20038, 33.24038, 33.28038, 33.32038, 33.360382, 33.400383, 33.440384, 33.480385, 33.520386, 33.560387, 33.600388, 33.64039, 33.68039, 33.72039, 33.76039, 33.800392, 33.840393, 33.880394, 33.920395, 33.960396, 34.000397, 34.040398, 34.0804, 34.1204, 34.1604, 34.2004, 34.240402, 34.280403, 34.320404, 34.360405, 34.400406, 34.440407, 34.480408, 34.52041, 34.56041, 34.60041, 34.64041, 34.680412, 34.720413, 34.760414, 34.800415, 34.840416, 34.880417, 34.920418, 34.96042, 35.00042, 35.04042, 35.08042, 35.120422, 35.160423, 35.200424, 35.240425, 35.280426, 35.320427, 35.360428, 35.40043, 35.44043, 35.48043, 35.52043, 35.560432, 35.600433, 35.640434, 35.680435, 35.720436, 35.760437, 35.800438, 35.84044, 35.88044, 35.92044, 35.96044, 36.000443, 36.040443, 36.080444, 36.120445, 36.160446, 36.200447, 36.240448, 36.28045, 36.32045, 36.36045, 36.40045, 36.440453, 36.480453, 36.520454, 36.560455, 36.600456, 36.640457, 36.680458, 36.72046, 36.76046, 36.80046, 36.84046, 36.880463, 36.920464, 36.960464, 37.000465, 37.040466, 37.080467, 37.12047, 37.16047, 37.20047, 37.24047, 37.28047, 37.320473, 37.360474, 37.400475, 37.440475, 37.480476, 37.520477, 37.56048, 37.60048, 37.64048, 37.68048, 37.72048, 37.760483, 37.800484, 37.840485, 37.880486, 37.920486, 37.960487, 38.00049, 38.04049, 38.08049, 38.12049, 38.160492, 38.200493, 38.240494, 38.280495, 38.320496, 38.360497, 38.400497, 38.4405, 38.4805, 38.5205, 38.5605, 38.600502, 38.640503, 38.680504, 38.720505, 38.760506, 38.800507, 38.840508, 38.88051, 38.92051, 38.96051, 39.00051, 39.040512, 39.080513, 39.120514, 39.160515, 39.200516, 39.240517, 39.280518, 39.32052, 39.36052, 39.40052, 39.44052, 39.480522, 39.520523, 39.560524, 39.600525, 39.640526, 39.680527, 39.720528, 39.76053, 39.80053, 39.84053, 39.88053, 39.920532, 39.960533]}, {"y": [0.0, -0.00093293114, -1.1446139, -0.0033041239, 1.1379578, -0.0057849884, 1.1359303, -0.007592559, 1.1345052, 2.277039, 1.1344777, -0.007654071, -1.1498702, -2.2926104, -1.1503701, -2.2939868, -2.2954237, -1.1563231, -2.29983, -2.3025405, -1.1681821, -0.03638637, 1.0940988, 1.0880625, 2.2185578, 3.3503754, 2.2075465, 3.3413486, 2.1977491, 3.3330925, 2.1890655, 1.044928, 2.180572, 3.3161805, 3.3123507, 4.4497705, 5.589135, 4.4455433, 3.301728, 4.4434953, 3.3002317, 3.3000422, 2.1571465, 3.2999017, 4.4428625, 3.3006592, 2.1587746, 2.1595078, 3.3029284, 3.303717, 2.1619983, 1.0200357, 2.1632304, 3.306312, 2.1638439, 3.3070407, 2.1647239, 1.0222374, 2.1650326, 2.1647666, 2.1643615, 3.3067048, 4.4493628, 3.306863, 2.1647077, 1.0223705, 2.165297, 1.0222633, 2.1644666, 3.3064752, 3.3059344, 4.4485817, 5.5917034, 4.450218, 3.3100615, 2.1706157, 3.3149478, 2.1759443, 2.178676, 1.0395123, 1.0416634, 2.1858385, 1.0453684, 1.0468559, -0.09479046, 1.0484549, 2.1914582, 3.3345265, 2.1923046, 3.335844, 3.3369865, 2.1962504, 3.3403666, 4.484124, 3.3456044, 2.208867, 1.0728667, 2.21702, 1.0813203, 2.2253563, 1.0902627, -0.04485023, -1.1812388, -0.037305593, -1.1756942, -0.031563997, -1.1715637, -2.3128567, -1.169621, -0.02675891, 1.1158862, 2.2587352, 3.4019625, 2.2606308, 2.26252, 1.1235325, 2.267614, 1.1304086, -0.0057462454, -0.0016803304, -1.1380011, -2.2755482, -1.1315224, 0.012840748, 0.015576998, -1.12323, -1.1206765, -2.2605472, -1.1165371, 0.02751112, 1.1715167, 0.0316844, -1.1075186, 0.036836624, -1.1020733, 0.04229331, -1.0961531, -2.2347178, -1.0903941, -1.0877812, 0.056530714, -1.0819386, -2.2202027, -1.0758225, -2.2144396, -1.0700591, -2.2088516, -1.0644729, 0.07969749, -1.0581074, -2.1950898, -1.0508118, -2.1873443, -3.3240247, -4.461997, -3.3179982, -2.1736827, -3.313275, -2.1690109, -3.30898, -2.1647696, -2.1627088, -3.3026733, -2.1584415, -3.2984936, -2.1542675, -1.0101506, -2.1492424, -3.28762, -4.426332, -3.2820654, -4.421794, -3.2776322, -4.4181795, -5.5595584, -4.4162264, -3.2731493, -4.4159136, -3.2732568, -3.2735283, -4.416705, -3.2745428, -4.4180765, -3.2764878, -2.1349497, -0.9928442, -2.1357617, -2.135419, -3.2775426, -2.1340203, -3.2757053, -4.4174285, -5.559751, -6.7027206, -5.561055, -4.420973, -3.2820215, -2.143272, -3.287634, -2.1485384, -3.292886, -2.1536024, -2.1559508, -3.3002415, -3.302597, -2.16336, -3.3077126, -2.168714, -1.0294409, -2.173658, -1.0334461, -2.1775022, -2.1791167, -1.0381136, -2.1819346, -3.3257556, -3.3272362, -2.1869082, -3.3311112, -2.191627, -3.335906, -2.197474, -1.0594043, -2.203794, -2.2069178, -1.0691053, -2.2134829, -1.0758271, -2.2201786, -1.0828984, 0.05458331, -1.0897348, -1.0928607, 0.045449495, -1.0989118, -1.1017523, 0.036947966, 1.176424, 0.032390118, -1.1117274, -2.2557993, -2.2579043, -1.1192616, -2.26336, -1.1264122, 0.009576082, 1.1458302, 2.2835102, 2.281368, 1.1374075, -0.006706476, 1.1339993, -0.009984016, -1.1539629, -0.013584971, -1.1577667, -0.018457651, 1.120464, -0.023904324, 1.1150998, -0.029268146, 1.1096619, 2.2489848, 1.1048069, 2.245205, 1.1012338, -0.04274082, -1.1866548, -1.1885823, -0.050189137, 1.086992, -0.057188153, -1.2004849, -1.2037287, -0.07082772, 1.0595709, 2.1887844, 3.3182652, 4.449437, 5.583518, 4.441048, 3.297163, 2.1529183, 3.2892375, 3.285353, 4.4223065, 4.419233, 3.2749076, 4.413692, 5.553497, 4.409618, 4.4083447, 3.2646286, 4.4063764, 3.2628577, 2.119304, 3.2607589, 3.2593818, 2.1154232, 3.2558038, 4.395882, 5.5365214, 4.392741, 5.5345964, 6.6772146, 7.8202066, 6.678935, 5.53999, 4.4030714, 3.2672257, 2.131225, 3.2754087, 2.1385932, 1.0006912, -0.13913572, 1.0043849, 2.148008, 2.1486907, 2.1491094, 1.0063454, 2.1487389, 1.005218, 2.1464458, 3.2871265, 4.428039, 3.2842505, 4.4260044, 5.5684247, 4.425861, 5.5693254, 4.428382, 3.2885427, 2.1490111, 3.2933087, 4.437501, 3.2987077, 2.1604996, 1.0218084, 2.1660385, 1.0261619, -0.11481178, 1.0286413, -0.11394358, 1.0284542, 2.170177, 3.3117473, 4.453796, 3.3107882, 4.453861, 3.3119044, 2.1704433, 3.31428, 2.173162, 3.317144, 2.1764536, 3.320574, 3.3225806, 2.183107, 2.1856506, 3.3299537, 2.191369, 2.1944609, 1.0562984, 2.2006855, 1.0622816, 2.2066722, 1.068211, 2.212601, 3.356707, 2.2194133, 2.2232904, 1.0877316, -0.047918558, -0.044169463, -1.1818066, -0.037745595, 1.1066209, -0.03240955, -1.1720976, -0.028068304, -1.1691073, -0.025455594, 1.1181288, -0.02367282, 1.1199324, 1.12095, -0.020466805, -1.162158, -0.01879239, 1.1244806, -0.017740965, 1.1256397, 1.1264107, -0.015276551, 1.1284353, -0.013009548, 1.1308274, 1.1322529, 2.2761931, 1.1365414, -0.0021648407, 1.1421791, 0.004219532, 1.1484903, 0.011521101, -1.1251447, 0.01915133, -1.1177672, 0.026549459, 1.170514, 1.1740997, 0.039738417, -1.0933486, -2.2265162, -3.3611054, -4.4981937, -3.3548665, -4.4948773, -3.3512998, -2.2076943, -1.064168, 0.0794394, -1.0616841, -2.2023487, -1.0582582, 0.085758805, 1.229243, 0.09088147, 1.2332077, 0.09838104, -1.033194, -2.1626387, -3.2913399, -4.420763, -5.5522966, -4.409995, -4.405208, -4.4006476, -3.2567506, -4.391613, -5.5272965, -6.6649156, -5.521218, -6.661351, -5.517667, -4.373993, -3.230346, -2.0866342, -3.2272224, -4.3669972, -5.5068364, -4.362667, -5.503173, -5.501738, -4.357924, -5.499488, -6.6415944, -5.4986663, -5.4989853, -4.3568115, -5.500291, -6.6438894, -5.5030346, -5.505108, -4.365775, -3.2264862, -2.08631, -3.2301354, -2.0884151, -3.2316241, -2.088752, -3.2309535, -3.2299345, -2.086139, -3.226429, -2.0824332, -3.2207656, -4.357833, -3.2136893, -4.3496637, -5.485241, -6.621676, -7.7600164, -6.61631, -7.757092, -8.899321, -8.899428, -7.757897, -6.61829, -5.4802504, -6.624304, -5.48794, -4.352516, -3.2167726, -2.079521, -3.2234192, -4.3677645, -5.5120792, -4.3738413, -5.5180883, -4.3810143, -3.2443967, -4.388681, -5.5326157, -4.3970127, -5.5402956, -5.544445, -4.412738, -4.41847, -3.289763, -2.1617308, -1.0329041, 0.098109245, 0.09403723, -1.0489174, -2.1927295, -1.0580246, 0.07765055, -1.0663018, 0.071074724, 1.2100941, 0.06631839, -1.0777009, -2.2217336, -3.36569, -2.2260714, -3.370061, -2.232467, -3.3759284, -4.517281, -3.383359, -2.2533135, -1.1260805, -2.2649891, -1.1403214, -0.017860174, 1.1037798, 2.2260132, 1.0878642, 2.2110398, 3.3355136, 4.4626555, 3.3227706, 2.1811907, 3.310524, 2.1684418, 3.298291, 4.4290895, 5.562172, 4.419477, 3.2756557, 4.41103, 5.5477257, 4.403782, 5.5425186, 4.398429, 3.25417, 2.109991, 3.2490726, 3.2461271, 3.2429905, 4.38066, 3.2363014, 4.373659, 5.511266, 4.366953, 4.3639145, 3.2195823, 3.2164018, 4.3531775, 5.489674, 4.345399, 5.4821806, 6.6196713, 5.4754696, 6.614397, 5.470185, 5.467989, 6.6082425, 5.464227, 4.3202033, 4.3184295, 5.458418, 4.3141546, 5.4535193, 6.592863, 6.59076, 7.7315397, 6.5878925, 5.4443407, 6.5863667, 7.728818, 6.586159, 5.44388, 6.587266, 5.4452925, 6.5888805, 5.4472637, 6.5910473, 5.4498577, 6.5938263, 5.453157, 4.3124647, 5.4564104, 6.600356, 5.4597244, 6.603868, 5.4640546, 6.6083136, 7.7520676, 6.6144514, 5.478664, 4.3435774, 3.2079058, 4.351965, 3.2149959, 4.359266, 5.5035844, 4.366442, 3.2293725, 3.2326076, 2.0938966, 2.0958202, 3.2397714, 4.3837585, 3.2427673, 4.3867126, 3.24564, 4.389568, 5.533489, 6.677118, 5.5380583, 4.4009495, 3.2648382, 2.1285095, 3.2727299, 4.41695, 3.280622, 4.4246016, 3.2894766, 2.1548557, 1.0194312, 2.1633935, 3.3076253, 4.4513264, 3.3162813, 3.3210368, 3.3260317, 2.1943939, 2.2001936, 3.3422213, 4.482053, 3.3535428, 2.228382, 3.3650022, 2.2428656, 1.1235981, 0.0059211254, -1.1114637, -2.2298563, -1.0944554, -2.2140377, -3.3351445, -4.4590945, -3.3212461, -3.3145752, -4.442437, -3.3014696, -2.1595802, -3.2887335, -4.418499, -3.2762587, -4.4072313, -5.5397367, -6.674958, -5.5320582, -6.6704545, -6.6689196, -5.525277, -4.38162, -3.238011, -2.0943274, -3.2351403, -3.2331104, -4.3727493, -5.512564, -4.3684187, -3.2242608, -4.36424, -5.5042305, -6.645071, -5.5014744, -4.357999, -5.5001893, -6.6428146, -6.643188, -6.6441665, -5.5037007, -5.506023, -4.367199, -3.2284527, -2.0887918, -3.2327409, -2.091483, -3.234961, -2.092475, -0.9494363, -2.091056, -3.2313304, -4.3708277, -3.2265038, -4.3654447, -5.504424, -6.6443954, -6.6431737, -5.499686, -5.4992642, -4.356267, -4.35613, -5.4988575, -6.641898, -5.4999485, -4.3585787, -4.3598638, -3.2183833, -4.361975, -5.505562, -6.6492286, -5.5086784, -5.5109744, -6.6548157, -5.5172205, -4.381278, -3.245839, -2.1096385, -0.9715903, 0.16886532, -0.97430885, -2.1174793, -2.1173604, -3.2595925, -4.40168, -5.544157, -4.4015036, -5.5448704, -4.4033923, -5.5472918, -4.4073544, -3.2682843, -4.41263, -3.2742183, -2.1358404, -3.280198, -3.2829976, -4.427282, -5.5708833, -4.434186, -3.2995024, -2.1656525, -3.309277, -2.1758153, -1.0421592, -2.1858351, -1.0513406, 0.084329486, -1.0595492, -2.2038558, -1.0669276, -2.2112393, -1.0744015, -2.2186768, -3.362423, -2.227065, -1.0930841, 0.040828228, 1.1759932, 0.0322299, 1.1693698, 0.025226116, 1.1639556, 0.019753695, -1.124554, -2.2686658, -1.1302344, -1.1336018, -2.2773235, -1.1417001, -0.0076055527, 1.1262486, -0.017484665, 1.116706, -0.027083158, 1.1070837, 2.2417502, 1.0978645, 1.0936097, -0.050383568, 1.0845743, 2.2193418, 1.0753756, 2.2105432, 3.3465374, 4.484448, 3.3407373, 2.1965365, 3.3367553, 2.1927376, 2.1912234, 1.0472354, -0.09668207, 1.0430018, 2.181695, 3.3203855, 3.3179522, 3.3158479, 3.3140655, 2.1700304, 3.3109632, 2.167036, 3.3082442, 2.164413, 1.0205662, 2.1612234, 3.301529, 4.4423194, 3.298572, 2.1548696, 3.2964535, 4.4383388, 5.580944, 4.4386272, 3.2971568, 2.1560369, 3.299952, 4.443863, 3.303618, 2.163948, 1.0239388, -0.11717379, 1.0262108, 2.1693406, 3.3123908, 4.455661, 3.3140285, 2.1729758, 2.1744487, 1.0331821, 2.176859, 1.0349228, -0.10758448, 1.0349281, 2.176871, 1.0331981, 2.1744728, 3.3156397, 3.3145702, 2.170989, 3.3129597, 2.169561, 3.3118072, 2.1685915, 1.0253042, 1.0244465, 2.1656098, 2.163986, 3.3046815, 4.4458733, 5.588054, 4.4453573, 3.3033967, 4.4471345, 3.3064003, 2.1662428, 1.0258102, 2.1697612, 3.313714, 2.1728888, 2.174514, 1.0334655, 2.1772404, 2.1783032, 3.322051, 4.465817, 3.3255904, 2.186378, 1.0472609, -0.09271705, 1.051174, 2.195093, 3.339014, 2.1985083, 1.0582942, 2.2024632, 1.0620944, -0.07875371, 1.0649717, -0.07699859, 1.0661496, 2.2091396, 1.0665089, -0.07632327, 1.0660586, 2.208179, 3.350576, 2.2077677, 1.0651776, 2.2082157, 3.3514144, 2.2095346, 1.0680096, 2.2117627, 2.2129967, 1.0719081, 2.2158642, 1.0749542, -0.0661999, -1.2082006, -0.065304756, 1.0770775, 1.0764387, 2.2185364, 1.0751984, 2.2174997, 1.0743011, 2.2167792, 2.2166393, 3.359671, 2.217628, 2.2187386, 1.0775486, 2.2214882, 1.0806087, 2.2246647, 1.0841864, 2.2283568, 1.088395, -0.05161345, 1.0925227, 2.2366552, 1.0967023, -0.043141127, -0.041263357, -1.1823418, -1.1816378, -0.038609505, 1.1041417, 2.2469988, 3.390213, 2.2487345, 1.1082637, -0.031901598, -1.1726143, -0.028857589, 1.1148539, 2.258607, 3.4022384, 2.262574, 1.1246513, -0.012433171, 1.131853, -0.004755497, -1.1414905, -1.1382118, -2.276894, -1.1328629, -1.1309078, 0.013213277, 0.015029476, 1.1591543, 0.019535303, -1.1196895, -1.1173155, 0.026981592, -1.1126468, -2.2526882, -1.1086665, -2.2497892, -1.10611, 0.03753102, 1.1812391, 0.040545225, 1.1845723, 0.045607686, -1.0923266, -2.2302945, -1.0859402, 0.058369994, -1.0794067, -2.2169008, -1.072533, 0.07163179, -1.0651387, -1.061091, -2.1966152, -3.332561, -4.4700885, -3.3262155, -4.4659925, -3.322135, -2.1782184], "type": "scatter", "name": "ẋ", "yaxis": "y", "x": [0.0, 0.04, 0.08, 0.12, 0.16, 0.19999999, 0.23999998, 0.27999997, 0.31999996, 0.35999995, 0.39999995, 0.43999994, 0.47999993, 0.5199999, 0.55999994, 0.59999996, 0.64, 0.68, 0.72, 0.76000005, 0.8000001, 0.8400001, 0.8800001, 0.92000014, 0.96000016, 1.0000001, 1.0400001, 1.08, 1.12, 1.16, 1.1999999, 1.2399999, 1.2799999, 1.3199998, 1.3599998, 1.3999997, 1.4399997, 1.4799997, 1.5199996, 1.5599996, 1.5999995, 1.6399995, 1.6799995, 1.7199994, 1.7599994, 1.7999994, 1.8399993, 1.8799993, 1.9199992, 1.9599992, 1.9999992, 2.0399992, 2.0799992, 2.1199992, 2.1599991, 2.199999, 2.239999, 2.279999, 2.319999, 2.359999, 2.399999, 2.4399989, 2.4799988, 2.5199988, 2.5599988, 2.5999987, 2.6399987, 2.6799986, 2.7199986, 2.7599986, 2.7999985, 2.8399985, 2.8799984, 2.9199984, 2.9599984, 2.9999983, 3.0399983, 3.0799983, 3.1199982, 3.1599982, 3.1999981, 3.239998, 3.279998, 3.319998, 3.359998, 3.399998, 3.439998, 3.4799979, 3.5199978, 3.5599978, 3.5999978, 3.6399977, 3.6799977, 3.7199976, 3.7599976, 3.7999976, 3.8399975, 3.8799975, 3.9199975, 3.9599974, 3.9999974, 4.0399976, 4.0799975, 4.1199975, 4.1599975, 4.1999974, 4.2399974, 4.2799973, 4.3199973, 4.3599973, 4.399997, 4.439997, 4.479997, 4.519997, 4.559997, 4.599997, 4.639997, 4.679997, 4.719997, 4.759997, 4.799997, 4.839997, 4.879997, 4.9199967, 4.9599967, 4.9999967, 5.0399966, 5.0799966, 5.1199965, 5.1599965, 5.1999965, 5.2399964, 5.2799964, 5.3199964, 5.3599963, 5.3999963, 5.4399962, 5.479996, 5.519996, 5.559996, 5.599996, 5.639996, 5.679996, 5.719996, 5.759996, 5.799996, 5.839996, 5.879996, 5.919996, 5.9599957, 5.9999957, 6.0399957, 6.0799956, 6.1199956, 6.1599956, 6.1999955, 6.2399955, 6.2799954, 6.3199954, 6.3599954, 6.3999953, 6.4399953, 6.4799953, 6.519995, 6.559995, 6.599995, 6.639995, 6.679995, 6.719995, 6.759995, 6.799995, 6.839995, 6.879995, 6.919995, 6.959995, 6.9999948, 7.0399947, 7.0799947, 7.1199946, 7.1599946, 7.1999946, 7.2399945, 7.2799945, 7.3199944, 7.3599944, 7.3999944, 7.4399943, 7.4799943, 7.5199943, 7.559994, 7.599994, 7.639994, 7.679994, 7.719994, 7.759994, 7.799994, 7.839994, 7.879994, 7.919994, 7.959994, 7.999994, 8.039994, 8.079994, 8.119994, 8.159994, 8.199994, 8.239994, 8.279994, 8.319994, 8.359994, 8.399994, 8.439994, 8.479994, 8.519994, 8.559994, 8.599994, 8.639994, 8.679994, 8.719994, 8.759994, 8.7999935, 8.8399935, 8.879993, 8.919993, 8.959993, 8.999993, 9.039993, 9.079993, 9.119993, 9.159993, 9.199993, 9.239993, 9.279993, 9.319993, 9.359993, 9.399993, 9.439993, 9.479993, 9.519993, 9.559993, 9.599993, 9.639993, 9.679993, 9.719993, 9.759993, 9.799993, 9.839993, 9.8799925, 9.919992, 9.959992, 9.999992, 10.039992, 10.079992, 10.119992, 10.159992, 10.199992, 10.239992, 10.279992, 10.319992, 10.359992, 10.399992, 10.439992, 10.479992, 10.519992, 10.559992, 10.599992, 10.639992, 10.679992, 10.719992, 10.759992, 10.799992, 10.839992, 10.879992, 10.9199915, 10.959991, 10.999991, 11.039991, 11.079991, 11.119991, 11.159991, 11.199991, 11.239991, 11.279991, 11.319991, 11.359991, 11.399991, 11.439991, 11.479991, 11.519991, 11.559991, 11.599991, 11.639991, 11.679991, 11.719991, 11.759991, 11.799991, 11.839991, 11.879991, 11.919991, 11.9599905, 11.99999, 12.03999, 12.07999, 12.11999, 12.15999, 12.19999, 12.23999, 12.27999, 12.31999, 12.35999, 12.39999, 12.43999, 12.47999, 12.51999, 12.55999, 12.59999, 12.63999, 12.67999, 12.71999, 12.75999, 12.79999, 12.83999, 12.87999, 12.91999, 12.95999, 12.9999895, 13.039989, 13.079989, 13.119989, 13.159989, 13.199989, 13.239989, 13.279989, 13.319989, 13.359989, 13.399989, 13.439989, 13.479989, 13.519989, 13.559989, 13.599989, 13.639989, 13.679989, 13.719989, 13.759989, 13.799989, 13.839989, 13.879989, 13.919989, 13.959989, 13.999989, 14.0399885, 14.0799885, 14.119988, 14.159988, 14.199988, 14.239988, 14.279988, 14.319988, 14.359988, 14.399988, 14.439988, 14.479988, 14.519988, 14.559988, 14.599988, 14.639988, 14.679988, 14.719988, 14.759988, 14.799988, 14.839988, 14.879988, 14.919988, 14.959988, 14.999988, 15.039988, 15.079988, 15.1199875, 15.159987, 15.199987, 15.239987, 15.279987, 15.319987, 15.359987, 15.399987, 15.439987, 15.479987, 15.519987, 15.559987, 15.599987, 15.639987, 15.679987, 15.719987, 15.759987, 15.799987, 15.839987, 15.879987, 15.919987, 15.959987, 15.999987, 16.039988, 16.079988, 16.11999, 16.15999, 16.199991, 16.239992, 16.279993, 16.319994, 16.359995, 16.399996, 16.439997, 16.479998, 16.519999, 16.56, 16.6, 16.640001, 16.680002, 16.720003, 16.760004, 16.800005, 16.840006, 16.880007, 16.920008, 16.960009, 17.00001, 17.04001, 17.080011, 17.120012, 17.160013, 17.200014, 17.240015, 17.280016, 17.320017, 17.360018, 17.400019, 17.44002, 17.48002, 17.520021, 17.560022, 17.600023, 17.640024, 17.680025, 17.720026, 17.760027, 17.800028, 17.840029, 17.88003, 17.92003, 17.960032, 18.000032, 18.040033, 18.080034, 18.120035, 18.160036, 18.200037, 18.240038, 18.280039, 18.32004, 18.36004, 18.400042, 18.440042, 18.480043, 18.520044, 18.560045, 18.600046, 18.640047, 18.680048, 18.720049, 18.76005, 18.80005, 18.840052, 18.880053, 18.920053, 18.960054, 19.000055, 19.040056, 19.080057, 19.120058, 19.160059, 19.20006, 19.24006, 19.280062, 19.320063, 19.360064, 19.400064, 19.440065, 19.480066, 19.520067, 19.560068, 19.600069, 19.64007, 19.68007, 19.720072, 19.760073, 19.800074, 19.840075, 19.880075, 19.920076, 19.960077, 20.000078, 20.04008, 20.08008, 20.12008, 20.160082, 20.200083, 20.240084, 20.280085, 20.320086, 20.360086, 20.400087, 20.440088, 20.48009, 20.52009, 20.560091, 20.600092, 20.640093, 20.680094, 20.720095, 20.760096, 20.800097, 20.840097, 20.880098, 20.9201, 20.9601, 21.000101, 21.040102, 21.080103, 21.120104, 21.160105, 21.200106, 21.240107, 21.280107, 21.320108, 21.36011, 21.40011, 21.440111, 21.480112, 21.520113, 21.560114, 21.600115, 21.640116, 21.680117, 21.720118, 21.760118, 21.80012, 21.84012, 21.880121, 21.920122, 21.960123, 22.000124, 22.040125, 22.080126, 22.120127, 22.160128, 22.200129, 22.24013, 22.28013, 22.320131, 22.360132, 22.400133, 22.440134, 22.480135, 22.520136, 22.560137, 22.600138, 22.640139, 22.68014, 22.72014, 22.760141, 22.800142, 22.840143, 22.880144, 22.920145, 22.960146, 23.000147, 23.040148, 23.080149, 23.12015, 23.16015, 23.200151, 23.240152, 23.280153, 23.320154, 23.360155, 23.400156, 23.440157, 23.480158, 23.520159, 23.56016, 23.60016, 23.640162, 23.680162, 23.720163, 23.760164, 23.800165, 23.840166, 23.880167, 23.920168, 23.960169, 24.00017, 24.04017, 24.080172, 24.120173, 24.160173, 24.200174, 24.240175, 24.280176, 24.320177, 24.360178, 24.400179, 24.44018, 24.48018, 24.520182, 24.560183, 24.600183, 24.640184, 24.680185, 24.720186, 24.760187, 24.800188, 24.840189, 24.88019, 24.92019, 24.960192, 25.000193, 25.040194, 25.080194, 25.120195, 25.160196, 25.200197, 25.240198, 25.2802, 25.3202, 25.3602, 25.400202, 25.440203, 25.480204, 25.520205, 25.560205, 25.600206, 25.640207, 25.680208, 25.72021, 25.76021, 25.80021, 25.840212, 25.880213, 25.920214, 25.960215, 26.000216, 26.040216, 26.080217, 26.120218, 26.16022, 26.20022, 26.240221, 26.280222, 26.320223, 26.360224, 26.400225, 26.440226, 26.480227, 26.520227, 26.560228, 26.60023, 26.64023, 26.680231, 26.720232, 26.760233, 26.800234, 26.840235, 26.880236, 26.920237, 26.960238, 27.000238, 27.04024, 27.08024, 27.120241, 27.160242, 27.200243, 27.240244, 27.280245, 27.320246, 27.360247, 27.400248, 27.440248, 27.48025, 27.52025, 27.560251, 27.600252, 27.640253, 27.680254, 27.720255, 27.760256, 27.800257, 27.840258, 27.880259, 27.92026, 27.96026, 28.000261, 28.040262, 28.080263, 28.120264, 28.160265, 28.200266, 28.240267, 28.280268, 28.320269, 28.36027, 28.40027, 28.440271, 28.480272, 28.520273, 28.560274, 28.600275, 28.640276, 28.680277, 28.720278, 28.760279, 28.80028, 28.84028, 28.880281, 28.920282, 28.960283, 29.000284, 29.040285, 29.080286, 29.120287, 29.160288, 29.200289, 29.24029, 29.28029, 29.320292, 29.360292, 29.400293, 29.440294, 29.480295, 29.520296, 29.560297, 29.600298, 29.640299, 29.6803, 29.7203, 29.760302, 29.800303, 29.840303, 29.880304, 29.920305, 29.960306, 30.000307, 30.040308, 30.080309, 30.12031, 30.16031, 30.200312, 30.240313, 30.280313, 30.320314, 30.360315, 30.400316, 30.440317, 30.480318, 30.520319, 30.56032, 30.60032, 30.640322, 30.680323, 30.720324, 30.760324, 30.800325, 30.840326, 30.880327, 30.920328, 30.96033, 31.00033, 31.04033, 31.080332, 31.120333, 31.160334, 31.200335, 31.240335, 31.280336, 31.320337, 31.360338, 31.40034, 31.44034, 31.480341, 31.520342, 31.560343, 31.600344, 31.640345, 31.680346, 31.720346, 31.760347, 31.800348, 31.84035, 31.88035, 31.920351, 31.960352, 32.00035, 32.04035, 32.080353, 32.120354, 32.160355, 32.200356, 32.240356, 32.280357, 32.32036, 32.36036, 32.40036, 32.44036, 32.480362, 32.520363, 32.560364, 32.600365, 32.640366, 32.680367, 32.720367, 32.76037, 32.80037, 32.84037, 32.88037, 32.920372, 32.960373, 33.000374, 33.040375, 33.080376, 33.120377, 33.160378, 33.20038, 33.24038, 33.28038, 33.32038, 33.360382, 33.400383, 33.440384, 33.480385, 33.520386, 33.560387, 33.600388, 33.64039, 33.68039, 33.72039, 33.76039, 33.800392, 33.840393, 33.880394, 33.920395, 33.960396, 34.000397, 34.040398, 34.0804, 34.1204, 34.1604, 34.2004, 34.240402, 34.280403, 34.320404, 34.360405, 34.400406, 34.440407, 34.480408, 34.52041, 34.56041, 34.60041, 34.64041, 34.680412, 34.720413, 34.760414, 34.800415, 34.840416, 34.880417, 34.920418, 34.96042, 35.00042, 35.04042, 35.08042, 35.120422, 35.160423, 35.200424, 35.240425, 35.280426, 35.320427, 35.360428, 35.40043, 35.44043, 35.48043, 35.52043, 35.560432, 35.600433, 35.640434, 35.680435, 35.720436, 35.760437, 35.800438, 35.84044, 35.88044, 35.92044, 35.96044, 36.000443, 36.040443, 36.080444, 36.120445, 36.160446, 36.200447, 36.240448, 36.28045, 36.32045, 36.36045, 36.40045, 36.440453, 36.480453, 36.520454, 36.560455, 36.600456, 36.640457, 36.680458, 36.72046, 36.76046, 36.80046, 36.84046, 36.880463, 36.920464, 36.960464, 37.000465, 37.040466, 37.080467, 37.12047, 37.16047, 37.20047, 37.24047, 37.28047, 37.320473, 37.360474, 37.400475, 37.440475, 37.480476, 37.520477, 37.56048, 37.60048, 37.64048, 37.68048, 37.72048, 37.760483, 37.800484, 37.840485, 37.880486, 37.920486, 37.960487, 38.00049, 38.04049, 38.08049, 38.12049, 38.160492, 38.200493, 38.240494, 38.280495, 38.320496, 38.360497, 38.400497, 38.4405, 38.4805, 38.5205, 38.5605, 38.600502, 38.640503, 38.680504, 38.720505, 38.760506, 38.800507, 38.840508, 38.88051, 38.92051, 38.96051, 39.00051, 39.040512, 39.080513, 39.120514, 39.160515, 39.200516, 39.240517, 39.280518, 39.32052, 39.36052, 39.40052, 39.44052, 39.480522, 39.520523, 39.560524, 39.600525, 39.640526, 39.680527, 39.720528, 39.76053, 39.80053, 39.84053, 39.88053, 39.920532, 39.960533]}, {"y": [0.0, 0.010275189, 0.5919021, 0.036447108, -0.5186413, 0.063735366, -0.49579382, 0.08358312, -0.47979593, -1.047437, -0.47947115, 0.08425534, 0.64874625, 1.2188451, 0.65495133, 1.2388353, 1.2619461, 0.7312988, 1.3303951, 1.3797325, 0.8955504, 0.4255738, -0.037404686, 0.033393748, -0.4296503, -0.8997574, -0.295861, -0.7841549, -0.18108445, -0.68323296, -0.08172017, 0.51964605, 0.014134049, -0.49117744, -0.4459868, -0.9640028, -1.4943563, -0.90450126, -0.31964856, -0.87914205, -0.30257607, -0.30045423, 0.27140686, -0.29890785, -0.87165296, -0.30753064, 0.25375336, 0.24558337, -0.33242244, -0.34130073, 0.21833569, 0.7800392, 0.20449859, -0.36946875, 0.19777131, -0.3775373, 0.18812078, 0.7554311, 0.18466574, 0.18761295, 0.19210126, -0.37383235, -0.9429959, -0.37566215, 0.1883387, 0.75403106, 0.18174535, 0.7553005, 0.19103718, -0.37147737, -0.36536372, -0.934401, -1.5108074, -0.95734286, -0.41411573, 0.12436801, -0.46912694, 0.065354645, 0.03495699, 0.5706269, 0.5452192, -0.044453144, 0.5031913, 0.48594245, 1.0437146, 0.46704984, -0.10603905, -0.67996264, -0.11543542, -0.69531417, -0.7092636, -0.15954977, -0.74924165, -1.3424687, -0.8184492, -0.30526328, 0.20370746, -0.39782208, 0.108502865, -0.49364573, 0.00733608, 0.50842416, 1.0167477, 0.4174745, 0.94436276, 0.35023475, 0.892306, 1.4440526, 0.86590827, 0.29434562, -0.2747776, -0.84618133, -1.4240464, -0.87170094, -0.8964732, -0.36380917, -0.9613208, -0.4451931, 0.0648523, 0.01894192, 0.53013074, 1.0484962, 0.45199734, -0.1428042, -0.17339501, 0.35950813, 0.3304509, 0.8714132, 0.28299493, -0.3038423, -0.89237326, -0.35187668, 0.18451267, -0.4095508, 0.12420082, -0.4711247, 0.05847037, 0.5887905, -0.0057449937, -0.03478267, -0.62949026, -0.1001485, 0.42794156, -0.16808718, 0.36315578, -0.23205346, 0.30075705, -0.294154, -0.89019084, -0.3681425, 0.14904904, -0.45099348, 0.062228978, 0.5762999, 1.0980402, 0.5028337, -0.09029454, 0.44945037, -0.14189345, 0.4012974, -0.18867326, -0.21166676, 0.33153236, -0.2588128, 0.28518826, -0.3050152, -0.89671576, -0.3630982, 0.16606307, 0.69727993, 0.10359514, 0.64413685, 0.054215193, 0.60240316, 1.1568799, 0.57858837, 0.004614711, 0.57502776, 0.0057971478, 0.008785939, 0.583937, 0.01995629, 0.59959465, 0.04142046, -0.51634896, -1.079023, -0.50667703, -0.51065016, 0.05297643, -0.5264238, 0.032779932, 0.592299, 1.1573012, 1.7318343, 1.178107, 0.6370449, 0.103469014, -0.42884958, 0.1656723, -0.36975545, 0.22383279, -0.31329167, -0.2867708, 0.3049768, 0.33164278, -0.20505843, 0.3887873, -0.14576831, -0.68206465, -0.09039682, -0.6353908, -0.047682524, -0.029848278, -0.58237165, 0.0012610555, 0.58490175, 0.60246557, 0.05631715, 0.6472654, 0.108947754, 0.7035455, 0.17483771, -0.3516698, 0.24518383, 0.28055525, -0.24378037, 0.3541335, -0.1688149, 0.42974496, -0.089880586, -0.6106886, -0.013122916, 0.021772183, -0.5065544, 0.08905971, 0.12075459, -0.41115737, -0.9480115, -0.35803455, 0.23004341, 0.81935936, 0.8464473, 0.31660867, 0.9148074, 0.40054625, -0.10819721, -0.6184471, -1.1370807, -1.105207, -0.5142302, 0.07409549, -0.4757763, 0.110162616, 0.69674647, 0.1503756, 0.7416902, 0.20530474, -0.32862467, 0.2657414, -0.268929, 0.32542264, -0.20858228, -0.7450808, -0.15383452, -0.7003839, -0.11392254, 0.4718691, 1.0603323, 1.0881237, 0.5624061, 0.043999016, 0.64429164, 1.2458086, 1.3016868, 0.82910967, 0.37018055, -0.0824675, -0.53652465, -0.9996115, -1.4788622, -0.87646836, -0.27519196, 0.32554197, -0.18609232, -0.14210397, -0.65814453, -0.6206942, -0.024282932, -0.5566321, -1.0956061, -0.50750417, -0.4927402, 0.08922565, -0.4706418, 0.10866165, 0.6887221, 0.13200372, 0.14725156, 0.73365647, 0.18737125, -0.35684186, -0.90496767, -0.3207214, -0.8811299, -1.4494421, -2.0287697, -1.4850092, -0.95881885, -0.44521356, 0.062229514, 0.5705459, -0.030164003, 0.48536903, 1.007334, 1.5418619, 0.9542161, 0.37229264, 0.3645753, 0.35984856, 0.9302578, 0.3641106, 0.9444241, 0.39063668, -0.15908283, -0.7104798, -0.12697715, -0.6866821, -1.2526889, -0.68488055, -1.2660105, -0.7167938, -0.17547703, 0.36381942, -0.22816622, -0.8211944, -0.2898538, 0.23788488, 0.76858634, 0.17504734, 0.7167076, 1.2662444, 0.68532544, 1.2534013, 0.68767583, 0.12828499, -0.4298638, -0.99210024, -0.418675, -0.99293816, -0.43169725, 0.12538904, -0.4582569, 0.0955174, -0.4904275, 0.05927956, -0.52921975, -0.5527984, -0.014312446, -0.042567555, -0.63687694, -0.10654819, -0.14116032, 0.38611096, -0.21036059, 0.31906044, -0.27697176, 0.25295997, -0.34317034, -0.940492, -0.42339152, -0.46896425, 0.036000043, 0.541476, 0.49679416, 1.0162838, 0.41998947, -0.17470759, 0.36015546, 0.89926827, 0.3100921, 0.8622801, 0.28031158, -0.29976618, 0.26069474, -0.3196584, -0.33110636, 0.22559434, 0.7845026, 0.20676875, -0.36944067, 0.19523889, -0.38228, -0.39102513, 0.16828847, -0.41353774, 0.14340067, -0.4402269, -0.45660353, -1.0455453, -0.5076798, 0.02410543, -0.5723089, -0.04722202, -0.64590317, -0.12966675, 0.38481778, -0.21514452, 0.30156624, -0.29813945, -0.8985501, -0.94771, -0.45633784, 0.028062105, 0.5128964, 1.0055403, 1.5127672, 0.9151408, 1.4527836, 0.8661588, 0.28487474, -0.29445535, -0.8758035, -0.3228252, 0.22677499, -0.36082155, -0.95036477, -1.5442784, -1.024998, -1.6266755, -1.1442943, -0.6802794, -0.22771841, 0.22092322, 0.6733681, 1.1371562, 0.53343564, 0.47549757, 0.42124262, -0.1816769, 0.31777713, 0.8218175, 1.3372558, 0.74113774, 1.2818367, 0.69576323, 0.11400974, -0.46698034, -1.0509986, -0.504081, 0.037216187, 0.57893795, -0.0107726455, 0.53715014, 0.5203953, -0.06301516, 0.49502456, 1.057758, 0.48529196, 0.48897022, -0.07518065, 0.50364494, 1.0858959, 0.5364973, 0.5609145, 0.023711562, -0.5132116, -1.0560546, -0.46973848, -1.0283455, -0.45221597, -1.0239553, -0.46009392, -0.47181916, -1.0575178, -0.5135736, -1.107615, -0.58265007, -0.06529772, -0.66586745, -0.15824968, 0.34713298, 0.8573861, 1.379148, 0.7852871, 1.3322921, 1.8927246, 1.897016, 1.347206, 0.81249386, 0.28783864, 0.88732517, 0.37794453, -0.12610584, -0.63195044, -1.1465483, -0.54936594, 0.04584819, 0.64124066, 0.113977015, 0.7122704, 0.19547498, -0.31868124, 0.28154784, 0.882389, 0.37958056, 0.98273414, 1.0435076, 0.57650614, 0.6499632, 0.20331079, -0.23979002, -0.6870763, -1.1460975, -1.0823829, -0.47899115, 0.12415546, -0.37390608, -0.8773868, -0.2766546, -0.7955769, -1.3247055, -0.73339206, -0.1458627, 0.4408363, 1.0299662, 0.49230444, 1.0886269, 0.5702512, 1.1716162, 1.7740679, 1.3054819, 0.8588613, 0.427443, 1.0252987, 0.61975026, 0.22610283, -0.16314113, -0.5555657, 0.038012385, -0.3640089, -0.7729492, -1.1962781, -0.5953055, 0.0073860884, -0.44595912, 0.15742376, -0.30063856, -0.7637336, -1.2392648, -0.63606447, -0.03327161, -0.53644407, -1.0471697, -0.44857574, -0.9783848, -0.38553786, 0.20550472, 0.79753166, 0.2634635, 0.29681987, 0.33256257, -0.19052416, 0.4079253, -0.1123991, -0.63419515, -0.03701216, -0.0031216294, 0.59391737, 0.63243276, 0.11779195, -0.3952446, 0.20500398, -0.3105449, -0.8302374, -0.2332198, -0.76605517, -0.17335397, -0.14889872, -0.6941189, -0.10688406, 0.4797384, 0.5003197, -0.04291469, 0.5489074, 0.011385739, -0.52600193, -0.50142545, -1.050253, -0.4674, 0.11232889, -0.45033586, -1.0170467, -0.44790745, 0.11734474, -0.46027607, 0.10185462, -0.47835714, 0.08021361, -0.50272155, 0.051689923, -0.5341327, 0.015319586, 0.564935, -0.02059853, -0.60625523, -0.05728829, -0.6468238, -0.105519235, -0.6988947, -1.2950221, -0.7789909, -0.27371728, 0.22763348, 0.7322519, 0.13130337, 0.64752954, 0.04885167, -0.5496839, -0.031567276, 0.48612875, 0.4482719, 0.9778476, 0.95167404, 0.36374223, -0.22210926, 0.33054018, -0.2546363, 0.29878598, -0.2861122, -0.8727093, -1.4638754, -0.9362054, -0.42070693, 0.08902931, 0.5999999, -7.6413155e-5, -0.6001529, -0.08918542, -0.6909821, -0.1907385, 0.30671647, 0.8086387, 0.20742708, -0.39337343, -0.9949093, -0.49815333, -0.555994, -0.6181014, -0.14663455, -0.21478167, -0.8183415, -1.4200926, -0.9910482, -0.5803027, -1.1725056, -0.7943779, -0.43222106, -0.0789015, 0.27279416, 0.63009065, 0.046031713, 0.4165213, 0.79533434, 1.1897497, 0.59376407, 0.506798, 0.9422202, 0.34001166, -0.26305497, 0.18930426, 0.64489245, 0.04121822, 0.5078529, 0.98272586, 1.4728837, 0.87175506, 1.3938469, 1.366266, 0.779713, 0.19795507, -0.38246882, -0.9654277, -0.4158616, -0.4391763, 0.10099664, 0.6423285, 0.052781224, -0.53648925, 0.006598592, 0.5497608, 1.0989437, 0.516644, -0.062143266, 0.5021725, 1.0708861, 1.0762005, 1.0902485, 0.5447721, 0.57224, 0.039587796, -0.49258155, -1.0306388, -0.44170415, -0.9956093, -0.41574204, -0.98288405, -1.5579817, -1.0032566, -0.45914644, 0.07973248, -0.5138059, 0.020030499, 0.5541087, 1.0947015, 1.0771073, 0.4967332, 0.49186224, -0.081163526, -0.0826706, 0.4873354, 1.0612676, 0.50027514, -0.05587268, -0.04169446, -0.5988056, -0.018389344, 0.5619008, 1.1458629, 0.59989315, 0.6274322, 1.2236418, 0.70683765, 0.19973493, -0.3045361, -0.813109, -1.3327119, -1.8689085, -1.2841227, -0.70784944, -0.7093007, -0.14468938, 0.41860688, 0.9856413, 0.4165489, 0.99479055, 0.43861932, 1.0262425, 0.48551792, -0.04951918, 0.54564995, 0.016540468, -0.51236266, 0.08302599, 0.114244245, 0.7104124, 1.3088603, 0.8016553, 0.30601466, -0.18504849, 0.41862008, -0.06901041, -0.5576949, 0.04571694, -0.45040363, -0.9530899, -0.35244048, 0.24740607, -0.2693985, 0.33038253, -0.18562514, 0.41457742, 1.015651, 0.5164041, 0.024812073, -0.4664058, -0.96455973, -0.36315894, -0.8794028, -0.2815982, -0.8124123, -0.21913224, 0.37316072, 0.9670996, 0.4399951, 0.47940582, 1.080012, 0.57923007, 0.08699331, -0.4039325, 0.19950801, -0.29438764, 0.30893898, -0.18474644, -0.68117666, -0.078813195, -0.030627482, 0.5715585, 0.07204214, -0.42642143, 0.17620352, -0.32578292, -0.8324168, -1.3504446, -0.7551892, -0.1631031, -0.707957, -0.12060672, -0.10385254, 0.48217484, 1.0709082, 0.53291637, 0.0013072491, -0.5302859, -0.5017533, -0.4772839, -0.45669392, 0.1300725, -0.4219766, 0.16299096, -0.39168665, 0.1918098, 0.7765027, 0.22773558, -0.31856304, -0.86829853, -0.28479797, 0.29685426, -0.26147056, -0.82230246, -1.3905027, -0.8261269, -0.26954323, 0.2843287, -0.30037934, -0.88687843, -0.3424608, 0.19808352, 0.74088407, 1.2917378, 0.7118773, 0.13719583, -0.43641812, -1.0133502, -0.45562738, 0.09752709, 0.081247464, 0.63621366, 0.054557085, 0.6161923, 1.1833742, 0.6161275, 0.054421842, 0.6360258, 0.0809809, -0.47324383, -0.46093205, 0.1191757, -0.44293198, 0.13483617, -0.4300821, 0.14545915, 0.72209203, 0.7326346, 0.17886752, 0.19692357, -0.3529536, -0.9065085, -1.468859, -0.8994328, -0.3379526, -0.9215696, -0.37259066, 0.1723311, 0.7191487, 0.13295484, -0.45244962, 0.09856528, 0.08059027, 0.63345814, 0.050385952, 0.038671095, -0.54381686, -1.1296707, -0.5868221, -0.0507614, 0.48469448, 1.025811, 0.43838876, -0.14646864, -0.73220885, -0.18474144, 0.36070412, -0.22835606, 0.31851965, 0.8688142, 0.28566766, 0.8473443, 0.27230382, -0.3006131, 0.26835603, 0.8395587, 0.2734533, -0.29019827, -0.8564512, -0.28552645, 0.28300905, -0.29045498, -0.86615926, -0.30546707, 0.25228477, -0.33007067, -0.3439861, 0.20957744, -0.3757934, 0.17607284, 0.729756, 1.2905934, 0.7183624, 0.1521588, 0.15922381, -0.40418333, 0.17282307, -0.3926713, 0.18266225, -0.38468087, -0.38309523, -0.95679116, -0.3945424, -0.40718007, 0.14731485, -0.43786567, 0.113670886, -0.4734906, 0.074270606, -0.5151796, 0.027788639, 0.5710698, -0.017906904, -0.6069789, -0.064290166, 0.47767007, 0.45596763, 1.0079994, 0.99832815, 0.42463464, -0.14563608, -0.7171048, -1.2941056, -0.73925674, -0.19214052, 0.35286564, 0.9017154, 0.31788814, -0.26388896, -0.8473767, -1.4357886, -0.9015481, -0.37853515, 0.13950962, -0.46036625, 0.05347663, 0.56804645, 0.5289552, 1.057626, 0.4647318, 0.44218165, -0.14608634, -0.16625838, -0.7562249, -0.2170772, 0.31952465, 0.29268906, -0.29922578, 0.24096516, 0.78390265, 0.19598621, 0.7493163, 0.16733187, -0.4135337, -0.99710304, -0.44885945, -1.0407242, -0.50929, 0.015768051, 0.5410276, -0.05546063, -0.65214497, -0.12895352, 0.39256233, -0.20561612, -0.80440015, -0.29082358, -0.33747983, 0.1674729, 0.67479515, 1.1913807, 0.59496444, 1.133382, 0.545225, -0.039795697], "type": "scatter", "name": "θ̇", "yaxis": "y2", "x": [0.0, 0.04, 0.08, 0.12, 0.16, 0.19999999, 0.23999998, 0.27999997, 0.31999996, 0.35999995, 0.39999995, 0.43999994, 0.47999993, 0.5199999, 0.55999994, 0.59999996, 0.64, 0.68, 0.72, 0.76000005, 0.8000001, 0.8400001, 0.8800001, 0.92000014, 0.96000016, 1.0000001, 1.0400001, 1.08, 1.12, 1.16, 1.1999999, 1.2399999, 1.2799999, 1.3199998, 1.3599998, 1.3999997, 1.4399997, 1.4799997, 1.5199996, 1.5599996, 1.5999995, 1.6399995, 1.6799995, 1.7199994, 1.7599994, 1.7999994, 1.8399993, 1.8799993, 1.9199992, 1.9599992, 1.9999992, 2.0399992, 2.0799992, 2.1199992, 2.1599991, 2.199999, 2.239999, 2.279999, 2.319999, 2.359999, 2.399999, 2.4399989, 2.4799988, 2.5199988, 2.5599988, 2.5999987, 2.6399987, 2.6799986, 2.7199986, 2.7599986, 2.7999985, 2.8399985, 2.8799984, 2.9199984, 2.9599984, 2.9999983, 3.0399983, 3.0799983, 3.1199982, 3.1599982, 3.1999981, 3.239998, 3.279998, 3.319998, 3.359998, 3.399998, 3.439998, 3.4799979, 3.5199978, 3.5599978, 3.5999978, 3.6399977, 3.6799977, 3.7199976, 3.7599976, 3.7999976, 3.8399975, 3.8799975, 3.9199975, 3.9599974, 3.9999974, 4.0399976, 4.0799975, 4.1199975, 4.1599975, 4.1999974, 4.2399974, 4.2799973, 4.3199973, 4.3599973, 4.399997, 4.439997, 4.479997, 4.519997, 4.559997, 4.599997, 4.639997, 4.679997, 4.719997, 4.759997, 4.799997, 4.839997, 4.879997, 4.9199967, 4.9599967, 4.9999967, 5.0399966, 5.0799966, 5.1199965, 5.1599965, 5.1999965, 5.2399964, 5.2799964, 5.3199964, 5.3599963, 5.3999963, 5.4399962, 5.479996, 5.519996, 5.559996, 5.599996, 5.639996, 5.679996, 5.719996, 5.759996, 5.799996, 5.839996, 5.879996, 5.919996, 5.9599957, 5.9999957, 6.0399957, 6.0799956, 6.1199956, 6.1599956, 6.1999955, 6.2399955, 6.2799954, 6.3199954, 6.3599954, 6.3999953, 6.4399953, 6.4799953, 6.519995, 6.559995, 6.599995, 6.639995, 6.679995, 6.719995, 6.759995, 6.799995, 6.839995, 6.879995, 6.919995, 6.959995, 6.9999948, 7.0399947, 7.0799947, 7.1199946, 7.1599946, 7.1999946, 7.2399945, 7.2799945, 7.3199944, 7.3599944, 7.3999944, 7.4399943, 7.4799943, 7.5199943, 7.559994, 7.599994, 7.639994, 7.679994, 7.719994, 7.759994, 7.799994, 7.839994, 7.879994, 7.919994, 7.959994, 7.999994, 8.039994, 8.079994, 8.119994, 8.159994, 8.199994, 8.239994, 8.279994, 8.319994, 8.359994, 8.399994, 8.439994, 8.479994, 8.519994, 8.559994, 8.599994, 8.639994, 8.679994, 8.719994, 8.759994, 8.7999935, 8.8399935, 8.879993, 8.919993, 8.959993, 8.999993, 9.039993, 9.079993, 9.119993, 9.159993, 9.199993, 9.239993, 9.279993, 9.319993, 9.359993, 9.399993, 9.439993, 9.479993, 9.519993, 9.559993, 9.599993, 9.639993, 9.679993, 9.719993, 9.759993, 9.799993, 9.839993, 9.8799925, 9.919992, 9.959992, 9.999992, 10.039992, 10.079992, 10.119992, 10.159992, 10.199992, 10.239992, 10.279992, 10.319992, 10.359992, 10.399992, 10.439992, 10.479992, 10.519992, 10.559992, 10.599992, 10.639992, 10.679992, 10.719992, 10.759992, 10.799992, 10.839992, 10.879992, 10.9199915, 10.959991, 10.999991, 11.039991, 11.079991, 11.119991, 11.159991, 11.199991, 11.239991, 11.279991, 11.319991, 11.359991, 11.399991, 11.439991, 11.479991, 11.519991, 11.559991, 11.599991, 11.639991, 11.679991, 11.719991, 11.759991, 11.799991, 11.839991, 11.879991, 11.919991, 11.9599905, 11.99999, 12.03999, 12.07999, 12.11999, 12.15999, 12.19999, 12.23999, 12.27999, 12.31999, 12.35999, 12.39999, 12.43999, 12.47999, 12.51999, 12.55999, 12.59999, 12.63999, 12.67999, 12.71999, 12.75999, 12.79999, 12.83999, 12.87999, 12.91999, 12.95999, 12.9999895, 13.039989, 13.079989, 13.119989, 13.159989, 13.199989, 13.239989, 13.279989, 13.319989, 13.359989, 13.399989, 13.439989, 13.479989, 13.519989, 13.559989, 13.599989, 13.639989, 13.679989, 13.719989, 13.759989, 13.799989, 13.839989, 13.879989, 13.919989, 13.959989, 13.999989, 14.0399885, 14.0799885, 14.119988, 14.159988, 14.199988, 14.239988, 14.279988, 14.319988, 14.359988, 14.399988, 14.439988, 14.479988, 14.519988, 14.559988, 14.599988, 14.639988, 14.679988, 14.719988, 14.759988, 14.799988, 14.839988, 14.879988, 14.919988, 14.959988, 14.999988, 15.039988, 15.079988, 15.1199875, 15.159987, 15.199987, 15.239987, 15.279987, 15.319987, 15.359987, 15.399987, 15.439987, 15.479987, 15.519987, 15.559987, 15.599987, 15.639987, 15.679987, 15.719987, 15.759987, 15.799987, 15.839987, 15.879987, 15.919987, 15.959987, 15.999987, 16.039988, 16.079988, 16.11999, 16.15999, 16.199991, 16.239992, 16.279993, 16.319994, 16.359995, 16.399996, 16.439997, 16.479998, 16.519999, 16.56, 16.6, 16.640001, 16.680002, 16.720003, 16.760004, 16.800005, 16.840006, 16.880007, 16.920008, 16.960009, 17.00001, 17.04001, 17.080011, 17.120012, 17.160013, 17.200014, 17.240015, 17.280016, 17.320017, 17.360018, 17.400019, 17.44002, 17.48002, 17.520021, 17.560022, 17.600023, 17.640024, 17.680025, 17.720026, 17.760027, 17.800028, 17.840029, 17.88003, 17.92003, 17.960032, 18.000032, 18.040033, 18.080034, 18.120035, 18.160036, 18.200037, 18.240038, 18.280039, 18.32004, 18.36004, 18.400042, 18.440042, 18.480043, 18.520044, 18.560045, 18.600046, 18.640047, 18.680048, 18.720049, 18.76005, 18.80005, 18.840052, 18.880053, 18.920053, 18.960054, 19.000055, 19.040056, 19.080057, 19.120058, 19.160059, 19.20006, 19.24006, 19.280062, 19.320063, 19.360064, 19.400064, 19.440065, 19.480066, 19.520067, 19.560068, 19.600069, 19.64007, 19.68007, 19.720072, 19.760073, 19.800074, 19.840075, 19.880075, 19.920076, 19.960077, 20.000078, 20.04008, 20.08008, 20.12008, 20.160082, 20.200083, 20.240084, 20.280085, 20.320086, 20.360086, 20.400087, 20.440088, 20.48009, 20.52009, 20.560091, 20.600092, 20.640093, 20.680094, 20.720095, 20.760096, 20.800097, 20.840097, 20.880098, 20.9201, 20.9601, 21.000101, 21.040102, 21.080103, 21.120104, 21.160105, 21.200106, 21.240107, 21.280107, 21.320108, 21.36011, 21.40011, 21.440111, 21.480112, 21.520113, 21.560114, 21.600115, 21.640116, 21.680117, 21.720118, 21.760118, 21.80012, 21.84012, 21.880121, 21.920122, 21.960123, 22.000124, 22.040125, 22.080126, 22.120127, 22.160128, 22.200129, 22.24013, 22.28013, 22.320131, 22.360132, 22.400133, 22.440134, 22.480135, 22.520136, 22.560137, 22.600138, 22.640139, 22.68014, 22.72014, 22.760141, 22.800142, 22.840143, 22.880144, 22.920145, 22.960146, 23.000147, 23.040148, 23.080149, 23.12015, 23.16015, 23.200151, 23.240152, 23.280153, 23.320154, 23.360155, 23.400156, 23.440157, 23.480158, 23.520159, 23.56016, 23.60016, 23.640162, 23.680162, 23.720163, 23.760164, 23.800165, 23.840166, 23.880167, 23.920168, 23.960169, 24.00017, 24.04017, 24.080172, 24.120173, 24.160173, 24.200174, 24.240175, 24.280176, 24.320177, 24.360178, 24.400179, 24.44018, 24.48018, 24.520182, 24.560183, 24.600183, 24.640184, 24.680185, 24.720186, 24.760187, 24.800188, 24.840189, 24.88019, 24.92019, 24.960192, 25.000193, 25.040194, 25.080194, 25.120195, 25.160196, 25.200197, 25.240198, 25.2802, 25.3202, 25.3602, 25.400202, 25.440203, 25.480204, 25.520205, 25.560205, 25.600206, 25.640207, 25.680208, 25.72021, 25.76021, 25.80021, 25.840212, 25.880213, 25.920214, 25.960215, 26.000216, 26.040216, 26.080217, 26.120218, 26.16022, 26.20022, 26.240221, 26.280222, 26.320223, 26.360224, 26.400225, 26.440226, 26.480227, 26.520227, 26.560228, 26.60023, 26.64023, 26.680231, 26.720232, 26.760233, 26.800234, 26.840235, 26.880236, 26.920237, 26.960238, 27.000238, 27.04024, 27.08024, 27.120241, 27.160242, 27.200243, 27.240244, 27.280245, 27.320246, 27.360247, 27.400248, 27.440248, 27.48025, 27.52025, 27.560251, 27.600252, 27.640253, 27.680254, 27.720255, 27.760256, 27.800257, 27.840258, 27.880259, 27.92026, 27.96026, 28.000261, 28.040262, 28.080263, 28.120264, 28.160265, 28.200266, 28.240267, 28.280268, 28.320269, 28.36027, 28.40027, 28.440271, 28.480272, 28.520273, 28.560274, 28.600275, 28.640276, 28.680277, 28.720278, 28.760279, 28.80028, 28.84028, 28.880281, 28.920282, 28.960283, 29.000284, 29.040285, 29.080286, 29.120287, 29.160288, 29.200289, 29.24029, 29.28029, 29.320292, 29.360292, 29.400293, 29.440294, 29.480295, 29.520296, 29.560297, 29.600298, 29.640299, 29.6803, 29.7203, 29.760302, 29.800303, 29.840303, 29.880304, 29.920305, 29.960306, 30.000307, 30.040308, 30.080309, 30.12031, 30.16031, 30.200312, 30.240313, 30.280313, 30.320314, 30.360315, 30.400316, 30.440317, 30.480318, 30.520319, 30.56032, 30.60032, 30.640322, 30.680323, 30.720324, 30.760324, 30.800325, 30.840326, 30.880327, 30.920328, 30.96033, 31.00033, 31.04033, 31.080332, 31.120333, 31.160334, 31.200335, 31.240335, 31.280336, 31.320337, 31.360338, 31.40034, 31.44034, 31.480341, 31.520342, 31.560343, 31.600344, 31.640345, 31.680346, 31.720346, 31.760347, 31.800348, 31.84035, 31.88035, 31.920351, 31.960352, 32.00035, 32.04035, 32.080353, 32.120354, 32.160355, 32.200356, 32.240356, 32.280357, 32.32036, 32.36036, 32.40036, 32.44036, 32.480362, 32.520363, 32.560364, 32.600365, 32.640366, 32.680367, 32.720367, 32.76037, 32.80037, 32.84037, 32.88037, 32.920372, 32.960373, 33.000374, 33.040375, 33.080376, 33.120377, 33.160378, 33.20038, 33.24038, 33.28038, 33.32038, 33.360382, 33.400383, 33.440384, 33.480385, 33.520386, 33.560387, 33.600388, 33.64039, 33.68039, 33.72039, 33.76039, 33.800392, 33.840393, 33.880394, 33.920395, 33.960396, 34.000397, 34.040398, 34.0804, 34.1204, 34.1604, 34.2004, 34.240402, 34.280403, 34.320404, 34.360405, 34.400406, 34.440407, 34.480408, 34.52041, 34.56041, 34.60041, 34.64041, 34.680412, 34.720413, 34.760414, 34.800415, 34.840416, 34.880417, 34.920418, 34.96042, 35.00042, 35.04042, 35.08042, 35.120422, 35.160423, 35.200424, 35.240425, 35.280426, 35.320427, 35.360428, 35.40043, 35.44043, 35.48043, 35.52043, 35.560432, 35.600433, 35.640434, 35.680435, 35.720436, 35.760437, 35.800438, 35.84044, 35.88044, 35.92044, 35.96044, 36.000443, 36.040443, 36.080444, 36.120445, 36.160446, 36.200447, 36.240448, 36.28045, 36.32045, 36.36045, 36.40045, 36.440453, 36.480453, 36.520454, 36.560455, 36.600456, 36.640457, 36.680458, 36.72046, 36.76046, 36.80046, 36.84046, 36.880463, 36.920464, 36.960464, 37.000465, 37.040466, 37.080467, 37.12047, 37.16047, 37.20047, 37.24047, 37.28047, 37.320473, 37.360474, 37.400475, 37.440475, 37.480476, 37.520477, 37.56048, 37.60048, 37.64048, 37.68048, 37.72048, 37.760483, 37.800484, 37.840485, 37.880486, 37.920486, 37.960487, 38.00049, 38.04049, 38.08049, 38.12049, 38.160492, 38.200493, 38.240494, 38.280495, 38.320496, 38.360497, 38.400497, 38.4405, 38.4805, 38.5205, 38.5605, 38.600502, 38.640503, 38.680504, 38.720505, 38.760506, 38.800507, 38.840508, 38.88051, 38.92051, 38.96051, 39.00051, 39.040512, 39.080513, 39.120514, 39.160515, 39.200516, 39.240517, 39.280518, 39.32052, 39.36052, 39.40052, 39.44052, 39.480522, 39.520523, 39.560524, 39.600525, 39.640526, 39.680527, 39.720528, 39.76053, 39.80053, 39.84053, 39.88053, 39.920532, 39.960533]}]} // Get the plotly listeners const plotly_listeners = {} // Get the JS listeners const js_listeners = {} // Deal with eventual custom classes let custom_classlist = [] // Load the plotly library if (!window.Plotly) { const {plotly} = await import('https://cdn.plot.ly/plotly-2.16.1.min.js') } // Check if we have to force local mathjax font cache if (false && window?.MathJax?.config?.svg?.fontCache === 'global') { window.MathJax.config.svg.fontCache = 'local' } // Flag to check if this cell was manually ran or reactively ran const firstRun = this ? false : true const PLOT = this ?? document.createElement("div"); const parent = currentScript.parentElement const isPlutoWrapper = parent.classList.contains('raw-html-wrapper') if (firstRun) { // It seem plot divs would not autosize themself inside flexbox containers without this parent.appendChild(PLOT) } // If width is not specified, set it to 100% PLOT.style.width = plot_obj.layout.width ? "" : "100%" // For the height we have to also put a fixed value in case the plot is put on a non-fixed-size container (like the default wrapper) PLOT.style.height = plot_obj.layout.height ? "" : (isPlutoWrapper || parent.clientHeight == 0) ? "400px" : "100%" PLOT.classList.forEach(cn => { if (cn !== 'js-plotly-plot' && !custom_classlist.includes(cn)) { PLOT.classList.toggle(cn, false) } }) for (const className of custom_classlist) { PLOT.classList.toggle(className, true) } // Create the resizeObserver to make the plot even more responsive! :magic: const resizeObserver = new ResizeObserver(entries => { PLOT.style.height = plot_obj.layout.height ? "" : (isPlutoWrapper || parent.clientHeight == 0) ? "400px" : "100%" /* The addition of the invalid argument `plutoresize` seems to fix the problem with calling `relayout` simply with `{autosize: true}` as update breaking mouse relayout events tracking. See https://github.com/plotly/plotly.js/issues/6156 for details */ Plotly.relayout(PLOT, {..._.pick(PLOT.layout, ['width','height']), autosize: true, plutoresize: true}) }) resizeObserver.observe(PLOT) Plotly.react(PLOT, plot_obj).then(() => { // Assign the Plotly event listeners for (const [key, listener_vec] of Object.entries(plotly_listeners)) { for (const listener of listener_vec) { PLOT.on(key, listener) } } // Assign the JS event listeners for (const [key, listener_vec] of Object.entries(js_listeners)) { for (const listener of listener_vec) { PLOT.addEventListener(key, listener) } } } ) invalidation.then(() => { // Remove all plotly listeners PLOT.removeAllListeners() // Remove all JS listeners for (const [key, listener_vec] of Object.entries(js_listeners)) { for (const listener of listener_vec) { PLOT.removeEventListener(key, listener) } } // Remove the resizeObserver resizeObserver.disconnect() }) return PLOT mimetext/htmlrootassigneelast_run_timestampA : persist_js_state·has_pluto_hook_features§cell_id$daf35bfe-8f9c-4f55-971d-4d443be8f8bfdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cellsçerrored$8e096fae-9941-49d8-ae87-c68b02f68da5queued¤logsrunning¦outputbodyprefixContinuousMDP{Float32, Tuple{Float32, Float32}, Float32, ContinuousMDPTransitionSampler{Float32, Tuple{Float32, Float32}, Float32, var"#step#1606"}, typeof(Main.var"workspace#8".MountainCarTask.initialize_state), typeof(Main.var"workspace#8".MountainCarTask.isterm), Returns{Bool}}elementsptfprefixZContinuousMDPTransitionSampler{Float32, Tuple{Float32, Float32}, Float32, var"#step#1606"}elementsstepJ(::Main.var"workspace#8".var"#step#1606") (generic function with 1 method)text/plaintypestructprefix_shortContinuousMDPTransitionSamplerobjectidffffffffd87e3d3f!application/vnd.pluto.tree+objectinitialize_state1initialize_state (generic function with 1 method)text/plainisterm'isterm (generic function with 1 method)text/plainis_valid_actionReturns{Bool}(true)text/plaintypestructprefix_shortContinuousMDPobjectidda132fcfmime!application/vnd.pluto.tree+objectrootassignee%const mountaincar_continuous_beta_mdplast_run_timestampA > Jpersist_js_state·has_pluto_hook_features§cell_id$8e096fae-9941-49d8-ae87-c68b02f68da5depends_on_disabled_cells§runtimeMpublished_object_keysdepends_on_skipped_cells§errored$666a4e89-306b-4fb2-bdc4-3dda2c63153fqueued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestampA s6Ұpersist_js_state·has_pluto_hook_features§cell_id$666a4e89-306b-4fb2-bdc4-3dda2c63153fdepends_on_disabled_cells§runtimeƍ published_object_keysdepends_on_skipped_cells§errored$5d35e515-e2d3-443e-becf-eb28c25db346queued¤logsrunning¦outputbodyd>

$\lambda_\theta$: 0.85

$\lambda_\mathbf{w}$: 0.95

$\alpha_{\overline{r}}$:

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

mimetext/htmlrootassigneelast_run_timestampA !8gpersist_js_state·has_pluto_hook_features§cell_id$5d35e515-e2d3-443e-becf-eb28c25db346depends_on_disabled_cells§runtimeͅpublished_object_keysdepends_on_skipped_cellsçerrored$4c34640f-efa2-4e1d-8a70-0acd2ce45428queued¤logsrunning¦outputbody

Bonus Problems: Comparing Techniques

Consider the case of applying the techniques in this chapter to problems where we choose feature vectors and parameters to effectively compute the tabular case. That is we enumerate every state and state/action pair. Our parameters for each function will store a single value for each case. Let's consider the gradients for both the state-value estimate and the policy. We will use two sets of parameters: $\mathbf{w}$ and $\mathbf{\theta}$. $\mathbf{w}_s$ is the parameter for state s and $\mathbf{\theta}_{s, a}$ is the parameter for state/action pair $(s, a)$. Using this notation $\mathbf{w}$ is a vector and $\theta$ is a matrix.

Starting with the state-value function:

$$\begin{align} \hat v(s, \mathbf{w}) &= \mathbf{w}_s \\ \nabla v(s, \mathbf{w}) &= \nabla \mathbf{w}_s \\ &= \mathbf{e}_s \end{align}$$

where $\mathbf{e}_s$ is the one-hot vector for index s and length equal to the number of states.

Now moving on to the policy, we will use a soft-max function to convert action preferences into probabilities.

$$\begin{align} \pi(a|s, \theta) &= \frac{\exp{\theta_{s, a}}}{\sum_{i = 1}^{n_A}{\exp{\theta_{s, i}}}} \\ \nabla \pi(a|s, \theta) &= \nabla \frac{\exp{\theta_{s, a}}}{\sum_{i = 1}^{n_A}{\exp{\theta_{s, i}}}} \\ \end{align}$$

But we already calculated the gradient of the soft-max function of a vector $\mathbf{x}$.

$$\nabla\sigma(\mathbf{x})_{i, j} = \sigma(\mathbf{x})_i \left ( \delta_{i, j} - \sigma(\mathbf{x})_j \right )$$

Comparing to what we desire, $\mathbf{x} = \mathbf{\theta}_s$ which is the parameter vector for the state s and $\sigma = \pi$. So we can immediately write down the components of this gradient:

$$\begin{align} \nabla \pi(a|\theta_s)_i &= \pi(a|\theta_s) \left (\delta_{a, i} - \pi(i|\theta_s) \right ) \\ \frac{\nabla \pi(a|\theta_s)_i}{\pi(a|\theta_s)} = \nabla \ln \pi(a|\theta_s)_i &= \left (\delta_{a, i} - \pi(i|\theta_s) \right ) \\ \end{align}$$

$$\begin{equation} \nabla \ln{\pi(a|\theta_s)}_i = \begin{cases} -\pi(i|\theta_s) & i \neq a \\ 1 - \pi(i|\theta_s) & i = a \end{cases} \end{equation}$$

This is a gradient vector which corresponds to the components of $\theta_s$ which is the parameter vector for each action at that state. We have a new vector update for each unique state/action pair observed, but once those two are fixed the number of components that need to be calculated is just a vector with a length equal to the number of actions.

mimetext/htmlrootassigneelast_run_timestampA ^persist_js_state·has_pluto_hook_features§cell_id$4c34640f-efa2-4e1d-8a70-0acd2ce45428depends_on_disabled_cells§runtime Rpublished_object_keysdepends_on_skipped_cells§errored$e7566274-5518-4e28-8738-d4b1747d0cfbqueued¤logsrunning¦outputbody:form_state_value_function (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA "persist_js_state·has_pluto_hook_features§cell_id$e7566274-5518-4e28-8738-d4b1747d0cfbdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$6bf5ad39-1400-4e1f-a843-a1934b8aaa48queued¤logsrunning¦outputbodyNupdate_squashed_gaussian_eligibility_vector! (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA #SmHpersist_js_state·has_pluto_hook_features§cell_id$6bf5ad39-1400-4e1f-a843-a1934b8aaa48depends_on_disabled_cells§runtime^published_object_keysdepends_on_skipped_cells§errored$17d07ef4-7c0a-47cc-a701-32c60336571bqueued¤logsrunning¦outputbody

Noticing this pattern, the kth term will be of the form $\gamma^k \sum_{x \in \mathcal{S}} \Pr(s \rightarrow x, k, \pi)f(x)$ and the total expression will just be a sum of all of these terms to infinity or the maximum length of an episode under the policy. Looking more closely at the probability term, we can equate it to some other probabilities regarding episode length.

mimetext/htmlrootassigneelast_run_timestampA Ұpersist_js_state·has_pluto_hook_features§cell_id$17d07ef4-7c0a-47cc-a701-32c60336571bdepends_on_disabled_cells§runtimeYܵpublished_object_keysdepends_on_skipped_cells§errored$76fd79a2-2bc8-45f8-a243-48415118898aqueued¤logsrunning¦outputbody'BinarySquashedGaussianEligibilityVectormimetext/plainrootassigneelast_run_timestampA !>persist_js_state·has_pluto_hook_features§cell_id$76fd79a2-2bc8-45f8-a243-48415118898adepends_on_disabled_cells§runtimeMypublished_object_keysdepends_on_skipped_cells§errored$0b01ba67-3921-4f3f-a7e8-235190bc84ebqueued¤logsrunning¦outputbody/make_beta_dist (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA !bpersist_js_state·has_pluto_hook_features§cell_id$0b01ba67-3921-4f3f-a7e8-235190bc84ebdepends_on_disabled_cells§runtimevߵpublished_object_keysdepends_on_skipped_cells§errored$9acdbf38-2e10-45ec-85a0-d0db8453a599queued¤logsrunning¦outputbody;fcann_feature_vector_setup (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 2|Ȱpersist_js_state·has_pluto_hook_features§cell_id$9acdbf38-2e10-45ec-85a0-d0db8453a599depends_on_disabled_cells§runtimeM=/published_object_keysdepends_on_skipped_cells§errored$d4e87ac4-6008-43b2-aa06-e232ec2b2b5bqueued¤logsrunning¦outputbody
mimetext/htmlrootassigneelast_run_timestampA @tlpersist_js_state·has_pluto_hook_features§cell_id$d4e87ac4-6008-43b2-aa06-e232ec2b2b5bdepends_on_disabled_cells§runtimespublished_object_keysdepends_on_skipped_cellsçerrored$05f120be-9695-4824-82fd-142a0df13098queued¤logsrunning¦outputbodyoactor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA /Ipersist_js_state·has_pluto_hook_features§cell_id$05f120be-9695-4824-82fd-142a0df13098depends_on_disabled_cells§runtimeNS+published_object_keysdepends_on_skipped_cells§errored$b2539398-fdbc-42a2-a8f3-d327358f3643queued¤logsrunning¦outputbodyB

Waiting to run parameter study

mimetext/htmlrootassigneelast_run_timestampA /Wڰpersist_js_state·has_pluto_hook_features§cell_id$b2539398-fdbc-42a2-a8f3-d327358f3643depends_on_disabled_cells§runtimeMpublished_object_keysdepends_on_skipped_cellsçerrored$c5dd7e99-57e0-4bc7-97d2-2c780b23bcffqueued¤logsrunning¦outputbody

Discrete Action Space

As an initial test, consider the discrete action space originally used for the mountain car problem where there are three actions (-1, 0, 1) corresponding to full throttle reverse, idle, and full throttle forward. We can apply the same tile coding solution technique from before but with a policy gradient method instead of Sarsa.

mimetext/htmlrootassigneelast_run_timestampA ׿persist_js_state·has_pluto_hook_features§cell_id$c5dd7e99-57e0-4bc7-97d2-2c780b23bcffdepends_on_disabled_cells§runtimeepublished_object_keysdepends_on_skipped_cells§errored$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9queued¤logsrunning¦outputbodyelementsstep_rewardsprefixFloat32elements0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain 0.0text/plainmoreB@0.0text/plaintypeArrayprefix_shortobjectid4b70953b2e6e8649!application/vnd.pluto.tree+objecttotal_reward24.0text/plaintotal_steps1000000text/plainpolicy_parameterselementsprefixMatrix{Float32}elements@32×2 Matrix{Float32}: 0.374307 -0.234304 -1.03248 -0.269911 -0.975325 0.849759 -0.263278 -0.385389 -0.107577 0.263333 -0.156025 -0.58667 1.3691 -0.270142 ⋮ -0.318091 -0.298098 -0.180892 0.259876 0.400833 -1.47239 -0.804528 0.356901 -0.326647 -0.10743 -0.822327 -0.125748text/plainL32×32 Matrix{Float32}: -0.162673 0.242498 0.19849 … -0.0109337 0.0482874 0.0460886 0.00193776 0.0534719 0.0860083 0.212865 -0.0703064 0.164744 -0.0152179 0.0776176 -0.0888733 0.0287231 -0.161298 -0.0916366 0.193733 0.457864 -0.162398 0.0144948 -0.0480639 0.117131 -0.0145481 0.0171614 0.209093 0.151762 0.0403211 -0.122444 -0.00489918 -0.157581 -0.228196 … 0.0034082 0.243973 -0.0950734 0.0173464 0.114532 0.128324 0.107545 -0.105426 -0.0336514 ⋮ ⋱ ⋮ 0.248201 -0.10714 -0.0626096 -0.261732 -0.0755426 0.0384916 -0.0113953 0.0904461 -0.114413 -0.516735 0.451616 0.0130118 0.0351254 0.273681 -0.10648 0.173477 -0.0711579 -0.108224 0.196006 -0.257032 0.0930074 0.00664788 0.0640232 0.00205874 -0.529999 -0.130317 0.230962 … 0.06913 0.0748414 0.0767005 -0.0918788 0.0198111 0.153276 -0.0242945 0.0524623 0.0539445text/plainE32×32 Matrix{Float32}: 0.347561 -0.206838 -0.110822 … 0.255197 0.0296856 -0.173584 -0.229165 -0.188896 0.146177 -0.0633724 -0.093465 -0.229342 -0.106134 0.0831949 0.125071 0.0767818 -0.159762 0.0817969 0.0219935 -0.205179 0.394048 0.401863 0.175625 -0.110434 -0.0176931 -0.1009 -0.00526891 -0.155378 -0.160092 -0.0125793 -0.207594 0.0895822 0.0657224 … -0.32334 0.0428685 -0.269805 0.2151 -0.18638 0.267279 -0.0182067 -0.146437 -0.174477 ⋮ ⋱ ⋮ -0.252566 -0.250205 0.0205444 0.139519 0.110159 -0.0384127 0.350132 -0.095402 -0.354336 -0.00293087 0.0494648 0.120143 0.202794 0.0784002 0.0125694 -0.0405287 -0.0384652 -0.392017 -0.0430515 -0.169582 -0.0726074 0.127153 0.258665 -0.180882 0.21014 -0.244566 -0.0983913 … -0.041157 0.0283188 0.25321 -0.0835586 0.169916 0.109187 0.043264 -0.197076 0.0629424text/plain3×32 Matrix{Float32}: 0.124274 -0.0137226 0.260139 0.113848 … -0.0893076 -0.293468 -0.279324 -0.279206 0.0308841 -0.234585 -0.107916 -0.0757386 -0.188042 -0.128437 0.0175082 0.232399 0.387761 0.109973 -0.15397 -0.0257646 0.087439text/plaintypeArrayprefix_shortobjectid924a9802c2b3d1bf!application/vnd.pluto.tree+objectprefixVector{Float32}elementsprefixFloat32elementsmoretypeArrayprefix_shortobjectid15a29cc9c963e49b!application/vnd.pluto.tree+objectprefixFloat32elementsmoretypeArrayprefix_shortobjectid6d94a44679480443!application/vnd.pluto.tree+objectprefixFloat32elementsmoretypeArrayprefix_shortobjectid2df4df9efa07e239!application/vnd.pluto.tree+objectprefixFloat32elementsmoretypeArrayprefix_shortobjectid26f93cf46c10a6cc!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid7870826d8effa5f9!application/vnd.pluto.tree+objecttypeTupleobjectid88491d5c3a018601!application/vnd.pluto.tree+objectvalue_parameterselementsprefixMatrix{Float32}elements@32×2 Matrix{Float32}: 0.374307 -0.234304 -1.03248 -0.269911 -0.975325 0.849759 -0.263278 -0.385389 -0.107577 0.263333 -0.156025 -0.58667 1.3691 -0.270142 ⋮ -0.318091 -0.298098 -0.180892 0.259876 0.400833 -1.47239 -0.804528 0.356901 -0.326647 -0.10743 -0.822327 -0.125748text/plainL32×32 Matrix{Float32}: -0.162673 0.242498 0.19849 … -0.0109337 0.0482874 0.0460886 0.00193776 0.0534719 0.0860083 0.212865 -0.0703064 0.164744 -0.0152179 0.0776176 -0.0888733 0.0287231 -0.161298 -0.0916366 0.193733 0.457864 -0.162398 0.0144948 -0.0480639 0.117131 -0.0145481 0.0171614 0.209093 0.151762 0.0403211 -0.122444 -0.00489918 -0.157581 -0.228196 … 0.0034082 0.243973 -0.0950734 0.0173464 0.114532 0.128324 0.107545 -0.105426 -0.0336514 ⋮ ⋱ ⋮ 0.248201 -0.10714 -0.0626096 -0.261732 -0.0755426 0.0384916 -0.0113953 0.0904461 -0.114413 -0.516735 0.451616 0.0130118 0.0351254 0.273681 -0.10648 0.173477 -0.0711579 -0.108224 0.196006 -0.257032 0.0930074 0.00664788 0.0640232 0.00205874 -0.529999 -0.130317 0.230962 … 0.06913 0.0748414 0.0767005 -0.0918788 0.0198111 0.153276 -0.0242945 0.0524623 0.0539445text/plainE32×32 Matrix{Float32}: 0.347561 -0.206838 -0.110822 … 0.255197 0.0296856 -0.173584 -0.229165 -0.188896 0.146177 -0.0633724 -0.093465 -0.229342 -0.106134 0.0831949 0.125071 0.0767818 -0.159762 0.0817969 0.0219935 -0.205179 0.394048 0.401863 0.175625 -0.110434 -0.0176931 -0.1009 -0.00526891 -0.155378 -0.160092 -0.0125793 -0.207594 0.0895822 0.0657224 … -0.32334 0.0428685 -0.269805 0.2151 -0.18638 0.267279 -0.0182067 -0.146437 -0.174477 ⋮ ⋱ ⋮ -0.252566 -0.250205 0.0205444 0.139519 0.110159 -0.0384127 0.350132 -0.095402 -0.354336 -0.00293087 0.0494648 0.120143 0.202794 0.0784002 0.0125694 -0.0405287 -0.0384652 -0.392017 -0.0430515 -0.169582 -0.0726074 0.127153 0.258665 -0.180882 0.21014 -0.244566 -0.0983913 … -0.041157 0.0283188 0.25321 -0.0835586 0.169916 0.109187 0.043264 -0.197076 0.0629424text/plaink1×32 Matrix{Float32}: 0.00542627 0.0200724 0.0360005 0.0195947 … -0.014911 0.00224356 -0.0349084text/plaintypeArrayprefix_shortobjectid479b0cbaa6cbd394!application/vnd.pluto.tree+objectprefixVector{Float32}elementsprefixFloat32elementsmoretypeArrayprefix_shortobjectid15a29cc9c963e49b!application/vnd.pluto.tree+objectprefixFloat32elementsmoretypeArrayprefix_shortobjectid6d94a44679480443!application/vnd.pluto.tree+objectprefixFloat32elementsmoretypeArrayprefix_shortobjectid2df4df9efa07e239!application/vnd.pluto.tree+objectprefixFloat32elementsmoretypeArrayprefix_shortobjectid7910f55289b7a643!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid9c627d14d05acfd3!application/vnd.pluto.tree+objecttypeTupleobjectidfe8a6e1639811d0!application/vnd.pluto.tree+objectpolicy_functionπtext/plainpolicy_sample_actionπ_sampletext/plainestimate_state_valueestimate_state_valuetext/plainpolicy_and_valuepolicy_and_valuetext/plaintypeNamedTupleobjectid93d530c3a3b92f70mime!application/vnd.pluto.tree+objectrootassignee'const mountaincar_continuing_fcann_testlast_run_timestampA =$persist_js_state·has_pluto_hook_features§cell_id$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9depends_on_disabled_cells§runtime -published_object_keysdepends_on_skipped_cellsçerrored$042fbafe-2401-4fb7-ac13-4531e0782c79queued¤logsrunning¦outputbodyBupdate_binary_eligibility_vector! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA Jܰpersist_js_state·has_pluto_hook_features§cell_id$042fbafe-2401-4fb7-ac13-4531e0782c79depends_on_disabled_cells§runtime 刵published_object_keysdepends_on_skipped_cells§errored$d57375a5-b9e0-4742-b5f7-6a7da891604aqueued¤logsrunning¦outputbodyNmountaincar_binary_continuing_parameter_study (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA =2persist_js_state·has_pluto_hook_features§cell_id$d57375a5-b9e0-4742-b5f7-6a7da891604adepends_on_disabled_cells§runtime2published_object_keysdepends_on_skipped_cellsçerrored$07ad517a-c2ac-4377-99fb-adb13d0f1d0cqueued¤logsrunning¦outputbodyprefixFloat32elements0.470621text/plain0.529379text/plaintypeArrayprefix_shortobjectid6c423950a4c4737emime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA %Ae=persist_js_state·has_pluto_hook_features§cell_id$07ad517a-c2ac-4377-99fb-adb13d0f1d0cdepends_on_disabled_cells§runtime0published_object_keysdepends_on_skipped_cellsçerrored$71a5fce8-6d9a-4625-bad1-a951d61bff28queued¤logsrunning¦outputbodycX

$\lambda_\theta$: 0.05

$\lambda_\mathbf{w}$: 0.8

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

mimetext/htmlrootassigneelast_run_timestampA @/S?persist_js_state·has_pluto_hook_features§cell_id$71a5fce8-6d9a-4625-bad1-a951d61bff28depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cellsçerrored$77906355-08f8-4b08-b051-84697199b519queued¤logsrunning¦outputbodyelements0.5text/plain0.07text/plaintypeTupleobjectid7f36f7939155dfecmime!application/vnd.pluto.tree+objectrootassigneeconst mountaincar_max_valslast_run_timestampA :n3persist_js_state·has_pluto_hook_features§cell_id$77906355-08f8-4b08-b051-84697199b519depends_on_disabled_cells§runtime6tpublished_object_keysdepends_on_skipped_cells§errored$5207308e-f636-4d47-b135-036a6e7b8ecdqueued¤logsrunning¦outputbodyڛTotal Reward: -147.0
mimetext/htmlrootassigneelast_run_timestampA A"persist_js_state·has_pluto_hook_features§cell_id$5207308e-f636-4d47-b135-036a6e7b8ecddepends_on_disabled_cells§runtime擵published_object_keysdepends_on_skipped_cellsçerrored$16113560-e911-47b4-abc4-641bbd246454queued¤logsrunning¦outputbody-$ mimetext/htmlrootassigneelast_run_timestampA >9persist_js_state·has_pluto_hook_features§cell_id$16113560-e911-47b4-abc4-641bbd246454depends_on_disabled_cells§runtime7[published_object_keysdepends_on_skipped_cellsçerrored$b7f77935-bcab-4ef1-8e1b-a7d059784ff3queued¤logsrunning¦outputbody

Evaluation State for Policy Function

x position: 0.0

velocity: 0.0

mimetext/htmlrootassigneelast_run_timestampA >cCpersist_js_state·has_pluto_hook_features§cell_id$b7f77935-bcab-4ef1-8e1b-a7d059784ff3depends_on_disabled_cells§runtimeܵpublished_object_keysdepends_on_skipped_cellsçerrored$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5queued¤logsrunning¦outputbodymsg٦UndefVarError: `reinforce_test` not defined in `Main.var"workspace#8"` Suggestion: add an appropriate import or assignment. This global was declared but not assigned.stacktracecall_shorttop-level scopeinlined£urlpath/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5source_packagecalltop-level scopelinfo_typeCore.CodeInfolinefileMChapter_13_Policy_Gradient_Methods.jl#==#f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5functop-level scopeparent_modulefrom_c¤mime'application/vnd.pluto.stacktrace+objectrootassigneelast_run_timestampA 0persist_js_state·has_pluto_hook_features§cell_id$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$00bd2835-b006-4244-9877-bc7e031e3ef8queued¤logsrunning¦outputbody7plot_squashed_gaussian (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA !@persist_js_state·has_pluto_hook_features§cell_id$00bd2835-b006-4244-9877-bc7e031e3ef8depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cellsçerrored$50ae94c4-70f3-4215-82bd-eb2227c2badfqueued¤logsrunning¦outputbodyB

Waiting to run parameter study

mimetext/htmlrootassigneelast_run_timestampA 2T*persist_js_state·has_pluto_hook_features§cell_id$50ae94c4-70f3-4215-82bd-eb2227c2badfdepends_on_disabled_cells§runtime>,published_object_keysdepends_on_skipped_cellsçerrored$cc3ac95e-a398-438a-ba3d-62b6733f6342queued¤logsrunning¦outputbodyAupdate_fcann_action_preferences! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$cc3ac95e-a398-438a-ba3d-62b6733f6342depends_on_disabled_cells§runtimePpublished_object_keysdepends_on_skipped_cells§errored$c926b6df-c40b-4c4c-8a95-ce9e41feb100queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestamppersist_js_state·has_pluto_hook_features§cell_id$c926b6df-c40b-4c4c-8a95-ce9e41feb100depends_on_disabled_cellsçruntimepublished_object_keysdepends_on_skipped_cells§errored$740a3f41-9302-481d-b373-762c0dea8effqueued¤logsrunning¦outputbodyEupdate_gaussian_eligibility_vector! (generic function with 4 methods)mimetext/plainrootassigneelast_run_timestampA #]ֻpersist_js_state·has_pluto_hook_features§cell_id$740a3f41-9302-481d-b373-762c0dea8effdepends_on_disabled_cells§runtime% published_object_keysdepends_on_skipped_cells§errored$ba642a22-6623-482a-ab4a-81585b83e457queued¤logsrunning¦outputbody8average_continuing_runs (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA _persist_js_state·has_pluto_hook_features§cell_id$ba642a22-6623-482a-ab4a-81585b83e457depends_on_disabled_cells§runtime/pSpublished_object_keysdepends_on_skipped_cells§errored$d17a4bd0-5992-4247-912d-73d51758d2f3queued¤logsrunning¦outputbodyJ

Continuing Cartpole Example

mimetext/htmlrootassigneelast_run_timestampA $persist_js_state·has_pluto_hook_features§cell_id$d17a4bd0-5992-4247-912d-73d51758d2f3depends_on_disabled_cells§runtimejpublished_object_keysdepends_on_skipped_cells§errored$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7efqueued¤logsrunning¦outputbodyy
mimetext/htmlrootassigneelast_run_timestampA @Zpersist_js_state·has_pluto_hook_features§cell_id$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7efdepends_on_disabled_cells§runtimeZǮpublished_object_keysdepends_on_skipped_cellsçerrored$5ee4ce72-7740-4297-8d84-619e0708e4acqueued¤logsrunning¦outputbodyJcartpole_continuing_fcann_parameter_study (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 8epersist_js_state·has_pluto_hook_features§cell_id$5ee4ce72-7740-4297-8d84-619e0708e4acdepends_on_disabled_cells§runtimewwpublished_object_keysdepends_on_skipped_cellsçerrored$645e93e7-e92e-49c4-9757-8294fabf4e9bqueued¤logsrunning¦outputbodyz^ mimetext/htmlrootassigneelast_run_timestampA +{persist_js_state·has_pluto_hook_features§cell_id$645e93e7-e92e-49c4-9757-8294fabf4e9bdepends_on_disabled_cells§runtime'M}published_object_keysdepends_on_skipped_cellsçerrored$0c56b341-24eb-4c78-844e-182f44a7221aqueued¤logsrunning¦outputbody&V mimetext/htmlrootassigneelast_run_timestampA #_ persist_js_state·has_pluto_hook_features§cell_id$0c56b341-24eb-4c78-844e-182f44a7221adepends_on_disabled_cells§runtime_bpublished_object_keysdepends_on_skipped_cellsçerrored$d34d22ad-89c2-423e-91dd-bfb895dc6540queued¤logsrunning¦outputbody?cartpole_fcann_parameter_study (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 3.ڰpersist_js_state·has_pluto_hook_features§cell_id$d34d22ad-89c2-423e-91dd-bfb895dc6540depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cellsçerrored$20776e09-7d9b-4db8-a060-7bceeec65b47queued¤logsrunning¦outputbodyfactor_critic_with_eligibility_traces_binary_features_gaussian_actions (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA /persist_js_state·has_pluto_hook_features§cell_id$20776e09-7d9b-4db8-a060-7bceeec65b47depends_on_disabled_cells§runtime>Jpublished_object_keysdepends_on_skipped_cells§errored$7856b8a0-565d-4c86-9b3c-4424ff9b86ddqueued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestampA d9persist_js_state·has_pluto_hook_features§cell_id$7856b8a0-565d-4c86-9b3c-4424ff9b86dddepends_on_disabled_cells§runtime&published_object_keysdepends_on_skipped_cells§errored$735b548a-88f5-4a30-ab8f-dfb3d6401b2bqueued¤logsrunning¦outputbody

13.7 Policy Parameterization for Continuous Actions

With a parameterized policy we are to learn statistics of the distribution that selects actions. As a foundation consider the normal distribution:

$$p(x) \doteq \frac{1}{\sigma \sqrt{2\pi}} \exp \left ( - \frac{(x-\mu)^2}{2\sigma^2} \right ) \tag{13.18}$$

mimetext/htmlrootassigneelast_run_timestampA Xpersist_js_state·has_pluto_hook_features§cell_id$735b548a-88f5-4a30-ab8f-dfb3d6401b2bdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$7cf26604-9c2b-4a77-9674-7d4dac2f99f0queued¤logslinemsgsReplacing docs for `Main.var"workspace#8".make_random_walk_mrp :: Tuple{Integer}` in module `Main.var"workspace#8"`text/plaincell_id$7cf26604-9c2b-4a77-9674-7d4dac2f99f0kwargsidBase_Docs_4352b6d8filedocs/Docs.jlgroupDocslevelWarnlinemsgsReplacing docs for `Main.var"workspace#8".make_random_walk_mrp :: Tuple{Integer}` in module `Main.var"workspace#8"`text/plaincell_id$7cf26604-9c2b-4a77-9674-7d4dac2f99f0kwargsidBase_Docs_4352b6d8filedocs/Docs.jlgroupDocslevelWarnlinemsgsReplacing docs for `Main.var"workspace#8".make_random_walk_mrp :: Tuple{Integer}` in module `Main.var"workspace#8"`text/plaincell_id$7cf26604-9c2b-4a77-9674-7d4dac2f99f0kwargsidBase_Docs_4352b6d8filedocs/Docs.jlgroupDocslevelWarnlinemsg}WARNING: could not import FCANN.cuda_allocate into workspace#8 WARNING: could not import FCANN.cuda_allocate into workspace#8 WARNING: replacing module MountainCarTask. WARNING: could not import FCANN.cuda_allocate into workspace#8 WARNING: replacing module MountainCarTask. WARNING: could not import FCANN.cuda_allocate into workspace#8 WARNING: replacing module MountainCarTask. text/plaincell_id$7cf26604-9c2b-4a77-9674-7d4dac2f99f0kwargsidPlutoRunner_d1acb81efileP/home/runner/.julia/packages/Pluto/5ete1/src/runner/PlutoRunner/src/io/stdout.jlgroupstdoutlevelLogLevel(-555)running¦outputbody"# This file is machine-generated - editing it directly is not advised\n\njulia_version = \"1.11.5\"\nmanifest_format = \"2.0\"\nproject_hash = \"52f0e08d74c26001471ce64a62da0627b2421990\"\n\n[[deps.AbstractPlutoDingetjes]]\ndeps = [\"Pkg\"]\ngit-tree-sha1 = \"6e1d2a35f2f90a4bc7c2ed98079b2ba09c35b83a\"\nuuid = \"6e696" ⋯ 22302 bytes ⋯ " \"8e850b90-86db-534c-a0d3-1478176c7d93\"\nversion = \"5.11.0+0\"\n\n[[deps.nghttp2_jll]]\ndeps = [\"Artifacts\", \"Libdl\"]\nuuid = \"8e850ede-7688-5339-a07c-302acd2aaf8d\"\nversion = \"1.59.0+0\"\n\n[[deps.p7zip_jll]]\ndeps = [\"Artifacts\", \"Libdl\"]\nuuid = \"3f19e933-33d8-53b3-aaab-bd5110c3b7a0\"\nversion = \"17.4.0+2\"\n"mimetext/plainrootassigneelast_run_timestampA 5persist_js_state·has_pluto_hook_features§cell_id$7cf26604-9c2b-4a77-9674-7d4dac2f99f0depends_on_disabled_cells§runtime\published_object_keysdepends_on_skipped_cells§errored$87ee21f3-16ca-4c8c-a0b9-f9e2fd258a91queued¤logsrunning¦outputbodyd

REINFORCE Implementation for Continuous Action Spaces

mimetext/htmlrootassigneelast_run_timestampA 6persist_js_state·has_pluto_hook_features§cell_id$87ee21f3-16ca-4c8c-a0b9-f9e2fd258a91depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$54f1546d-87ae-49d2-92ed-6fcc9b66e027queued¤logsrunning¦outputbody?

Mountain Car MDP

mimetext/htmlrootassigneelast_run_timestampA Zpersist_js_state·has_pluto_hook_features§cell_id$54f1546d-87ae-49d2-92ed-6fcc9b66e027depends_on_disabled_cells§runtimeqpublished_object_keysdepends_on_skipped_cells§errored$63fbf8f4-e4e2-4893-be09-67450e92dbd7queued¤logsrunning¦outputbody*plot_cart (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA @{!persist_js_state·has_pluto_hook_features§cell_id$63fbf8f4-e4e2-4893-be09-67450e92dbd7depends_on_disabled_cells§runtimeIkpublished_object_keysdepends_on_skipped_cellsçerrored$d5020a8d-1dd7-403c-9d1f-665b95543943queued¤logsrunning¦outputbodymreinforce_with_baseline_monte_carlo_control_linear_features_gaussian_actions (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 'Apersist_js_state·has_pluto_hook_features§cell_id$d5020a8d-1dd7-403c-9d1f-665b95543943depends_on_disabled_cells§runtime@0Spublished_object_keysdepends_on_skipped_cells§errored$37a8ef7e-e859-4ef0-81e2-76c02a324031queued¤logsrunning¦outputbody

Policy Gradient Theorem Proof

In all cases below when a sum over states is taken, it is assumed to be over the set of non-terminal states: $\sum_s \implies \sum_{s \in \mathcal{S}}$ Note that for the case of the value function this is identical to the sum over $\mathcal{S}^+$ because the state-action values are always zero for terminal states.

$$\begin{flalign} \nabla v_\pi(s) &= \nabla \left [ \sum_a \pi(a \vert s) q_\pi(s, a) \right ] \text{, } \forall s \in \mathcal{S} \tag{definitiong of value functions and expected value} \\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \nabla q_\pi(s, a) \right ] \tag{product rule} \\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \nabla \sum_{s^\prime, r} p(s^\prime, r \vert s, a)(r + \gamma v_\pi(s^\prime) \right ] \tag{relationship between action and state values} \\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \gamma \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) \right ] \tag{gradient independence}\\ \end{flalign}$$

Note that the final term in the sum is the original expression evaluated at $s^\prime$ instead of $s$, so we have derived a recurssive expression which can be applied repeatedly:

$$\begin{flalign} \nabla v_\pi(s) &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \gamma \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \left [ \nabla \pi(a^\prime \vert s^\prime) q_\pi(s^\prime, a^\prime) + \pi(a^\prime \vert s^\prime) \gamma \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) \nabla v_\pi(s^{\prime \prime}) \right ] \right ] \tag{recur once}\\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) \right ] + \gamma \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \left [ \nabla \pi(a^\prime \vert s^\prime) q_\pi(s^\prime, a^\prime) \right ] \right ] + \\ &\hspace{50px} \gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) \nabla v_\pi(s^{\prime \prime}) \right ] \tag{grouping terms}\\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) \right ] + \gamma \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \left [ \nabla \pi(a^\prime \vert s^\prime) q_\pi(s^\prime, a^\prime) \right ] \right ] + \\ &\hspace{50px} \gamma^2 \sum_a \left [ \pi(a \vert s)\sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) \sum_{a^{\prime \prime}} [ \nabla \pi(a^{\prime \prime} \vert s^{\prime \prime}) q_\pi(s^{\prime \prime}, a^{\prime \prime})\right ] + \cdots \tag{extend recursion}\\ \end{flalign}$$

mimetext/htmlrootassigneelast_run_timestampA npersist_js_state·has_pluto_hook_features§cell_id$37a8ef7e-e859-4ef0-81e2-76c02a324031depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$98229733-a71e-44ca-a52a-b7229cf8b422queued¤logsrunning¦outputbody$

The probability transition function is normalized over all possible transition states $\sum_{s^\prime \in \mathcal{S}^+} p(s^\prime \vert s, a) = 1$. If we only take the sum of $\mathcal{S}$ then we instead get the probability that after a single transition we have NOT reached a terminal state. Let's say we also have a policy function $\pi(a \vert s)$ which is normalized over actions: $\sum_a \pi(a \vert s) = 1$. Now if we combine the two, we can arrive at a new distribution over transition states: $p(s^\prime \vert s, \pi) = \sum_a \pi(a \vert s) p(s^\prime \vert s, a)$ which is the probability of transitioning from $s$ to $s^\prime$ under the policy. We can see that this distribution is normalized over the transition states as well as long as we include the terminal state: $\sum_{s^\prime \in \mathcal{S}^+} p(s^\prime \vert s, \pi) = \sum_{s^\prime \in \mathcal{S}^+, a} \pi(a \vert s) p(s^\prime \vert s, a) = \sum_a \pi(a \vert s) \sum_{s^\prime \in \mathcal{S}^+} p(s^\prime \vert s, a) = 1 \times 1 = 1$. If instead we take the sum over $\mathcal{S}$ we simply get the probability of NOT terminating in one step.

What if we consider two steps into the future though? Now we have $\sum_{s^\prime}\sum_{a^\prime}\pi(a^\prime \vert s^\prime)p(s^{\prime \prime} \vert s^\prime, a^\prime)\sum_a \pi(a \vert s) p(s^\prime \vert s, a) = \sum_{s^\prime}p(s^{\prime \prime} \vert s^\prime, \pi) p(s^\prime \vert s, \pi)$. It would appear as though we can just put the two probabilities together and consider a new distribution over $s^{\prime \prime}$ which is $p(s^{\prime \prime} \vert s, \pi, 2)$ where instead of one step this now occurs over two steps, but how is this distribution normalized? In the case of the one step, transition, we saw that its sum over all transition states is 1 as expected. If we sum both transition states over only $\mathcal{S}$ rather than $\mathcal{S}^+$ what is the result? We already know that $\sum_{s^{\prime \prime} \in \mathcal{S}^+} p(s^{\prime \prime} \vert s^\prime , \pi) = \Pr \{ S_1 \neq S_T \ \vert S_0 = s^\prime, \pi \}$ that is the probability that after transitioning out of $s^\prime$ under the policy $\pi$ we have not reached a terminal state.

$$\sum_{s^{\prime \prime} \in \mathcal{S}} \sum_{s^\prime \in \mathcal{S}} p(s^{\prime \prime} \vert s^\prime, \pi) p(s^\prime \vert s, \pi) = \sum_{s^\prime \in \mathcal{S}} p(s^\prime \vert s, \pi) \sum_{s^{\prime \prime} \in \mathcal{S}} p(s^{\prime \prime} \vert s^\prime, \pi) = \Pr \{ S_2 \neq S_T \vert S_0 = s, \pi \}$$

which is to say the probability that after two transitions from $s$ we are not in a terminal state under the policy $\pi$.

For the derivations that follow, we always take sums of these distributions over $\mathcal{S}$. For episodic problems, the on policy distribution $\mu_\pi(s)$ which is the probability of being in a state $s$ during an episode always excludes the terminal state. That is because if there is a non-zero probability of reaching a terminal state under a policy, then considering all possible episodes we may have an infinite number of visits to the terminal state. Technically the episodes have infinite length but we are only interested in the portion of the episode that preceeds the terminal state for the purpose of calculating probabilities. The more careful statement about the on policy distribution is that it measures the probability of being in a state during the non-terminal part of an episode. If we try to include the terminal states, then we cannot have a proper normalized definition of the on-policy distribution. Moreover, we have no need to measure the value of a terminal state accurately, since we always know it to be 0. The on policy distribution is used to formulate the value error objective function and it should only include states for which the value estimation is non-trivial.

mimetext/htmlrootassigneelast_run_timestampA Kpersist_js_state·has_pluto_hook_features§cell_id$98229733-a71e-44ca-a52a-b7229cf8b422depends_on_disabled_cells§runtime Tnpublished_object_keysdepends_on_skipped_cells§errored$42775fd1-5b27-48e0-abf1-9b22bb775e6dqueued¤logsrunning¦outputbody,l mimetext/htmlrootassigneelast_run_timestampA /DȰpersist_js_state·has_pluto_hook_features§cell_id$42775fd1-5b27-48e0-abf1-9b22bb775e6ddepends_on_disabled_cells§runtimez3published_object_keysdepends_on_skipped_cellsçerrored$7dbb42a3-aa8c-47e5-b668-18e6325d4038queued¤logsrunning¦outputbody8

Tile Coding Method

mimetext/htmlrootassigneelast_run_timestampA =lpersist_js_state·has_pluto_hook_features§cell_id$7dbb42a3-aa8c-47e5-b668-18e6325d4038depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$192b9f82-8d3a-408f-91c2-829cfcd32572queued¤logsrunning¦outputbody8cartpole_vector_update! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 2Xpersist_js_state·has_pluto_hook_features§cell_id$192b9f82-8d3a-408f-91c2-829cfcd32572depends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cells§errored$b5319d8b-0420-4ebf-b603-ea0b93365ac1queued¤logsrunning¦outputbodyGshow_mountaincar_continuous_trajectory (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA AXpersist_js_state·has_pluto_hook_features§cell_id$b5319d8b-0420-4ebf-b603-ea0b93365ac1depends_on_disabled_cells§runtimePPpublished_object_keysdepends_on_skipped_cellsçerrored$4cbdb082-22ba-49e9-a6ed-4380917625acqueued¤logsrunning¦outputbodyb

Actor-Critic with Eligibility Traces Implementation

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$4cbdb082-22ba-49e9-a6ed-4380917625acdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$cc80848a-6834-4272-9152-e17b45448814queued¤logsrunning¦outputbody,wind_speeds (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA ?/persist_js_state·has_pluto_hook_features§cell_id$cc80848a-6834-4272-9152-e17b45448814depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cellsçerrored$05bfd818-bf4e-4bda-baa9-5ba647867097queued¤logsrunning¦outputbodyUactor_critic_with_eligibility_traces_binary_features (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA +~persist_js_state·has_pluto_hook_features§cell_id$05bfd818-bf4e-4bda-baa9-5ba647867097depends_on_disabled_cells§runtime8 Rpublished_object_keysdepends_on_skipped_cells§errored$f0962801-0dfa-421f-8ffc-e64068e49913queued¤logsrunning¦outputbodyelementsfeature_vectorprefixFloat32elements0.0text/plain0.0text/plaintypeArrayprefix_shortobjectid7a7b475772429c5a!application/vnd.pluto.tree+objectnum_features2text/plainupdate_feature_vector!update_feature_vector!text/plaintypeNamedTupleobjectidcf73078e0a2e34b3mime!application/vnd.pluto.tree+objectrootassignee%const mountaincar_fcann_feature_setuplast_run_timestampA 2uhpersist_js_state·has_pluto_hook_features§cell_id$f0962801-0dfa-421f-8ffc-e64068e49913depends_on_disabled_cells§runtime^published_object_keysdepends_on_skipped_cells§errored$11a55af7-5301-4507-bb26-88e1e11236dbqueued¤logsrunning¦outputbody. mimetext/htmlrootassigneelast_run_timestampA 2lpersist_js_state·has_pluto_hook_features§cell_id$11a55af7-5301-4507-bb26-88e1e11236dbdepends_on_disabled_cells§runtimeo)gpublished_object_keysdepends_on_skipped_cellsçerrored$ddbca73f-c692-46f2-95f3-a7dd849d33f7queued¤logsrunning¦outputbodyڒ'Total Reward: -113.0
mimetext/htmlrootassigneelast_run_timestampA AӰpersist_js_state·has_pluto_hook_features§cell_id$ddbca73f-c692-46f2-95f3-a7dd849d33f7depends_on_disabled_cells§runtime`published_object_keysdepends_on_skipped_cellsçerrored$b4875f2b-5487-429f-80a3-d1032bbccfc1queued¤logsrunning¦outputbody[

Policy Gradient Theorem Proof for Continuing Problems

mimetext/htmlrootassigneelast_run_timestampA Mpersist_js_state·has_pluto_hook_features§cell_id$b4875f2b-5487-429f-80a3-d1032bbccfc1depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$0cd96c44-cae6-421f-9fae-26141600bef4queued¤logsrunning¦outputbody\ mimetext/htmlrootassigneelast_run_timestampA 0lnpersist_js_state·has_pluto_hook_features§cell_id$0cd96c44-cae6-421f-9fae-26141600bef4depends_on_disabled_cells§runtime#published_object_keysdepends_on_skipped_cellsçerrored$51d6337d-c0bd-40a9-9129-7d88e41e4093queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestampA '1persist_js_state·has_pluto_hook_features§cell_id$51d6337d-c0bd-40a9-9129-7d88e41e4093depends_on_disabled_cells§runtime*mpublished_object_keysdepends_on_skipped_cells§errored$5859ca11-90f8-4fd6-88ed-c56efe796fe8queued¤logsrunning¦outputbody4 mimetext/htmlrootassigneelast_run_timestampA 15 !persist_js_state·has_pluto_hook_features§cell_id$5859ca11-90f8-4fd6-88ed-c56efe796fe8depends_on_disabled_cells§runtime"ZHpublished_object_keysdepends_on_skipped_cellsçerrored$3ea08816-705e-4be7-a175-dbd3f3e4c17dqueued¤logsrunning¦outputbody>

Misc Utilities/Functions

mimetext/htmlrootassigneelast_run_timestampA Epersist_js_state·has_pluto_hook_features§cell_id$3ea08816-705e-4be7-a175-dbd3f3e4c17ddepends_on_disabled_cells§runtimedpublished_object_keysdepends_on_skipped_cells§errored$f3e2db06-9cb7-464a-96b8-938175efd26bqueued¤logsrunning¦outputbody
mimetext/htmlrootassigneelast_run_timestampA @icpersist_js_state·has_pluto_hook_features§cell_id$af144759-fe66-4ad0-b378-e9eb4e859db4depends_on_disabled_cells§runtime/published_object_keysdepends_on_skipped_cellsçerrored$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2queued¤logsrunning¦outputbodyprefix"ContinuousMDP{Float32, Tuple{Float32, Float32}, Float32, ContinuousMDPTransitionSampler{Float32, Tuple{Float32, Float32}, Float32, var"#step#1603"{Float32}}, typeof(Main.var"workspace#8".MountainCarTask.initialize_state), typeof(Main.var"workspace#8".MountainCarTask.isterm), Returns{Bool}}elementsptfprefixcContinuousMDPTransitionSampler{Float32, Tuple{Float32, Float32}, Float32, var"#step#1603"{Float32}}elementsstepS(::Main.var"workspace#8".var"#step#1603"{Float32}) (generic function with 1 method)text/plaintypestructprefix_shortContinuousMDPTransitionSamplerobjectid10d85fa8!application/vnd.pluto.tree+objectinitialize_state1initialize_state (generic function with 1 method)text/plainisterm'isterm (generic function with 1 method)text/plainis_valid_actionReturns{Bool}(true)text/plaintypestructprefix_shortContinuousMDPobjectid81156278aaa90f92mime!application/vnd.pluto.tree+objectrootassignee const mountaincar_continuous_mdplast_run_timestampA =sPpersist_js_state·has_pluto_hook_features§cell_id$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2depends_on_disabled_cells§runtime+published_object_keysdepends_on_skipped_cells§errored$fb8904a9-ae64-41cc-93b6-5a25855edad0queued¤logsrunning¦outputbody;get_corridor_episode_stats (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA !fpersist_js_state·has_pluto_hook_features§cell_id$fb8904a9-ae64-41cc-93b6-5a25855edad0depends_on_disabled_cells§runtime3vpublished_object_keysdepends_on_skipped_cellsçerrored$a5b002c9-5e11-462a-9da0-6e060c7963f8queued¤logsrunning¦outputbodyelementsprefix,Main.var"workspace#8".CartPoleState{Float32}elementsprefixCartPoleState{Float32}elementsx30.0text/plainθ0.8text/plainẋ0.0text/plainθ̇-0.0text/plaint0.0text/plaintypestructprefix_shortCartPoleStateobjectida843943c04b3a7e!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx30.0221text/plainθ0.795094text/plainẋ1.10655text/plainθ̇-0.245739text/plaint0.04text/plaintypestructprefix_shortCartPoleStateobjectidce8725f349c35742!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx30.0885text/plainθ0.780267text/plainẋ2.21393text/plainθ̇-0.497006text/plaint0.08text/plaintypestructprefix_shortCartPoleStateobjectid56323732a2cd6c6!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx30.1993text/plainθ0.755186text/plainẋ3.323text/plainθ̇-0.759398text/plaint0.12text/plaintypestructprefix_shortCartPoleStateobjectide9d1d7d09422679d!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx30.3544text/plainθ0.719291text/plainẋ4.43467text/plainθ̇-1.03864text/plaint0.16text/plaintypestructprefix_shortCartPoleStateobjectid587ca329eacee254!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx30.5541text/plainθ0.671792text/plainẋ5.54994text/plainθ̇-1.34061text/plaint0.2text/plaintypestructprefix_shortCartPoleStateobjectidf23d0775120eae84!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx30.7985text/plainθ0.61166text/plainẋ6.66984text/plainθ̇-1.67126text/plaint0.24text/plaintypestructprefix_shortCartPoleStateobjectid937ae5d5b0481894!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx31.0878text/plainθ0.53763text/plainẋ7.79538text/plainθ̇-2.03646text/plaint0.28text/plaintypestructprefix_shortCartPoleStateobjectidac7ee607274c7714!application/vnd.pluto.tree+object prefixCartPoleState{Float32}elementsx31.4222text/plainθ0.448211text/plainẋ8.92731text/plainθ̇-2.44158text/plaint0.32text/plaintypestructprefix_shortCartPoleStateobjectidd7442e47daa49912!application/vnd.pluto.tree+objectmoreprefixCartPoleState{Float32}elementsx36.3249text/plainθ-1.21277text/plainẋ-1.05244text/plainθ̇-1.20747text/plaint1.0text/plaintypestructprefix_shortCartPoleStateobjectidf9c13b3b4eec42a5!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid1e33aa8b879d9c18!application/vnd.pluto.tree+objectprefixInt64elements3text/plain3text/plain3text/plain3text/plain3text/plain3text/plain3text/plain3text/plain 3text/plainmore1text/plaintypeArrayprefix_shortobjectidaf7504f82a205193!application/vnd.pluto.tree+objectprefixFloat32elements1.0text/plain1.0text/plain1.0text/plain1.0text/plain1.0text/plain1.0text/plain1.0text/plain1.0text/plain 1.0text/plainmore1.0text/plaintypeArrayprefix_shortobjectid5288029df0465058!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx36.2609text/plainθ-1.26108text/plainẋ-2.1482text/plainθ̇-1.21309text/plaint1.04text/plaintypestructprefix_shortCartPoleStateobjectid54ad65ee320ed994!application/vnd.pluto.tree+object26text/plaintypeTupleobjectid77b4b6114cbab8a0mime!application/vnd.pluto.tree+objectrootassigneeconst ep2last_run_timestampA :?persist_js_state·has_pluto_hook_features§cell_id$a5b002c9-5e11-462a-9da0-6e060c7963f8depends_on_disabled_cells§runtimeY)published_object_keysdepends_on_skipped_cells§errored$83640f5b-fe13-4ec1-98a0-67a56c189ba1queued¤logsrunning¦outputbodyGactor_critic_with_eligibility_traces! (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA + qapersist_js_state·has_pluto_hook_features§cell_id$83640f5b-fe13-4ec1-98a0-67a56c189ba1depends_on_disabled_cells§runtimeppublished_object_keysdepends_on_skipped_cells§errored$61650a97-b353-4a85-b50b-93fee296ac7bqueued¤logsrunning¦outputbodyelementsfeature_vectorprefixFloat32elements0.0text/plain0.0text/plain0.0text/plain0.0text/plaintypeArrayprefix_shortobjectidaa97f30b6a1c9855!application/vnd.pluto.tree+objectnum_features4text/plainupdate_feature_vector!update_feature_vector!text/plaintypeNamedTupleobjectid18316424d13c4613mime!application/vnd.pluto.tree+objectrootassignee"const cartpole_fcann_feature_setuplast_run_timestampA 2I0persist_js_state·has_pluto_hook_features§cell_id$61650a97-b353-4a85-b50b-93fee296ac7bdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$602a07dd-8928-4b44-97e5-01c5cbf38351queued¤logsrunning¦outputbody5plot_cartpole_policy (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA @6gpersist_js_state·has_pluto_hook_features§cell_id$602a07dd-8928-4b44-97e5-01c5cbf38351depends_on_disabled_cells§runtimeupublished_object_keysdepends_on_skipped_cellsçerrored$f7433324-acc3-49a5-b5b3-ada0c8f09d52queued¤logsrunning¦outputbodyelementsprefixInt64elements1text/plain2text/plain3text/plaintypeArrayprefix_shortobjectidb9d5c490997071d7!application/vnd.pluto.tree+objectprefixInt64elements2text/plain1text/plain2text/plaintypeArrayprefix_shortobjectid45f86e4ab1b5cc81!application/vnd.pluto.tree+objectprefixFloat32elements-1.0text/plain-1.0text/plain-1.0text/plaintypeArrayprefix_shortobjectide320a1e36c0632f8!application/vnd.pluto.tree+object4text/plain3text/plaintypeTupleobjectid2f266a154ddd7c23mime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA ! persist_js_state·has_pluto_hook_features§cell_id$f7433324-acc3-49a5-b5b3-ada0c8f09d52depends_on_disabled_cells§runtime2]published_object_keysdepends_on_skipped_cellsçerrored$0c9986bb-54c0-4b08-9c29-4bfb0b68b54equeued¤logsrunning¦outputbody

Exercise 13.2

Generalize the proof of the policy gradient theorem and the steps leading to the REINFORCE update equation (13.8), so that (13.8) ends up with a factor of $\gamma^t$ and thus aligns with the general algorithm given in the pseudocode.

See proof above in the section on proving the policy gradient theorem.

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$3bafd7df-9bc0-4d13-874d-739590cf3ad9depends_on_disabled_cells§runtimej$published_object_keysdepends_on_skipped_cells§errored$f27f2bcd-05b6-44fe-bf9e-a3e51556db7cqueued¤logsrunning¦outputbodyelementsstepsteptext/plainfailurefailuretext/plaininitialize_stateinitialize_statetext/plaindiscrete_actionsprefixFloat32elements-300.0text/plain0.0text/plain300.0text/plaintypeArrayprefix_shortobjectid7bda0c2c0e3ade17!application/vnd.pluto.tree+objectmin_valselements-50.0text/plain-1.22173text/plain-50.0text/plain-10.0text/plaintypeTupleobjectide43d54ac3f2f06ed!application/vnd.pluto.tree+objectmax_valselements50.0text/plain1.22173text/plain50.0text/plain10.0text/plaintypeTupleobjectid670383347ca2d0ca!application/vnd.pluto.tree+objecth0.04text/plaintypeNamedTupleobjectid9bb540c63f908a36mime!application/vnd.pluto.tree+objectrootassigneeconst cartpole_functionslast_run_timestampA mpersist_js_state·has_pluto_hook_features§cell_id$f27f2bcd-05b6-44fe-bf9e-a3e51556db7cdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cellsçerrored$41dc149d-c6f3-4b0d-a856-06f3aaae3049queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestampA PLpersist_js_state·has_pluto_hook_features§cell_id$41dc149d-c6f3-4b0d-a856-06f3aaae3049depends_on_disabled_cells§runtime&kfpublished_object_keysdepends_on_skipped_cells§errored$38e5d800-4d43-40d2-87ea-f7d4b4283dabqueued¤logsrunning¦outputbody

In order to find the p that maximizes the expected value for state 1, we should differentiate by p and set the result to 0

$$\frac{\partial v_1}{\partial p} = -\frac{2p(1-p) - 2(1+p)(1 - 2p)}{p^2(1-p)^2}$$

Setting this equal to 0 implies

$$\begin{flalign} p-p^2 &= 1 - 2p + p - 2p^2\\ p^2 + 2p - 1 &= 0 \\ \end{flalign}$$

Using the quadratic equation, there are two solutions but since we know p has to be positive we only take that one.

$$p = \frac{-2 \pm \sqrt{4 + 4}}{2} = \frac{-2 \pm 2\sqrt{2}}{2} = -1 \pm \sqrt{2} \implies p = \sqrt{2} - 1 \approx 0.41421$$

So, in order to maximize the value at state 1, we have $p_{\text{left}} \approx 0.414$ and $p_{\text{right}} \approx 0.586$. That also implies that $v_1 = -2\frac{1+p}{p(1-p)} = -2\frac{\sqrt{2}}{(\sqrt{2}-1)(2 - \sqrt{2})}= \frac{-2\sqrt{2}}{2 \sqrt{2} - 2 - 2 + \sqrt{2}} = \frac{-2 \sqrt{2}}{3\sqrt{2} - 4} \approx -11.657$

mimetext/htmlrootassigneelast_run_timestampA -persist_js_state·has_pluto_hook_features§cell_id$38e5d800-4d43-40d2-87ea-f7d4b4283dabdepends_on_disabled_cells§runtime%published_object_keysdepends_on_skipped_cells§errored$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2queued¤logsrunning¦outputbodyFone_step_actor_critic_binary_features (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 'U[persist_js_state·has_pluto_hook_features§cell_id$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2depends_on_disabled_cells§runtime7published_object_keysdepends_on_skipped_cells§errored$73b90260-d57a-449a-8db6-47f91e6a4e4fqueued¤logsrunning¦outputbodyM

Eligibility Vector with Binary Features

mimetext/htmlrootassigneelast_run_timestampA װpersist_js_state·has_pluto_hook_features§cell_id$73b90260-d57a-449a-8db6-47f91e6a4e4fdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$5aba4f96-e877-457e-8e95-18737348f99fqueued¤logsrunning¦outputbodyCactor_critic_fcann_parameter_study (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA l۰persist_js_state·has_pluto_hook_features§cell_id$5aba4f96-e877-457e-8e95-18737348f99fdepends_on_disabled_cells§runtimehzpublished_object_keysdepends_on_skipped_cells§errored$fed4dc4c-0d1c-4ee3-9d0e-8ef2a7db7486queued¤logsrunning¦outputbodyd<

$\lambda_\theta$: 0.1

$\lambda_\mathbf{w}$: 0.98

$\alpha_{\overline{r}}$:

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

mimetext/htmlrootassigneelast_run_timestampA !7V.persist_js_state·has_pluto_hook_features§cell_id$fed4dc4c-0d1c-4ee3-9d0e-8ef2a7db7486depends_on_disabled_cells§runtime$rpublished_object_keysdepends_on_skipped_cellsçerrored$27487ad0-4779-42ce-8def-e660ef04bee0queued¤logsrunning¦outputbodyelementsaction_probabilitiesprefixFloat32elements0.149956text/plain0.000232849text/plain0.849811text/plaintypeArrayprefix_shortobjectid566b0efc4555caa4!application/vnd.pluto.tree+objectstate_value_estimate665.762text/plaintypeNamedTupleobjectid733e2d8aae6a0198mime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA 7persist_js_state·has_pluto_hook_features§cell_id$27487ad0-4779-42ce-8def-e660ef04bee0depends_on_disabled_cells§runtime͜published_object_keysdepends_on_skipped_cells§errored$0d93132d-5819-47dc-8cf2-462d480d9c3dqueued¤logsrunning¦outputbodyB

Waiting to run parameter study

mimetext/htmlrootassigneelast_run_timestampA @0opersist_js_state·has_pluto_hook_features§cell_id$0d93132d-5819-47dc-8cf2-462d480d9c3ddepends_on_disabled_cells§runtime킵published_object_keysdepends_on_skipped_cellsçerrored$9978d537-49ff-4014-a971-b42704c50a6bqueued¤logsrunning¦outputbodyd

$\lambda_\theta$: 0.95

$\lambda_\mathbf{w}$: 0.2

hidden layer size: , num layers:

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

mimetext/htmlrootassigneelast_run_timestampA 3@persist_js_state·has_pluto_hook_features§cell_id$9978d537-49ff-4014-a971-b42704c50a6bdepends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cellsçerrored$f8215517-b18f-4a03-9421-8edab4ca8089queued¤logsrunning¦outputbodyڅ, mimetext/htmlrootassigneelast_run_timestampA >n$̰persist_js_state·has_pluto_hook_features§cell_id$f8215517-b18f-4a03-9421-8edab4ca8089depends_on_disabled_cells§runtime=յpublished_object_keysdepends_on_skipped_cellsçerrored$1ac9296f-047b-4051-ba5c-0c23d5f9cde9queued¤logsrunning¦outputbodyprefixٕStateMDP{Float32, Int64, Symbol, StateMDPTransitionSampler{Float32, Int64, var"#step#1204"}, var"#1203#1205", Returns{Bool}, TabularRL.var"#164#169"}elementsactionsprefixSymbolelements:lefttext/plain:righttext/plaintypeArrayprefix_shortobjectid56b8e3577fbdce3b!application/vnd.pluto.tree+objectptfprefix:StateMDPTransitionSampler{Float32, Int64, var"#step#1204"}elementsstepJ(::Main.var"workspace#8".var"#step#1204") (generic function with 1 method)text/plaintypestructprefix_shortStateMDPTransitionSamplerobjectidffffffff611818f4!application/vnd.pluto.tree+objectinitialize_stateҳ (generic function with 1 method)text/plainistermReturns{Bool}(false)text/plainis_valid_action%#164 (generic function with 1 method)text/plainaction_indexprefixDict{Symbol, Int64}elements:lefttext/plain1text/plain:righttext/plain2text/plaintypeDictprefix_shortDictobjectid11acce9bbde33ff7!application/vnd.pluto.tree+objecttypestructprefix_shortStateMDPobjectid76806692942ce065mime!application/vnd.pluto.tree+objectrootassigneeconst corridor_continuing_mdplast_run_timestampA Vzpersist_js_state·has_pluto_hook_features§cell_id$1ac9296f-047b-4051-ba5c-0c23d5f9cde9depends_on_disabled_cells§runtimeCCpublished_object_keysdepends_on_skipped_cells§errored$c87dba8c-9a96-41b3-9dc7-a6c088ec1eafqueued¤logsrunning¦outputbodyڞ?Total Reward: -160.0
mimetext/htmlrootassigneelast_run_timestampA Apersist_js_state·has_pluto_hook_features§cell_id$c87dba8c-9a96-41b3-9dc7-a6c088ec1eafdepends_on_disabled_cells§runtime! published_object_keysdepends_on_skipped_cellsçerrored$5cc4d12d-b537-47e2-8109-4e7a234fdf25queued¤logsrunning¦outputbody2make_corridor_mdp (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA Npersist_js_state·has_pluto_hook_features§cell_id$5cc4d12d-b537-47e2-8109-4e7a234fdf25depends_on_disabled_cells§runtime&epublished_object_keysdepends_on_skipped_cells§errored$5334064b-5a16-4135-afa0-86a48291725bqueued¤logsrunning¦outputbodyelementsvalue-11.0794text/plainaction2text/plainaction_valuesprefixFloat32elements-11.0825text/plain-11.0794text/plaintypeArrayprefix_shortobjectid293c0cefdbfaa2e0!application/vnd.pluto.tree+objecttypeNamedTupleobjectid54bdcd18010f08f6mime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA °persist_js_state·has_pluto_hook_features§cell_id$5334064b-5a16-4135-afa0-86a48291725bdepends_on_disabled_cells§runtimeAepublished_object_keysdepends_on_skipped_cellsçerrored$9c342958-1971-48ec-b919-5dfdcbc915a4queued¤logsrunning¦outputbody٣

Change Plot Background Color

mimetext/htmlrootassigneelast_run_timestampA )kpersist_js_state·has_pluto_hook_features§cell_id$9c342958-1971-48ec-b919-5dfdcbc915a4depends_on_disabled_cells§runtimeapublished_object_keysdepends_on_skipped_cellsçerrored$966ef17c-23be-49dc-bc37-4cb52b34c049queued¤logsrunning¦outputbody;

Neural Network Method

mimetext/htmlrootassigneelast_run_timestampA Wipersist_js_state·has_pluto_hook_features§cell_id$966ef17c-23be-49dc-bc37-4cb52b34c049depends_on_disabled_cells§runtimeȠpublished_object_keysdepends_on_skipped_cells§errored$e7e49ff8-32df-48a4-afb2-462859592e92queued¤logsrunning¦outputbodyGform_state_and_policy_function_outputs (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 5?persist_js_state·has_pluto_hook_features§cell_id$e7e49ff8-32df-48a4-afb2-462859592e92depends_on_disabled_cells§runtimeܱpublished_object_keysdepends_on_skipped_cells§errored$78c83673-2117-4542-b4d8-1c243e8f610bqueued¤logsrunning¦outputbody

Eligibility Vector

Recall for the gaussian case and linear approximation we had:

$$\begin{flalign} \pi(a \vert s, \boldsymbol{\theta}) &= \frac{1}{\sqrt{2 \pi \sigma(s, \boldsymbol{\theta})^2}} \exp \left ( - \frac{(a - \mu(s, \boldsymbol{\theta}))^2}{2 \sigma(s, \boldsymbol{\theta})^2} \right )\\ \mu(s, \boldsymbol{\theta}) & \doteq \boldsymbol{\theta}_\mu ^ \top \mathbf{x}_\mu(s) \\ \sigma(s, \boldsymbol{\theta}) & \doteq \exp \left ( \boldsymbol{\theta}_\sigma ^ \top \mathbf{x}_\sigma(s) \right ) \\ \nabla \ln \pi(a \vert s, \boldsymbol{\theta}_\mu) &= \frac{1}{\sigma(s, \boldsymbol{\theta})^2} \left ( a - \mu(s, \boldsymbol{\theta}) \right ) \mathbf{x}_\mu(s) \\ \nabla \ln \pi(a \vert s, \boldsymbol{\theta}_\sigma) &= \left (\frac{(a - \mu(s, \boldsymbol{\theta}))^2}{\sigma(s, \boldsymbol{\theta})^2} \right )\mathbf{x}_\sigma(s) \\ \end{flalign}$$

For the squashed gaussian we can apply the previous results to the new pdf:

$$\begin{flalign} \pi(a \vert s, \boldsymbol{\theta}) &= \frac{1}{\sqrt{2 \pi \sigma(s, \boldsymbol{\theta})^2}} \exp \left ( - \frac{(\tanh^{-1}(a) - \mu(s, \boldsymbol{\theta}))^2}{2 \sigma(s, \boldsymbol{\theta})^2} \right ) \left \vert \frac{1}{1 - a^2} \right \vert\\ \mu(s, \boldsymbol{\theta}) & \doteq \boldsymbol{\theta}_\mu ^ \top \mathbf{x}_\mu(s) \\ \sigma(s, \boldsymbol{\theta}) & \doteq \exp \left ( \boldsymbol{\theta}_\sigma ^ \top \mathbf{x}_\sigma(s) \right ) \\ \nabla \ln \pi(a \vert s, \boldsymbol{\theta}_\mu) &= \frac{1}{\sigma(s, \boldsymbol{\theta})^2} \left ( \tanh^{-1}(a) - \mu(s, \boldsymbol{\theta}) \right ) \mathbf{x}_\mu(s) \\ \nabla \ln \pi(a \vert s, \boldsymbol{\theta}_\sigma) &= \left (\frac{(\tanh^{-1}(a) - \mu(s, \boldsymbol{\theta}))^2}{\sigma(s, \boldsymbol{\theta})^2} \right )\mathbf{x}_\sigma(s) \\ \end{flalign}$$

mimetext/htmlrootassigneelast_run_timestampA 6persist_js_state·has_pluto_hook_features§cell_id$78c83673-2117-4542-b4d8-1c243e8f610bdepends_on_disabled_cells§runtime.published_object_keysdepends_on_skipped_cells§errored$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8fqueued¤logsrunning¦outputbodyTotal Reward: -626.0
mimetext/htmlrootassigneelast_run_timestampA A,persist_js_state·has_pluto_hook_features§cell_id$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8fdepends_on_disabled_cells§runtime$Spublished_object_keysdepends_on_skipped_cellsçerrored$396e0047-d848-462f-a769-0cc2829abc78queued¤logsrunning¦outputbodyelementsaction_probabilitiesprefixFloat32elements0.00209303text/plain0.997907text/plaintypeArrayprefix_shortobjectid2bca965e70e9689c!application/vnd.pluto.tree+objectstate_value_estimate-160.412text/plaintypeNamedTupleobjectid860e53ebb42149aemime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA +&`ٰpersist_js_state·has_pluto_hook_features§cell_id$396e0047-d848-462f-a769-0cc2829abc78depends_on_disabled_cells§runtime&0published_object_keysdepends_on_skipped_cellsçerrored$ff4f977e-48df-4c12-845c-c245b4d39d6dqueued¤logsrunning¦outputbodyEactor_critic_linear_parameter_study (generic function with 3 methods)mimetext/plainrootassigneelast_run_timestampA +d persist_js_state·has_pluto_hook_features§cell_id$ff4f977e-48df-4c12-845c-c245b4d39d6ddepends_on_disabled_cells§runtime|$published_object_keysdepends_on_skipped_cells§errored$aa450da4-fe84-4eea-b6c4-9820b7982437queued¤logsrunning¦outputbody'

With continuous policy parametrization, we can smoothly very action selection probabilities by arbitrarily small amounts, something that was not possible with ϵ-greedy action selection. Therefore stronger convergence guarantees are possible for policy-gradient methods than for action-value methods.

In the episodic case, assuming some particular non-random starting state $s_0$, we define the performance of a policy parametrized by θ as:

$$\begin{align} J(\mathbf{\theta}) \doteq v_{\pi_\mathbf{\theta}}(s_0) \tag{13.4} \end{align}$$

where $v_{\pi_\mathbf{\theta}}$ is the true value function for $\pi_\mathbf{\theta}$, the policy determined by $\mathbf{\theta}$.

The policy gradient theorem provides an analytic expression for the gradient of performance with respect to the policy parameter that does not involve the derivative of the state distribution:

$$\begin{align} \nabla J(\mathbf{\theta}) \propto \sum_s \mu (s) \sum_a q_\pi (s, a) \nabla \pi (a|s,\mathbf{\theta}) \tag{13.5} \end{align}$$

where the gradients are column vectors of partial derivatives with respect to the components of $\mathbf{\theta}$. In the episodic case, the constant of proportionality is the average length of an episode, and in the continuing case it is 1. The distribution here $\mu$ is the on-policy distribution under $\pi$.

mimetext/htmlrootassigneelast_run_timestampA apersist_js_state·has_pluto_hook_features§cell_id$aa450da4-fe84-4eea-b6c4-9820b7982437depends_on_disabled_cells§runtimeVpublished_object_keysdepends_on_skipped_cells§errored$bb1ef180-39ac-475f-beea-ef573e71a3bfqueued¤logsrunning¦outputbody3 mimetext/htmlrootassigneelast_run_timestampA :!Ѱpersist_js_state·has_pluto_hook_features§cell_id$bb1ef180-39ac-475f-beea-ef573e71a3bfdepends_on_disabled_cells§runtimeQpublished_object_keysdepends_on_skipped_cellsçerrored$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27queued¤logsrunning¦outputbodyelementsstep_rewardsprefixFloat32elements0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain 0.0text/plainmore0.0text/plaintypeArrayprefix_shortobjectid95970e74b5b370e0!application/vnd.pluto.tree+objecttotal_reward-919.0text/plaintotal_steps300000text/plainpolicy_parameterselementsprefixMatrix{Float32}elements4×4 Matrix{Float32}: -0.61371 -0.300138 -1.8085 -0.459389 1.14587 1.97024 -0.0989286 3.46696 0.37794 1.43625 -0.486325 1.01832 -1.23568 -0.360545 -0.007061 0.142772text/plain4×4 Matrix{Float32}: 1.49822 -1.71173 -0.432459 -0.420095 1.66924 -0.585423 -0.137278 0.266889 -0.305919 -0.21953 -0.214605 0.665994 -0.358615 -1.48606 -0.193572 -0.27517text/plainٚ3×4 Matrix{Float32}: 1.12812 -0.066766 1.00552 1.5663 0.149909 -0.769449 -0.710168 0.0421501 -1.57573 -0.817095 0.0582138 -1.37842text/plaintypeArrayprefix_shortobjectidd9856d724a422ef9!application/vnd.pluto.tree+objectprefixVector{Float32}elementsprefixFloat32elementsmoretypeArrayprefix_shortobjectidfc5785410fe653ae!application/vnd.pluto.tree+objectprefixFloat32elementsmoretypeArrayprefix_shortobjectide9ead6b56b0827f7!application/vnd.pluto.tree+objectprefixFloat32elementsmoretypeArrayprefix_shortobjectide3ea0176df069512!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid17e49c6ec7008c4d!application/vnd.pluto.tree+objecttypeTupleobjectid42fb8df5a41099a9!application/vnd.pluto.tree+objectvalue_parameterselementsprefixMatrix{Float32}elements4×4 Matrix{Float32}: -0.61371 -0.300138 -1.8085 -0.459389 1.14587 1.97024 -0.0989286 3.46696 0.37794 1.43625 -0.486325 1.01832 -1.23568 -0.360545 -0.007061 0.142772text/plain4×4 Matrix{Float32}: 1.49822 -1.71173 -0.432459 -0.420095 1.66924 -0.585423 -0.137278 0.266889 -0.305919 -0.21953 -0.214605 0.665994 -0.358615 -1.48606 -0.193572 -0.27517text/plain?1×4 Matrix{Float32}: 0.252601 0.459532 -0.215838 -0.295855text/plaintypeArrayprefix_shortobjectid22a18aa94fc7c923!application/vnd.pluto.tree+objectprefixVector{Float32}elementsprefixFloat32elementsmoretypeArrayprefix_shortobjectidfc5785410fe653ae!application/vnd.pluto.tree+objectprefixFloat32elementsmoretypeArrayprefix_shortobjectide9ead6b56b0827f7!application/vnd.pluto.tree+objectprefixFloat32elementsmoretypeArrayprefix_shortobjectidfea60baa07c69e8d!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid216f1039cd628e70!application/vnd.pluto.tree+objecttypeTupleobjectid11d70edd308a4afb!application/vnd.pluto.tree+objectpolicy_functionπtext/plainpolicy_sample_actionπ_sampletext/plainestimate_state_valueestimate_state_valuetext/plainpolicy_and_valuepolicy_and_valuetext/plaintypeNamedTupleobjectida9ee11dbd1e5a8b1mime!application/vnd.pluto.tree+objectrootassignee$const cartpole_continuing_fcann_testlast_run_timestampA 3 Rpersist_js_state·has_pluto_hook_features§cell_id$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27depends_on_disabled_cells§runtime_Q1apublished_object_keysdepends_on_skipped_cellsçerrored$5b868eba-c1af-49f6-8f93-79b78c319a6fqueued¤logsrunning¦outputbodyNreinforce_with_baseline_monte_carlo_control! (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA #gV`persist_js_state·has_pluto_hook_features§cell_id$5b868eba-c1af-49f6-8f93-79b78c319a6fdepends_on_disabled_cells§runtimef rpublished_object_keysdepends_on_skipped_cells§errored$68469a40-7976-48b7-b7a1-eaa4c5f33a18queued¤logsrunning¦outputbodyCplot_mountaincar_continuous_values (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA @persist_js_state·has_pluto_hook_features§cell_id$68469a40-7976-48b7-b7a1-eaa4c5f33a18depends_on_disabled_cells§runtimeGɵpublished_object_keysdepends_on_skipped_cellsçerrored$2a586e46-66e4-461a-85c8-5817e4d1aa43queued¤logsrunning¦outputbody

$$\begin{flalign} \nabla J(\boldsymbol{\theta}) &= \nabla v_\pi(s_0) \\ &= \sum_s \sum_k \gamma^k \Pr \{ s_0 \rightarrow s, k, \pi \} f(s) \\ &= \sum_s \sum_k \gamma^k \frac{\sum_{x \in \mathcal{S}} \sum_{t = 0}^\infty \Pr \{ s_0 \rightarrow x, t, \pi \}}{\sum_{x \in \mathcal{S}} \sum_{t = 0}^\infty \Pr \{ s_0 \rightarrow x, t, \pi \}} \Pr \{ s_0 \rightarrow s, k, \pi \} f(s) \tag{multiply by 1}\\ &= \eta \sum_s \sum_k \gamma^k \frac{\Pr \{ s_0 \rightarrow s, k, \pi \}}{\sum_{x \in \mathcal{S}} \sum_{t = 0}^\infty \Pr \{ s_0 \rightarrow x, t, \pi \}} f(s) \tag{average episode length}\\ &= \eta \sum_s \sum_k \gamma^k \mu_\pi(s, k) f(s) \tag{on policy distribution over states and steps}\\ &= \eta \mathbb{E}_\pi[ \gamma^k f(s) \mid S_0 = s_0, S_k = s] \tag{definition of expected value}\\ &\propto \mathbb{E}_\pi \left [ \gamma^k \sum_a \nabla \pi(a \vert s) q_\pi(s, a) \mid S_0 = s_0, S_k = s \right ] \tag{13.5}\\ \end{flalign}$$

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$2a586e46-66e4-461a-85c8-5817e4d1aa43depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$a206c759-3f6e-4003-8cba-5f6ce6742646queued¤logsrunning¦outputbody

Figure 13.1

REINFORCE on short-corridor gridworld (Example 13.1). Performance varies with step size but can approach the ideal. Feature vector encodes every state identically.

mimetext/htmlrootassigneelast_run_timestampA d(persist_js_state·has_pluto_hook_features§cell_id$a206c759-3f6e-4003-8cba-5f6ce6742646depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$fc3dcd26-c5cf-4141-bf6c-eaed5fc9bb1dqueued¤logsrunning¦outputbody

Consider the linear parameterization proposed with $h_a = \boldsymbol{\theta}^\top \mathbf{x}(s, a)$:

$$\frac{\partial{h_a}}{\partial{\theta_i}} = \mathbf{x}(s, a)_i \implies \nabla(\pi(a \vert s, \boldsymbol{\theta}))_i = \pi_a \left ( \mathbf{x}(s, a)_i - \sum_k \pi_k \mathbf{x}(s, k)_i \right)$$

Now consider $\mathbf{h} = \theta ^ \top \mathbf{x}$ with $h_a = \mathbf{h}_a$. Since the parameters are now represented as a matrix, we can also index the gradient partial derivatives such that $\nabla \left ( f(\theta) \right )_{i, j} = \frac{\partial f(\theta)}{\theta_{i, j}}$

$$\frac{\partial{h_a}}{\partial{\theta_{i, j}}} = \begin{cases} \mathbf{x}(s)_i, & \text{ if } j = a \\ 0, & \text{ else } \end{cases} \implies \nabla(\pi(s, \boldsymbol{\theta})_a)_{i, j} = \pi_a \left ( \frac{\partial h_a}{\partial \theta_{i, j}} - \sum_k \pi_k \frac{\partial h_k}{\partial \theta_{i, j}} \right)=\pi_a \begin{cases} \mathbf{x}(s)_i (1 - \pi_j), & \text{ if } j = a \\ -\pi_j \mathbf{x}(s)_i, & \text{ else }\\ \end{cases}$$

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$fc3dcd26-c5cf-4141-bf6c-eaed5fc9bb1ddepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$3cfd63ad-b1a2-4b99-ae97-2ff10351e4f5queued¤logsrunning¦outputbodyC

Beta Distribution Alternative

mimetext/htmlrootassigneelast_run_timestampA jpersist_js_state·has_pluto_hook_features§cell_id$3cfd63ad-b1a2-4b99-ae97-2ff10351e4f5depends_on_disabled_cells§runtimeǵpublished_object_keysdepends_on_skipped_cells§errored$31db0f58-28e4-454f-9394-25565687266fqueued¤logsrunning¦outputbodyb mimetext/htmlrootassigneelast_run_timestampA 0persist_js_state·has_pluto_hook_features§cell_id$31db0f58-28e4-454f-9394-25565687266fdepends_on_disabled_cells§runtime aGpublished_object_keysdepends_on_skipped_cellsçerrored$822e4d69-2582-4956-858e-06ecb091e76aqueued¤logsrunning¦outputbody9display_cartpole_episode (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 0`Ұpersist_js_state·has_pluto_hook_features§cell_id$822e4d69-2582-4956-858e-06ecb091e76adepends_on_disabled_cells§runtimeHFpublished_object_keysdepends_on_skipped_cellsçerrored$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580aqueued¤logsrunning¦outputbody
mimetext/htmlrootassigneelast_run_timestampA @Nװpersist_js_state·has_pluto_hook_features§cell_id$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580adepends_on_disabled_cells§runtimeL2published_object_keysdepends_on_skipped_cellsçerrored$05b0fcad-628b-48d2-aa24-f6f562dbb660queued¤logsrunning¦outputbodyQ

$$\begin{flalign} &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) \sum_{a^{\prime \prime}} [ \nabla \pi(a^{\prime \prime} \vert s^{\prime \prime}) q_\pi(s^{\prime \prime}, a^{\prime \prime})\right ] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) f(s^{\prime \prime}) \right ] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{s^{\prime \prime}} f(s^{\prime \prime}) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) p(s^{\prime \prime} \vert s^\prime, a^\prime) \right ] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \mathbb{E}_\pi[f(s^{\prime \prime}) \vert s^\prime] \right ] \\ &\gamma^2 \mathbb{E}_\pi[f(s^{\prime \prime}) \vert s] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) g(s^\prime) \right ] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \mathbb{E}[g(s^\prime) \vert s, a] \right ] \\ &\gamma^2 \mathbb{E}_\pi[g(s^\prime) \vert s]\\ &\gamma^2 \sum_{s^{\prime \prime}} \Pr(s \rightarrow s^{\prime \prime}, 2, \pi) f(s^{\prime \prime}) \end{flalign}$$

mimetext/htmlrootassigneelast_run_timestampA `persist_js_state·has_pluto_hook_features§cell_id$05b0fcad-628b-48d2-aa24-f6f562dbb660depends_on_disabled_cells§runtime{published_object_keysdepends_on_skipped_cells§errored$d2729657-d0bf-4d39-8ec7-f242a1ad48d6queued¤logsrunning¦outputbodyJcreate_continuous_action_mountaincar_beta (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA >tpersist_js_state·has_pluto_hook_features§cell_id$d2729657-d0bf-4d39-8ec7-f242a1ad48d6depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$5c11a92d-7496-4aba-af15-2537eac49dd7queued¤logsrunning¦outputbody!Array{Vector{T}, 1} where T<:Realmimetext/plainrootassigneelast_run_timestampA NQpersist_js_state·has_pluto_hook_features§cell_id$5c11a92d-7496-4aba-af15-2537eac49dd7depends_on_disabled_cells§runtime>cpublished_object_keysdepends_on_skipped_cells§errored$1753b5ed-c00b-4b60-b492-822180778e8cqueued¤logsrunning¦outputbody>update_linear_value_gradient! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$1753b5ed-c00b-4b60-b492-822180778e8cdepends_on_disabled_cells§runtime qpublished_object_keysdepends_on_skipped_cells§errored$f7ede764-5ad8-426b-a805-cc21b622d977queued¤logsrunning¦outputbody5

Results Caching

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$f7ede764-5ad8-426b-a805-cc21b622d977depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$36d514fa-b27a-4c6b-8399-9d108377b9b5queued¤logsrunning¦outputbodycA

$\lambda_\theta$: 0.75

$\lambda_\mathbf{w}$: 0.25

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

mimetext/htmlrootassigneelast_run_timestampA ?persist_js_state·has_pluto_hook_features§cell_id$36d514fa-b27a-4c6b-8399-9d108377b9b5depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cellsçerrored$6b1acb57-159a-4b7f-99fe-5f996522243bqueued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestamppersist_js_state·has_pluto_hook_features§cell_id$6b1acb57-159a-4b7f-99fe-5f996522243bdepends_on_disabled_cellsçruntimepublished_object_keysdepends_on_skipped_cells§errored$45f0a385-6465-4acc-8637-1b007a0fe215queued¤logsrunning¦outputbodyAupdate_fcann_eligibility_vector! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$45f0a385-6465-4acc-8637-1b007a0fe215depends_on_disabled_cells§runtimefpublished_object_keysdepends_on_skipped_cells§errored$c52c4cec-0ea8-4af3-831a-d284f0e086eequeued¤logsrunning¦outputbody- mimetext/htmlrootassigneelast_run_timestampA @-Upersist_js_state·has_pluto_hook_features§cell_id$c52c4cec-0ea8-4af3-831a-d284f0e086eedepends_on_disabled_cells§runtimeMpublished_object_keysdepends_on_skipped_cellsçerrored$f8614042-7c94-4d47-a1b6-4e96676b4e8bqueued¤logsrunning¦outputbodyLactor_critic_fcann_episodic_parameter_study (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA /qްpersist_js_state·has_pluto_hook_features§cell_id$f8614042-7c94-4d47-a1b6-4e96676b4e8bdepends_on_disabled_cells§runtime~published_object_keysdepends_on_skipped_cellsçerrored$76eb6743-cac0-4174-9ba3-a0691c200b54queued¤logsrunning¦outputbody:make_n_param_dist_params (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA '$:persist_js_state·has_pluto_hook_features§cell_id$76eb6743-cac0-4174-9ba3-a0691c200b54depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$94517664-6988-44dc-a297-e9d5873ee540queued¤logsrunning¦outputbodyV

Squashed Gaussian Plot Parameters

$\mu$: 0.0

$\sigma$: 0.5

maximum value: 1.0

mimetext/htmlrootassigneelast_run_timestampA !z"2persist_js_state·has_pluto_hook_features§cell_id$94517664-6988-44dc-a297-e9d5873ee540depends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cellsçerrored$d037ea92-915c-4dc7-97c6-d006d92e088aqueued¤logsrunning¦outputbody,figure_13_1 (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA #}persist_js_state·has_pluto_hook_features§cell_id$d037ea92-915c-4dc7-97c6-d006d92e088adepends_on_disabled_cells§runtimeo published_object_keysdepends_on_skipped_cellsçerrored$24fa139c-ad4b-49db-ac8f-23c476ed8608queued¤logsrunning¦outputbodymsgInexactError: Int64(NaN32)stacktracecall_shortInt64inlinedãurlpath./float.jlsource_packagecallInt64linfo_typeNothinglinefilefloat.jlfuncInt64parent_modulefrom_cŒcall_shortconvertinlinedãurlpath./number.jlsource_packagecallconvertlinfo_typeNothinglinefilenumber.jlfuncconvertparent_modulefrom_cŒcall_short_round_convertinlinedãurlpath./rounding.jlsource_packagecall_round_convertlinfo_typeNothinglinefilerounding.jlfunc_round_convertparent_modulefrom_cŒcall_shortroundinlinedãurlpath./rounding.jlsource_packagecallroundlinfo_typeNothinglineߤfilerounding.jlfuncroundparent_modulefrom_cŒcall_shortceilinlinedãurlpath./rounding.jlsource_packagecallceillinfo_typeNothinglineܤfilerounding.jlfuncceilparent_modulefrom_cŒcall_short#1114inlinedãurlpath٫/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-12/Chapter_12_Eligibility_Traces.jlsource_packagecall#1114linfo_typeNothinglineʤfile Chapter_12_Eligibility_Traces.jlfunc#1114parent_modulefrom_cŒcall_short'(::var"#1114#1116"{…})(tiling::Int64)inlined£urlpath./nonesource_packageMaincallٻ(::var"#1114#1116"{4, Int64, NTuple{4, Float32}, NTuple{4, Float32}, NTuple{4, Int64}, NTuple{4, Float32}, NTuple{4, Int64}, NTuple{4, Float32}, NTuple{4, Float32}, Int64})(tiling::Int64)linfo_typeCore.MethodInstancelinefilenonefunc#1114parent_moduleMain.var"workspace#8"from_cŒcall_shortiterateinlinedãurlpath./generator.jlsource_packagecalliteratelinfo_typeNothingline0filegenerator.jlfunciterateparent_modulefrom_cŒcall_shortiterateinlinedãurlpath./iterators.jlsource_packagecalliteratelinfo_typeNothinglineΤfileiterators.jlfunciterateparent_modulefrom_cŒcall_shortiterateinlinedãurlpath./iterators.jlsource_packagecalliteratelinfo_typeNothinglineͤfileiterators.jlfunciterateparent_modulefrom_cŒcall_short|update_binary_feature_vector!(x::BinaryFeatureVector{…}, s::CartPoleState{…}, get_active_features::var"#1549#1552"{…})inlined£urlhttps://github.com/jekyllstein/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/tree/fc4495701c659f9d92b015bfca6e6d3b480d4178//Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#8eab55a5-41b7-4f5e-a02f-4c19388bc9ea#L1path/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#8eab55a5-41b7-4f5e-a02f-4c19388bc9easource_packageMaincallXupdate_binary_feature_vector!(x::BinaryFeatureVector{Int64}, s::CartPoleState{Float32}, get_active_features::var"#1549#1552"{@NamedTuple{num_features::Int64, get_active_features::var"#f#1125"{NTuple{4, Float32}, NTuple{4, Float32}, NTuple{4, Float32}, Int64, NTuple{4, Int64}, NTuple{4, Float32}, Int64, NTuple{4, Int64}, NTuple{4, Float32}}}})linfo_typeCore.MethodInstancelinefileMChapter_13_Policy_Gradient_Methods.jl#==#8eab55a5-41b7-4f5e-a02f-4c19388bc9eafuncupdate_binary_feature_vector!parent_moduleMain.var"workspace#8"from_cŒcall_shortupdate_feature_vector!inlinedãurlpath/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#ba5d6311-daee-4abc-b2fb-fae2184ef3ebsource_packagecallupdate_feature_vector!linfo_typeNothinglinefileMChapter_13_Policy_Gradient_Methods.jl#==#ba5d6311-daee-4abc-b2fb-fae2184ef3ebfuncupdate_feature_vector!parent_modulefrom_cŒcall_shortπ!inlinedãurlpath/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#f545c800-0bf3-491f-9d7d-42341cfdb573source_packagecallπ!linfo_typeNothinglinefileMChapter_13_Policy_Gradient_Methods.jl#==#f545c800-0bf3-491f-9d7d-42341cfdb573funcπ!parent_modulefrom_cŒcall_shortπinlinedãurlpath/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#5b868eba-c1af-49f6-8f93-79b78c319a6fsource_packagecallπlinfo_typeNothinglinefileMChapter_13_Policy_Gradient_Methods.jl#==#5b868eba-c1af-49f6-8f93-79b78c319a6ffuncπparent_modulefrom_cŒcall_short4(::var"#π_sample#1309"{…})(s::CartPoleState{…})inlined£urlhttps://github.com/jekyllstein/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/tree/fc4495701c659f9d92b015bfca6e6d3b480d4178//Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#5b868eba-c1af-49f6-8f93-79b78c319a6f#L8path/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#5b868eba-c1af-49f6-8f93-79b78c319a6fsource_packageMaincall(::var"#π_sample#1309"{typeof(gaussian_action_sampler), var"#π#1308"{Matrix{Float32}, Vector{Float32}, BinaryFeatureVector{Int64}, var"#π!#1303"{var"#update_feature_vector!#1364"{var"#1549#1552"{@NamedTuple{num_features::Int64, get_active_features::var"#f#1125"{NTuple{4, Float32}, NTuple{4, Float32}, NTuple{4, Float32}, Int64, NTuple{4, Int64}, NTuple{4, Float32}, Int64, NTuple{4, Int64}, NTuple{4, Float32}}}}}, typeof(update_binary_action_preferences!)}}})(s::CartPoleState{Float32})linfo_typeCore.MethodInstancelinefileMChapter_13_Policy_Gradient_Methods.jl#==#5b868eba-c1af-49f6-8f93-79b78c319a6ffuncπ_sampleparent_moduleMain.var"workspace#8"from_cŒcall_shortوrunepisode!(::Tuple{…}, mdp::ContinuousMDP{…}, π::var"#π_sample#1309"{…}; s0::CartPoleState{…}, a0::Float32, max_steps::Int64)inlined£urlhttps://github.com/jekyllstein/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/tree/fc4495701c659f9d92b015bfca6e6d3b480d4178//Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#f946c886-6246-4f98-a96f-f06984691ad8#L2path/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#f946c886-6246-4f98-a96f-f06984691ad8source_packageMaincall*runepisode!(::Tuple{Vector{CartPoleState{Float32}}, Vector{Float32}, Vector{Float32}}, mdp::ContinuousMDP{Float32, CartPoleState{Float32}, Float32, ContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1521#1537", var"#init_θ#1551", var"#1523#1539", var"#1524#1540"}}, var"#failure#1527"{Float32, Float32, Float32, Float32}, Returns{Bool}}, π::var"#π_sample#1309"{typeof(gaussian_action_sampler), var"#π#1308"{Matrix{Float32}, Vector{Float32}, BinaryFeatureVector{Int64}, var"#π!#1303"{var"#update_feature_vector!#1364"{var"#1549#1552"{@NamedTuple{num_features::Int64, get_active_features::var"#f#1125"{NTuple{4, Float32}, NTuple{4, Float32}, NTuple{4, Float32}, Int64, NTuple{4, Int64}, NTuple{4, Float32}, Int64, NTuple{4, Int64}, NTuple{4, Float32}}}}}, typeof(update_binary_action_preferences!)}}}; s0::CartPoleState{Float32}, a0::Float32, max_steps::Int64)linfo_typeCore.MethodInstanceline fileMChapter_13_Policy_Gradient_Methods.jl#==#f946c886-6246-4f98-a96f-f06984691ad8func#runepisode!#1258parent_moduleMain.var"workspace#8"from_cŒcall_shortrunepisode!inlinedãurlpath/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#f946c886-6246-4f98-a96f-f06984691ad8source_packagecallrunepisode!linfo_typeNothinglinefileMChapter_13_Policy_Gradient_Methods.jl#==#f946c886-6246-4f98-a96f-f06984691ad8funcrunepisode!parent_modulefrom_cŒcall_shortreinforce_with_baseline_monte_carlo_control!(policy_params::Matrix{…}, ∇lnπ::BinaryGaussianEligibilityVector{…}, value_params::Vector{…}, ∇v̂::BinaryFeatureVector{…}, mdp::ContinuousMDP{…}, update_action_distribution!::typeof(update_binary_action_preferences!), action_dist_params::Vector{…}, action_sampler::typeof(gaussian_action_sampler), update_eligibility_vector!::typeof(update_gaussian_eligibility_vector!), x::BinaryFeatureVector{…}, update_feature_vector!::var"#update_feature_vector!#1364"{…}, value_function::typeof(binary_value_function), update_value_gradient!::typeof(update_binary_value_gradient!), max_episodes::Int64; α_w::Float32, α_θ::Float32, γ::Float32, epkwargs::@Kwargs{})inlined£urlhttps://github.com/jekyllstein/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/tree/fc4495701c659f9d92b015bfca6e6d3b480d4178//Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#5b868eba-c1af-49f6-8f93-79b78c319a6f#L2path/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#5b868eba-c1af-49f6-8f93-79b78c319a6fsource_packageMaincallreinforce_with_baseline_monte_carlo_control!(policy_params::Matrix{Float32}, ∇lnπ::BinaryGaussianEligibilityVector{Float32, Float32, Float32, BinaryFeatureVector{Int64}}, value_params::Vector{Float32}, ∇v̂::BinaryFeatureVector{Int64}, mdp::ContinuousMDP{Float32, CartPoleState{Float32}, Float32, ContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1521#1537", var"#init_θ#1551", var"#1523#1539", var"#1524#1540"}}, var"#failure#1527"{Float32, Float32, Float32, Float32}, Returns{Bool}}, update_action_distribution!::typeof(update_binary_action_preferences!), action_dist_params::Vector{Float32}, action_sampler::typeof(gaussian_action_sampler), update_eligibility_vector!::typeof(update_gaussian_eligibility_vector!), x::BinaryFeatureVector{Int64}, update_feature_vector!::var"#update_feature_vector!#1364"{var"#1549#1552"{@NamedTuple{num_features::Int64, get_active_features::var"#f#1125"{NTuple{4, Float32}, NTuple{4, Float32}, NTuple{4, Float32}, Int64, NTuple{4, Int64}, NTuple{4, Float32}, Int64, NTuple{4, Int64}, NTuple{4, Float32}}}}}, value_function::typeof(binary_value_function), update_value_gradient!::typeof(update_binary_value_gradient!), max_episodes::Int64; α_w::Float32, α_θ::Float32, γ::Float32, epkwargs::@Kwargs{})linfo_typeCore.MethodInstancelinefileMChapter_13_Policy_Gradient_Methods.jl#==#5b868eba-c1af-49f6-8f93-79b78c319a6ffunc2#reinforce_with_baseline_monte_carlo_control!#1304parent_moduleMain.var"workspace#8"from_cŒcall_shortreinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions(mdp::ContinuousMDP{…}, get_active_features::Function, num_features::Int64, max_episodes::Int64; policy_params::Matrix{…}, value_params::Vector{…}, kwargs::@Kwargs{…})inlined£urlhttps://github.com/jekyllstein/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/tree/fc4495701c659f9d92b015bfca6e6d3b480d4178//Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00#L1path/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00source_packageMaincallreinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions(mdp::ContinuousMDP{Float32, CartPoleState{Float32}, Float32, ContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1521#1537", var"#init_θ#1551", var"#1523#1539", var"#1524#1540"}}, var"#failure#1527"{Float32, Float32, Float32, Float32}, Returns{Bool}}, get_active_features::Function, num_features::Int64, max_episodes::Int64; policy_params::Matrix{Float32}, value_params::Vector{Float32}, kwargs::@Kwargs{α_θ::Float32, α_w::Float32})linfo_typeCore.MethodInstancelinefileMChapter_13_Policy_Gradient_Methods.jl#==#7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00funcR#reinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions#1367parent_moduleMain.var"workspace#8"from_cŒcall_shorttop-level scopeinlined£urlpath/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#24fa139c-ad4b-49db-ac8f-23c476ed8608source_packagecalltop-level scopelinfo_typeCore.CodeInfolinefileMChapter_13_Policy_Gradient_Methods.jl#==#24fa139c-ad4b-49db-ac8f-23c476ed8608functop-level scopeparent_modulefrom_c¤mime'application/vnd.pluto.stacktrace+objectrootassigneeconst reinforce_testlast_run_timestampA 0Upersist_js_state·has_pluto_hook_features§cell_id$24fa139c-ad4b-49db-ac8f-23c476ed8608depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$2025ff38-f2ec-4224-b771-ff72ffe1af28queued¤logsrunning¦outputbodyelements-1.2text/plain-0.07text/plaintypeTupleobjectid8f2fc5f476d649acmime!application/vnd.pluto.tree+objectrootassigneeconst mountaincar_min_valslast_run_timestampA :8Lpersist_js_state·has_pluto_hook_features§cell_id$2025ff38-f2ec-4224-b771-ff72ffe1af28depends_on_disabled_cells§runtime6published_object_keysdepends_on_skipped_cells§errored$cb70d400-3e9c-441c-b17c-e727e8c928f3queued¤logsrunning¦outputbodyB

Waiting to run parameter study

mimetext/htmlrootassigneelast_run_timestampA 2zpersist_js_state·has_pluto_hook_features§cell_id$cb70d400-3e9c-441c-b17c-e727e8c928f3depends_on_disabled_cells§runtimeXpublished_object_keysdepends_on_skipped_cellsçerrored$e034b9cb-f4ee-46f4-bea6-72c93c75d966queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestampA %zpersist_js_state·has_pluto_hook_features§cell_id$e034b9cb-f4ee-46f4-bea6-72c93c75d966depends_on_disabled_cells§runtime`?9published_object_keysdepends_on_skipped_cells§errored$e6cf9550-2e69-4b82-92cf-5e07a35490aaqueued¤logsrunning¦outputbody.zero_params! (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA ;persist_js_state·has_pluto_hook_features§cell_id$e6cf9550-2e69-4b82-92cf-5e07a35490aadepends_on_disabled_cells§runtime˵published_object_keysdepends_on_skipped_cells§errored$717e4c69-59d5-4929-923f-dd35a97fb160queued¤logsrunning¦outputbodypactor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA /ͳհpersist_js_state·has_pluto_hook_features§cell_id$717e4c69-59d5-4929-923f-dd35a97fb160depends_on_disabled_cells§runtime.ŵpublished_object_keysdepends_on_skipped_cells§errored$1386ffdb-940d-4f1b-a872-4e38647b5335queued¤logsrunning¦outputbody

Test One-step Actor-Critic

The following function calls execute the One-step Actor-Critic algorithm on Example 13.1. The output displayed is the policy function acting on the single state representation for the problem. The two values represent the probability of taking the left and right action respectively. If converged properly, the right action probability should be higher, approaching a value of about 60%.

mimetext/htmlrootassigneelast_run_timestampA 8persist_js_state·has_pluto_hook_features§cell_id$1386ffdb-940d-4f1b-a872-4e38647b5335depends_on_disabled_cells§runtimekpublished_object_keysdepends_on_skipped_cells§errored$a893a87b-2d07-4db5-9d1a-9da8646216f4queued¤logsrunning¦outputbody>update_params_with_gradient! (generic function with 5 methods)mimetext/plainrootassigneelast_run_timestampA Npersist_js_state·has_pluto_hook_features§cell_id$a893a87b-2d07-4db5-9d1a-9da8646216f4depends_on_disabled_cells§runtime|published_object_keysdepends_on_skipped_cells§errored$2cbc972b-c685-4c1c-8a8d-9d58b197ad90queued¤logsrunning¦outputbody // We start by putting all the variable interpolation here at the beginning // Publish the plot object to JS let plot_obj = {"layout": {"xaxis": {"title": {"text": "Training Step"}}, "template": {"layout": {"coloraxis": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "xaxis": {"gridcolor": "white", "zerolinewidth": 2, "title": {"standoff": 15}, "ticks": "", "zerolinecolor": "white", "automargin": true, "linecolor": "white"}, "hovermode": "closest", "paper_bgcolor": "white", "geo": {"showlakes": true, "showland": true, "landcolor": "#E5ECF6", "bgcolor": "white", "subunitcolor": "white", "lakecolor": "white"}, "colorscale": {"sequential": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]], "diverging": [[0, "#8e0152"], [0.1, "#c51b7d"], [0.2, "#de77ae"], [0.3, "#f1b6da"], [0.4, "#fde0ef"], [0.5, "#f7f7f7"], [0.6, "#e6f5d0"], [0.7, "#b8e186"], [0.8, "#7fbc41"], [0.9, "#4d9221"], [1, "#276419"]], "sequentialminus": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}, "yaxis": {"gridcolor": "white", "zerolinewidth": 2, "title": {"standoff": 15}, "ticks": "", "zerolinecolor": "white", "automargin": true, "linecolor": "white"}, "shapedefaults": {"line": {"color": "#2a3f5f"}}, "hoverlabel": {"align": "left"}, "mapbox": {"style": "light"}, "polar": {"angularaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "bgcolor": "#E5ECF6", "radialaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}}, "autotypenumbers": "strict", "font": {"color": "#2a3f5f"}, "ternary": {"baxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "bgcolor": "#E5ECF6", "caxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "aaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}}, "annotationdefaults": {"arrowhead": 0, "arrowwidth": 1, "arrowcolor": "#2a3f5f"}, "plot_bgcolor": "#E5ECF6", "title": {"x": 0.05}, "scene": {"xaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}, "zaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}, "yaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}}, "colorway": ["#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52"]}, "data": {"barpolar": [{"type": "barpolar", "marker": {"line": {"color": "#E5ECF6", "width": 0.5}}}], "carpet": [{"aaxis": {"gridcolor": "white", "endlinecolor": "#2a3f5f", "minorgridcolor": "white", "startlinecolor": "#2a3f5f", "linecolor": "white"}, "type": "carpet", "baxis": {"gridcolor": "white", "endlinecolor": "#2a3f5f", "minorgridcolor": "white", "startlinecolor": "#2a3f5f", "linecolor": "white"}}], "scatterpolar": [{"type": "scatterpolar", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "parcoords": [{"line": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "type": "parcoords"}], "scatter": [{"type": "scatter", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram2dcontour": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "histogram2dcontour", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "contour": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "contour", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scattercarpet": [{"type": "scattercarpet", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "mesh3d": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "mesh3d"}], "surface": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "surface", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scattermapbox": [{"type": "scattermapbox", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scattergeo": [{"type": "scattergeo", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram": [{"type": "histogram", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "pie": [{"type": "pie", "automargin": true}], "choropleth": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "choropleth"}], "heatmapgl": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "heatmapgl", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "bar": [{"type": "bar", "error_y": {"color": "#2a3f5f"}, "error_x": {"color": "#2a3f5f"}, "marker": {"line": {"color": "#E5ECF6", "width": 0.5}}}], "heatmap": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "heatmap", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "contourcarpet": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "contourcarpet"}], "table": [{"type": "table", "header": {"line": {"color": "white"}, "fill": {"color": "#C8D4E3"}}, "cells": {"line": {"color": "white"}, "fill": {"color": "#EBF0F8"}}}], "scatter3d": [{"line": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "type": "scatter3d", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scattergl": [{"type": "scattergl", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram2d": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "histogram2d", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scatterternary": [{"type": "scatterternary", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scatterpolargl": [{"type": "scatterpolargl", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}]}}, "margin": {"l": 50, "b": 50, "r": 50, "t": 60}, "yaxis": {"title": {"text": "Reward Average"}}}, "config": {"showLink": false, "editable": false, "responsive": true, "staticPlot": false, "scrollZoom": true}, "frames": [], "data": [{"y": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.370569e-5, 5.313496e-5, 5.2576237e-5, 5.2029136e-5, 5.1493305e-5, 5.0965802e-5, 5.045154e-5, 4.9947554e-5, 4.9453538e-5, 4.89692e-5, 4.84919e-5, 4.8026126e-5, 4.7569214e-5, 4.7120913e-5, 9.336197e-5, 9.2494105e-5, 9.164643e-5, 9.081415e-5, 8.999685e-5, 8.919413e-5, 0.00013260255, 0.00013144058, 0.00017373177, 0.00017223562, 0.00017076502, 0.00016931217, 0.00020986359, 0.00020811654, 0.00024767802, 0.00028659162, 0.0003248599, 0.00036252316, 0.00039958445, 0.00043605804, 0.00047195784, 0.0004682744, 0.00050334923, 0.0005763246, 0.0006100583, 0.0006432815, 0.00067600555, 0.0007082153, 0.0007399734, 0.0007712649, 0.00080210005, 0.00083248876, 0.0008624097, 0.00089193333, 0.0009210386, 0.0009497344, 0.0010129589, 0.0010405828, 0.0010678609, 0.0010947656, 0.0011213048, 0.0011474857, 0.0011732761, 0.0012320603, 0.0012239092, 0.0012487266, 0.0013058666, 0.0013297872, 0.0013534416, 0.001376793, 0.001431662, 0.0014542235, 0.0014764552, 0.0015296725, 0.0015512053, 0.0016033052, 0.0016241228, 0.0016750928, 0.0016952232, 0.0017151111, 0.001734761, 0.0017541773, 0.0018028668, 0.0018510357, 0.0018694318, 0.0018876144, 0.0019344593, 0.0019520037, 0.001997945, 0.0020149846, 0.002060052, 0.0020765518, 0.0020928092, 0.0021366929, 0.0021800923, 0.0021955704, 0.0022381744, 0.0022531082, 0.0022679411, 0.0023094688, 0.0023238421, 0.0023646315, 0.0023784984, 0.0023922815, 0.002432072, 0.002445433, 0.0024586557, 0.0024716787, 0.002510246, 0.0025229359, 0.002535497, 0.0025731584, 0.0025852765, 0.0025973378, 0.0026341293, 0.002670557, 0.0026820207, 0.0027177904, 0.0027289118, 0.0027641724, 0.0027749627, 0.0028096633, 0.0028200655, 0.0028542206, 0.0028880525, 0.0029215654, 0.0029313134, 0.0029642424, 0.0029969334, 0.003029322, 0.0030383943, 0.003070296, 0.003101836, 0.003110455, 0.0031189965, 0.0031499607, 0.0031806473, 0.0031886902, 0.0032189318, 0.003248906, 0.0032566122, 0.0032861587, 0.003315375, 0.0033226921, 0.0033515687, 0.0033801969, 0.0033871417, 0.0034153005, 0.0034432919, 0.0034710465, 0.0034774912, 0.003504869, 0.0035319442, 0.0035380549, 0.0035648406, 0.0035914055, 0.0036177516, 0.0036438075, 0.0036493375, 0.003675127, 0.0036804853, 0.0037059416, 0.003711059, 0.003736189, 0.0037412192, 0.0037660303, 0.0037906459, 0.0037953276, 0.003800047, 0.0038047296, 0.0038093757, 0.0038333463, 0.0038570575, 0.003861449, 0.0038849444, 0.0039082607, 0.0039124074, 0.003935369, 0.0039582313, 0.0039809216, 0.0039847344, 0.004007157, 0.0040293382, 0.0040329294, 0.0040364945, 0.0040767607, 0.004080139, 0.004101648, 0.0041230745, 0.004126249, 0.0041474323, 0.0041684634, 0.0041713663, 0.0041921614, 0.0041950336, 0.0042155976, 0.0042183665, 0.0042210417, 0.0042412984, 0.0042439485, 0.004263984, 0.0042665373, 0.004286282, 0.004288741, 0.004308347, 0.0043278197, 0.004330111, 0.0043493034, 0.004368439, 0.0043874453, 0.0043895054, 0.004408314, 0.004410217, 0.0044288305, 0.004447321, 0.0044656885, 0.00446745, 0.0044855573, 0.0045036194, 0.004521563, 0.00452312, 0.0045408844, 0.004542295, 0.004559883, 0.0045773573, 0.0045787105, 0.0045960136, 0.004613133, 0.00461436, 0.0046155793, 0.0046325475, 0.0046494096, 0.0046504345, 0.004667135, 0.004668171, 0.0046847127, 0.004701152, 0.0047174175, 0.004718286, 0.0047344714, 0.0047352826, 0.004751317, 0.0047671823, 0.004783023, 0.0047836783, 0.004784329, 0.004799976, 0.0048005027, 0.004816008, 0.0048314207, 0.0048319204, 0.0048324172, 0.004847574, 0.0048627127, 0.004877763, 0.00487812, 0.0048930375, 0.0048932773, 0.0049080644, 0.004922766, 0.0049229884, 0.0049232095, 0.0049376707, 0.0049378485, 0.0049380255, 0.0049382015, 0.0049525267, 0.0049667004, 0.004966794, 0.0049668876, 0.0049809716, 0.0049949773, 0.005008835, 0.0050226855, 0.0050226226, 0.0050363583, 0.005036258, 0.005049812, 0.0050496757, 0.00504954, 0.0050630155, 0.0050628446, 0.005076142, 0.005089436, 0.0050891954, 0.005088956, 0.005102109, 0.0051017683, 0.005114817, 0.005114512, 0.005127458, 0.0051403353, 0.0051531447, 0.0051658186, 0.005165384, 0.0051780273, 0.005190605, 0.005190109, 0.005202522, 0.0052149384, 0.0052143834, 0.005226705, 0.0052389633, 0.0052382844, 0.0052376753, 0.0052498123, 0.005249177, 0.005248545, 0.0052604955, 0.0052724523, 0.0052717663, 0.005271084, 0.0052704057, 0.005282152, 0.005281449, 0.0052807494, 0.005292448, 0.005291725, 0.00529094, 0.005302527, 0.0053017843, 0.005313288, 0.005312523, 0.005311697, 0.0053230925, 0.0053223087, 0.0053336234, 0.0053448835, 0.005343989, 0.0053551705, 0.005366298, 0.005377372, 0.005376472, 0.0053874054, 0.005398351, 0.005397408, 0.0054082777, 0.0054190964, 0.005418048, 0.0054287924, 0.0054277894, 0.0054384614, 0.0054374402, 0.005447977, 0.0054585277, 0.0054690302, 0.005467948, 0.005466871, 0.0054772184, 0.0054875812, 0.005497897, 0.0055081653, 0.005518387, 0.0055171475, 0.005527302, 0.0055261105, 0.0055361995, 0.005534993, 0.00553373, 0.005543734, 0.0055425186, 0.0055524586, 0.005562354, 0.005572144, 0.0055708764, 0.0055696145, 0.005579385, 0.00557811, 0.0055767796, 0.005575516, 0.0055851876, 0.0055948175, 0.005593523, 0.0056030317, 0.00561256, 0.0056112353, 0.005620705, 0.0056301337, 0.0056394613, 0.005648809, 0.005658117, 0.005667385, 0.005676614, 0.005685743, 0.0056842887, 0.005693423, 0.0057025184, 0.005701038, 0.0056995037, 0.005708529, 0.005717517, 0.0057160174, 0.005724952, 0.0057233837, 0.0057322658, 0.005730748, 0.005739578, 0.005748372, 0.005746771, 0.0057452363, 0.005753964, 0.005762656, 0.005771313, 0.0057798754, 0.005788462, 0.005786861, 0.0057852664, 0.0057937894, 0.00580222, 0.0058106747, 0.0058190953, 0.0058174524, 0.005825826, 0.0058241175, 0.0058324444, 0.005840738, 0.0058489987, 0.0058572264, 0.005865364, 0.0058636554, 0.0058718054, 0.0058799237, 0.0058781966, 0.005886213, 0.00588448, 0.0058925105, 0.005900509, 0.005908477, 0.005916356, 0.0059242626, 0.005932138, 0.005939983, 0.0059381737, 0.00594592, 0.0059441063, 0.0059518684, 0.00595005, 0.005948239, 0.005955892, 0.0059635728, 0.0059617464, 0.005969387, 0.0059769982, 0.0059751007, 0.005982673, 0.0059902165, 0.0059977323, 0.0059958654, 0.006003286, 0.0060014166, 0.006008855, 0.006016266, 0.0060236496, 0.0060217003, 0.0060290466, 0.006036366, 0.0060436577, 0.0060509234, 0.0060581067, 0.006065319, 0.0060633733, 0.0060614347, 0.0060686017, 0.0060666054, 0.006073737, 0.0060808426, 0.0060879225, 0.0060949773, 0.006101951, 0.006108955, 0.006106966, 0.0061139357, 0.006111945, 0.0061188266, 0.006116834, 0.006123737, 0.006130615, 0.0061286124, 0.006135403, 0.006142224, 0.0061490214, 0.0061557945, 0.006153765, 0.006160452, 0.0061584217, 0.006165131, 0.0061631, 0.006169778, 0.0061763786, 0.00618301, 0.006180962, 0.006187563, 0.0061941408, 0.0062006423, 0.006198578, 0.006205103, 0.006203038, 0.006209533, 0.006207415, 0.00621388, 0.0062203235, 0.0062267454, 0.0062246644, 0.0062310044, 0.006237375, 0.006243725, 0.0062500527, 0.0062479503, 0.0062541976, 0.006252095, 0.0062583666, 0.0062646177, 0.006262508, 0.006268679, 0.0062748813, 0.006281063, 0.0062789405, 0.0062933653, 0.0062911776, 0.0062972917, 0.0063116145, 0.0063094595, 0.006315513, 0.0063214954, 0.0063193347, 0.006317181, 0.006315035, 0.00632103, 0.006326955, 0.0063329116, 0.006338849, 0.0063447673, 0.0063506668, 0.006348439, 0.0063543133, 0.0063521382, 0.0063579874, 0.0063638184, 0.006361638, 0.006367394, 0.0063731815, 0.006378951, 0.006376761, 0.0063825063, 0.006388183, 0.0063859886, 0.0063916924, 0.006397378, 0.0064030457, 0.0064007915, 0.006406436, 0.006404234, 0.0064098556, 0.0064154593, 0.006420996, 0.0064265653, 0.006432117, 0.006437652, 0.006435425, 0.0064408877, 0.006446383, 0.0064441534, 0.0064496268, 0.0064473986, 0.006452801, 0.006458236, 0.006456005, 0.0064614187, 0.0064668157, 0.006472147, 0.006477511, 0.006482859, 0.0064806114, 0.006485938, 0.0064912, 0.006496495, 0.0065017743, 0.006507037, 0.0065122847, 0.0065174676, 0.0065226834, 0.006527884, 0.0065330686, 0.006538238, 0.006543343, 0.006548482, 0.006553605, 0.0065513025, 0.0065564066, 0.006554058, 0.006559143, 0.0065568457, 0.0065619117, 0.006559617, 0.0065646158, 0.006562324, 0.006567352, 0.0065723653, 0.0065700724, 0.006575019, 0.00658, 0.0065849656, 0.0065826676, 0.0065876152, 0.006592501, 0.00659742, 0.006602325, 0.0066072163, 0.0066049057, 0.006609732, 0.006614591, 0.006619436, 0.0066171214, 0.0066219494, 0.006619591, 0.0066244015, 0.0066220933, 0.006619791, 0.0066245813, 0.0066293105, 0.0066270083, 0.0066317674, 0.0066365134, 0.0066342107, 0.0066388934, 0.006643609, 0.006648312, 0.0066530015, 0.006650692, 0.0066553187, 0.006659979, 0.0066576693, 0.006662313, 0.0066600065, 0.006664588, 0.006662285, 0.006666897, 0.006664597, 0.006669193, 0.0066668503, 0.0066714305, 0.0066759977, 0.0066805533, 0.006685096, 0.00668958, 0.006694098, 0.0066917893, 0.0066962917, 0.0066939862, 0.006698428, 0.006696126, 0.006700598, 0.006698299, 0.0067027565, 0.0067071565, 0.00671159, 0.0067160116, 0.006720421, 0.0067248186, 0.0067291595, 0.006733534, 0.0067378967, 0.0067422474, 0.0067399265, 0.0067442185, 0.0067419014, 0.006746224, 0.0067505348, 0.0067482186, 0.006752471, 0.006756757, 0.006761031, 0.0067652944, 0.0067695463, 0.0067737424, 0.0067714173, 0.006769098, 0.0067733224, 0.006777536, 0.0067816945, 0.006779374, 0.0067770593, 0.0067812465, 0.0067854226, 0.006783065, 0.006787228, 0.00679138, 0.0067955214, 0.006793207, 0.0067972913, 0.006801409, 0.0068055163, 0.006803201, 0.006807295, 0.00680494, 0.0068026343, 0.0068067135, 0.0068107825, 0.006808478, 0.0068124915, 0.0068165376, 0.0068205735, 0.006824599, 0.0068286145, 0.006832577, 0.0068302653, 0.0068342583, 0.006838241, 0.0068422146, 0.0068461346, 0.0068500875, 0.0068540312, 0.0068579647, 0.0068556443, 0.0068595232, 0.006857207, 0.0068611167, 0.006865017, 0.0068689077, 0.0068727457, 0.006876617, 0.006874297, 0.0068781567, 0.00687584, 0.0068796463, 0.006877334, 0.006881171, 0.0068849986, 0.0068949456, 0.006898705, 0.006902497, 0.00690628, 0.0069100535, 0.006913818, 0.006917531, 0.0069152005, 0.0069189453, 0.006916619, 0.006926407, 0.006924035, 0.006927751, 0.0069314577, 0.0069351555, 0.0069388445, 0.0069424827, 0.006946154, 0.0069498164, 0.00695347, 0.006951133, 0.00694876, 0.0069524013, 0.0069560343, 0.006959659, 0.0069573284, 0.006954962, 0.006958575, 0.0069562537, 0.0069598565, 0.0069575394, 0.006961091, 0.006964675, 0.006968251, 0.0069718184, 0.0069753774, 0.006978887, 0.0069765667, 0.0069801076, 0.0069836406, 0.0069871647, 0.0069906404, 0.0069941483, 0.006997648, 0.006995325, 0.006998815, 0.007002257, 0.0069999364, 0.0070034093, 0.007006874, 0.007010331, 0.007013739, 0.00701718, 0.007020613, 0.00701829, 0.007021714, 0.0070250896, 0.0070284978, 0.0070261764, 0.0070295758, 0.007032967, 0.0070363507, 0.007033991, 0.007037366, 0.007040733, 0.0070440923, 0.007047444, 0.0070507484, 0.007054085, 0.0070517636, 0.0070550917, 0.0070584123, 0.007056054, 0.0070537413, 0.007057052, 0.0070603555, 0.0070580454, 0.007061301, 0.007064588, 0.0070678685, 0.00706556, 0.007068832, 0.007072057, 0.007069752, 0.007073008, 0.007076257, 0.007079499, 0.007082694, 0.00708039, 0.0070836167, 0.0070813163, 0.007084535, 0.0070821997, 0.0070909113, 0.007094108, 0.007091809, 0.0070949984, 0.0070981416, 0.007101317, 0.0070990203, 0.0071021873, 0.007099895, 0.0071030157, 0.007106168, 0.0071093137, 0.0071070236, 0.0071101612, 0.007113254, 0.0071163783, 0.0071140905, 0.007111807, 0.007114923, 0.0071179937, 0.0071210964, 0.0071134386, 0.0071165394, 0.007108903, 0.0071066045, 0.007104349, 0.007102098, 0.0071051945, 0.0071082837, 0.007111329, 0.0071090804, 0.007112156, 0.0071045975, 0.007097055, 0.007089491, 0.0070819804, 0.007074486, 0.007067007, 0.0070595443, 0.0070573343, 0.007060435, 0.0070530027, 0.007045586, 0.007038185, 0.0070307623, 0.0070233922, 0.0070160376, 0.0070086983, 0.0070013744, 0.006994029, 0.006986736, 0.0069794576, 0.0069721946, 0.0069649466, 0.0069576777, 0.00695046, 0.006943257, 0.006936069, 0.006928896, 0.006921702, 0.006914559, 0.00690743, 0.006900316, 0.006893217, 0.0068860967, 0.0068790265, 0.0068719713, 0.00686493, 0.006857903, 0.006850856, 0.006843858, 0.0068368744, 0.006829905, 0.0068229497, 0.0068159737, 0.006809047, 0.006802134, 0.0067952354, 0.00678835, 0.0067814454, 0.006774588, 0.006767745, 0.0067609157, 0.0067541003, 0.0067472644, 0.0067404765, 0.006733702, 0.0067269416, 0.006720194, 0.0067134267, 0.0067067067, 0.0067], "type": "scatter", "x": [1, 201, 401, 602, 802, 1002, 1202, 1402, 1603, 1803, 2003, 2203, 2403, 2604, 2804, 3004, 3204, 3404, 3605, 3805, 4005, 4205, 4405, 4606, 4806, 5006, 5206, 5406, 5607, 5807, 6007, 6207, 6407, 6608, 6808, 7008, 7208, 7408, 7609, 7809, 8009, 8209, 8409, 8610, 8810, 9010, 9210, 9410, 9611, 9811, 10011, 10211, 10411, 10612, 10812, 11012, 11212, 11412, 11613, 11813, 12013, 12213, 12413, 12614, 12814, 13014, 13214, 13414, 13615, 13815, 14015, 14215, 14415, 14616, 14816, 15016, 15216, 15416, 15617, 15817, 16017, 16217, 16417, 16618, 16818, 17018, 17218, 17418, 17619, 17819, 18019, 18219, 18419, 18620, 18820, 19020, 19220, 19420, 19621, 19821, 20021, 20221, 20421, 20622, 20822, 21022, 21222, 21422, 21623, 21823, 22023, 22223, 22423, 22624, 22824, 23024, 23224, 23424, 23625, 23825, 24025, 24225, 24425, 24626, 24826, 25026, 25226, 25426, 25626, 25827, 26027, 26227, 26427, 26627, 26828, 27028, 27228, 27428, 27628, 27829, 28029, 28229, 28429, 28629, 28830, 29030, 29230, 29430, 29630, 29831, 30031, 30231, 30431, 30631, 30832, 31032, 31232, 31432, 31632, 31833, 32033, 32233, 32433, 32633, 32834, 33034, 33234, 33434, 33634, 33835, 34035, 34235, 34435, 34635, 34836, 35036, 35236, 35436, 35636, 35837, 36037, 36237, 36437, 36637, 36838, 37038, 37238, 37438, 37638, 37839, 38039, 38239, 38439, 38639, 38840, 39040, 39240, 39440, 39640, 39841, 40041, 40241, 40441, 40641, 40842, 41042, 41242, 41442, 41642, 41843, 42043, 42243, 42443, 42643, 42844, 43044, 43244, 43444, 43644, 43845, 44045, 44245, 44445, 44645, 44846, 45046, 45246, 45446, 45646, 45847, 46047, 46247, 46447, 46647, 46848, 47048, 47248, 47448, 47648, 47849, 48049, 48249, 48449, 48649, 48850, 49050, 49250, 49450, 49650, 49851, 50051, 50251, 50451, 50651, 50852, 51052, 51252, 51452, 51652, 51853, 52053, 52253, 52453, 52653, 52854, 53054, 53254, 53454, 53654, 53855, 54055, 54255, 54455, 54655, 54856, 55056, 55256, 55456, 55656, 55857, 56057, 56257, 56457, 56657, 56858, 57058, 57258, 57458, 57658, 57859, 58059, 58259, 58459, 58659, 58860, 59060, 59260, 59460, 59660, 59861, 60061, 60261, 60461, 60661, 60862, 61062, 61262, 61462, 61662, 61863, 62063, 62263, 62463, 62663, 62864, 63064, 63264, 63464, 63664, 63865, 64065, 64265, 64465, 64665, 64866, 65066, 65266, 65466, 65666, 65867, 66067, 66267, 66467, 66667, 66868, 67068, 67268, 67468, 67668, 67869, 68069, 68269, 68469, 68669, 68870, 69070, 69270, 69470, 69670, 69871, 70071, 70271, 70471, 70671, 70872, 71072, 71272, 71472, 71672, 71873, 72073, 72273, 72473, 72673, 72874, 73074, 73274, 73474, 73674, 73875, 74075, 74275, 74475, 74675, 74876, 75076, 75276, 75476, 75676, 75876, 76077, 76277, 76477, 76677, 76877, 77078, 77278, 77478, 77678, 77878, 78079, 78279, 78479, 78679, 78879, 79080, 79280, 79480, 79680, 79880, 80081, 80281, 80481, 80681, 80881, 81082, 81282, 81482, 81682, 81882, 82083, 82283, 82483, 82683, 82883, 83084, 83284, 83484, 83684, 83884, 84085, 84285, 84485, 84685, 84885, 85086, 85286, 85486, 85686, 85886, 86087, 86287, 86487, 86687, 86887, 87088, 87288, 87488, 87688, 87888, 88089, 88289, 88489, 88689, 88889, 89090, 89290, 89490, 89690, 89890, 90091, 90291, 90491, 90691, 90891, 91092, 91292, 91492, 91692, 91892, 92093, 92293, 92493, 92693, 92893, 93094, 93294, 93494, 93694, 93894, 94095, 94295, 94495, 94695, 94895, 95096, 95296, 95496, 95696, 95896, 96097, 96297, 96497, 96697, 96897, 97098, 97298, 97498, 97698, 97898, 98099, 98299, 98499, 98699, 98899, 99100, 99300, 99500, 99700, 99900, 100101, 100301, 100501, 100701, 100901, 101102, 101302, 101502, 101702, 101902, 102103, 102303, 102503, 102703, 102903, 103104, 103304, 103504, 103704, 103904, 104105, 104305, 104505, 104705, 104905, 105106, 105306, 105506, 105706, 105906, 106107, 106307, 106507, 106707, 106907, 107108, 107308, 107508, 107708, 107908, 108109, 108309, 108509, 108709, 108909, 109110, 109310, 109510, 109710, 109910, 110111, 110311, 110511, 110711, 110911, 111112, 111312, 111512, 111712, 111912, 112113, 112313, 112513, 112713, 112913, 113114, 113314, 113514, 113714, 113914, 114115, 114315, 114515, 114715, 114915, 115116, 115316, 115516, 115716, 115916, 116117, 116317, 116517, 116717, 116917, 117118, 117318, 117518, 117718, 117918, 118119, 118319, 118519, 118719, 118919, 119120, 119320, 119520, 119720, 119920, 120121, 120321, 120521, 120721, 120921, 121122, 121322, 121522, 121722, 121922, 122123, 122323, 122523, 122723, 122923, 123124, 123324, 123524, 123724, 123924, 124125, 124325, 124525, 124725, 124925, 125125, 125326, 125526, 125726, 125926, 126126, 126327, 126527, 126727, 126927, 127127, 127328, 127528, 127728, 127928, 128128, 128329, 128529, 128729, 128929, 129129, 129330, 129530, 129730, 129930, 130130, 130331, 130531, 130731, 130931, 131131, 131332, 131532, 131732, 131932, 132132, 132333, 132533, 132733, 132933, 133133, 133334, 133534, 133734, 133934, 134134, 134335, 134535, 134735, 134935, 135135, 135336, 135536, 135736, 135936, 136136, 136337, 136537, 136737, 136937, 137137, 137338, 137538, 137738, 137938, 138138, 138339, 138539, 138739, 138939, 139139, 139340, 139540, 139740, 139940, 140140, 140341, 140541, 140741, 140941, 141141, 141342, 141542, 141742, 141942, 142142, 142343, 142543, 142743, 142943, 143143, 143344, 143544, 143744, 143944, 144144, 144345, 144545, 144745, 144945, 145145, 145346, 145546, 145746, 145946, 146146, 146347, 146547, 146747, 146947, 147147, 147348, 147548, 147748, 147948, 148148, 148349, 148549, 148749, 148949, 149149, 149350, 149550, 149750, 149950, 150150, 150351, 150551, 150751, 150951, 151151, 151352, 151552, 151752, 151952, 152152, 152353, 152553, 152753, 152953, 153153, 153354, 153554, 153754, 153954, 154154, 154355, 154555, 154755, 154955, 155155, 155356, 155556, 155756, 155956, 156156, 156357, 156557, 156757, 156957, 157157, 157358, 157558, 157758, 157958, 158158, 158359, 158559, 158759, 158959, 159159, 159360, 159560, 159760, 159960, 160160, 160361, 160561, 160761, 160961, 161161, 161362, 161562, 161762, 161962, 162162, 162363, 162563, 162763, 162963, 163163, 163364, 163564, 163764, 163964, 164164, 164365, 164565, 164765, 164965, 165165, 165366, 165566, 165766, 165966, 166166, 166367, 166567, 166767, 166967, 167167, 167368, 167568, 167768, 167968, 168168, 168369, 168569, 168769, 168969, 169169, 169370, 169570, 169770, 169970, 170170, 170371, 170571, 170771, 170971, 171171, 171372, 171572, 171772, 171972, 172172, 172373, 172573, 172773, 172973, 173173, 173374, 173574, 173774, 173974, 174174, 174375, 174575, 174775, 174975, 175175, 175375, 175576, 175776, 175976, 176176, 176376, 176577, 176777, 176977, 177177, 177377, 177578, 177778, 177978, 178178, 178378, 178579, 178779, 178979, 179179, 179379, 179580, 179780, 179980, 180180, 180380, 180581, 180781, 180981, 181181, 181381, 181582, 181782, 181982, 182182, 182382, 182583, 182783, 182983, 183183, 183383, 183584, 183784, 183984, 184184, 184384, 184585, 184785, 184985, 185185, 185385, 185586, 185786, 185986, 186186, 186386, 186587, 186787, 186987, 187187, 187387, 187588, 187788, 187988, 188188, 188388, 188589, 188789, 188989, 189189, 189389, 189590, 189790, 189990, 190190, 190390, 190591, 190791, 190991, 191191, 191391, 191592, 191792, 191992, 192192, 192392, 192593, 192793, 192993, 193193, 193393, 193594, 193794, 193994, 194194, 194394, 194595, 194795, 194995, 195195, 195395, 195596, 195796, 195996, 196196, 196396, 196597, 196797, 196997, 197197, 197397, 197598, 197798, 197998, 198198, 198398, 198599, 198799, 198999, 199199, 199399, 199600, 199800, 200000]}]} // Get the plotly listeners const plotly_listeners = {} // Get the JS listeners const js_listeners = {} // Deal with eventual custom classes let custom_classlist = [] // Load the plotly library if (!window.Plotly) { const {plotly} = await import('https://cdn.plot.ly/plotly-2.16.1.min.js') } // Check if we have to force local mathjax font cache if (false && window?.MathJax?.config?.svg?.fontCache === 'global') { window.MathJax.config.svg.fontCache = 'local' } // Flag to check if this cell was manually ran or reactively ran const firstRun = this ? false : true const PLOT = this ?? document.createElement("div"); const parent = currentScript.parentElement const isPlutoWrapper = parent.classList.contains('raw-html-wrapper') if (firstRun) { // It seem plot divs would not autosize themself inside flexbox containers without this parent.appendChild(PLOT) } // If width is not specified, set it to 100% PLOT.style.width = plot_obj.layout.width ? "" : "100%" // For the height we have to also put a fixed value in case the plot is put on a non-fixed-size container (like the default wrapper) PLOT.style.height = plot_obj.layout.height ? "" : (isPlutoWrapper || parent.clientHeight == 0) ? "400px" : "100%" PLOT.classList.forEach(cn => { if (cn !== 'js-plotly-plot' && !custom_classlist.includes(cn)) { PLOT.classList.toggle(cn, false) } }) for (const className of custom_classlist) { PLOT.classList.toggle(className, true) } // Create the resizeObserver to make the plot even more responsive! :magic: const resizeObserver = new ResizeObserver(entries => { PLOT.style.height = plot_obj.layout.height ? "" : (isPlutoWrapper || parent.clientHeight == 0) ? "400px" : "100%" /* The addition of the invalid argument `plutoresize` seems to fix the problem with calling `relayout` simply with `{autosize: true}` as update breaking mouse relayout events tracking. See https://github.com/plotly/plotly.js/issues/6156 for details */ Plotly.relayout(PLOT, {..._.pick(PLOT.layout, ['width','height']), autosize: true, plutoresize: true}) }) resizeObserver.observe(PLOT) Plotly.react(PLOT, plot_obj).then(() => { // Assign the Plotly event listeners for (const [key, listener_vec] of Object.entries(plotly_listeners)) { for (const listener of listener_vec) { PLOT.on(key, listener) } } // Assign the JS event listeners for (const [key, listener_vec] of Object.entries(js_listeners)) { for (const listener of listener_vec) { PLOT.addEventListener(key, listener) } } } ) invalidation.then(() => { // Remove all plotly listeners PLOT.removeAllListeners() // Remove all JS listeners for (const [key, listener_vec] of Object.entries(js_listeners)) { for (const listener of listener_vec) { PLOT.removeEventListener(key, listener) } } // Remove the resizeObserver resizeObserver.disconnect() }) return PLOT mimetext/htmlrootassigneelast_run_timestampA =Z(persist_js_state·has_pluto_hook_features§cell_id$98222fcd-b456-477c-90dd-844df36877e5depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cellsçerrored$f7f58fd2-facc-4b87-9172-5e911677c8f4queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestampA apersist_js_state·has_pluto_hook_features§cell_id$f7f58fd2-facc-4b87-9172-5e911677c8f4depends_on_disabled_cells§runtime(published_object_keysdepends_on_skipped_cells§errored$58403c8e-0ee4-4466-ba25-ee0c86fb0b47queued¤logsrunning¦outputbodyn

Consider $\mathbf{x}(s)$ and $\mathbf{h}(s, \boldsymbol{\theta})$ which produces a vector of action preferences. We would like to derive an expression for $\nabla \ln \pi (a \vert s, \boldsymbol{\theta})$ in the case of $\mathbf{\pi}(s, \boldsymbol{\theta}) = \sigma(\mathbf{h}(s, \boldsymbol{\theta}))$ where $\sigma(\mathbf{x})$ is the softmax function defined in section 13.1. Here I'm using the notation $\mathbf{\pi}(s, \boldsymbol{\theta})$ to refer to the vector of action probabilities at a given state. The subscript on the vector refers to selecting that element from the vector. To shorten expressions, the following terms are equivalent:

$$\begin{flalign} \mathbf{\pi} &\doteq \mathbf{\pi}(s, \boldsymbol{\theta}) \\ \mathbf{h} &\doteq \mathbf{h}(s, \boldsymbol{\theta}) \\ x_i &\doteq \mathbf{x}_i \text{ for all vectors} \\ \end{flalign}$$

Using these conventions, we previously had an expression for the ith component of the gradient of the policy:

$$\nabla \left( \pi_a \right )_i = \pi_a \left ( \frac{\partial{h_a}}{\partial{\theta_i}} - \sum_k{\pi_k \frac{\partial{h_k}}{\partial{\theta_i}}} \right )$$

We can use this expression to derive the components of the eligibility vector in general:

$$\begin{flalign} \nabla \left( \ln \mathbf{\pi}_a \right)_i &= \frac{\nabla \left( \pi_a \right )_i}{\pi_a}\\ &=\frac{\partial{h_a}}{\partial{\theta_i}} - \sum_k{\pi_k \frac{\partial{h_k}}{\partial{\theta_i}}} \\ \end{flalign}$$

Connection to Cross-Entropy Loss

Classification problems involve training a function to predict the class label of an input. The function returns a vector of class preferences which can be converted to a probability distribution by the soft-max function. The cross-entropy loss is a way of comparing this distribution with the desired output label to generate an error value.

Let's denote $\mathbf{p}(s)$ as the vector of true probabilities for an example $s$ and keep our output function as $\pi(s,\theta) = \sigma(\mathbf{h}(s, \boldsymbol{\theta}))$. The cross entropy loss is defined as:

$$\mathcal{L}(\mathbf{p}, \mathbf{\pi}) = -\sum_i \mathbf{p}_i \ln \mathbf{\pi}_i$$

omitting $s$ and $\boldsymbol{\theta}$.

In a typical situation with a dataset, $\mathbf{p}(s)$ will be a one-hot vector representing the index of label of the example in the dataset. Let's call that index $a$ such that $p_a = 1$ and $p_i = 0 \: \forall i \neq a$. The loss then simplifies to $\mathcal{L}(a, \mathbf{\pi}) = -\ln \mathbf{\pi}_a$. When we train with gradient descent on such a dataset, we must compute the gradient of this loss with respect to the parameters or $-\nabla \ln \pi_a$ which is just negative one times the eligibility vector for general paramaterized approximation. So if we have a function that computes the gradient of the cross entropy loss of the soft-max output for a vector function and a label index, we can replace the label index of the dataset with the desired action index $a$ and then that gradient will match our desired gradient after multiplying by negative one.

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$58403c8e-0ee4-4466-ba25-ee0c86fb0b47depends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cells§errored$e1aec891-d95a-47d1-97d7-d2a4cfb16e64queued¤logsrunning¦outputbodyGsetup_fcann_policy_and_value_arguments (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA -persist_js_state·has_pluto_hook_features§cell_id$e1aec891-d95a-47d1-97d7-d2a4cfb16e64depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$3d065608-eef2-4caa-b17d-ec60714e3d58queued¤logsrunning¦outputbodySactor_critic_binary_episodic_beta_parameter_study (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA /persist_js_state·has_pluto_hook_features§cell_id$3d065608-eef2-4caa-b17d-ec60714e3d58depends_on_disabled_cells§runtime0Tpublished_object_keysdepends_on_skipped_cells§errored$b87ff1a9-abff-40f7-a1d8-f751a1c8b060queued¤logsrunning¦outputbodyS

In the episodic case, we provided a reward of -1 per step and then considered an episode finished when a failure state was reached. In the continuing case, the step function will provide a reward of 0 unless a failure occurs in which case it will provide a reward of -1 and then initialize a new state.

mimetext/htmlrootassigneelast_run_timestampA #persist_js_state·has_pluto_hook_features§cell_id$b87ff1a9-abff-40f7-a1d8-f751a1c8b060depends_on_disabled_cells§runtime#published_object_keysdepends_on_skipped_cells§errored$e89bdc84-dbb5-4c73-a39c-6392e5f79704queued¤logsrunning¦outputbodyW
mimetext/htmlrootassigneelast_run_timestampA @&persist_js_state·has_pluto_hook_features§cell_id$e89bdc84-dbb5-4c73-a39c-6392e5f79704depends_on_disabled_cells§runtimePpublished_object_keysdepends_on_skipped_cellsçerrored$d3b56fca-5b79-4465-8987-8d0005f854d8queued¤logsrunning¦outputbodyelementsepisode_rewardsprefixFloat32elements25.0text/plain32.0text/plain32.0text/plain29.0text/plain27.0text/plain57.0text/plain25.0text/plain22.0text/plain 29.0text/plainmore'20.0text/plaintypeArrayprefix_shortobjectida84c5aa4de73c4dc!application/vnd.pluto.tree+objectepisode_stepsprefixInt64elements25text/plain32text/plain32text/plain29text/plain27text/plain57text/plain25text/plain22text/plain 29text/plainmore'20text/plaintypeArrayprefix_shortobjectid8ee6ad21aba86826!application/vnd.pluto.tree+objectpolicy_functionπ2text/plainpolicy_sample_actionπ_sample2text/plainpolicy_parameters52488×3 Matrix{Float32}: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0text/plainestimate_state_valueestimate_state_valuetext/plainvalue_parametersprefixFloat32elements0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain 0.0text/plainmore0.0text/plaintypeArrayprefix_shortobjectid1f27a50400f1227e!application/vnd.pluto.tree+objectpolicy_and_valuepolicy_and_valuetext/plaintypeNamedTupleobjectidcb87f538fb76a21fmime!application/vnd.pluto.tree+objectrootassigneeconst reinforce_test2last_run_timestampA 1/persist_js_state·has_pluto_hook_features§cell_id$d3b56fca-5b79-4465-8987-8d0005f854d8depends_on_disabled_cells§runtime& published_object_keysdepends_on_skipped_cells§errored$d21617aa-6f38-4a90-8586-4b32022497adqueued¤logsrunning¦outputbodyprefixStateMDP{Float32, CartPoleState{Float32}, Float32, StateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1516#1532"{var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}, Vector{Float32}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1521#1537", var"#init_θ#1551", var"#1523#1539", var"#1524#1540"}}, Returns{Bool}, TabularRL.var"#164#169"}elementsactionsprefixFloat32elements-300.0text/plain0.0text/plain300.0text/plaintypeArrayprefix_shortobjectid98ad56d5f22ee7f4!application/vnd.pluto.tree+objectptfprefixStateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1516#1532"{var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}, Vector{Float32}}}elementsstep׬ (generic function with 1 method)text/plaintypestructprefix_shortStateMDPTransitionSamplerobjectid1a36bcae07eee980!application/vnd.pluto.tree+objectinitialize_state-(::Main.var"workspace#8".var"#initialize_state#1525"{Main.var"workspace#8".var"#initialize_state#1514#1526"{Main.var"workspace#8".var"#1521#1537", Main.var"workspace#8".var"#init_θ#1551", Main.var"workspace#8".var"#1523#1539", Main.var"workspace#8".var"#1524#1540"}}) (generic function with 1 method)text/plainistermReturns{Bool}(false)text/plainis_valid_action%#164 (generic function with 1 method)text/plainaction_indexprefixDict{Float32, Int64}elements0.0text/plain2text/plain300.0text/plain3text/plain-300.0text/plain1text/plaintypeDictprefix_shortDictobjectidc3e528862580ef49!application/vnd.pluto.tree+objecttypestructprefix_shortStateMDPobjectid45fb03e2144b629bmime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA :6persist_js_state·has_pluto_hook_features§cell_id$d21617aa-6f38-4a90-8586-4b32022497addepends_on_disabled_cells§runtimeCnpublished_object_keysdepends_on_skipped_cells§errored$0574f5a0-72e7-4aa2-80ac-f4ce4f0fe7c2queued¤logsrunning¦outputbodyh
mimetext/htmlrootassigneelast_run_timestampA @Qspersist_js_state·has_pluto_hook_features§cell_id$0574f5a0-72e7-4aa2-80ac-f4ce4f0fe7c2depends_on_disabled_cells§runtime{Cpublished_object_keysdepends_on_skipped_cellsçerrored$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7dqueued¤logsrunning¦outputbodyelementsstep_rewardsprefixFloat32elementstypeArrayprefix_shortobjectid8bf530e829da7fcb!application/vnd.pluto.tree+objectepisode_stepsprefixInt64elements50726text/plain56742text/plain58488text/plain59843text/plain61006text/plain61966text/plain63617text/plain67188text/plain 70662text/plainmoreƒ999851text/plaintypeArrayprefix_shortobjectid11b0fa894263dfe!application/vnd.pluto.tree+objectepisode_rewardsprefixFloat32elements-50725.0text/plain-6016.0text/plain-1746.0text/plain-1355.0text/plain-1163.0text/plain-960.0text/plain-1651.0text/plain-3571.0text/plain -3474.0text/plainmoreƒ-150.0text/plaintypeArrayprefix_shortobjectid466b48119ec6acef!application/vnd.pluto.tree+objectpolicy_parametersF1452×2 Matrix{Float32}: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.202401 -0.0946094 0.181695 -0.0776743 0.0960714 -0.0431665 0.0161015 -0.00757248 0.000145402 9.99713f-5 0.115561 -0.0530652text/plainvalue_parametersprefixFloat32elements0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain 0.0text/plainmore-2.51317text/plaintypeArrayprefix_shortobjectidd12e54d6c0758517!application/vnd.pluto.tree+objectpolicy_functionπtext/plainpolicy_sample_actionπ_sampletext/plainestimate_state_valueestimate_state_valuetext/plainpolicy_and_valuepolicy_and_valuetext/plaintypeNamedTupleobjectid4211faf2c3e65661mime!application/vnd.pluto.tree+objectrootassignee(const mountaincar_continuous_test_train3last_run_timestampA >Wװpersist_js_state·has_pluto_hook_features§cell_id$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7ddepends_on_disabled_cells§runtimeHpublished_object_keysdepends_on_skipped_cells§errored$d82e7ab8-c372-4462-afb5-1617560cdb56queued¤logsrunning¦outputbodyl
mimetext/htmlrootassigneelast_run_timestampA @dApersist_js_state·has_pluto_hook_features§cell_id$d82e7ab8-c372-4462-afb5-1617560cdb56depends_on_disabled_cells§runtimeekbpublished_object_keysdepends_on_skipped_cellsçerrored$3c89209c-9202-4d5d-841c-ea34be369616queued¤logsrunning¦outputbodyelementsstep_rewardsprefixFloat32elements0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain 0.0text/plainmoreu00.0text/plaintypeArrayprefix_shortobjectid55b4385e79639d7e!application/vnd.pluto.tree+objecttotal_reward-266.0text/plaintotal_steps30000text/plainpolicy_parameters29160×3 Matrix{Float32}: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0text/plainvalue_parametersprefixFloat32elements0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain 0.0text/plainmoreq蒣0.0text/plaintypeArrayprefix_shortobjectid7ce03c391e3787fd!application/vnd.pluto.tree+objectpolicy_functionπtext/plainpolicy_sample_actionπ_sampletext/plainestimate_state_valueestimate_state_valuetext/plainpolicy_and_valuepolicy_and_valuetext/plaintypeNamedTupleobjectid9e144f403694fa96mime!application/vnd.pluto.tree+objectrootassigneeconst cartpole_continuing_testlast_run_timestampA +dpersist_js_state·has_pluto_hook_features§cell_id$3c89209c-9202-4d5d-841c-ea34be369616depends_on_disabled_cells§runtime.published_object_keysdepends_on_skipped_cellsçerrored$635abb34-2c97-4f04-a74c-22fbec32f408queued¤logsrunning¦outputbody5fcann_value_function (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA Lbpersist_js_state·has_pluto_hook_features§cell_id$635abb34-2c97-4f04-a74c-22fbec32f408depends_on_disabled_cells§runtime ppublished_object_keysdepends_on_skipped_cells§errored$0bf3b988-b3fb-49d5-8dde-b25766596363queued¤logsrunning¦outputbody6linear_value_function (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA vpersist_js_state·has_pluto_hook_features§cell_id$0bf3b988-b3fb-49d5-8dde-b25766596363depends_on_disabled_cells§runtimeq published_object_keysdepends_on_skipped_cells§errored$d8222abf-139c-4220-8e92-cc987ec6900cqueued¤logsrunning¦outputbody#

Note that for the corridor problem, the state-value learning rates have very little impact and learning is most effective when $\lambda_{\boldsymbol{\theta}}$ is close to 1 which mimics REINFORCE with baseline.

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$d8222abf-139c-4220-8e92-cc987ec6900cdepends_on_disabled_cells§runtime7published_object_keysdepends_on_skipped_cells§errored$68e6f17e-8c87-40f0-a673-1115ecd1b71dqueued¤logsrunning¦outputbody 

Exercise 13.5

A Bernoulli-logistic unit is a stochastic neuron-like unit used in some ANNs. Its input at time t is a feature vector $\mathbf{x}(S_t)$; its output, $A_t$, is a random variable having two values, 0 and 1, with $\Pr \{A_t=1 \}=P_t$ and $\Pr\{A_t=0\}=1-P_t$ (the Bernoulli distribution). Let $h(s, 0, \mathbf{\theta})$ and $h(s, 1, \mathbf{\theta})$ be the preferences in state $s$ for the unit's two actions given by policy parameter $\mathbf{\theta}$. Assume that the difference between the action preferences is given by a weights sum of teh unit's input vector, that is, assume that $h(s, 1, \mathbf{\theta})-h(s,0, \mathbf{\theta}) = \mathbf{\theta}^\top \mathbf{x}(s)$, where $\mathbf{\theta}$ is the unit's weight vector.

  1. Show that if the exponential soft-max distribution (13.2) is used to convert action preferences to policies, then ${P_t = \pi(1|S_t, \theta_t)=1/(1+\exp(-\theta_t^\top\mathbf{x}(S_t)))}$ (the logistic function).

  2. What is the Monte-Carlo REINFORCE update of $\theta_t$ to $\theta_{t+1}$ upon receipt of return $G_t$?

  3. Express the eligility $\nabla \ln \pi(a|s, \theta)$ for a Bernoulli-logistic unit, in terms of $a$, $\mathbf{x}(s)$, and $\pi(a|s, \theta)$ by calculating the gradient.

Hint for part (c): Define $P=\pi(1|s,\theta)$ and compute the derivative of the logarithm, for each action, using the chain rule on $P$. Combine the two results into one expression that depends on $a$ and $P$, and then use the chain rule again, this time on $\theta^\top\mathbf{x}(s)$, noting that the derivative of the logistic function $f(x)=1/(1+e^{-x})$ is $f(x)(1-f(x))$.

mimetext/htmlrootassigneelast_run_timestampA (Npersist_js_state·has_pluto_hook_features§cell_id$68e6f17e-8c87-40f0-a673-1115ecd1b71ddepends_on_disabled_cells§runtime

Continuing Mountain Car Example

mimetext/htmlrootassigneelast_run_timestampA sҰpersist_js_state·has_pluto_hook_features§cell_id$5500fd8e-64cb-4af7-808d-230440746319depends_on_disabled_cells§runtimejpublished_object_keysdepends_on_skipped_cells§errored$76d54520-baa3-44bf-b303-4cdcb8b87080queued¤logsrunning¦outputbody4make_sample_vector (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA 2vpersist_js_state·has_pluto_hook_features§cell_id$76d54520-baa3-44bf-b303-4cdcb8b87080depends_on_disabled_cells§runtime xpublished_object_keysdepends_on_skipped_cells§errored$27441783-d3c6-40be-9c36-4941613e6ae9queued¤logsrunning¦outputbodyg mimetext/htmlrootassigneelast_run_timestampA : ԰persist_js_state·has_pluto_hook_features§cell_id$27441783-d3c6-40be-9c36-4941613e6ae9depends_on_disabled_cells§runtime8published_object_keysdepends_on_skipped_cellsçerrored$fac138d9-3c5d-44b0-a87c-b13872f19450queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestampA z_persist_js_state·has_pluto_hook_features§cell_id$fac138d9-3c5d-44b0-a87c-b13872f19450depends_on_disabled_cells§runtime0published_object_keysdepends_on_skipped_cells§errored$82e0e9a0-9662-429a-87e3-e6bdae02709aqueued¤logsrunning¦outputbodyelementsstep_rewardsprefixFloat32elements0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain 0.0text/plainmoreB@0.0text/plaintypeArrayprefix_shortobjectid7d0e4ef3f6cc0757!application/vnd.pluto.tree+objecttotal_reward-2782.0text/plaintotal_steps1000000text/plainpolicy_parameterselementsprefixMatrix{Float32}elements~32×4 Matrix{Float32}: 0.321239 0.977521 -0.464886 0.62383 -0.713521 -0.761162 0.719418 -0.470045 1.61797 1.53386 -0.401547 1.00554 0.367269 1.45895 0.371751 1.28611 -0.158712 0.279256 1.20228 1.70069 0.283697 1.21211 -0.329526 0.839996 1.00743 0.554876 -0.24803 -0.450346 ⋮ -0.762991 1.46526 -0.173236 0.414643 0.215731 1.20015 -0.432446 0.804414 1.15654 1.65033 -0.0695841 2.1971 0.25408 1.3398 -1.09059 0.634848 -0.359608 -1.55853 1.21788 -1.01226 0.450864 -0.360808 0.469054 -0.341672text/plain832×32 Matrix{Float32}: 0.298359 -0.109246 -0.227421 … 0.109085 0.180137 0.256977 -0.117696 -0.165926 0.10662 -0.149729 -0.0800637 0.20747 -0.026369 -0.0294542 0.277826 0.0174747 0.0704003 -0.0902252 0.0273941 -0.21855 -0.0283726 0.0638221 -0.11311 0.0278709 -0.100952 0.414816 0.261797 0.351694 -0.0347838 0.187904 -0.242686 0.070586 -0.317466 … 0.0113853 -0.0128254 -0.327565 0.0240385 0.212511 0.0172485 0.174674 -0.134726 0.277685 ⋮ ⋱ ⋮ -0.0967271 -0.165915 -0.405567 -0.159964 -0.421851 -0.0303206 0.302258 0.234183 -0.597393 0.351072 -0.75114 -0.125634 -0.182872 -0.197682 0.0573918 0.217619 0.0951767 0.0406607 0.025299 -0.139596 0.169242 -0.118836 -0.0188807 0.305599 0.140254 -0.0946509 0.0477902 … -0.0605054 0.082238 0.174884 0.291454 -0.00377796 0.28423 -0.0312474 -0.359758 0.0568826text/plain3×32 Matrix{Float32}: -0.264022 0.00228838 -0.296634 … -0.856357 0.00620197 -0.0195779 -0.0371509 0.21098 0.21658 0.122437 0.176658 -0.142418 0.039277 -0.021776 0.0372981 0.453329 0.119452 0.134094text/plaintypeArrayprefix_shortobjectidd11f8b5eec6ee50b!application/vnd.pluto.tree+objectprefixVector{Float32}elementsprefixFloat32elementsmoretypeArrayprefix_shortobjectid5489af23f3423d7f!application/vnd.pluto.tree+objectprefixFloat32elementsmoretypeArrayprefix_shortobjectidf10200752c32f98b!application/vnd.pluto.tree+objectprefixFloat32elementsmoretypeArrayprefix_shortobjectid2fe67d2d29fbf4be!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid4c8103cd6331d69a!application/vnd.pluto.tree+objecttypeTupleobjectidcba062158ba2207a!application/vnd.pluto.tree+objectvalue_parameterselementsprefixMatrix{Float32}elements~32×4 Matrix{Float32}: 0.321239 0.977521 -0.464886 0.62383 -0.713521 -0.761162 0.719418 -0.470045 1.61797 1.53386 -0.401547 1.00554 0.367269 1.45895 0.371751 1.28611 -0.158712 0.279256 1.20228 1.70069 0.283697 1.21211 -0.329526 0.839996 1.00743 0.554876 -0.24803 -0.450346 ⋮ -0.762991 1.46526 -0.173236 0.414643 0.215731 1.20015 -0.432446 0.804414 1.15654 1.65033 -0.0695841 2.1971 0.25408 1.3398 -1.09059 0.634848 -0.359608 -1.55853 1.21788 -1.01226 0.450864 -0.360808 0.469054 -0.341672text/plain832×32 Matrix{Float32}: 0.298359 -0.109246 -0.227421 … 0.109085 0.180137 0.256977 -0.117696 -0.165926 0.10662 -0.149729 -0.0800637 0.20747 -0.026369 -0.0294542 0.277826 0.0174747 0.0704003 -0.0902252 0.0273941 -0.21855 -0.0283726 0.0638221 -0.11311 0.0278709 -0.100952 0.414816 0.261797 0.351694 -0.0347838 0.187904 -0.242686 0.070586 -0.317466 … 0.0113853 -0.0128254 -0.327565 0.0240385 0.212511 0.0172485 0.174674 -0.134726 0.277685 ⋮ ⋱ ⋮ -0.0967271 -0.165915 -0.405567 -0.159964 -0.421851 -0.0303206 0.302258 0.234183 -0.597393 0.351072 -0.75114 -0.125634 -0.182872 -0.197682 0.0573918 0.217619 0.0951767 0.0406607 0.025299 -0.139596 0.169242 -0.118836 -0.0188807 0.305599 0.140254 -0.0946509 0.0477902 … -0.0605054 0.082238 0.174884 0.291454 -0.00377796 0.28423 -0.0312474 -0.359758 0.0568826text/plainl1×32 Matrix{Float32}: -0.0115592 -0.0158553 0.0209644 -0.0216808 … 0.0455652 0.0202239 -0.0194471text/plaintypeArrayprefix_shortobjectidada0492357e9f8c9!application/vnd.pluto.tree+objectprefixVector{Float32}elementsprefixFloat32elementsmoretypeArrayprefix_shortobjectid5489af23f3423d7f!application/vnd.pluto.tree+objectprefixFloat32elementsmoretypeArrayprefix_shortobjectidf10200752c32f98b!application/vnd.pluto.tree+objectprefixFloat32elementsmoretypeArrayprefix_shortobjectidaea01b1ad9e99c14!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectidfca9734af5fc5753!application/vnd.pluto.tree+objecttypeTupleobjectid8f92ee17ba43b477!application/vnd.pluto.tree+objectpolicy_functionπtext/plainpolicy_sample_actionπ_sampletext/plainestimate_state_valueestimate_state_valuetext/plainpolicy_and_valuepolicy_and_valuetext/plaintypeNamedTupleobjectida50652c9a06ac8aemime!application/vnd.pluto.tree+objectrootassigneeconst reinforce_test5last_run_timestampA :iwpersist_js_state·has_pluto_hook_features§cell_id$82e0e9a0-9662-429a-87e3-e6bdae02709adepends_on_disabled_cells§runtime]_εpublished_object_keysdepends_on_skipped_cells§errored$d3c1379f-acd6-4e15-be7e-a5dbe46a4f62queued¤logsrunning¦outputbodymimetext/htmlrootassigneelast_run_timestampA !6: persist_js_state·has_pluto_hook_features§cell_id$d3c1379f-acd6-4e15-be7e-a5dbe46a4f62depends_on_disabled_cells§runtime v,published_object_keysdepends_on_skipped_cellsçerrored$fad02876-efba-46a7-9cb7-43820528779fqueued¤logsrunning¦outputbodyԼ
mimetext/htmlrootassigneelast_run_timestampA @persist_js_state·has_pluto_hook_features§cell_id$fad02876-efba-46a7-9cb7-43820528779fdepends_on_disabled_cells§runtime _published_object_keysdepends_on_skipped_cellsçerrored$1ce4bc6c-7cde-48e9-8ff1-7281697fd121queued¤logsrunning¦outputbodyԹ
mimetext/htmlrootassigneelast_run_timestampA @Ͱpersist_js_state·has_pluto_hook_features§cell_id$1ce4bc6c-7cde-48e9-8ff1-7281697fd121depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cellsçerrored$024dcd1a-8eaa-4a95-8037-2f578828309cqueued¤logsrunning¦outputbodyelementsepisodicelementsdiscreteprefixStateMDP{Float32, CartPoleState{Float32}, Float32, StateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1515#1531"{var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}, Vector{Float32}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1517#1533", var"#1518#1534", var"#1519#1535", var"#1520#1536"}}, var"#failure#1527"{Float32, Float32, Float32, Float32}, var"#164#169"}elementsactionsprefixFloat32elementsmoretypeArrayprefix_shortobjectidc0c262704df82f17!application/vnd.pluto.tree+objectptfprefixStateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1515#1531"{var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}, Vector{Float32}}}elementsstep#1515text/plaintypestructprefix_shortStateMDPTransitionSamplerobjectid9ee4213a397680f!application/vnd.pluto.tree+objectinitialize_stateinitialize_statetext/plainistermfailuretext/plainis_valid_action#164text/plainaction_indexprefixDict{Float32, Int64}elementsmoretypeDictprefix_shortDictobjectide7e786fa670e0f8a!application/vnd.pluto.tree+objecttypestructprefix_shortStateMDPobjectidbd86d1ba5aef60c1!application/vnd.pluto.tree+objectcontinuousprefixContinuousMDP{Float32, CartPoleState{Float32}, Float32, ContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1517#1533", var"#1518#1534", var"#1519#1535", var"#1520#1536"}}, var"#failure#1527"{Float32, Float32, Float32, Float32}, Returns{Bool}}elementsptfprefixٺContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}}elementsstepepisodic_steptext/plaintypestructprefix_shortContinuousMDPTransitionSamplerobjectidaf5b729c12e09a75!application/vnd.pluto.tree+objectinitialize_stateinitialize_statetext/plainistermfailuretext/plainis_valid_actionReturns{Bool}(true)text/plaintypestructprefix_shortContinuousMDPobjectid218961bfbacff4da!application/vnd.pluto.tree+objecttypeNamedTupleobjectidfb8d4c7a0f6d0ab6!application/vnd.pluto.tree+objectcontinuingelementsdiscreteprefixStateMDP{Float32, CartPoleState{Float32}, Float32, StateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1516#1532"{var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}, Vector{Float32}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1517#1533", var"#1518#1534", var"#1519#1535", var"#1520#1536"}}, Returns{Bool}, var"#164#169"}elementsactionsprefixFloat32elementsmoretypeArrayprefix_shortobjectid9ea52cd90e32ffac!application/vnd.pluto.tree+objectptfprefixStateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1516#1532"{var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}, Vector{Float32}}}elementsstep#1516text/plaintypestructprefix_shortStateMDPTransitionSamplerobjectidd8cb80a42895dd80!application/vnd.pluto.tree+objectinitialize_stateinitialize_statetext/plainistermReturns{Bool}(false)text/plainis_valid_action#164text/plainaction_indexprefixDict{Float32, Int64}elementsmoretypeDictprefix_shortDictobjectidcadcdafa3882e5d2!application/vnd.pluto.tree+objecttypestructprefix_shortStateMDPobjectidbe73bc3cc632ca17!application/vnd.pluto.tree+objectcontinuousprefixContinuousMDP{Float32, CartPoleState{Float32}, Float32, ContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1517#1533", var"#1518#1534", var"#1519#1535", var"#1520#1536"}}, Returns{Bool}, Returns{Bool}}elementsptfprefixContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}}elementsstepcontinuing_steptext/plaintypestructprefix_shortContinuousMDPTransitionSamplerobjectid56c89dca7fef04f4!application/vnd.pluto.tree+objectinitialize_stateinitialize_statetext/plainistermReturns{Bool}(false)text/plainis_valid_actionReturns{Bool}(true)text/plaintypestructprefix_shortContinuousMDPobjectid1aec13aa82c2a793!application/vnd.pluto.tree+objecttypeNamedTupleobjectida6cba0b5616fa636!application/vnd.pluto.tree+objecttypeNamedTupleobjectid1bb5ff08b1fd3f62mime!application/vnd.pluto.tree+objectrootassigneeconst cartpole_mdpslast_run_timestampA 0Ypersist_js_state·has_pluto_hook_features§cell_id$024dcd1a-8eaa-4a95-8037-2f578828309cdepends_on_disabled_cells§runtimehi͵published_object_keysdepends_on_skipped_cells§errored$e1274f57-75cb-4659-a82f-e5870c5367e2queued¤logsrunning¦outputbodyelementsprefix,Main.var"workspace#8".CartPoleState{Float32}elementsprefixCartPoleState{Float32}elementsx0.0text/plainθ-0.05text/plainẋ0.0text/plainθ̇0.0text/plaint0.0text/plaintypestructprefix_shortCartPoleStateobjectidae4ff4dafbab7702!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx-0.0228369text/plainθ-0.0387834text/plainẋ-1.14189text/plainθ̇0.561267text/plaint0.04text/plaintypestructprefix_shortCartPoleStateobjectidf4f18354d3b0f1bf!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx-0.0456448text/plainθ-0.0278905text/plainẋ0.00148022text/plainθ̇-0.0162889text/plaint0.08text/plaintypestructprefix_shortCartPoleStateobjectid5b583c0a7544def1!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx-0.0227183text/plainθ-0.0400882text/plainẋ1.14486text/plainθ̇-0.593963text/plaint0.12text/plaintypestructprefix_shortCartPoleStateobjectid72477b9cd9f8e47d!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx0.0459466text/plainθ-0.075463text/plainẋ2.2884text/plainθ̇-1.17575text/plaint0.16text/plaintypestructprefix_shortCartPoleStateobjectid1f26dac2911cc407!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx0.114661text/plainθ-0.111478text/plainẋ1.14752text/plainθ̇-0.626576text/plaint0.2text/plaintypestructprefix_shortCartPoleStateobjectidb6636b0b1a7abc71!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx0.183441text/plainθ-0.148371text/plainẋ2.2914text/plainθ̇-1.21882text/plaint0.24text/plaintypestructprefix_shortCartPoleStateobjectid73ab8a844681fbc1!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx0.252315text/plainθ-0.18652text/plainẋ1.1526text/plainθ̇-0.690589text/plaint0.28text/plaintypestructprefix_shortCartPoleStateobjectidc0e09a18786c0171!application/vnd.pluto.tree+object prefixCartPoleState{Float32}elementsx0.275668text/plainθ-0.203737text/plainẋ0.0152452text/plainθ̇-0.171239text/plaint0.32text/plaintypestructprefix_shortCartPoleStateobjectiddbe0f46ac8b4faa8!application/vnd.pluto.tree+objectmore蒅prefixCartPoleState{Float32}elementsx-21.9987text/plainθ-0.229379text/plainẋ-7.55936text/plainθ̇-0.870074text/plaint39.9605text/plaintypestructprefix_shortCartPoleStateobjectid3d78166ff8c9568d!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid2eb7920274ac0e71!application/vnd.pluto.tree+objectprefixInt64elements1text/plain3text/plain3text/plain3text/plain1text/plain3text/plain1text/plain1text/plain 3text/plainmore蒡1text/plaintypeArrayprefix_shortobjectidc5435db221c3a676!application/vnd.pluto.tree+objectprefixFloat32elements1.0text/plain1.0text/plain1.0text/plain1.0text/plain1.0text/plain1.0text/plain1.0text/plain1.0text/plain 1.0text/plainmore蒣1.0text/plaintypeArrayprefix_shortobjectideb8910706e4f0966!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx0.0text/plainθ-0.05text/plainẋ0.0text/plainθ̇0.0text/plaint0.0text/plaintypestructprefix_shortCartPoleStateobjectidae4ff4dafbab7702!application/vnd.pluto.tree+object1000text/plaintypeTupleobjectidee1344ec172eb9cemime!application/vnd.pluto.tree+objectrootassigneeconst eplast_run_timestampA 7Ͱpersist_js_state·has_pluto_hook_features§cell_id$e1274f57-75cb-4659-a82f-e5870c5367e2depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$fdd3f4fd-4706-4d6b-b150-6ee6b4b370cbqueued¤logsrunning¦outputbody

Notes on Probability Distributions

In order to prove the policy gradient theorem, we must manipulate terms that are probability distributions over states and visit steps. In order to build intuition for these distributions, we can visualize how data is being averaged with the sort corridor example. The following function simulates many episodes in the environment with a stochastic policy that has some probability of moving left regardless of the state. The simulation keeps track of the visit count for a given state and the visit step. The result of the accumulation is a matrix who's columns contain the number of times each state was visited on every step of an episode across all of the simulated episodes. If we divide each count by the number of episodes simulated, then we have an unbiased sample of the probability of visiting a state on each step $k$ of an episode: $\Pr \{ S_k = s \mid \pi \}$ such that $\sum_{s \in \mathcal{S}^+} \Pr \{ S_k = s \mid \pi \} = 1$.

Note that this distribution is only normalized over the sum of all states including terminal states which is denoted in episodic problems by the notation $\mathcal{S}^+$. The notation $\mathcal{S}$ excludes all terminal states, so if we sum the above probabilities over that set on a given step $k$ we calculate the probability that we are NOT in a terminal state by the time we reach step $k$: $\sum_\mathcal{S} \Pr \{ S_k = s \mid \pi \} = \Pr \{ T \gt k \mid \pi \}$ where we use the notation that $T$ is the step of termination for a particular episode.

mimetext/htmlrootassigneelast_run_timestampA `persist_js_state·has_pluto_hook_features§cell_id$fdd3f4fd-4706-4d6b-b150-6ee6b4b370cbdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$b02ba928-5b9f-4695-b980-07988c788bb9queued¤logsrunning¦outputbodyelementsstep_rewardsprefixFloat32elements0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain 0.0text/plainmore @0.0text/plaintypeArrayprefix_shortobjectidb3b4eac83edd1ed9!application/vnd.pluto.tree+objecttotal_reward1340.0text/plaintotal_steps200000text/plainpolicy_parameters1452×3 Matrix{Float32}: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ -0.500074 -0.203697 0.703771 0.162614 -0.501013 0.3384 0.0877269 0.0986086 -0.186336 0.0 0.0 0.0 0.0 0.0 0.0 0.00818538 0.00600036 -0.0141857text/plainvalue_parametersprefixFloat32elements0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain 0.0text/plainmore0.00032779text/plaintypeArrayprefix_shortobjectidb0ccbc779c7fa779!application/vnd.pluto.tree+objectpolicy_functionπtext/plainpolicy_sample_actionπ_sampletext/plainestimate_state_valueestimate_state_valuetext/plainpolicy_and_valuepolicy_and_valuetext/plaintypeNamedTupleobjectid5d7c92c5c5b78181mime!application/vnd.pluto.tree+objectrootassignee&const mountaincar_continuing_tile_testlast_run_timestampA =Z'Npersist_js_state·has_pluto_hook_features§cell_id$b02ba928-5b9f-4695-b980-07988c788bb9depends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cellsçerrored$f946c886-6246-4f98-a96f-f06984691ad8queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestampA !persist_js_state·has_pluto_hook_features§cell_id$f946c886-6246-4f98-a96f-f06984691ad8depends_on_disabled_cells§runtime3published_object_keysdepends_on_skipped_cells§errored$3c316495-bb6c-41e2-a38f-ba867a319fbbqueued¤logsrunning¦outputbody5create_cartpole_mdps (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA /5persist_js_state·has_pluto_hook_features§cell_id$3c316495-bb6c-41e2-a38f-ba867a319fbbdepends_on_disabled_cells§runtimeMpublished_object_keysdepends_on_skipped_cells§errored$6c5e9bb2-4c38-4613-9652-dec99e97b512queued¤logsrunning¦outputbody<

Policy Function Output

mimetext/htmlrootassigneelast_run_timestampA qzpersist_js_state·has_pluto_hook_features§cell_id$6c5e9bb2-4c38-4613-9652-dec99e97b512depends_on_disabled_cells§runtimeOpublished_object_keysdepends_on_skipped_cells§errored$b0a66a19-ee76-463b-a704-8fcee85444d0queued¤logsrunning¦outputbody>update_params_with_gradient! (generic function with 4 methods)mimetext/plainrootassigneelast_run_timestampA spersist_js_state·has_pluto_hook_features§cell_id$b0a66a19-ee76-463b-a704-8fcee85444d0depends_on_disabled_cells§runtime^ߞpublished_object_keysdepends_on_skipped_cells§errored$13ebc12f-ff6f-4266-88d3-28d6df5fcf59queued¤logsrunning¦outputbodyWactor_critic_binary_episodic_gaussian_parameter_study (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA /persist_js_state·has_pluto_hook_features§cell_id$13ebc12f-ff6f-4266-88d3-28d6df5fcf59depends_on_disabled_cells§runtime2xpublished_object_keysdepends_on_skipped_cells§errored$7a6fb1f0-fc3c-4c29-a6d9-769d32ca98a9queued¤logsrunning¦outputbodyK

Example 13.1 Short corridor gridworld

mimetext/htmlrootassigneelast_run_timestampA (Rpersist_js_state·has_pluto_hook_features§cell_id$7a6fb1f0-fc3c-4c29-a6d9-769d32ca98a9depends_on_disabled_cells§runtimeµpublished_object_keysdepends_on_skipped_cells§errored$f2f2dd1d-180c-4d36-b515-5079d129f93aqueued¤logsrunning¦outputbody$ mimetext/htmlrootassigneelast_run_timestampA ڰpersist_js_state·has_pluto_hook_features§cell_id$f2f2dd1d-180c-4d36-b515-5079d129f93adepends_on_disabled_cells§runtimeΜpublished_object_keysdepends_on_skipped_cellsçerrored$553b0ceb-f2ca-41ee-99bc-9f53a4487b49queued¤logsrunning¦outputbody11.67104mimetext/plainrootassigneelast_run_timestampA &persist_js_state·has_pluto_hook_features§cell_id$553b0ceb-f2ca-41ee-99bc-9f53a4487b49depends_on_disabled_cells§runtime apublished_object_keysdepends_on_skipped_cellsçerrored$f9facbba-39d4-483e-9066-275603156db0queued¤logsrunning¦outputbody8plot_mountaincar_values (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA @K persist_js_state·has_pluto_hook_features§cell_id$f9facbba-39d4-483e-9066-275603156db0depends_on_disabled_cells§runtime<ɵpublished_object_keysdepends_on_skipped_cellsçerrored$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0queued¤logsrunning¦outputbodyprefixFloat32elements0.999969text/plain3.11996f-5text/plaintypeArrayprefix_shortobjectidb2757275efe1e0camime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA +5persist_js_state·has_pluto_hook_features§cell_id$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0depends_on_disabled_cells§runtime mpublished_object_keysdepends_on_skipped_cellsçerrored$d41f1dd1-45fe-4456-9a01-ed47fd6704a7queued¤logsrunning¦outputbodyAupdate_beta_eligibility_vector! (generic function with 4 methods)mimetext/plainrootassigneelast_run_timestampA #^V persist_js_state·has_pluto_hook_features§cell_id$d41f1dd1-45fe-4456-9a01-ed47fd6704a7depends_on_disabled_cells§runtime$epublished_object_keysdepends_on_skipped_cells§errored$ba5d6311-daee-4abc-b2fb-fae2184ef3ebqueued¤logsrunning¦outputbodyGsetup_binary_gaussian_policy_arguments (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA ''Epersist_js_state·has_pluto_hook_features§cell_id$ba5d6311-daee-4abc-b2fb-fae2184ef3ebdepends_on_disabled_cells§runtimeepublished_object_keysdepends_on_skipped_cells§errored$8e742d32-c074-4981-b35b-b596b64c869bqueued¤logsrunning¦outputbodydD

$\lambda_\theta$: 0.95

$\lambda_\mathbf{w}$: 0.05

$\alpha_{\overline{r}}$:

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

mimetext/htmlrootassigneelast_run_timestampA [}persist_js_state·has_pluto_hook_features§cell_id$8e742d32-c074-4981-b35b-b596b64c869bdepends_on_disabled_cells§runtimelpublished_object_keysdepends_on_skipped_cellsçerrored$03a218cb-aa83-4000-85b5-c6f247087053queued¤logsrunning¦outputbody>update_binary_value_gradient! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA Opersist_js_state·has_pluto_hook_features§cell_id$03a218cb-aa83-4000-85b5-c6f247087053depends_on_disabled_cells§runtime Upublished_object_keysdepends_on_skipped_cells§errored$1ec1acf1-f833-4478-9b3c-88029340a629queued¤logsrunning¦outputbodyV
Non-linear Features

This version of REINFORCE uses non-linear features in a fully connected neural network. The number of parameters no longer matches the size of the input feature vector, but a mapping from state to feature vector is still required. One must specify the size of the feature vector, a function that updates the values in a feature vector given a state, and the size of each hidden layer in the neural network. Additional keyword arguments are available to change the construction of the neural network such as adding residual layers.

mimetext/htmlrootassigneelast_run_timestampA =qpersist_js_state·has_pluto_hook_features§cell_id$1ec1acf1-f833-4478-9b3c-88029340a629depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$de3cba34-9842-44d1-9b79-47126c0a0751queued¤logsrunning¦outputbodyelementsnum_features29160text/plainget_active_featuresftext/plaintypeNamedTupleobjectide42a62c766d4f57bmime!application/vnd.pluto.tree+objectrootassigneeconst cartpole_tilecoding_setuplast_run_timestampA spersist_js_state·has_pluto_hook_features§cell_id$de3cba34-9842-44d1-9b79-47126c0a0751depends_on_disabled_cells§runtimehpublished_object_keysdepends_on_skipped_cellsçerrored$04f42c09-8ab5-4233-b196-51c4aa2dcedbqueued¤logsrunning¦outputbodyB

Waiting to run parameter study

mimetext/htmlrootassigneelast_run_timestampA =3zpersist_js_state·has_pluto_hook_features§cell_id$04f42c09-8ab5-4233-b196-51c4aa2dcedbdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cellsçerrored$54ff46a2-489a-4dd2-bc30-df70c780cc42queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestamppersist_js_state·has_pluto_hook_features§cell_id$54ff46a2-489a-4dd2-bc30-df70c780cc42depends_on_disabled_cellsçruntimepublished_object_keysdepends_on_skipped_cells§errored$7126aefd-b847-497a-9545-514e9b9afa71queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestamppersist_js_state·has_pluto_hook_features§cell_id$7126aefd-b847-497a-9545-514e9b9afa71depends_on_disabled_cellsçruntimepublished_object_keysdepends_on_skipped_cells§errored$48dcd2d0-a940-41da-a097-90c780f2ec4dqueued¤logsrunning¦outputbody

Alternative Paramaterization

If the action space is small enough, then it may be convenient to create a function that simply outputs the preferences for all of the actions at a given state. Let's call $N_a$ to be the number of available actions. We would then consider the vector function $\mathbf{h}(s, \boldsymbol{\theta}) \in \mathbb{R}^{N_a}$ and its components $h_1, h_2, h_3, \dots, h_{N_a}$. To be the action preferences at each state. With this style of paramaterization, we need only compute state feature vectors $\mathbf{x}(s) \in \mathbb{R}^d$.

Similarly, the policy function would also be a vector function. In order to compute the softmax, we must evaluate the denominator of (13.2) which requires knowing all of the action preferences. Practically, it is only defined as a function on vectors, so consider the following notation to simplify expressions where we use the symbol $\mathbf{\sigma}$ to denote the soft-max vector function.

$$\sigma(\mathbf{x}) = \frac{e^{\mathbf{x}}}{\sum_j{e^{x_j}}} \text{ where we abuse the notation } e^{\mathbf{x}} = \begin{pmatrix} e^{x_1} \\ e^{x_2} \\ \vdots \\ e^{x_n} \end{pmatrix}$$

Using this notation, we can write down the policy function under this new parameterization: $\mathbf{\pi}(s, \boldsymbol{\theta}) = \mathbf{\sigma}(\mathbf{h}(s, \boldsymbol{\theta}))$. What do linear preferences look like with this parameterization? Instead of a parameter vector $\boldsymbol{\theta} \in \mathbb{R}^{d^\prime}$, we have a parameter matrix $\boldsymbol{\theta} \in \mathbb{R}^{d \times N_a}$ and the vector of preferences is the result of a matrix vector multiplication: $\mathbf{h}(s, \boldsymbol{\theta}) = \theta^\top \mathbf{x}(s) \in \mathbb{R}^{N_a}$. Subscript notation is used to refer to single preference values so $\mathbf{h}_i$ would be the $ith$ index of $\mathbf{h}$ for the $ith$ action preference equivalent to $h_i$.

mimetext/htmlrootassigneelast_run_timestampA Kpersist_js_state·has_pluto_hook_features§cell_id$48dcd2d0-a940-41da-a097-90c780f2ec4ddepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$e1493cea-19c4-475d-98a0-86d27fb04af1queued¤logsrunning¦outputbody12.0527mimetext/plainrootassigneelast_run_timestampA " Bpersist_js_state·has_pluto_hook_features§cell_id$e1493cea-19c4-475d-98a0-86d27fb04af1depends_on_disabled_cells§runtimee%published_object_keysdepends_on_skipped_cellsçerrored$511a847f-234c-465e-8f4a-688e79d9b975queued¤logsrunning¦outputbody

13.6 Policy Gradient for Continuing Problems

In the continuing case we need to define the average reward per time step as discussed in Section 10.3. In the update procedure the δ is calculated differently in terms of the reward compared to this long running average. The value functions in this case will also learn the reward difference from the average which is assumed to have a well defined expected value under the stationary state distribution for the policy. This shift in the value function will not affect performance since shifting the value function up and down by a constant does not affect the learned policy. To implement this we need a new learning rate $α_{\overline{R}}$ which controls how quickly the reward average updates. This replaces $γ$ in a sense since we no longer discount rewards of future time steps.

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$511a847f-234c-465e-8f4a-688e79d9b975depends_on_disabled_cells§runtimeX published_object_keysdepends_on_skipped_cells§errored$697b2310-9d96-4f7f-be62-c3bd6bf736f3queued¤logsrunning¦outputbodyRreinforce_with_baseline_monte_carlo_control_fcann (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA &L^persist_js_state·has_pluto_hook_features§cell_id$697b2310-9d96-4f7f-be62-c3bd6bf736f3depends_on_disabled_cells§runtime?Hpublished_object_keysdepends_on_skipped_cells§errored$056a8adc-92f4-4b33-90d9-4b3b4026bbbcqueued¤logsrunning¦outputbody?update_traces_with_gradient! (generic function with 16 methods)mimetext/plainrootassigneelast_run_timestampA 'E 1persist_js_state·has_pluto_hook_features§cell_id$056a8adc-92f4-4b33-90d9-4b3b4026bbbcdepends_on_disabled_cells§runtime:published_object_keysdepends_on_skipped_cells§errored$bc8a399b-8864-4473-89d2-e3b0a03d15b5queued¤logsrunning¦outputbody9corridor_parameter_study (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA =rpersist_js_state·has_pluto_hook_features§cell_id$bc8a399b-8864-4473-89d2-e3b0a03d15b5depends_on_disabled_cells§runtimev)published_object_keysdepends_on_skipped_cellsçerrored$bba13634-ff0e-47f7-a23b-8d56098f4ac6queued¤logsrunning¦outputbody7make_gaussian_sampler (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA #/persist_js_state·has_pluto_hook_features§cell_id$bba13634-ff0e-47f7-a23b-8d56098f4ac6depends_on_disabled_cells§runtime>=published_object_keysdepends_on_skipped_cells§errored$407a0724-4bb6-4c83-ab2d-17a0e19c4072queued¤logsrunning¦outputbodyelementsstep_rewardsprefixFloat32elementstypeArrayprefix_shortobjectid549a1baeeaca6eef!application/vnd.pluto.tree+objectepisode_stepsprefixInt64elements22text/plain43text/plain67text/plain80text/plain95text/plain135text/plain150text/plain170text/plain 204text/plainmoreC999842text/plaintypeArrayprefix_shortobjectid2ae5060d976042e1!application/vnd.pluto.tree+objectepisode_rewardsprefixFloat32elements21.0text/plain21.0text/plain24.0text/plain13.0text/plain15.0text/plain40.0text/plain15.0text/plain20.0text/plain 34.0text/plainmoreC2275.0text/plaintypeArrayprefix_shortobjectidd47c5540ea8ef0bd!application/vnd.pluto.tree+objectpolicy_parameterselementsprefixMatrix{Float32}elementsw64×4 Matrix{Float32}: -5.53118 5.62263 -9.49019 5.85546 -1.17843 -1.11469 -1.2416 -3.49056 -1.92707 -1.40302 -2.22741 -1.58268 -2.94946 -3.06743 -1.8371 -6.95662 1.86754 -2.16378 3.88687 -3.24863 0.291204 -1.61792 0.682906 -2.27799 -0.79289 -2.07449 -1.00932 -3.37475 ⋮ -0.79335 -1.99861 -0.0176253 -3.12513 1.74487 -2.19695 4.04714 -4.79236 1.69527 -1.42853 4.23525 -1.5761 4.19682 4.37338 6.10929 7.2588 1.20301 -0.674417 2.90166 -1.15281 0.625654 -0.512086 0.92509 -0.696122text/plain64×64 Matrix{Float32}: 0.60997 0.103415 0.094945 … -0.419245 -0.119642 0.0531449 -2.50197 -0.28097 -0.123862 0.044202 -0.084665 -0.292616 -2.09229 0.334419 0.180306 -0.818408 -0.131487 0.0162404 -1.33884 -0.49032 -0.125689 0.110491 0.105055 0.0762119 -0.368561 -0.0121925 -0.42925 -0.117518 -0.0350838 -0.164547 -0.271998 -0.0338218 0.142059 … 0.383608 0.0644618 -0.167785 1.88499 -0.0478991 -0.0462724 -0.0637653 0.285258 0.190184 ⋮ ⋱ -1.28638 -0.190348 -0.160667 0.246749 0.0221627 -0.000938494 2.39131 -0.0584706 0.122001 0.195413 0.175769 0.176224 -1.76799 0.122958 0.301179 … -1.10118 -0.374552 -0.0232428 1.57355 0.19686 0.350402 0.069147 0.136391 -0.109028 -2.17096 0.155426 -0.267845 -0.0634167 -0.379201 -0.16344 0.227783 -0.0117455 -0.235104 -0.0324068 0.133656 0.130253text/plain3×64 Matrix{Float32}: 0.526751 -0.232876 0.513466 -0.544604 … 0.0746399 -0.140322 -0.371757 0.176781 -0.139932 0.0111613 -0.0715213 -0.0481775 -0.291423 -0.20689 -0.478238 0.0583387 -0.167361 0.302655 0.05054 0.0507894 0.134926text/plaintypeArrayprefix_shortobjectidca0199107099d027!application/vnd.pluto.tree+objectprefixVector{Float32}elementsprefixFloat32elementsmoretypeArrayprefix_shortobjectida33118e322f8d991!application/vnd.pluto.tree+objectprefixFloat32elementsmoretypeArrayprefix_shortobjectid23a7de8ba3e649b8!application/vnd.pluto.tree+objectprefixFloat32elementsmoretypeArrayprefix_shortobjectid678f2ddb2bb4cfde!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectidaee4891a3a5fb0e7!application/vnd.pluto.tree+objecttypeTupleobjectid7a6612f9aa87890d!application/vnd.pluto.tree+objectvalue_parameterselementsprefixMatrix{Float32}elementsw64×4 Matrix{Float32}: -5.53118 5.62263 -9.49019 5.85546 -1.17843 -1.11469 -1.2416 -3.49056 -1.92707 -1.40302 -2.22741 -1.58268 -2.94946 -3.06743 -1.8371 -6.95662 1.86754 -2.16378 3.88687 -3.24863 0.291204 -1.61792 0.682906 -2.27799 -0.79289 -2.07449 -1.00932 -3.37475 ⋮ -0.79335 -1.99861 -0.0176253 -3.12513 1.74487 -2.19695 4.04714 -4.79236 1.69527 -1.42853 4.23525 -1.5761 4.19682 4.37338 6.10929 7.2588 1.20301 -0.674417 2.90166 -1.15281 0.625654 -0.512086 0.92509 -0.696122text/plain64×64 Matrix{Float32}: 0.60997 0.103415 0.094945 … -0.419245 -0.119642 0.0531449 -2.50197 -0.28097 -0.123862 0.044202 -0.084665 -0.292616 -2.09229 0.334419 0.180306 -0.818408 -0.131487 0.0162404 -1.33884 -0.49032 -0.125689 0.110491 0.105055 0.0762119 -0.368561 -0.0121925 -0.42925 -0.117518 -0.0350838 -0.164547 -0.271998 -0.0338218 0.142059 … 0.383608 0.0644618 -0.167785 1.88499 -0.0478991 -0.0462724 -0.0637653 0.285258 0.190184 ⋮ ⋱ -1.28638 -0.190348 -0.160667 0.246749 0.0221627 -0.000938494 2.39131 -0.0584706 0.122001 0.195413 0.175769 0.176224 -1.76799 0.122958 0.301179 … -1.10118 -0.374552 -0.0232428 1.57355 0.19686 0.350402 0.069147 0.136391 -0.109028 -2.17096 0.155426 -0.267845 -0.0634167 -0.379201 -0.16344 0.227783 -0.0117455 -0.235104 -0.0324068 0.133656 0.130253text/plainn1×64 Matrix{Float32}: -5.14633 8.35324 7.5348 7.53401 -7.5177 … 6.49972 -5.74128 6.85215 -7.85695text/plaintypeArrayprefix_shortobjectid39708eb2342a44c2!application/vnd.pluto.tree+objectprefixVector{Float32}elementsprefixFloat32elementsmoretypeArrayprefix_shortobjectida33118e322f8d991!application/vnd.pluto.tree+objectprefixFloat32elementsmoretypeArrayprefix_shortobjectid23a7de8ba3e649b8!application/vnd.pluto.tree+objectprefixFloat32elementsmoretypeArrayprefix_shortobjectida143fc9d03d11d9a!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid87ff3a09514822ef!application/vnd.pluto.tree+objecttypeTupleobjectidf81615e58a2eea74!application/vnd.pluto.tree+objectpolicy_functionπtext/plainpolicy_sample_actionπ_sampletext/plainestimate_state_valueestimate_state_valuetext/plainpolicy_and_valuepolicy_and_valuetext/plaintypeNamedTupleobjectidd248bb92e282e08emime!application/vnd.pluto.tree+objectrootassigneeconst reinforce_test4last_run_timestampA 7Jpersist_js_state·has_pluto_hook_features§cell_id$407a0724-4bb6-4c83-ab2d-17a0e19c4072depends_on_disabled_cells§runtimeTǵpublished_object_keysdepends_on_skipped_cells§errored$77cf3a74-899f-4ade-99f2-5aaf7a98c02dqueued¤logsrunning¦outputbody4scale_fcann_params! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA .bpersist_js_state·has_pluto_hook_features§cell_id$77cf3a74-899f-4ade-99f2-5aaf7a98c02ddepends_on_disabled_cells§runtime潵published_object_keysdepends_on_skipped_cells§errored$28ce6e60-59cf-408a-8081-b978507b3c72queued¤logsrunning¦outputbodyaf

x position: 0.0

pole angle: 0.0012229534

x velocity: 0.0

pole angular velocity: 0.0

mimetext/htmlrootassigneelast_run_timestampA !0persist_js_state·has_pluto_hook_features§cell_id$28ce6e60-59cf-408a-8081-b978507b3c72depends_on_disabled_cells§runtimeQQpublished_object_keysdepends_on_skipped_cellsçerrored$7ccadf01-fbba-4dfd-a5ad-770dab9946f9queued¤logsrunning¦outputbody

We can define our policy as a normal distribution function over actions for a given state and parameter vector.

$$\pi(a|s, \mathbf{\theta}) \doteq \frac{1}{\sigma(s, \mathbf{\theta}) \sqrt{2\pi}} \exp \left ( - \frac{(a-\mu(s, \mathbf{\theta}))^2}{2\sigma(s, \mathbf{\theta})^2} \right ) \tag{13.19}$$

This policy requires μ and σ to be parameterized by the parameter vector. To make a linear model for both parameters we can use the following formulas:

$$\mu(s, \mathbf{\theta}) \doteq \mathbf{\theta}_\mu ^\top \mathbf{x}_\mu(s) \text{ and } \sigma(s, \mathbf{\theta}) \doteq \exp{( \mathbf{\theta}_\sigma ^ \top \mathbf{x}_\sigma (s))} \tag{13.20}$$

where $\mathbf{x}_\mu(s)$ and $\mathbf{x}_\sigma(s)$ are state feature vectors. With these formulas we can apply the previous algorithms to solve environments with real-valued actions.

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$7ccadf01-fbba-4dfd-a5ad-770dab9946f9depends_on_disabled_cells§runtime\published_object_keysdepends_on_skipped_cells§errored$b72e030f-7d52-481f-b4f7-2b16b227e547queued¤logsrunning¦outputbody

Figure 13.2

Adding a baseline to REINFORCE can make it learn much faster as illustrated here on the short-corridor gridworld (Example 13.1). Here the approximate state-value function used in the baseline is $\hat v(s, \mathbf{w}) = w$. There is only one component of the feature vector and the state value approximation parameters.

mimetext/htmlrootassigneelast_run_timestampA Zpersist_js_state·has_pluto_hook_features§cell_id$b72e030f-7d52-481f-b4f7-2b16b227e547depends_on_disabled_cells§runtimeb/published_object_keysdepends_on_skipped_cells§errored$4c5cb75e-79b5-4502-b1eb-6246e002feafqueued¤logsrunning¦outputbodycM

$\lambda_\theta$: 0.1

$\lambda_\mathbf{w}$: 0.9

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

mimetext/htmlrootassigneelast_run_timestampA @. persist_js_state·has_pluto_hook_features§cell_id$4c5cb75e-79b5-4502-b1eb-6246e002feafdepends_on_disabled_cells§runtime0:published_object_keysdepends_on_skipped_cellsçerrored$48b342f2-e48f-457a-9bd3-b3504a79f3a6queued¤logsrunning¦outputbody

Binary Features

This version of REINFORCE uses binary feature vectors for which one needs to specify the total number of features as well as a function that returns the active features for a given state.

mimetext/htmlrootassigneelast_run_timestampA ڰpersist_js_state·has_pluto_hook_features§cell_id$48b342f2-e48f-457a-9bd3-b3504a79f3a6depends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cells§errored$5d50a5d0-8fe2-4c6e-b76c-d5614e4fd884queued¤logsrunning¦outputbody4show_or_lookup_plot (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA >ѿpersist_js_state·has_pluto_hook_features§cell_id$5d50a5d0-8fe2-4c6e-b76c-d5614e4fd884depends_on_disabled_cells§runtime:澵published_object_keysdepends_on_skipped_cells§errored$ba645f6b-143f-4e83-9003-707770ae308dqueued¤logsrunning¦outputbody

Probability distributions for short corridor gridworld example with probability of left action selected below

mimetext/htmlrootassigneelast_run_timestampA #persist_js_state·has_pluto_hook_features§cell_id$ca360680-afc9-4dd9-9351-493643f91575depends_on_disabled_cells§runtimeF7published_object_keysdepends_on_skipped_cells§errored$d95f75b5-21d8-4862-baa7-50b58d9725b8queued¤logsrunning¦outputbody

Soft-max notation and gradients

To use policy gradient methods, we must be able to take the gradient of the policy function for every state-action pair. Using the above notation and treating the policy as a vector function, we must know the gradient of the soft-max applied to a vector function at a particular index. Each gradient is a column vector of length $d$ where $d$ is the number of parameters. There is a separate gradient available for every index in the vector output which is one for each action or a total of $N_a$. To simplify expressions, $\mathbf{h}(s, \boldsymbol{\theta})$ will we written as $\mathbf{h}$ and $\mathbf{\pi} = \mathbf{\sigma}(\mathbf{h})$. Our desired gradient is with respect to a particular component of $\mathbf{\sigma}(\mathbf{h})$ denoted $\mathbf{\sigma}(\mathbf{h})_a$ where $a$ represents the action index. The gradient itself is the vector of partial derivatives with respect to the parameters $\theta$. The $ith$ component of the gradient $\nabla(f(\theta))_i = \frac{\partial f(\theta)}{\partial \theta_i}$. When we compute the gradient we need all the components whose expression is derived below.

$$\begin{align} \nabla \left ( \sigma(\mathbf{h})_a \right )_i &= \frac{\partial}{\partial \theta_i} \left ( \frac{e^{h_a}}{\sum_k{e^{h_k}}} \right ) \\ &=\left ( \frac{1}{{\sum_k{e^{h_k}}}} \right )^2 \left ( e^{h_a} \frac{\partial{h_a}}{\partial{\theta_i}} \sum_k{e^{h_k}} - e^{h_a} \sum_k{e^{h_k} \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \\ &=\left ( \frac{1}{{\sum_k{e^{h_k}}}} \right )^2 e^{h_a} \left ( \frac{\partial{h_a}}{\partial{\theta_i}} \sum_k{e^{h_k}} - \sum_k{e^{h_k} \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \tag{factoring out exponenential term}\\ &=\left ( \frac{e^{h_a}}{{\sum_k{e^{h_k}}}} \right ) \left ( \frac{\partial{h_a}}{\partial{\theta_i}} \sum_k{\frac{e^{h_k}}{\sum_l e^{h_l}}} - \sum_k{\frac{e^{h_k}}{\sum_l e^{h_l}} \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \tag{distributing squared fraction}\\ &=\pi_a \left ( \frac{\partial{h_a}}{\partial{\theta_i}} \sum_k{\pi_k} - \sum_k{\pi_k \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \tag{definition of policy function}\\ &=\pi_a \left ( \frac{\partial{h_a}}{\partial{\theta_i}} - \sum_k{\pi_k \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \end{align}$$

The final step results form the fact that the policy function is a probability distribution so the sum over it is always 1.

mimetext/htmlrootassigneelast_run_timestampA tpersist_js_state·has_pluto_hook_features§cell_id$d95f75b5-21d8-4862-baa7-50b58d9725b8depends_on_disabled_cells§runtime̦published_object_keysdepends_on_skipped_cells§errored$65be0e58-24be-4932-92a9-9e4825b14144queued¤logsrunning¦outputbodybactor_critic_binary_continuing_squashed_gaussian_parameter_study (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA /[persist_js_state·has_pluto_hook_features§cell_id$65be0e58-24be-4932-92a9-9e4825b14144depends_on_disabled_cells§runtime&spublished_object_keysdepends_on_skipped_cells§errored$60c21e9c-e42d-4f0b-a910-3b318440fbc8queued¤logsrunning¦outputbody:

Normal Distribution Plot with

$\mu$: 0.0

$\sigma$: 1.0

mimetext/htmlrootassigneelast_run_timestampA !Nepersist_js_state·has_pluto_hook_features§cell_id$60c21e9c-e42d-4f0b-a910-3b318440fbc8depends_on_disabled_cells§runtimeƵpublished_object_keysdepends_on_skipped_cellsçerrored$da2d3186-a778-41cc-9b49-759bf1e9b8faqueued¤logsrunning¦outputbodyٚUnion{AbstractVector{I} where I, C1, C2, C3} where {I<:Integer, C1<:AbstractVector{I}, N, C2<:NTuple{N, I}, T<:AbstractVector{I}, C3<:(Base.Generator{T})}mimetext/plainrootassigneelast_run_timestampA ՙpersist_js_state·has_pluto_hook_features§cell_id$da2d3186-a778-41cc-9b49-759bf1e9b8fadepends_on_disabled_cells§runtime[published_object_keysdepends_on_skipped_cells§errored$b695ef21-a1ac-4d1f-a0e1-71cd81cede18queued¤logsrunning¦outputbodyݽ
mimetext/htmlrootassigneelast_run_timestampA @㍵persist_js_state·has_pluto_hook_features§cell_id$b695ef21-a1ac-4d1f-a0e1-71cd81cede18depends_on_disabled_cells§runtimeJ=Եpublished_object_keysdepends_on_skipped_cellsçerrored$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00queued¤logsrunning¦outputbodymreinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA ';persist_js_state·has_pluto_hook_features§cell_id$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00depends_on_disabled_cells§runtime=>published_object_keysdepends_on_skipped_cells§errored$dcb306ae-a1b1-43d6-ba6e-e38668838689queued¤logsrunning¦outputbodyF

Soft-max Implementation

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$dcb306ae-a1b1-43d6-ba6e-e38668838689depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$54f559b6-8a62-4a42-894d-c56e41d5ebefqueued¤logsrunning¦outputbody152×3 Matrix{Float64}: 1.0 0.0 0.0 0.499665 0.500335 0.0 0.500081 0.250068 0.249851 0.374974 0.374815 0.125075 0.375395 0.250027 0.187132 0.312784 0.280977 0.124948 0.296775 0.219084 0.140399 ⋮ 0.0 0.0 1.0e-6 0.0 1.0e-6 0.0 1.0e-6 0.0 0.0 1.0e-6 0.0 0.0 0.0 1.0e-6 0.0 0.0 0.0 1.0e-6mimetext/plainrootassigneeconst corridor_state_countslast_run_timestampA # eIpersist_js_state·has_pluto_hook_features§cell_id$54f559b6-8a62-4a42-894d-c56e41d5ebefdepends_on_disabled_cells§runtimesypublished_object_keysdepends_on_skipped_cellsçerrored$f545c800-0bf3-491f-9d7d-42341cfdb573queued¤logsrunning¦outputbodyFform_state_continuous_policy_function (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA #e2 persist_js_state·has_pluto_hook_features§cell_id$f545c800-0bf3-491f-9d7d-42341cfdb573depends_on_disabled_cells§runtimeѵpublished_object_keysdepends_on_skipped_cells§errored$8b35661b-5075-4d63-bc31-044407f99acfqueued¤logsrunning¦outputbodyelementsaction_probabilitiesprefixFloat32elements0.345429text/plain0.654571text/plaintypeArrayprefix_shortobjectidd2e340ba29bbb4ca!application/vnd.pluto.tree+objectstate_value_estimate0.00728306text/plaintypeNamedTupleobjectiddb1ef684d440398mime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA +fTpersist_js_state·has_pluto_hook_features§cell_id$8b35661b-5075-4d63-bc31-044407f99acfdepends_on_disabled_cells§runtime // We start by putting all the variable interpolation here at the beginning // Publish the plot object to JS let plot_obj = {"layout": {"template": {"layout": {"coloraxis": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "xaxis": {"gridcolor": "white", "zerolinewidth": 2, "title": {"standoff": 15}, "ticks": "", "zerolinecolor": "white", "automargin": true, "linecolor": "white"}, "hovermode": "closest", "paper_bgcolor": "white", "geo": {"showlakes": true, "showland": true, "landcolor": "#E5ECF6", "bgcolor": "white", "subunitcolor": "white", "lakecolor": "white"}, "colorscale": {"sequential": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]], "diverging": [[0, "#8e0152"], [0.1, "#c51b7d"], [0.2, "#de77ae"], [0.3, "#f1b6da"], [0.4, "#fde0ef"], [0.5, "#f7f7f7"], [0.6, "#e6f5d0"], [0.7, "#b8e186"], [0.8, "#7fbc41"], [0.9, "#4d9221"], [1, "#276419"]], "sequentialminus": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}, "yaxis": {"gridcolor": "white", "zerolinewidth": 2, "title": {"standoff": 15}, "ticks": "", "zerolinecolor": "white", "automargin": true, "linecolor": "white"}, "shapedefaults": {"line": {"color": "#2a3f5f"}}, "hoverlabel": {"align": "left"}, "mapbox": {"style": "light"}, "polar": {"angularaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "bgcolor": "#E5ECF6", "radialaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}}, "autotypenumbers": "strict", "font": {"color": "#2a3f5f"}, "ternary": {"baxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "bgcolor": "#E5ECF6", "caxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "aaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}}, "annotationdefaults": {"arrowhead": 0, "arrowwidth": 1, "arrowcolor": "#2a3f5f"}, "plot_bgcolor": "#E5ECF6", "title": {"x": 0.05}, "scene": {"xaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}, "zaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}, "yaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}}, "colorway": ["#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52"]}, "data": {"barpolar": [{"type": "barpolar", "marker": {"line": {"color": "#E5ECF6", "width": 0.5}}}], "carpet": [{"aaxis": {"gridcolor": "white", "endlinecolor": "#2a3f5f", "minorgridcolor": "white", "startlinecolor": "#2a3f5f", "linecolor": "white"}, "type": "carpet", "baxis": {"gridcolor": "white", "endlinecolor": "#2a3f5f", "minorgridcolor": "white", "startlinecolor": "#2a3f5f", "linecolor": "white"}}], "scatterpolar": [{"type": "scatterpolar", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "parcoords": [{"line": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "type": "parcoords"}], "scatter": [{"type": "scatter", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram2dcontour": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "histogram2dcontour", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "contour": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "contour", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scattercarpet": [{"type": "scattercarpet", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "mesh3d": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "mesh3d"}], "surface": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "surface", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scattermapbox": [{"type": "scattermapbox", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scattergeo": [{"type": "scattergeo", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram": [{"type": "histogram", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "pie": [{"type": "pie", "automargin": true}], "choropleth": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "choropleth"}], "heatmapgl": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "heatmapgl", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "bar": [{"type": "bar", "error_y": {"color": "#2a3f5f"}, "error_x": {"color": "#2a3f5f"}, "marker": {"line": {"color": "#E5ECF6", "width": 0.5}}}], "heatmap": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "heatmap", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "contourcarpet": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "contourcarpet"}], "table": [{"type": "table", "header": {"line": {"color": "white"}, "fill": {"color": "#C8D4E3"}}, "cells": {"line": {"color": "white"}, "fill": {"color": "#EBF0F8"}}}], "scatter3d": [{"line": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "type": "scatter3d", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scattergl": [{"type": "scattergl", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram2d": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "histogram2d", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scatterternary": [{"type": "scatterternary", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scatterpolargl": [{"type": "scatterpolargl", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}]}}, "margin": {"l": 50, "b": 50, "r": 50, "t": 60}}, "config": {"showLink": false, "editable": false, "responsive": true, "staticPlot": false, "scrollZoom": true}, "frames": [], "data": [{"y": [1.4867195147342977e-6, 1.5629451759545644e-6, 1.6429143776209075e-6, 1.7268022256027468e-6, 1.8147915680031392e-6, 1.9070733153312072e-6, 2.0038467728640893e-6, 2.1053199856146367e-6, 2.211710096333239e-6, 2.3232437169845253e-6, 2.440157314152359e-6, 2.5626976088394264e-6, 2.6911219911409617e-6, 2.825698950285607e-6, 2.9667085205500707e-6, 3.1144427435683705e-6, 3.2692061475706308e-6, 3.43131624410104e-6, 3.6011040427793162e-6, 3.778914584685238e-6, 3.9651074949611e-6, 4.160057555242689e-6, 4.3641552965451725e-6, 4.577807613246672e-6, 4.801438398828609e-6, 5.0354892040487905e-6, 5.280419918240033e-6, 5.536709474444599e-6, 5.804856579112089e-6, 6.085380467106363e-6, 6.378821682784907e-6, 6.685742887932588e-6, 7.00672969735016e-6, 7.342391542916562e-6, 7.693362566963276e-6, 8.060302545818064e-6, 8.44389784439499e-6, 8.844862402727114e-6, 9.263938755358373e-6, 9.701899084530991e-6, 1.0159546308125237e-5, 1.0637715203328491e-5, 1.113727356703151e-5, 1.1659123413970254e-5, 1.2204202213652822e-5, 1.2773484167131627e-5, 1.3367981524702597e-5, 1.398874594563419e-5, 1.4636869901050072e-5, 1.5313488121111463e-5, 1.6019779087665834e-5, 1.6756966573550372e-5, 1.7526321229760776e-5, 1.832916222171639e-5, 1.9166858915875072e-5, 2.0040832617972474e-5, 2.095255836418134e-5, 2.1903566766508593e-5, 2.289544591376843e-5, 2.3929843329491493e-5, 2.5008467988150235e-5, 2.6133092391102312e-5, 2.7305554703673704e-5, 2.852776095482414e-5, 2.980168730085674e-5, 3.1129382354654245e-5, 3.2512969581943714e-5, 3.3954649766109245e-5, 3.545670354309353e-5, 3.7021494007944656e-5, 3.8651469394583444e-5, 4.034916583038524e-5, 4.2117210167183925e-5, 4.395832289032471e-5, 4.587532110740675e-5, 4.787112161837096e-5, 4.99487440686039e-5, 5.211131418674075e-5, 5.43620671088645e-5, 5.670435079080807e-5, 5.9141629510279606e-5, 6.167748746053908e-5, 6.431563243736498e-5, 6.705989962105581e-5, 6.991425545522019e-5, 7.288280162411435e-5, 7.596977913028805e-5, 7.917957247430857e-5, 8.251671393832852e-5, 8.598588797526645e-5, 8.959193570537115e-5, 9.333985952193218e-5, 9.723482780790063e-5, 0.00010128217976517728, 0.00010548743035831562, 0.00010985627537438261, 0.00011439459660070622, 0.00011910846712222866, 0.00012400415674016654, 0.00012908813751366692, 0.000134367089426128, 0.000139847906177829, 0.00014553770110650033, 0.0001514438132374289, 0.00015757381346467387, 0.00016393551086493638, 0.00017053695914559335, 0.00017738646322837222, 0.00018449258597010427, 0.00019186415502195686, 0.0001995102698285009, 0.00020744030876792028, 0.0002156639364346309, 0.00022419111106551485, 0.00023303209211092708, 0.00024219744795157412, 0.0002516980637622996, 0.00026154514952375196, 0.00027175024818283975, 0.0002823252439628062, 0.00029328237082369165, 0.00030463422107385705, 0.0003163937541331769, 0.00032857430544841655, 0.00034118959556122023, 0.00035425373932904586, 0.0003677812552992853, 0.0003817870752367099, 0.0003962865538042735, 0.0004112954783971987, 0.0004268300791301634, 0.00044290703897727934, 0.00045954350406444516, 0.0004767570941135188, 0.000494565913037639, 0.0005129885596868767, 0.0005320441387432778, 0.0005517522717641883, 0.0005721331083726564, 0.0005932073375934776, 0.0006149961993333812, 0.0006375214960036303, 0.000660805604283163, 0.0006848714870202652, 0.0007097427052705204, 0.0007354434304686859, 0.0007619984567318983, 0.0007894332132914548, 0.0008177737770502308, 0.0008470468852625585, 0.0008772799483332424, 0.0009085010627321479, 0.0009407390240205852, 0.0009740233399855433, 0.0010083842438775384, 0.0010438527077476724, 0.0010804604558792594, 0.0011182399783091301, 0.0011572245444335014, 0.0011974482166930624, 0.0012389458643316494, 0.0012817531772227068, 0.0013259066797573873, 0.001371443744787974, 0.0014184026076199882, 0.0014668223800460944, 0.0015167430644146987, 0.0015682055677257688, 0.0016212517157462606, 0.0016759242671371072, 0.0017322669275836012, 0.0017903243639205966, 0.0018501422182437373, 0.0019117671219976476, 0.001975246710031648, 0.002040629634613374, 0.0021079655793902826, 0.0021773052732888235, 0.002248700504340723, 0.002322204133425472, 0.0023978701079179566, 0.002475753475229666, 0.0025559103962318453, 0.002638398158548473, 0.002723275189706743, 0.0028106010701323537, 0.0029004365459768177, 0.0029928435417632764, 0.003087885172837573, 0.0031856257576105738, 0.003286130829577608, 0.003389467149100821, 0.003495702714939351, 0.003604906775512773, 0.0037171498398822275, 0.003832503688433777, 0.003951041383248348, 0.004072837278141716, 0.004197967028358746, 0.004326507599904808, 0.0044585372784976774, 0.004594135678122931, 0.004733383749175122, 0.004876363786167487, 0.0050231594349922, 0.005173855699712969, 0.005328538948872136, 0.005487296921293292, 0.005650218731361132, 0.005817394873759533, 0.005988917227648743, 0.0061648790602628085, 0.006345375029907421, 0.006530501188339057, 0.00672035498250571, 0.006915035255629138, 0.007114642247609438, 0.007319277594731159, 0.007529044328651608, 0.007744046874650893, 0.007964391049123446, 0.008190184056291318, 0.008421534484118353, 0.008658552299405495, 0.008901348842046877, 0.009150036818426239, 0.009404730293934078, 0.009665544684584698, 0.009932596747713756, 0.01020600457173608, 0.010485887564943792, 0.010772366443325672, 0.01106556321738733, 0.011365601177953646, 0.011672604880933914, 0.011986700131030568, 0.012308013964373516, 0.012636674630060626, 0.012972811570586986, 0.0133165554011448, 0.01366803788777604, 0.014027391924361456, 0.014394751508428203, 0.014770251715760243, 0.015154028673795436, 0.01554621953379326, 0.01594696244175904, 0.016356396508108887, 0.01677466177606206, 0.01720189918874714, 0.017638250555008176, 0.018083858513899972, 0.018538866497858925, 0.01900341869453959, 0.019477660007305997, 0.0199617360143675, 0.020455792926551326, 0.02095997754370181, 0.021474437209700135, 0.021999319766097224, 0.022534773504353393, 0.023080947116680968, 0.023637989645483852, 0.024206050431392127, 0.024785279059888674, 0.025375825306525795, 0.02597783908073275, 0.026591470368212813, 0.027216869171932476, 0.027854185451705853, 0.028503569062375243, 0.029165169690596154, 0.0298391367902293, 0.030525619516347895, 0.03122476665786905, 0.03193672656881556, 0.0326616470982222, 0.03339967551869476, 0.034150958453636235, 0.034915641803154845, 0.03569387066866606, 0.0364857892762094, 0.03729154089849412, 0.03811126777569452, 0.03894511103501631, 0.03979321060905222, 0.04065570515295472, 0.0415327319604459, 0.04242442687869267, 0.04333092422207484, 0.04425235668487123, 0.04518885525289872, 0.046140549114130854, 0.047107565568330984, 0.048090029935734885, 0.049088065464814395, 0.0501017932391642, 0.0511313320835455, 0.052176798469127886, 0.05323830641797235, 0.05431596740679254, 0.05540989027004515, 0.056520181102387726, 0.05764694316055475, 0.058790276764699216, 0.0599502791992459, 0.06112704461331196, 0.06232066392074143, 0.06353122469981012, 0.06475881109265512, 0.0660035037044813, 0.06726537950260728, 0.06854451171540328, 0.06984096973118359, 0.07115481899711455, 0.07248612091819524, 0.07383493275638064, 0.07520130752990388, 0.07658529391286716, 0.07798693613516693, 0.07940627388281588, 0.08084334219873673, 0.08229817138408957, 0.08377078690020641, 0.08526120927120383, 0.08676945398734022, 0.08829553140919695, 0.08983944667274896, 0.09140119959540235, 0.09298078458307368, 0.09457819053837964, 0.09619340077002111, 0.09782639290342827, 0.09947713879274857, 0.10114560443425238, 0.10283174988122956, 0.10453552916046047, 0.10625689019033109, 0.10799577470067427, 0.1097521181544137, 0.11152584967108323, 0.11331689195230712, 0.11512516120930903, 0.11695056709253245, 0.11879301262344885, 0.12065239412862433, 0.1225286011761295, 0.1244215165143589, 0.12633101601334082, 0.128256968608611, 0.13019923624771867, 0.13215767383944704, 0.13413212920581077, 0.1361224430369076, 0.13812844884869419, 0.14014997294374945, 0.14218683437510368, 0.14423884491319003, 0.146305809015991, 0.1483875238024433, 0.15048377902915921, 0.15259435707053545, 0.15471903290229952, 0.15685757408855966, 0.15900974077241184, 0.1611752856701598, 0.1633539540692017, 0.16554548382963405, 0.16774960538962422, 0.1699660417745983, 0.17219450861028934, 0.1744347141396911, 0.17668635924395604, 0.1789491374672795, 0.1812227350458057, 0.18350683094058914, 0.18580109687464474, 0.18810519737411466, 0.19041878981358082, 0.19274152446554624, 0.19507304455410698, 0.19741298631283444, 0.1997609790468825, 0.20211664519933653, 0.20447960042181126, 0.2068494536493083, 0.20922580717933636, 0.21160825675529624, 0.21399639165413128, 0.21638979477823783, 0.21878804275162872, 0.22119070602034083, 0.22359734895707048, 0.22600752997002482, 0.22842080161596603, 0.2308367107174274, 0.23325479848407346, 0.23567460063817935, 0.23809564754419058, 0.24051746434233442, 0.24293957108624029, 0.2453614828845268, 0.2477827100463135, 0.2502027582306051, 0.25262112859949937, 0.25503731797516277, 0.2574508190005156, 0.25986112030356623, 0.2622677066653288, 0.26467005919125797, 0.2670676554861305, 0.2694599698322981, 0.2718464733712371, 0.27422663428831273, 0.27659991800067724, 0.27896578734821625, 0.28132370278745183, 0.2836731225883155, 0.28601350303369194, 0.2883442986216399, 0.2906649622701903, 0.2929749455246161, 0.29527369876707404, 0.2975606714285047, 0.29983531220268705, 0.30209706926233104, 0.3043453904770949, 0.3065797236334111, 0.3087995166559998, 0.3110042178309519, 0.3131932760302554, 0.3153661409376422, 0.31752226327562794, 0.31966109503361495, 0.3217820896969311, 0.32388470247667, 0.3259683905401998, 0.32803261324220845, 0.33007683235614477, 0.3321005123059243, 0.3341031203977566, 0.3360841270519577, 0.33804300603460824, 0.3399792346889136, 0.34189229416612926, 0.34378166965590573, 0.34564685061591266, 0.34748733100060025, 0.3493026094889525, 0.35109218971109374, 0.352855580473602, 0.35459229598338854, 0.35630185607000125, 0.3579837864062085, 0.35963761872672506, 0.36126289104493575, 0.3628591478674794, 0.3644259404065533, 0.3659628267897982, 0.3674693722676304, 0.3689451494178811, 0.37038973834761074, 0.3718027268919647, 0.3731837108099374, 0.37453229397691773, 0.3758480885738838, 0.3771307152731237, 0.37837980342035654, 0.37959499121313073, 0.38077592587538184, 0.38192226382802885, 0.38303367085549667, 0.38410982226804957, 0.38515040305982606, 0.3861551080624679, 0.3871236420942381, 0.3880557201045256, 0.3889510673136381, 0.3898094193477865, 0.39063052236916934, 0.3914141332010657, 0.3921600194478526, 0.39286795960986176, 0.39353774319299756, 0.3941691708130405, 0.39476205429456246, 0.3953162167643864, 0.39583149273952484, 0.3963077282095361, 0.3967447807132408, 0.3971425194097453, 0.39750082514372254, 0.3978195905049041, 0.3980987198817431, 0.39833812950920905, 0.3985377475106819, 0.3986975139339164, 0.3988173807810503, 0.3988973120326366, 0.3989372836656826, 0.3989372836656826, 0.3988973120326366, 0.3988173807810503, 0.3986975139339164, 0.3985377475106819, 0.39833812950920905, 0.3980987198817431, 0.3978195905049041, 0.39750082514372254, 0.3971425194097453, 0.3967447807132408, 0.3963077282095361, 0.39583149273952484, 0.3953162167643864, 0.39476205429456246, 0.3941691708130405, 0.39353774319299756, 0.39286795960986176, 0.39216001944785267, 0.3914141332010657, 0.39063052236916934, 0.3898094193477865, 0.3889510673136381, 0.38805572010452566, 0.3871236420942381, 0.3861551080624679, 0.38515040305982606, 0.38410982226804957, 0.3830336708554967, 0.38192226382802885, 0.38077592587538184, 0.37959499121313073, 0.37837980342035654, 0.37713071527312375, 0.3758480885738837, 0.37453229397691773, 0.3731837108099374, 0.3718027268919647, 0.3703897383476108, 0.368945149417881, 0.3674693722676304, 0.3659628267897982, 0.3644259404065533, 0.3628591478674795, 0.36126289104493564, 0.35963761872672506, 0.3579837864062085, 0.35630185607000125, 0.3545922959833886, 0.3528555804736019, 0.35109218971109374, 0.3493026094889525, 0.34748733100060025, 0.3456468506159127, 0.34378166965590556, 0.34189229416612926, 0.3399792346889136, 0.33804300603460824, 0.33608412705195784, 0.3341031203977564, 0.3321005123059243, 0.33007683235614477, 0.32803261324220845, 0.32596839054019994, 0.32388470247666984, 0.3217820896969311, 0.31966109503361495, 0.31752226327562794, 0.3153661409376423, 0.3131932760302552, 0.3110042178309519, 0.3087995166559998, 0.3065797236334111, 0.304345390477095, 0.3020970692623309, 0.29983531220268705, 0.2975606714285047, 0.29527369876707404, 0.2929749455246162, 0.29066496227019006, 0.2883442986216399, 0.28601350303369194, 0.2836731225883156, 0.28132370278745195, 0.27896578734821603, 0.27659991800067724, 0.27422663428831273, 0.2718464733712372, 0.2694599698322982, 0.26706765548613026, 0.26467005919125797, 0.2622677066653288, 0.25986112030356634, 0.2574508190005157, 0.25503731797516255, 0.25262112859949937, 0.2502027582306051, 0.24778271004631364, 0.24536148288452672, 0.24293957108624006, 0.24051746434233442, 0.23809564754419058, 0.2356746006381794, 0.2332547984840734, 0.2308367107174272, 0.22842080161596603, 0.22600752997002482, 0.2235973489570705, 0.22119070602034077, 0.21878804275162872, 0.21638979477823783, 0.21399639165413128, 0.21160825675529632, 0.2092258071793363, 0.2068494536493083, 0.20447960042181126, 0.20211664519933653, 0.1997609790468826, 0.19741298631283435, 0.19507304455410698, 0.19274152446554624, 0.19041878981358082, 0.1881051973741147, 0.18580109687464472, 0.18350683094058914, 0.1812227350458057, 0.1789491374672795, 0.17668635924395606, 0.17443471413969108, 0.17219450861028934, 0.1699660417745983, 0.16774960538962422, 0.1655454838296341, 0.1633539540692017, 0.1611752856701598, 0.15900974077241184, 0.15685757408855966, 0.15471903290229955, 0.1525943570705354, 0.15048377902915921, 0.1483875238024433, 0.14630580901599124, 0.14423884491319006, 0.1421868343751036, 0.14014997294374945, 0.13812844884869419, 0.13612244303690788, 0.1341321292058108, 0.13215767383944702, 0.13019923624771867, 0.128256968608611, 0.1263310160133411, 0.12442151651435894, 0.12252860117612943, 0.12065239412862433, 0.11879301262344885, 0.11695056709253268, 0.1151251612093089, 0.11331689195230711, 0.11152584967108323, 0.1097521181544137, 0.1079957747006745, 0.10625689019033098, 0.10453552916046042, 0.10283174988122956, 0.10114560443425238, 0.09947713879274879, 0.09782639290342818, 0.09619340077002107, 0.09457819053837964, 0.09298078458307368, 0.09140119959540258, 0.08983944667274886, 0.08829553140919695, 0.08676945398734022, 0.08526120927120383, 0.0837707869002066, 0.08229817138408949, 0.08084334219873673, 0.07940627388281588, 0.07798693613516693, 0.07658529391286735, 0.07520130752990378, 0.07383493275638064, 0.07248612091819524, 0.07115481899711455, 0.06984096973118376, 0.06854451171540318, 0.06726537950260728, 0.0660035037044813, 0.06475881109265512, 0.06353122469981028, 0.062320663920741357, 0.06112704461331196, 0.0599502791992459, 0.058790276764699216, 0.05764694316055489, 0.05652018110238766, 0.05540989027004515, 0.05431596740679254, 0.05323830641797235, 0.05217679846912802, 0.05113133208354541, 0.0501017932391642, 0.049088065464814395, 0.048090029935734926, 0.047107565568331115, 0.04614054911413077, 0.04518885525289872, 0.04425235668487123, 0.04333092422207487, 0.04242442687869281, 0.04153273196044583, 0.04065570515295472, 0.03979321060905222, 0.03894511103501634, 0.038111267775694645, 0.037291540898494055, 0.0364857892762094, 0.03569387066866606, 0.034915641803154894, 0.03415095845363621, 0.033399675518694695, 0.0326616470982222, 0.03193672656881556, 0.031224766657869087, 0.03052561951634785, 0.029839136790229235, 0.029165169690596154, 0.028503569062375243, 0.027854185451705874, 0.02721686917193245, 0.02659147036821275, 0.02597783908073275, 0.025375825306525795, 0.024785279059888695, 0.024206050431392095, 0.023637989645483852, 0.023080947116680968, 0.022534773504353393, 0.021999319766097244, 0.021474437209700114, 0.02095997754370181, 0.020455792926551326, 0.0199617360143675, 0.01947766000730601, 0.019003418694539566, 0.018538866497858925, 0.018083858513899972, 0.017638250555008176, 0.017201899188747153, 0.016774661776062048, 0.016356396508108887, 0.01594696244175904, 0.01554621953379326, 0.01515402867379545, 0.014770251715760229, 0.014394751508428203, 0.014027391924361456, 0.01366803788777604, 0.013316555401144821, 0.012972811570586976, 0.012636674630060626, 0.012308013964373516, 0.011986700131030568, 0.011672604880933924, 0.011365601177953634, 0.01106556321738733, 0.010772366443325672, 0.010485887564943792, 0.010206004571736088, 0.009932596747713743, 0.009665544684584698, 0.009404730293934078, 0.009150036818426239, 0.008901348842046887, 0.008658552299405481, 0.008421534484118353, 0.008190184056291318, 0.007964391049123446, 0.0077440468746508995, 0.007529044328651597, 0.007319277594731159, 0.007114642247609438, 0.006915035255629138, 0.006720354982505686, 0.006530501188339051, 0.006345375029907421, 0.0061648790602628085, 0.005988917227648743, 0.005817394873759507, 0.005650218731361122, 0.005487296921293292, 0.005328538948872136, 0.005173855699712969, 0.005023159434992182, 0.0048763637861674826, 0.004733383749175122, 0.004594135678122931, 0.0044585372784976774, 0.00432650759990479, 0.004197967028358743, 0.004072837278141716, 0.003951041383248348, 0.0038325036884337945, 0.0037171498398822144, 0.0036049067755127666, 0.003495702714939351, 0.003389467149100821, 0.0032861308295776223, 0.003185625757610563, 0.00308788517283757, 0.0029928435417632764, 0.0029004365459768177, 0.0028106010701323637, 0.0027232751897067337, 0.0026383981585484688, 0.0025559103962318453, 0.002475753475229666, 0.0023978701079179674, 0.002322204133425472, 0.002248700504340719, 0.0021773052732888235, 0.0021079655793902826, 0.002040629634613383, 0.001975246710031648, 0.0019117671219976442, 0.0018501422182437373, 0.0017903243639205966, 0.001732266927583609, 0.0016759242671371072, 0.0016212517157462576, 0.0015682055677257688, 0.0015167430644146987, 0.001466822380046101, 0.0014184026076199882, 0.001371443744787974, 0.0013259066797573873, 0.0012817531772227068, 0.0012389458643316548, 0.0011974482166930624, 0.0011572245444335014, 0.0011182399783091301, 0.0010804604558792594, 0.0010438527077476724, 0.0010083842438775384, 0.0009740233399855433, 0.0009407390240205852, 0.0009085010627321479, 0.0008772799483332424, 0.0008470468852625585, 0.0008177737770502308, 0.0007894332132914555, 0.0007619984567318983, 0.0007354434304686859, 0.0007097427052705204, 0.0006848714870202652, 0.0006608056042831641, 0.0006375214960036303, 0.0006149961993333812, 0.0005932073375934776, 0.0005721331083726564, 0.0005517522717641892, 0.0005320441387432778, 0.0005129885596868767, 0.000494565913037639, 0.0004767570941135188, 0.00045954350406444603, 0.00044290703897728015, 0.0004268300791301634, 0.0004112954783971987, 0.0003962865538042735, 0.0003817870752367106, 0.00036778125529928596, 0.00035425373932904586, 0.00034118959556122023, 0.00032857430544841655, 0.00031639375413317745, 0.00030463422107385754, 0.00029328237082369165, 0.0002823252439628062, 0.00027175024818283975, 0.00026154514952375245, 0.00025169806376230003, 0.00024219744795157412, 0.00023303209211092708, 0.00022419111106551485, 0.0002156639364346313, 0.00020744030876792066, 0.0001995102698285009, 0.00019186415502195686, 0.00018449258597010427, 0.00017738646322837252, 0.00017053695914559367, 0.00016393551086493638, 0.00015757381346467387, 0.0001514438132374289, 0.00014553770110650057, 0.00013984790617782924, 0.000134367089426128, 0.00012908813751366692, 0.00012400415674016654, 0.00011910846712222866, 0.00011439459660070622, 0.00010985627537438261, 0.00010548743035831562, 0.00010128217976517728, 9.723482780790063e-5, 9.333985952193218e-5, 8.959193570537115e-5, 8.598588797526645e-5, 8.251671393832852e-5, 7.917957247430857e-5, 7.596977913028805e-5, 7.288280162411435e-5, 6.991425545522019e-5, 6.705989962105581e-5, 6.431563243736498e-5, 6.167748746053908e-5, 5.9141629510279606e-5, 5.670435079080807e-5, 5.43620671088645e-5, 5.211131418674075e-5, 4.99487440686039e-5, 4.787112161837096e-5, 4.587532110740675e-5, 4.395832289032471e-5, 4.2117210167183925e-5, 4.034916583038524e-5, 3.8651469394583444e-5, 3.7021494007944656e-5, 3.545670354309353e-5, 3.3954649766109245e-5, 3.2512969581943714e-5, 3.1129382354654245e-5, 2.980168730085674e-5, 2.852776095482414e-5, 2.7305554703673704e-5, 2.6133092391102312e-5, 2.5008467988150235e-5, 2.3929843329491493e-5, 2.289544591376843e-5, 2.1903566766508593e-5, 2.095255836418134e-5, 2.0040832617972474e-5, 1.9166858915875072e-5, 1.832916222171639e-5, 1.7526321229760776e-5, 1.6756966573550372e-5, 1.6019779087665834e-5, 1.5313488121111463e-5, 1.4636869901050072e-5, 1.398874594563419e-5, 1.3367981524702597e-5, 1.2773484167131627e-5, 1.2204202213652822e-5, 1.1659123413970254e-5, 1.113727356703151e-5, 1.0637715203328491e-5, 1.0159546308125237e-5, 9.701899084530991e-6, 9.263938755358373e-6, 8.844862402727114e-6, 8.44389784439499e-6, 8.060302545818064e-6, 7.693362566963276e-6, 7.342391542916562e-6, 7.00672969735016e-6, 6.685742887932588e-6, 6.378821682784907e-6, 6.085380467106363e-6, 5.804856579112089e-6, 5.536709474444599e-6, 5.280419918240033e-6, 5.0354892040487905e-6, 4.801438398828609e-6, 4.577807613246672e-6, 4.3641552965451725e-6, 4.160057555242689e-6, 3.9651074949611e-6, 3.778914584685238e-6, 3.6011040427793162e-6, 3.43131624410104e-6, 3.2692061475706308e-6, 3.1144427435683705e-6, 2.9667085205500707e-6, 2.825698950285607e-6, 2.6911219911409617e-6, 2.5626976088394264e-6, 2.440157314152359e-6, 2.3232437169845253e-6, 2.211710096333239e-6, 2.1053199856146367e-6, 2.0038467728640893e-6, 1.9070733153312072e-6, 1.8147915680031392e-6, 1.7268022256027468e-6, 1.6429143776209075e-6, 1.5629451759545644e-6, 1.4867195147342977e-6], "type": "scatter", "x": [-5.0, -4.98998998998999, -4.97997997997998, -4.96996996996997, -4.95995995995996, -4.94994994994995, -4.93993993993994, -4.92992992992993, -4.91991991991992, -4.90990990990991, -4.8998998998999, -4.88988988988989, -4.87987987987988, -4.86986986986987, -4.85985985985986, -4.84984984984985, -4.83983983983984, -4.82982982982983, -4.81981981981982, -4.80980980980981, -4.7997997997998, -4.78978978978979, -4.77977977977978, -4.76976976976977, -4.75975975975976, -4.74974974974975, -4.73973973973974, -4.72972972972973, -4.71971971971972, -4.70970970970971, -4.6996996996997, -4.68968968968969, -4.67967967967968, -4.66966966966967, -4.65965965965966, -4.64964964964965, -4.63963963963964, -4.62962962962963, -4.61961961961962, -4.60960960960961, -4.5995995995996, -4.58958958958959, -4.57957957957958, -4.56956956956957, -4.55955955955956, -4.54954954954955, -4.53953953953954, -4.52952952952953, -4.51951951951952, -4.50950950950951, -4.4994994994995, -4.48948948948949, -4.47947947947948, -4.46946946946947, -4.45945945945946, -4.44944944944945, -4.43943943943944, -4.42942942942943, -4.41941941941942, -4.40940940940941, -4.3993993993994, -4.38938938938939, -4.37937937937938, -4.36936936936937, -4.35935935935936, -4.34934934934935, -4.33933933933934, -4.32932932932933, -4.31931931931932, -4.3093093093093096, -4.2992992992992995, -4.2892892892892895, -4.2792792792792795, -4.2692692692692695, -4.2592592592592595, -4.2492492492492495, -4.2392392392392395, -4.2292292292292295, -4.2192192192192195, -4.2092092092092095, -4.1991991991991995, -4.1891891891891895, -4.1791791791791795, -4.1691691691691695, -4.1591591591591595, -4.1491491491491495, -4.1391391391391394, -4.129129129129129, -4.119119119119119, -4.109109109109109, -4.099099099099099, -4.089089089089089, -4.079079079079079, -4.069069069069069, -4.059059059059059, -4.049049049049049, -4.039039039039039, -4.029029029029029, -4.019019019019019, -4.009009009009009, -3.9989989989989994, -3.9889889889889893, -3.9789789789789793, -3.9689689689689693, -3.9589589589589593, -3.9489489489489493, -3.9389389389389393, -3.9289289289289293, -3.9189189189189193, -3.9089089089089093, -3.8988988988988993, -3.8888888888888893, -3.8788788788788793, -3.8688688688688693, -3.8588588588588593, -3.8488488488488493, -3.8388388388388393, -3.8288288288288292, -3.8188188188188192, -3.8088088088088092, -3.7987987987987992, -3.788788788788789, -3.778778778778779, -3.768768768768769, -3.758758758758759, -3.748748748748749, -3.738738738738739, -3.728728728728729, -3.718718718718719, -3.708708708708709, -3.698698698698699, -3.688688688688689, -3.678678678678679, -3.668668668668669, -3.658658658658659, -3.648648648648649, -3.6386386386386387, -3.628628628628629, -3.618618618618619, -3.608608608608609, -3.598598598598599, -3.5885885885885886, -3.578578578578579, -3.568568568568569, -3.558558558558559, -3.548548548548549, -3.5385385385385386, -3.528528528528529, -3.518518518518519, -3.508508508508509, -3.498498498498499, -3.4884884884884886, -3.4784784784784786, -3.468468468468469, -3.458458458458459, -3.448448448448449, -3.4384384384384385, -3.4284284284284285, -3.418418418418419, -3.408408408408409, -3.398398398398399, -3.3883883883883885, -3.3783783783783785, -3.368368368368369, -3.358358358358359, -3.348348348348349, -3.3383383383383385, -3.3283283283283285, -3.3183183183183185, -3.308308308308309, -3.298298298298299, -3.2882882882882885, -3.2782782782782784, -3.2682682682682684, -3.258258258258259, -3.248248248248249, -3.2382382382382384, -3.2282282282282284, -3.2182182182182184, -3.208208208208209, -3.198198198198199, -3.1881881881881884, -3.1781781781781784, -3.1681681681681684, -3.1581581581581575, -3.148148148148149, -3.1381381381381384, -3.1281281281281283, -3.1181181181181183, -3.1081081081081074, -3.0980980980980988, -3.0880880880880883, -3.0780780780780783, -3.0680680680680683, -3.0580580580580574, -3.0480480480480487, -3.0380380380380383, -3.0280280280280283, -3.0180180180180183, -3.0080080080080074, -2.997997997997998, -2.987987987987988, -2.977977977977978, -2.9679679679679682, -2.9579579579579574, -2.947947947947948, -2.937937937937938, -2.9279279279279278, -2.917917917917918, -2.9079079079079073, -2.8978978978978978, -2.8878878878878878, -2.8778778778778777, -2.867867867867868, -2.8578578578578573, -2.8478478478478477, -2.8378378378378377, -2.8278278278278277, -2.817817817817818, -2.8078078078078073, -2.7977977977977977, -2.7877877877877877, -2.7777777777777777, -2.767767767767768, -2.7577577577577572, -2.7477477477477477, -2.7377377377377377, -2.7277277277277276, -2.717717717717718, -2.707707707707707, -2.6976976976976976, -2.6876876876876876, -2.6776776776776776, -2.667667667667668, -2.657657657657657, -2.6476476476476476, -2.6376376376376376, -2.6276276276276276, -2.617617617617618, -2.607607607607607, -2.5975975975975976, -2.5875875875875876, -2.5775775775775776, -2.567567567567568, -2.557557557557557, -2.5475475475475475, -2.5375375375375375, -2.5275275275275275, -2.517517517517518, -2.507507507507507, -2.4974974974974975, -2.4874874874874875, -2.4774774774774775, -2.467467467467468, -2.457457457457457, -2.4474474474474475, -2.4374374374374375, -2.4274274274274275, -2.417417417417418, -2.407407407407407, -2.3973973973973974, -2.3873873873873874, -2.3773773773773774, -2.367367367367368, -2.357357357357357, -2.3473473473473474, -2.3373373373373374, -2.3273273273273265, -2.317317317317318, -2.307307307307307, -2.2972972972972974, -2.2872872872872874, -2.2772772772772765, -2.267267267267268, -2.257257257257257, -2.2472472472472473, -2.2372372372372373, -2.2272272272272264, -2.2172172172172178, -2.207207207207207, -2.1971971971971973, -2.1871871871871873, -2.1771771771771764, -2.1671671671671677, -2.157157157157157, -2.1471471471471473, -2.1371371371371373, -2.1271271271271264, -2.1171171171171177, -2.107107107107107, -2.0970970970970972, -2.0870870870870872, -2.0770770770770763, -2.0670670670670677, -2.0570570570570568, -2.047047047047047, -2.037037037037037, -2.0270270270270263, -2.0170170170170176, -2.0070070070070063, -1.9969969969969972, -1.9869869869869872, -1.9769769769769765, -1.9669669669669676, -1.9569569569569565, -1.9469469469469471, -1.9369369369369371, -1.9269269269269265, -1.9169169169169176, -1.9069069069069065, -1.8968968968968971, -1.886886886886887, -1.8768768768768764, -1.8668668668668675, -1.8568568568568564, -1.846846846846847, -1.836836836836837, -1.8268268268268264, -1.8168168168168175, -1.8068068068068064, -1.796796796796797, -1.786786786786787, -1.7767767767767764, -1.7667667667667675, -1.7567567567567564, -1.746746746746747, -1.736736736736737, -1.7267267267267263, -1.7167167167167174, -1.7067067067067063, -1.696696696696697, -1.6866866866866868, -1.6766766766766763, -1.6666666666666674, -1.6566566566566563, -1.646646646646647, -1.6366366366366367, -1.6266266266266263, -1.6166166166166174, -1.6066066066066063, -1.596596596596597, -1.5865865865865867, -1.5765765765765762, -1.5665665665665673, -1.5565565565565562, -1.5465465465465469, -1.5365365365365367, -1.5265265265265262, -1.5165165165165173, -1.5065065065065062, -1.4964964964964969, -1.4864864864864866, -1.4764764764764762, -1.4664664664664673, -1.4564564564564562, -1.4464464464464468, -1.4364364364364366, -1.4264264264264261, -1.4164164164164172, -1.4064064064064061, -1.3963963963963968, -1.3863863863863866, -1.376376376376376, -1.3663663663663659, -1.356356356356356, -1.3463463463463468, -1.3363363363363365, -1.326326326326326, -1.3163163163163158, -1.306306306306306, -1.2962962962962967, -1.2862862862862865, -1.276276276276276, -1.2662662662662658, -1.256256256256256, -1.2462462462462467, -1.2362362362362365, -1.226226226226226, -1.2162162162162158, -1.206206206206206, -1.1961961961961967, -1.1861861861861864, -1.176176176176176, -1.1661661661661658, -1.156156156156156, -1.1461461461461466, -1.1361361361361364, -1.126126126126126, -1.1161161161161157, -1.106106106106106, -1.0960960960960966, -1.0860860860860864, -1.076076076076076, -1.0660660660660657, -1.056056056056056, -1.0460460460460457, -1.0360360360360363, -1.0260260260260259, -1.0160160160160157, -1.0060060060060059, -0.9959959959959956, -0.9859859859859861, -0.9759759759759761, -0.9659659659659656, -0.9559559559559556, -0.9459459459459456, -0.935935935935936, -0.925925925925926, -0.9159159159159156, -0.9059059059059056, -0.8958958958958956, -0.885885885885886, -0.875875875875876, -0.8658658658658656, -0.8558558558558556, -0.8458458458458455, -0.835835835835836, -0.825825825825826, -0.8158158158158155, -0.8058058058058055, -0.7957957957957955, -0.785785785785786, -0.7757757757757755, -0.7657657657657655, -0.7557557557557555, -0.7457457457457455, -0.7357357357357359, -0.7257257257257255, -0.7157157157157155, -0.7057057057057055, -0.6956956956956954, -0.6856856856856859, -0.6756756756756754, -0.6656656656656654, -0.6556556556556554, -0.6456456456456454, -0.6356356356356359, -0.6256256256256254, -0.6156156156156154, -0.6056056056056054, -0.5955955955955954, -0.5855855855855858, -0.5755755755755754, -0.5655655655655654, -0.5555555555555554, -0.5455455455455454, -0.5355355355355358, -0.5255255255255253, -0.5155155155155153, -0.5055055055055053, -0.4954954954954953, -0.48548548548548576, -0.4754754754754753, -0.4654654654654653, -0.4554554554554553, -0.4454454454454453, -0.4354354354354357, -0.4254254254254253, -0.41541541541541527, -0.40540540540540526, -0.39539539539539525, -0.3853853853853857, -0.37537537537537524, -0.36536536536536524, -0.35535535535535523, -0.3453453453453452, -0.33533533533533566, -0.3253253253253252, -0.3153153153153152, -0.3053053053053052, -0.2952952952952952, -0.28528528528528563, -0.2752752752752752, -0.26526526526526517, -0.25525525525525516, -0.24524524524524516, -0.2352352352352356, -0.22522522522522515, -0.21521521521521514, -0.20520520520520513, -0.19519519519519513, -0.18518518518518556, -0.1751751751751751, -0.1651651651651651, -0.1551551551551551, -0.1451451451451451, -0.1351351351351351, -0.12512512512512508, -0.11511511511511507, -0.10510510510510507, -0.09509509509509506, -0.08508508508508505, -0.07507507507507505, -0.06506506506506504, -0.055055055055055035, -0.04504504504504503, -0.03503503503503502, -0.025025025025025016, -0.01501501501501501, -0.005005005005005003, 0.005005005005005003, 0.01501501501501501, 0.025025025025025016, 0.03503503503503502, 0.04504504504504503, 0.055055055055055035, 0.06506506506506504, 0.07507507507507505, 0.08508508508508505, 0.09509509509509506, 0.10510510510510507, 0.11511511511511507, 0.12512512512512508, 0.1351351351351351, 0.1451451451451451, 0.1551551551551551, 0.1651651651651651, 0.1751751751751751, 0.18518518518518512, 0.19519519519519513, 0.20520520520520513, 0.21521521521521514, 0.22522522522522515, 0.23523523523523515, 0.24524524524524516, 0.25525525525525516, 0.26526526526526517, 0.2752752752752752, 0.2852852852852852, 0.2952952952952952, 0.3053053053053052, 0.3153153153153152, 0.3253253253253252, 0.3353353353353352, 0.3453453453453461, 0.35535535535535523, 0.36536536536536524, 0.37537537537537524, 0.38538538538538525, 0.39539539539539614, 0.40540540540540526, 0.41541541541541527, 0.4254254254254253, 0.4354354354354353, 0.4454454454454462, 0.4554554554554553, 0.4654654654654653, 0.4754754754754753, 0.4854854854854853, 0.4954954954954962, 0.5055055055055053, 0.5155155155155153, 0.5255255255255253, 0.5355355355355353, 0.5455455455455462, 0.5555555555555554, 0.5655655655655654, 0.5755755755755754, 0.5855855855855854, 0.5955955955955963, 0.6056056056056054, 0.6156156156156154, 0.6256256256256254, 0.6356356356356354, 0.6456456456456463, 0.6556556556556554, 0.6656656656656654, 0.6756756756756754, 0.6856856856856854, 0.6956956956956963, 0.7057057057057055, 0.7157157157157155, 0.7257257257257255, 0.7357357357357355, 0.7457457457457464, 0.7557557557557555, 0.7657657657657655, 0.7757757757757755, 0.7857857857857855, 0.7957957957957964, 0.8058058058058055, 0.8158158158158155, 0.8258258258258255, 0.8358358358358355, 0.8458458458458464, 0.8558558558558556, 0.8658658658658656, 0.8758758758758756, 0.8858858858858856, 0.8958958958958965, 0.9059059059059056, 0.9159159159159156, 0.9259259259259256, 0.9359359359359356, 0.9459459459459465, 0.9559559559559556, 0.9659659659659656, 0.9759759759759756, 0.9859859859859865, 0.9959959959959965, 1.0060060060060059, 1.0160160160160157, 1.0260260260260257, 1.0360360360360366, 1.0460460460460466, 1.056056056056056, 1.0660660660660657, 1.0760760760760757, 1.0860860860860866, 1.0960960960960966, 1.106106106106106, 1.1161161161161157, 1.1261261261261257, 1.1361361361361366, 1.1461461461461466, 1.156156156156156, 1.1661661661661658, 1.1761761761761758, 1.1861861861861867, 1.1961961961961967, 1.206206206206206, 1.2162162162162158, 1.2262262262262258, 1.2362362362362367, 1.2462462462462467, 1.256256256256256, 1.2662662662662658, 1.2762762762762758, 1.2862862862862867, 1.2962962962962967, 1.306306306306306, 1.3163163163163158, 1.3263263263263259, 1.3363363363363367, 1.3463463463463468, 1.356356356356356, 1.3663663663663659, 1.3763763763763759, 1.3863863863863868, 1.3963963963963968, 1.4064064064064061, 1.416416416416416, 1.426426426426426, 1.4364364364364368, 1.4464464464464468, 1.4564564564564562, 1.466466466466466, 1.476476476476476, 1.4864864864864868, 1.4964964964964969, 1.5065065065065062, 1.516516516516516, 1.526526526526526, 1.5365365365365369, 1.5465465465465469, 1.5565565565565562, 1.566566566566566, 1.576576576576577, 1.586586586586587, 1.596596596596597, 1.6066066066066063, 1.616616616616616, 1.626626626626627, 1.636636636636637, 1.646646646646647, 1.6566566566566563, 1.666666666666666, 1.676676676676677, 1.686686686686687, 1.696696696696697, 1.7067067067067063, 1.716716716716716, 1.726726726726727, 1.736736736736737, 1.746746746746747, 1.7567567567567564, 1.7667667667667661, 1.776776776776777, 1.786786786786787, 1.796796796796797, 1.8068068068068064, 1.8168168168168162, 1.826826826826827, 1.836836836836837, 1.846846846846847, 1.8568568568568564, 1.8668668668668662, 1.876876876876877, 1.886886886886887, 1.8968968968968971, 1.9069069069069065, 1.9169169169169162, 1.9269269269269271, 1.9369369369369371, 1.9469469469469471, 1.9569569569569565, 1.9669669669669663, 1.9769769769769772, 1.9869869869869872, 1.9969969969969972, 2.0070070070070063, 2.0170170170170163, 2.027027027027027, 2.037037037037037, 2.047047047047047, 2.0570570570570563, 2.0670670670670663, 2.0770770770770772, 2.0870870870870872, 2.0970970970970972, 2.1071071071071064, 2.1171171171171164, 2.1271271271271273, 2.1371371371371373, 2.1471471471471473, 2.1571571571571564, 2.1671671671671664, 2.1771771771771773, 2.1871871871871873, 2.1971971971971973, 2.2072072072072064, 2.217217217217218, 2.2272272272272273, 2.2372372372372373, 2.2472472472472473, 2.2572572572572565, 2.2672672672672682, 2.2772772772772774, 2.2872872872872874, 2.2972972972972974, 2.3073073073073065, 2.3173173173173183, 2.3273273273273274, 2.3373373373373374, 2.3473473473473474, 2.3573573573573565, 2.3673673673673683, 2.3773773773773774, 2.3873873873873874, 2.3973973973973974, 2.4074074074074066, 2.4174174174174183, 2.4274274274274275, 2.4374374374374375, 2.4474474474474475, 2.4574574574574566, 2.4674674674674684, 2.4774774774774775, 2.4874874874874875, 2.4974974974974975, 2.5075075075075066, 2.5175175175175184, 2.5275275275275275, 2.5375375375375375, 2.5475475475475475, 2.5575575575575566, 2.5675675675675684, 2.5775775775775776, 2.5875875875875876, 2.5975975975975976, 2.6076076076076067, 2.6176176176176185, 2.6276276276276276, 2.6376376376376376, 2.6476476476476476, 2.6576576576576567, 2.6676676676676685, 2.6776776776776776, 2.6876876876876876, 2.6976976976976976, 2.7077077077077067, 2.7177177177177185, 2.7277277277277276, 2.7377377377377377, 2.7477477477477477, 2.757757757757757, 2.7677677677677686, 2.7777777777777777, 2.7877877877877877, 2.7977977977977977, 2.807807807807807, 2.8178178178178186, 2.8278278278278277, 2.8378378378378377, 2.8478478478478477, 2.8578578578578586, 2.8678678678678686, 2.8778778778778777, 2.8878878878878878, 2.8978978978978978, 2.9079079079079087, 2.9179179179179187, 2.9279279279279278, 2.937937937937938, 2.947947947947948, 2.9579579579579587, 2.9679679679679687, 2.977977977977978, 2.987987987987988, 2.997997997997998, 3.0080080080080087, 3.0180180180180187, 3.0280280280280283, 3.0380380380380383, 3.0480480480480474, 3.0580580580580587, 3.0680680680680688, 3.0780780780780783, 3.0880880880880883, 3.0980980980980974, 3.108108108108109, 3.118118118118119, 3.1281281281281283, 3.1381381381381384, 3.1481481481481475, 3.158158158158159, 3.168168168168169, 3.1781781781781784, 3.1881881881881884, 3.1981981981981975, 3.208208208208209, 3.218218218218219, 3.2282282282282284, 3.2382382382382384, 3.2482482482482475, 3.258258258258259, 3.268268268268269, 3.2782782782782784, 3.2882882882882885, 3.2982982982982976, 3.308308308308309, 3.318318318318319, 3.3283283283283285, 3.3383383383383385, 3.3483483483483476, 3.358358358358359, 3.368368368368369, 3.3783783783783785, 3.3883883883883885, 3.3983983983983976, 3.408408408408409, 3.418418418418419, 3.4284284284284285, 3.4384384384384385, 3.448448448448449, 3.458458458458459, 3.468468468468469, 3.4784784784784786, 3.4884884884884886, 3.498498498498499, 3.508508508508509, 3.518518518518519, 3.5285285285285286, 3.5385385385385386, 3.548548548548549, 3.558558558558559, 3.568568568568569, 3.5785785785785786, 3.5885885885885886, 3.598598598598599, 3.608608608608609, 3.618618618618619, 3.6286286286286287, 3.6386386386386387, 3.648648648648649, 3.658658658658659, 3.668668668668669, 3.6786786786786787, 3.6886886886886887, 3.698698698698699, 3.708708708708709, 3.718718718718719, 3.7287287287287287, 3.7387387387387387, 3.748748748748749, 3.758758758758759, 3.768768768768769, 3.7787787787787788, 3.7887887887887888, 3.7987987987987992, 3.8088088088088092, 3.8188188188188192, 3.828828828828829, 3.838838838838839, 3.8488488488488493, 3.8588588588588593, 3.8688688688688693, 3.878878878878879, 3.888888888888889, 3.8988988988988993, 3.9089089089089093, 3.9189189189189193, 3.928928928928929, 3.938938938938939, 3.9489489489489493, 3.9589589589589593, 3.9689689689689693, 3.978978978978979, 3.988988988988989, 3.9989989989989994, 4.009009009009009, 4.019019019019019, 4.029029029029029, 4.039039039039039, 4.049049049049049, 4.059059059059059, 4.069069069069069, 4.079079079079079, 4.089089089089089, 4.099099099099099, 4.109109109109109, 4.119119119119119, 4.129129129129129, 4.1391391391391394, 4.1491491491491495, 4.1591591591591595, 4.1691691691691695, 4.1791791791791795, 4.1891891891891895, 4.1991991991991995, 4.2092092092092095, 4.2192192192192195, 4.2292292292292295, 4.2392392392392395, 4.2492492492492495, 4.2592592592592595, 4.2692692692692695, 4.2792792792792795, 4.2892892892892895, 4.2992992992992995, 4.3093093093093096, 4.31931931931932, 4.32932932932933, 4.33933933933934, 4.34934934934935, 4.35935935935936, 4.36936936936937, 4.37937937937938, 4.38938938938939, 4.3993993993994, 4.40940940940941, 4.41941941941942, 4.42942942942943, 4.43943943943944, 4.44944944944945, 4.45945945945946, 4.46946946946947, 4.47947947947948, 4.48948948948949, 4.4994994994995, 4.50950950950951, 4.51951951951952, 4.52952952952953, 4.53953953953954, 4.54954954954955, 4.55955955955956, 4.56956956956957, 4.57957957957958, 4.58958958958959, 4.5995995995996, 4.60960960960961, 4.61961961961962, 4.62962962962963, 4.63963963963964, 4.64964964964965, 4.65965965965966, 4.66966966966967, 4.67967967967968, 4.68968968968969, 4.6996996996997, 4.70970970970971, 4.71971971971972, 4.72972972972973, 4.73973973973974, 4.74974974974975, 4.75975975975976, 4.76976976976977, 4.77977977977978, 4.78978978978979, 4.7997997997998, 4.80980980980981, 4.81981981981982, 4.82982982982983, 4.83983983983984, 4.84984984984985, 4.85985985985986, 4.86986986986987, 4.87987987987988, 4.88988988988989, 4.8998998998999, 4.90990990990991, 4.91991991991992, 4.92992992992993, 4.93993993993994, 4.94994994994995, 4.95995995995996, 4.96996996996997, 4.97997997997998, 4.98998998998999, 5.0]}]} // Get the plotly listeners const plotly_listeners = {} // Get the JS listeners const js_listeners = {} // Deal with eventual custom classes let custom_classlist = [] // Load the plotly library if (!window.Plotly) { const {plotly} = await import('https://cdn.plot.ly/plotly-2.16.1.min.js') } // Check if we have to force local mathjax font cache if (false && window?.MathJax?.config?.svg?.fontCache === 'global') { window.MathJax.config.svg.fontCache = 'local' } // Flag to check if this cell was manually ran or reactively ran const firstRun = this ? false : true const PLOT = this ?? document.createElement("div"); const parent = currentScript.parentElement const isPlutoWrapper = parent.classList.contains('raw-html-wrapper') if (firstRun) { // It seem plot divs would not autosize themself inside flexbox containers without this parent.appendChild(PLOT) } // If width is not specified, set it to 100% PLOT.style.width = plot_obj.layout.width ? "" : "100%" // For the height we have to also put a fixed value in case the plot is put on a non-fixed-size container (like the default wrapper) PLOT.style.height = plot_obj.layout.height ? "" : (isPlutoWrapper || parent.clientHeight == 0) ? "400px" : "100%" PLOT.classList.forEach(cn => { if (cn !== 'js-plotly-plot' && !custom_classlist.includes(cn)) { PLOT.classList.toggle(cn, false) } }) for (const className of custom_classlist) { PLOT.classList.toggle(className, true) } // Create the resizeObserver to make the plot even more responsive! :magic: const resizeObserver = new ResizeObserver(entries => { PLOT.style.height = plot_obj.layout.height ? "" : (isPlutoWrapper || parent.clientHeight == 0) ? "400px" : "100%" /* The addition of the invalid argument `plutoresize` seems to fix the problem with calling `relayout` simply with `{autosize: true}` as update breaking mouse relayout events tracking. See https://github.com/plotly/plotly.js/issues/6156 for details */ Plotly.relayout(PLOT, {..._.pick(PLOT.layout, ['width','height']), autosize: true, plutoresize: true}) }) resizeObserver.observe(PLOT) Plotly.react(PLOT, plot_obj).then(() => { // Assign the Plotly event listeners for (const [key, listener_vec] of Object.entries(plotly_listeners)) { for (const listener of listener_vec) { PLOT.on(key, listener) } } // Assign the JS event listeners for (const [key, listener_vec] of Object.entries(js_listeners)) { for (const listener of listener_vec) { PLOT.addEventListener(key, listener) } } } ) invalidation.then(() => { // Remove all plotly listeners PLOT.removeAllListeners() // Remove all JS listeners for (const [key, listener_vec] of Object.entries(js_listeners)) { for (const listener of listener_vec) { PLOT.removeEventListener(key, listener) } } // Remove the resizeObserver resizeObserver.disconnect() }) return PLOT mimetext/htmlrootassigneelast_run_timestampA ![dpersist_js_state·has_pluto_hook_features§cell_id$09dd1440-5d09-421f-addc-b1ede43ff517depends_on_disabled_cells§runtimeсpublished_object_keysdepends_on_skipped_cellsçerrored$a0ca7a5e-0089-4a45-9278-c0f27cd096a0queued¤logsrunning¦outputbody
mimetext/htmlrootassigneelast_run_timestampA @Ͱpersist_js_state·has_pluto_hook_features§cell_id$a0ca7a5e-0089-4a45-9278-c0f27cd096a0depends_on_disabled_cells§runtimeJxpublished_object_keysdepends_on_skipped_cellsçerrored$64b38d1f-ecf9-4843-89a1-4c8953048265queued¤logsrunning¦outputbodyelementsprefix,Main.var"workspace#8".CartPoleState{Float32}elementsprefixCartPoleState{Float32}elementsx0.0text/plainθ0.05text/plainẋ0.0text/plainθ̇0.0text/plaint0.0text/plaintypestructprefix_shortCartPoleStateobjectid4addf63ed7bae523!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx-0.0228733text/plainθ0.0616247text/plainẋ-1.14368text/plainθ̇0.581558text/plaint0.04text/plaintypestructprefix_shortCartPoleStateobjectid1a6d5f3820b638b8!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx-0.0686444text/plainθ0.0851709text/plainẋ-1.14494text/plainθ̇0.596552text/plaint0.08text/plaintypestructprefix_shortCartPoleStateobjectidab75208723b5af11!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx-0.13732text/plainθ0.120792text/plainẋ-2.28879text/plainθ̇1.18529text/plaint0.12text/plaintypestructprefix_shortCartPoleStateobjectidcbd7b81704aa595!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx-0.206073text/plainθ0.157435text/plainẋ-1.14915text/plainθ̇0.648677text/plaint0.16text/plaintypestructprefix_shortCartPoleStateobjectide28e9577a3e9d405!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx-0.229269text/plainθ0.172792text/plainẋ-0.0107417text/plainθ̇0.119943text/plaint0.2text/plaintypestructprefix_shortCartPoleStateobjectida8499493c4a25ff7!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx-0.206936text/plainθ0.167051text/plainẋ1.12742text/plainθ̇-0.407285text/plaint0.24text/plaintypestructprefix_shortCartPoleStateobjectid2075d559f4fb1949!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx-0.184727text/plainθ0.162689text/plainẋ-0.0169652text/plainθ̇0.189115text/plaint0.28text/plaintypestructprefix_shortCartPoleStateobjectid172373efb3067c18!application/vnd.pluto.tree+object prefixCartPoleState{Float32}elementsx-0.162637text/plainθ0.159659text/plainẋ1.12149text/plainθ̇-0.340778text/plaint0.32text/plaintypestructprefix_shortCartPoleStateobjectide7d63656f3863dbd!application/vnd.pluto.tree+objectmoréprefixCartPoleState{Float32}elementsx-49.8897text/plainθ0.197691text/plainẋ-13.1155text/plainθ̇0.275797text/plaint5.12text/plaintypestructprefix_shortCartPoleStateobjectid843cef00b85f04b3!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid7e1165de0d244ff3!application/vnd.pluto.tree+objectprefixInt64elements1text/plain2text/plain1text/plain3text/plain3text/plain3text/plain1text/plain3text/plain 3text/plainmoré3text/plaintypeArrayprefix_shortobjectida5d9f6c7dcb5726a!application/vnd.pluto.tree+objectprefixFloat32elements1.0text/plain1.0text/plain1.0text/plain1.0text/plain1.0text/plain1.0text/plain1.0text/plain1.0text/plain 1.0text/plainmoré1.0text/plaintypeArrayprefix_shortobjectid132d26ffb75bd388!application/vnd.pluto.tree+objectprefixCartPoleState{Float32}elementsx-50.3916text/plainθ0.198355text/plainẋ-11.9784text/plainθ̇-0.24256text/plaint5.16text/plaintypestructprefix_shortCartPoleStateobjectid72ca98acf18b1cfb!application/vnd.pluto.tree+object129text/plaintypeTupleobjectid2d80665ecf07fdf6mime!application/vnd.pluto.tree+objectrootassignee,const cartpole_fcann_continuing_test_episodelast_run_timestampA 3 persist_js_state·has_pluto_hook_features§cell_id$64b38d1f-ecf9-4843-89a1-4c8953048265depends_on_disabled_cells§runtimeǁpublished_object_keysdepends_on_skipped_cellsçerrored$d963ff6d-f1b6-4799-aa0e-1ae100310d84queued¤logslinemsgB]Checking for cuda toolkit versions No cuda toolkit appears to be installed. If this sytem has an NVIDIA GPU, install the cuda toolkit and add nvcc to the system path to use the GPU backend. Available backends are: CPU Backend is set to CPU Num Grads Func Grads 0.008702 0.008294 -0.211358 -0.208870 -0.271201 -0.272823 -0.794768 -0.795432 1.321912 1.321019 0.010610 0.009960 0.003338 0.004296 -0.009298 -0.010741 0.107527 0.106389 -0.009775 -0.008679 0.024915 0.026168 0.010610 0.010030 0.008583 0.005909 0.058174 0.055782 -0.127435 -0.126989 0.002742 0.001004 -0.070930 -0.071241 -0.005364 -0.006470 0.013947 0.012872 0.053763 0.054418 0.311136 0.311052 0.015497 0.014644 -0.219107 -0.219193 -0.973701 -0.974077 0.538349 0.538771 0.105143 0.105787 0.056386 0.056097 -0.072718 -0.072026 -0.361681 -0.361433 0.279307 0.280814 0.145793 0.143951 0.048995 0.049274 -0.100732 -0.102160 -0.521660 -0.520240 0.269771 0.269774 0.070095 0.070005 0.011086 0.010243 -0.044942 -0.044660 -0.252724 -0.253604 0.093222 0.095847 -0.124216 -0.123114 0.012040 0.010086 0.074387 0.072455 0.361204 0.359321 -0.185847 -0.187581 0.121474 0.122355 0.006318 0.005624 -0.085115 -0.086171 -0.395298 -0.393476 0.196457 0.197631 0.226021 0.226000 0.903606 0.904000 -0.032306 -0.032251 -0.084519 -0.082810 0.115752 0.115757 0.007033 0.004606 -0.187397 -0.188008 -0.291586 -0.292546 -0.107527 -0.107447 -0.564456 -0.563941 -0.046372 -0.045384 0.419974 0.420656 Relative differences for method are 0.0018749096. Should be small (1e-9) Num Grads Func Grads 0.226974 0.228291 -0.147700 -0.150010 -0.397086 -0.396523 -1.448035 -1.449270 1.557469 1.558208 0.043392 0.041430 0.014067 0.013863 0.005603 0.007342 0.279069 0.281456 -0.018597 -0.020660 0.065207 0.068021 0.017762 0.019240 -0.002742 -0.001772 0.151992 0.152121 -0.370026 -0.369779 0.032902 0.033341 -0.059605 -0.061379 -0.015259 -0.016710 0.090718 0.089624 0.031114 0.032405 0.222206 0.224066 0.015378 0.014975 -0.239253 -0.239561 -0.717878 -0.717608 0.439525 0.440638 0.070691 0.070417 0.032902 0.032117 -0.102282 -0.104047 -0.301361 -0.303399 0.291467 0.290279 0.103831 0.102325 0.028253 0.028958 -0.135541 -0.133932 -0.377774 -0.378357 0.208259 0.210493 0.059247 0.058751 0.004411 0.005936 -0.066400 -0.066114 -0.147343 -0.145967 0.050664 0.048794 -0.101089 -0.103440 -0.002027 -0.000536 0.099182 0.098851 0.195622 0.195177 -0.158787 -0.159284 0.091553 0.091211 0.005245 0.005407 -0.094295 -0.097222 -0.277638 -0.277581 0.146151 0.146036 0.285268 0.286000 0.882745 0.884000 0.249863 0.249797 0.197411 0.198590 0.402927 0.403551 0.336289 0.335512 -0.161886 -0.161895 -0.155091 -0.155041 -0.186682 -0.186942 -0.842452 -0.842551 0.088453 0.088449 0.687838 0.688715 Relative differences for method are 0.0016782036. Should be small (1e-9) Num Grads Func Grads 0.028610 0.032211 -0.963449 -0.962050 -1.136065 -1.137463 -3.270149 -3.271130 5.587101 5.584826 0.077486 0.076260 -0.095367 -0.098550 -0.215054 -0.215321 0.079632 0.081413 0.739098 0.737280 0.122309 0.123499 0.018835 0.013728 -0.069857 -0.072780 0.049591 0.052830 -0.226498 -0.222941 0.063896 0.062422 -0.396490 -0.397996 -0.261545 -0.266012 -0.387192 -0.387975 1.085997 1.086820 1.346350 1.346262 -0.049829 -0.046599 -0.909328 -0.913211 -4.127741 -4.128638 2.273083 2.274081 0.267029 0.266290 0.149250 0.147556 -0.131607 -0.133167 -0.934601 -0.935876 1.001120 0.998651 0.482559 0.481657 0.104427 0.105235 -0.309229 -0.307927 -1.800060 -1.797327 0.991821 0.989244 0.315428 0.320147 0.010014 0.010303 -0.196934 -0.198439 -1.142979 -1.142353 0.427246 0.429229 -0.661611 -0.662484 0.063658 0.064201 0.410557 0.409356 1.935720 1.932071 -0.965834 -0.966196 0.538349 0.538219 -0.018835 -0.019689 -0.362873 -0.364258 -1.701593 -1.706390 0.842810 0.843614 0.613928 0.614286 3.943205 3.944720 -0.113726 -0.114396 -0.466585 -0.463806 0.290155 0.290062 -0.287771 -0.287623 -0.538588 -0.538925 -1.173973 -1.177111 -0.309706 -0.309572 -2.575636 -2.575907 -0.058651 -0.058277 2.411604 2.410151 Relative differences for method are 0.00076386787. Should be small (1e-9) Num Grads Func Grads 6.711959 6.693584 -68.183899 -68.190453 38.049698 38.039562 -54.199215 -54.186806 97.707741 97.718063 10.526656 10.566038 -70.594788 -70.581596 38.515091 38.487904 -30.736921 -30.749435 89.469902 89.493088 3.854751 3.846486 -25.175093 -25.155622 10.587691 10.596359 -19.964218 -19.955395 28.196333 28.179987 10.213851 10.200496 -82.588188 -82.617134 52.881237 52.895470 -65.713882 -65.718040 130.901337 130.928772 6.317138 6.333028 -86.460106 -86.465569 -36.693573 -36.688496 -36.502838 -36.511711 44.736858 44.745163 -4.116058 -4.114252 65.225601 65.216415 30.076979 30.074966 27.200697 27.205814 -20.799635 -20.794403 -2.260208 -2.282639 31.826017 31.828533 16.363144 16.344271 13.311385 13.307327 -16.103745 -16.111374 1.338959 1.361882 -24.999617 -24.977449 -10.320662 -10.327084 -10.065078 -10.083112 5.676269 5.666134 -5.649566 -5.641539 85.954659 85.937538 37.216187 37.207138 34.606934 34.607189 -39.308548 -39.303772 2.691269 2.685248 -38.249969 -38.242092 -16.252518 -16.261158 -16.225815 -16.215649 16.819000 16.811707 -1.462936 -1.462608 30.584333 30.600145 2.822876 2.822504 82.855217 82.856033 0.379562 0.378145 -5.743026 -5.740044 -0.720978 -0.720155 -17.814636 -17.773310 -0.083923 -0.083204 -20.568846 -20.545242 0.076294 0.076714 -57.123180 -57.119926 1.134872 1.135409 3.015518 3.019630 -2.380371 -2.382002 3.852844 3.869895 1.008987 1.007515 -25.325773 -25.326454 -2.027512 -2.027638 -69.564819 -69.550819 -0.858307 -0.857046 40.840145 40.822083 1.695633 1.695448 112.745277 112.731529 Relative differences for method are 0.00016244102. Should be small (1e-9) Num Grads Func Grads 0.370979 0.371239 -3.942966 -3.939380 1.880169 1.885892 -4.216194 -4.219157 6.735086 6.733737 0.416279 0.415627 -2.906561 -2.902907 1.408815 1.409868 -1.617193 -1.616733 4.136801 4.136244 0.260353 0.259042 -1.281023 -1.276769 0.546217 0.549732 -1.114845 -1.114833 1.324415 1.325205 0.393391 0.392242 -3.700971 -3.701753 2.170086 2.169993 -3.574133 -3.574246 6.063461 6.066147 0.670671 0.670281 -4.621267 -4.624799 -2.263308 -2.262765 -2.675056 -2.679310 3.272533 3.272666 -0.095606 -0.094364 2.565861 2.564243 1.099110 1.098414 0.891924 0.892591 -0.654697 -0.656995 0.049114 0.045307 0.544071 0.541945 0.272989 0.275064 -0.040054 -0.042496 -0.256300 -0.260528 0.154018 0.157435 -1.469135 -1.473261 -0.620604 -0.615052 -0.800848 -0.800142 0.550508 0.553296 -0.478268 -0.478678 4.086733 4.090587 1.882315 1.881419 2.112389 2.115821 -2.469778 -2.472500 0.278950 0.274027 -2.095699 -2.095469 -0.984669 -0.983863 -1.176834 -1.176937 1.266718 1.266336 -0.509024 -0.509570 2.299547 2.298040 0.456572 0.455321 3.930330 3.928681 0.109434 0.109184 -0.335455 -0.336301 -0.148058 -0.148257 -0.793695 -0.796763 -0.050306 -0.050607 -0.918627 -0.920102 0.015974 0.014915 -2.103090 -2.102135 0.313044 0.312434 -0.062227 -0.059504 -0.477552 -0.477148 -0.045061 -0.046998 0.325680 0.325307 -1.706362 -1.710228 -0.344992 -0.345803 -3.120184 -3.121110 -0.256777 -0.255635 2.459288 2.462493 0.304222 0.304774 5.084037 5.084333 Relative differences for method are 0.000524794. Should be small (1e-9) Num Grads Func Grads 0.312567 0.311857 0.060797 0.061075 -0.182033 -0.180999 -0.234008 -0.232954 -0.136733 -0.139338 -0.059962 -0.057574 -0.219822 -0.221368 0.201702 0.202568 0.025988 0.025125 0.081062 0.083038 -0.260592 -0.261443 0.021577 0.023480 -0.099421 -0.099594 -0.292659 -0.291759 -0.164032 -0.163031 -0.078917 -0.080009 -0.034451 -0.033984 0.128984 0.130029 0.008345 0.010039 0.174046 0.174043 -0.094056 -0.093796 -0.013828 -0.013256 0.057578 0.057035 0.092030 0.092475 -0.010848 -0.012292 0.020504 0.020589 0.103712 0.105378 -0.113130 -0.113505 0.017166 0.019778 -0.064254 -0.062586 0.159979 0.164053 -0.026703 -0.027487 0.051618 0.052212 0.198841 0.202832 0.144124 0.142701 0.209331 0.209704 -0.006795 -0.008453 0.004172 0.003124 -0.060678 -0.063686 -0.091195 -0.091876 -0.159264 -0.161411 -0.062108 -0.065715 -0.352025 -0.351371 -0.360966 -0.360692 0.148058 0.148704 0.161529 0.160150 -0.209570 -0.209143 -0.133514 -0.136240 0.100493 0.101331 0.019431 0.018995 0.087619 0.089369 0.033379 0.037169 0.201225 0.199820 0.211835 0.213000 -0.085711 -0.088189 -0.093460 -0.092003 0.123620 0.121501 0.061631 0.062936 -0.060201 -0.058970 0.001669 0.004213 0.108123 0.106151 -0.020027 -0.020401 0.338316 0.336111 0.303030 0.302518 -0.206590 -0.210287 -0.223994 -0.224138 0.186324 0.186575 0.137210 0.137436 -0.026226 -0.025840 -0.056744 -0.056368 0.064492 0.064329 0.001669 0.000783 0.207186 0.205043 0.207663 0.206123 -0.127316 -0.125652 -0.126839 -0.126457 0.117660 0.117819 0.068069 0.068519 -0.024915 -0.023798 -0.009179 -0.009422 -0.173569 -0.175902 -0.274539 -0.273241 -0.158906 -0.157542 -0.372171 -0.370569 -0.087976 -0.089809 -0.123262 -0.121120 -0.140309 -0.139166 0.042439 0.040605 0.222564 0.222288 -0.209928 -0.209915 0.201821 0.201518 0.084519 0.082732 0.426531 0.426795 0.420213 0.422846 -0.174165 -0.176922 -0.193238 -0.195676 0.250816 0.248087 0.196815 0.194071 -0.122786 -0.123045 -0.048637 -0.050270 0.039816 0.038292 -0.198364 -0.200369 0.256062 0.254771 0.003695 0.001437 -0.253201 -0.251856 -0.325084 -0.325386 0.086784 0.085402 0.296593 0.296466 0.129580 0.127823 -0.335813 -0.337562 0.062466 0.063843 0.195265 0.196389 -0.074983 -0.075120 0.080705 0.082724 0.173926 0.172865 0.202298 0.202088 -0.005603 -0.004043 -0.108004 -0.108222 -0.144482 -0.143103 0.176907 0.174852 0.063300 0.065570 0.256062 0.261189 -0.117183 -0.118304 0.119209 0.120075 0.228763 0.228869 0.277638 0.279610 -0.003099 -0.006243 -0.204802 -0.203730 -0.191450 -0.194106 0.292301 0.293116 -0.031471 -0.030124 0.141859 0.142707 -0.208497 -0.209867 -0.026941 -0.026536 0.196695 0.196141 0.244260 0.246189 -0.070572 -0.072115 -0.244141 -0.243417 -0.098586 -0.096951 0.258088 0.258783 0.208735 0.208590 0.164986 0.164989 0.353813 0.352502 0.459433 0.460969 -0.103235 -0.103246 -0.096798 -0.096121 0.235319 0.234387 0.061750 0.064499 -0.175953 -0.176755 0.091076 0.092398 -0.173450 -0.172762 0.401378 0.405599 0.116706 0.115096 -0.196934 -0.199266 -0.090599 -0.094570 -0.127792 -0.126921 -0.134468 -0.134261 0.315547 0.316991 0.520587 0.518939 -0.836611 -0.836996 0.042558 0.041820 -0.083089 -0.081949 -0.023961 -0.023933 0.035644 0.035489 0.026226 0.026675 0.028372 0.029101 0.048518 0.047496 -0.060201 -0.061848 -0.109315 -0.110237 0.200987 0.201105 0.235200 0.233015 0.101566 0.100780 0.259161 0.259513 -0.268936 -0.271424 -0.100374 -0.100878 0.051141 0.051580 -0.361204 -0.359039 0.009775 0.008971 -0.037432 -0.038072 1.041055 1.039986 -0.142574 -0.143103 0.441074 0.442194 0.192046 0.191359 -0.230789 -0.231685 -0.102997 -0.102431 -0.130296 -0.130364 -0.191212 -0.192988 0.326395 0.328803 0.539780 0.539642 -0.720501 -0.721265 -0.059605 -0.061262 -0.048280 -0.047782 -0.104427 -0.104571 0.070691 0.070518 0.013947 0.014226 -0.003576 -0.003398 0.090122 0.089000 -0.016570 -0.017456 -0.011921 -0.013042 -0.263453 -0.266497 -0.208616 -0.207550 0.619888 0.619279 0.316024 0.315512 -0.364423 -0.364230 -0.186801 -0.185087 -0.189185 -0.189708 -0.403643 -0.404409 0.485420 0.485202 0.818610 0.818659 -1.041293 -1.041815 -0.081539 -0.080142 0.436425 0.439344 0.216126 0.218905 -0.286102 -0.287349 -0.126243 -0.126274 -0.111699 -0.111244 -0.265479 -0.263417 0.304103 0.303301 0.483394 0.483243 -0.408053 -0.409209 0.062704 0.062238 -0.412106 -0.412213 -0.278354 -0.277767 0.311017 0.310491 0.159025 0.162666 0.103116 0.104658 0.409007 0.405954 -0.302196 -0.303278 -0.511885 -0.511293 0.366926 0.366982 -0.054240 -0.053122 0.166178 0.166554 0.028968 0.028728 -0.083208 -0.085248 -0.049710 -0.049947 -0.041842 -0.040493 -0.067949 -0.065471 0.110388 0.110542 0.176191 0.177371 -0.276923 -0.277232 -0.030756 -0.028911 0.551701 0.557508 0.312448 0.313525 -0.409842 -0.413396 -0.157475 -0.157533 -0.108600 -0.107965 -0.349045 -0.346642 0.376224 0.377657 0.558853 0.558513 -0.209928 -0.209954 0.193953 0.192894 0.428915 0.428110 0.445962 0.446616 -0.499606 -0.500053 -0.192881 -0.190013 -0.024080 -0.024177 -0.561595 -0.561750 0.243306 0.243692 0.326037 0.324034 0.806570 0.804877 1.450896 1.450685 -0.115156 -0.118111 1.001477 1.000420 0.400543 0.401283 1.772046 1.771781 0.511885 0.512357 -1.435280 -1.434563 -0.493884 -0.491516 0.315666 0.315025 -0.525355 -0.524058 0.869632 0.869365 0.763893 0.764892 -2.297044 -2.297577 -0.349283 -0.350460 -0.802994 -0.803257 -0.894308 -0.894548 -0.591516 -0.591840 0.586987 0.585220 -0.531197 -0.531900 0.256062 0.254967 0.478744 0.479863 0.458121 0.457866 Relative differences for method are 0.002172598. Should be small (1e-9) Beginning training with the following parameters: input size = 1, hidden layers = [1], output size = 1, batch size = 1024, num epochs = 150, training alpha = 0.002, decay rate = 0.1, L2 Reg Constant = 0.0, max norm reg constant = Inf, dropout rate = 0.0, residual layer size = 0 ------------------------------------------------------------------- Initial cost is 4.5569577 ------------------------------------------------------------------- Completed training on CPU with the following parameters:  input size = 1, hidden layers = [1], output size = 1, batch size = 1024, num epochs = 150, training alpha = 0.002, decay rate = 0.1, L2 Reg Constant = 0.0, max norm reg constant = Inf, dropout rate = 0.0, residual layer size = 0 Training Results: Cost reduced from 4.7715697to 1.9026493 after 1 seconds and 150 epochs Median time of 48.99338819086552 ns per example Total operations per example = 32.0 foward prop ops + 9.00390625 backprop ops + 0.03515625 update ops = 41.0390625 Approximate GFLOPS = 0.8376449157444136 ------------------------------------------------------------------- Completed benchmark with 1 input [1] hidden 1 output, and 1024 batchSize on a AMD EPYC 7763 64-Core Processor Time to train on CPU took 0.8029501438140869 seconds for 150 epochs Average time of 52.275399987896286 ns per example Total operations per example = 32.0 foward prop ops + 9.00390625 backprop ops + 0.03515625 update ops = 41.0390625 Approximate GFLOPS = 0.7850549686755544 Backend is set to CPU Num Grads Func Grads -0.167504 -0.167347 0.543714 0.543676 -0.144422 -0.144394 -0.008911 -0.008839 -0.272021 -0.272223 0.014335 0.014266 -0.214294 -0.214307 0.017107 0.017179 0.016794 0.016594 0.023171 0.023050 0.037402 0.037344 -0.182405 -0.182453 0.002950 0.002887 0.037596 0.037637 0.062510 0.062431 0.026956 0.027002 -0.002280 -0.002004 -0.009179 -0.009318 0.009418 0.009481 0.028744 0.028698 0.206247 0.206354 -0.404969 -0.404844 -0.597507 -0.597381 0.240430 0.240348 0.100955 0.101023 -0.034809 -0.034889 0.057399 0.057441 0.016525 0.016654 -0.005126 -0.005139 -0.028357 -0.028296 0.161871 0.161984 -0.274017 -0.274126 -0.309244 -0.309130 0.141218 0.141186 0.066102 0.066191 0.082031 0.082081 -0.197053 -0.197146 -0.400871 -0.400995 0.162691 0.162752 0.036269 0.036158 0.120744 0.120639 -0.170439 -0.170605 -0.198990 -0.198996 0.092894 0.092803 0.036508 0.036529 0.027657 0.027803 -0.045031 -0.045067 0.024885 0.024758 -0.005856 -0.005797 0.021622 0.021706 0.999942 1.000000 0.000000 0.000000 -0.260040 -0.260063 0.000000 0.000000 -0.155240 -0.155227 0.000000 0.000000 -0.826180 -0.826232 0.000000 0.000000 -0.518218 -0.518207 0.000000 0.000000 0.204593 0.204622 0.000000 0.000000 Relative differences for method are 0.0001881188. Should be small (1e-9) Checking for cuda toolkit versions No cuda toolkit appears to be installed. If this sytem has an NVIDIA GPU, install the cuda toolkit and add nvcc to the system path to use the GPU backend. Available backends are: CPU text/plaincell_id$d963ff6d-f1b6-4799-aa0e-1ae100310d84kwargsidPlutoRunner_d1acb81efileP/home/runner/.julia/packages/Pluto/5ete1/src/runner/PlutoRunner/src/io/stdout.jlgroupstdoutlevelLogLevel(-555)running¦outputbody mimetext/htmlrootassigneelast_run_timestampA kYpersist_js_state·has_pluto_hook_features§cell_id$d963ff6d-f1b6-4799-aa0e-1ae100310d84depends_on_disabled_cells§runtime opublished_object_keysdepends_on_skipped_cells§errored$b16899b7-36bf-4a5e-8e2f-4496b8450687queued¤logsrunning¦outputbody6squashed_gaussian_pdf (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA !Mpersist_js_state·has_pluto_hook_features§cell_id$b16899b7-36bf-4a5e-8e2f-4496b8450687depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$10cdd16e-a337-4421-a7a0-6de4e4b60c0fqueued¤logsrunning¦outputbodyBinaryGaussianEligibilityVectormimetext/plainrootassigneelast_run_timestampA !persist_js_state·has_pluto_hook_features§cell_id$10cdd16e-a337-4421-a7a0-6de4e4b60c0fdepends_on_disabled_cells§runtimeM>ݵpublished_object_keysdepends_on_skipped_cells§errored$a8b40b8f-051a-4e6f-a079-ece4f32873dequeued¤logsrunning¦outputbody>create_actor_critic_params_UI (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA ?ͧ~persist_js_state·has_pluto_hook_features§cell_id$a8b40b8f-051a-4e6f-a079-ece4f32873dedepends_on_disabled_cells§runtime:2published_object_keysdepends_on_skipped_cellsçerrored$5eebf3da-bfe7-46eb-81a3-f87f334ee270queued¤logsrunning¦outputbodyDcreate_actor_critic_fcann_params_UI (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 35Qpersist_js_state·has_pluto_hook_features§cell_id$5eebf3da-bfe7-46eb-81a3-f87f334ee270depends_on_disabled_cells§runtimeFUpublished_object_keysdepends_on_skipped_cellsçerrored$9bce6fdb-2cbc-4758-9a8b-794e490c973dqueued¤logsrunning¦outputbody/1mimetext/htmlrootassigneelast_run_timestampA :װpersist_js_state·has_pluto_hook_features§cell_id$9bce6fdb-2cbc-4758-9a8b-794e490c973ddepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cellsçerrored$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbfqueued¤logsrunning¦outputbodyEcreate_continuous_action_mountaincar (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA =_persist_js_state·has_pluto_hook_features§cell_id$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbfdepends_on_disabled_cells§runtime%published_object_keysdepends_on_skipped_cells§errored$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6queued¤logsrunning¦outputbodyelementsprefixTuple{Float32, Float32}elementselements-0.531205text/plain0.0text/plaintypeTupleobjectid82cf4db9a775ec1d!application/vnd.pluto.tree+objectelements-0.530148text/plain0.00105704text/plaintypeTupleobjectid7ce0ca28f7354422!application/vnd.pluto.tree+objectelements-0.528042text/plain0.00210616text/plaintypeTupleobjectide1eed8cc1ecb8f34!application/vnd.pluto.tree+objectelements-0.524902text/plain0.00313948text/plaintypeTupleobjectidf1a26da07d248f21!application/vnd.pluto.tree+objectelements-0.520753text/plain0.00414925text/plaintypeTupleobjectid9493e20af903dd19!application/vnd.pluto.tree+objectelements-0.515625text/plain0.00512791text/plaintypeTupleobjectid104e7d3122527a14!application/vnd.pluto.tree+objectelements-0.509557text/plain0.00606812text/plaintypeTupleobjectid878f21ed8c91c756!application/vnd.pluto.tree+objectelements-0.502594text/plain0.00696283text/plaintypeTupleobjectid5b9ea3da43f891d3!application/vnd.pluto.tree+object elements-0.494789text/plain0.0078054text/plaintypeTupleobjectid82934ae51d4b1cc4!application/vnd.pluto.tree+objectmorěelements0.495647text/plain0.0136242text/plaintypeTupleobjectid85c3ada8880376ae!application/vnd.pluto.tree+objecttypeArrayprefix_shortobjectid7b308864cd041b41!application/vnd.pluto.tree+objectprefixInt64elements3text/plain3text/plain3text/plain3text/plain3text/plain3text/plain3text/plain3text/plain 3text/plainmorě2text/plaintypeArrayprefix_shortobjectid9a11c7e08ac6a3fc!application/vnd.pluto.tree+objectprefixFloat32elements-1.0text/plain-1.0text/plain-1.0text/plain-1.0text/plain-1.0text/plain-1.0text/plain-1.0text/plain-1.0text/plain -1.0text/plainmorě-1.0text/plaintypeArrayprefix_shortobjectid8d82f25e85371d1!application/vnd.pluto.tree+objectelements0.5text/plain0.0134148text/plaintypeTupleobjectid32d698e8ae81b727!application/vnd.pluto.tree+object140text/plaintypeTupleobjectid498dfdf2d9f4ea29mime!application/vnd.pluto.tree+objectrootassignee)const mountaincar_continuing_test_episodelast_run_timestampA =kpersist_js_state·has_pluto_hook_features§cell_id$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6depends_on_disabled_cells§runtimeL>published_object_keysdepends_on_skipped_cellsçerrored$7afb6fb0-248a-4518-b94f-9876f81eca64queued¤logsrunning¦outputbodyDcorridor_continuing_parameter_study (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA +~persist_js_state·has_pluto_hook_features§cell_id$7afb6fb0-248a-4518-b94f-9876f81eca64depends_on_disabled_cells§runtime8?published_object_keysdepends_on_skipped_cellsçerrored$37a273b6-b104-46f0-987a-401dc1c97327queued¤logsrunning¦outputbodymimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$37a273b6-b104-46f0-987a-401dc1c97327depends_on_disabled_cells§runtimeUpublished_object_keysdepends_on_skipped_cellsçerrored$7a6f3f79-ea06-4994-8b62-90b2056e4034queued¤logsrunning¦outputbody@make_squashed_gaussian_sampler (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA #>persist_js_state·has_pluto_hook_features§cell_id$7a6f3f79-ea06-4994-8b62-90b2056e4034depends_on_disabled_cells§runtimedpublished_object_keysdepends_on_skipped_cells§errored$f2ed56c9-c2b7-42cb-a083-e12aeaa126efqueued¤logsrunning¦outputbodyprefixFloat32elements0.423691text/plain0.576308text/plaintypeArrayprefix_shortobjectidc8377fe263c620b2mime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA $persist_js_state·has_pluto_hook_features§cell_id$f2ed56c9-c2b7-42cb-a083-e12aeaa126efdepends_on_disabled_cells§runtimeMpublished_object_keysdepends_on_skipped_cellsçerrored$cbea5840-49d2-4e91-be9c-f5f15666d78aqueued¤logsrunning¦outputbodyprefixFloat32elements0.389351text/plain0.610649text/plaintypeArrayprefix_shortobjectidfe5c659bbdbdad22mime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA %Opersist_js_state·has_pluto_hook_features§cell_id$cbea5840-49d2-4e91-be9c-f5f15666d78adepends_on_disabled_cells§runtimezpublished_object_keysdepends_on_skipped_cellsçerrored$1f041cb3-618c-4380-a1ec-d7bbe4a80f62queued¤logsrunning¦outputbodyMactor_critic_binary_episodic_parameter_study (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA +/upersist_js_state·has_pluto_hook_features§cell_id$1f041cb3-618c-4380-a1ec-d7bbe4a80f62depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cellsçerrored$96506201-6b66-49e6-8179-06952e2394e1queued¤logsrunning¦outputbody>setup_binary_policy_arguments (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA ֹ9persist_js_state·has_pluto_hook_features§cell_id$96506201-6b66-49e6-8179-06952e2394e1depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$76b03e72-da04-4530-8534-6d6468268cbdqueued¤logsrunning¦outputbody

$$\sum_{s \in \mathcal{S}} \sum_{k = 0}^\infty \Pr \{ s_0 \rightarrow s, k, \pi \} = \sum_{k = 0}^\infty \left [ 1 - \Pr \{s_0 \rightarrow S_T, k, \pi \} \right ] = \eta$$

where $\eta$ is the average length of an episode. The quantity inside the brackets is the probability that an episode has not terminated by step k and follows from the fact that the sum over states in $\mathcal{S}$ is over the set of non-terminal states. If the sum was over $\mathcal{S}^+$ instead then it would be infinite since the first sum term would be 1 for every k. Normally to calculate $\eta$, we would use the expected value with the probability of an episode lasting exactly $k$ steps, but the probability we have access to here is actually the distribution function, not the density function. That is $\Pr \{s_0 \rightarrow S_T, k, \pi \} = \sum_{t = 0}^k \Pr \{ T = t \} = \Pr \{ T \leq k \}$ where $T$ is the length of an episode. Using these probabilities, we can write $\eta = \mathbb{E}_\pi [T] = \sum_{k = 0}^\infty k \Pr \{ T = k \} = \Pr \{T = 1 \} + 2 \Pr \{T = 2 \} + \cdots$.

Earlier we had the expression $\eta = \sum_{k = 0}^\infty \left [ 1 - \Pr \{s_0 \rightarrow S_T, k, \pi \} \right ] = \sum_{k = 0}^\infty \Pr \{T \gt k \} = \sum_{k = 0}^\infty \sum_{t = k + 1}^\infty \Pr \{T = t \}$

We can stack up the terms of this double sum to see that it is equivalent to the expected value calcuation from before:

$$\begin{flalign} \Pr \{ T = 1 \} + \Pr \{ T = 2 \} + &\Pr \{ T = 3 \} +\cdots \\ \Pr \{ T = 2 \} + &\Pr \{ T = 3 \} + \cdots \\ &\Pr \{ T = 3 \} + \cdots \\ \vdots \end{flalign}$$

If we count terms along the diagonal, we see that each value of $k$ has exactly $k$ terms, matching the expected value calculation.

What if we wanted to calculate the bivariate distribution over states and steps where we ignore the terminal states $\mu_\pi(s, k)$ such that $\sum_{s \in \mathcal{S}} \sum_k \mu_\pi(s, k) = 1$. This probability represents the chance of sampling a particular step and state simultaneously from a unbiased sample of non-terminal states in an episode. Luckily we can break down this probability into two components: 1) the probability of reaching a step k without terminating 2) the probability of being in a non-terminal state on step k. We saw already that 1) is just $\sum_{s \in \mathcal{S}} \Pr \{ s_0 \rightarrow s, k, \pi \}$ and 2) we can calculate by normalizing those probabilities over only the non-terminal states: $\frac{\Pr \{ s_0 \rightarrow s, k, \pi \}}{\sum_{s \in \mathcal{S}} \Pr \{ s_0 \rightarrow s, k, \pi \} }$. By multiplying these two together we see that the probability is just the original distribution but where the domain of possible input values is $s \in \mathcal{S}$ and all possible steps $k$. Therefore, we can transform this into a normalized bivariate distribution by dividing by its sum over those two sets:

$$\mu_\pi(s, k) = \frac{\Pr \{ s_0 \rightarrow s, k, \pi \}}{\sum_{x \in \mathcal{S}} \sum_{t = 0}^\infty \Pr \{ s_0 \rightarrow x, t, \pi \}}$$

Now that we have established the relationship between the on-policy distribution function and the probability expression we have, we can use it to complete the proof below.

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$76b03e72-da04-4530-8534-6d6468268cbddepends_on_disabled_cells§runtime µpublished_object_keysdepends_on_skipped_cells§errored$fd89433e-643c-474b-b3c4-a997678421a6queued¤logsrunning¦outputbody

Linear Features

This version of REINFORCE uses linear feature vectors for which one needs to specify the total number of features as well as a function that updates the values in a feature vector given a state.

mimetext/htmlrootassigneelast_run_timestampA Spersist_js_state·has_pluto_hook_features§cell_id$fd89433e-643c-474b-b3c4-a997678421a6depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$87feff3e-e510-4916-91a9-db3a2cd12225queued¤logsrunning¦outputbodye

$\lambda_\theta$: 0.75

$\lambda_\mathbf{w}$: 0.25

$\alpha_{\overline{r}}$:

hidden layer size: , num layers:

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

mimetext/htmlrootassigneelast_run_timestampA 8 fpersist_js_state·has_pluto_hook_features§cell_id$87feff3e-e510-4916-91a9-db3a2cd12225depends_on_disabled_cells§runtime jpublished_object_keysdepends_on_skipped_cellsçerrored$5261651e-a51e-4e80-8e23-83a4c10e5259queued¤logsrunning¦outputbodyEupdate_gaussian_eligibility_vector! (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA #Eհpersist_js_state·has_pluto_hook_features§cell_id$5261651e-a51e-4e80-8e23-83a4c10e5259depends_on_disabled_cells§runtimewpublished_object_keysdepends_on_skipped_cells§errored$dddc4a2f-34b2-41dc-85b3-55aba4880fa6queued¤logsrunning¦outputbodymsg٦UndefVarError: `reinforce_test` not defined in `Main.var"workspace#8"` Suggestion: add an appropriate import or assignment. This global was declared but not assigned.stacktracecall_shorttop-level scopeinlined£urlpath/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#dddc4a2f-34b2-41dc-85b3-55aba4880fa6source_packagecalltop-level scopelinfo_typeCore.CodeInfolinefileMChapter_13_Policy_Gradient_Methods.jl#==#dddc4a2f-34b2-41dc-85b3-55aba4880fa6functop-level scopeparent_modulefrom_c¤mime'application/vnd.pluto.stacktrace+objectrootassigneelast_run_timestampA 0Mpersist_js_state·has_pluto_hook_features§cell_id$dddc4a2f-34b2-41dc-85b3-55aba4880fa6depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cellsçerrored$54fff14b-cf53-47b0-9cfa-8b9ee33df54equeued¤logsrunning¦outputbodyBinaryBetaEligibilityVectormimetext/plainrootassigneelast_run_timestampA !persist_js_state·has_pluto_hook_features§cell_id$54fff14b-cf53-47b0-9cfa-8b9ee33df54edepends_on_disabled_cells§runtimeKCǵpublished_object_keysdepends_on_skipped_cells§errored$023f67b8-8f38-470a-9766-ac60a75678aaqueued¤logsrunning¦outputbodyelementsfeature_vectorprefixFloat32elements0.0text/plain0.0text/plaintypeArrayprefix_shortobjectid96151aba818dc130!application/vnd.pluto.tree+objectnum_features2text/plainupdate_feature_vector!update_feature_vector!text/plaintypeNamedTupleobjectid534e5799b20b5f5bmime!application/vnd.pluto.tree+objectrootassigneeconst mountaincar_fcann_setuplast_run_timestampA :jpersist_js_state·has_pluto_hook_features§cell_id$023f67b8-8f38-470a-9766-ac60a75678aadepends_on_disabled_cells§runtimeS~published_object_keysdepends_on_skipped_cells§errored$1558cec1-c4fd-4bc0-85ed-ae22c6067d41queued¤logsrunning¦outputbody:

We can also repeat this derivation for the alternative linear parameterization where we only have state feature vectors and a parameter matrix with components $\boldsymbol{\theta}_{i, j}$:

$$\begin{flalign} \mathbf{h} &= \boldsymbol{\theta}^\top \mathbf{x}(s) \\ h_a &= \mathbf{h}_a \\ \mathbf{\pi}(s) &= \sigma(\mathbf{h}) \\ \pi_a &= \sigma(\mathbf{h})_a \\ \nabla(\pi_a)_{i, j} &= \pi_a \begin{cases} \mathbf{x}(s)_i (1 - \pi_j), & \text{ if } j = a \\ -\pi_j \mathbf{x}(s)_i, & \text{ else }\\ \end{cases} \end{flalign}$$

We already know how to apply the chain rule to the natural logarithm so our final gradient is:

Applying this to the above expression yields:

$$\begin{flalign} \nabla \left ( \ln \pi_a \right )_{i, j} &= \frac{\nabla \left ( \pi_a \right )_{i, j}}{\pi_a} \\ &= \begin{cases} \mathbf{x}(s)_i (1 - \pi_j), & \text{ if } j = a \\ -\pi_j \mathbf{x}(s)_i, & \text{ else }\\ \end{cases} \end{flalign}$$

which is the per component version of the desired vector expression.

mimetext/htmlrootassigneelast_run_timestampA ְpersist_js_state·has_pluto_hook_features§cell_id$1558cec1-c4fd-4bc0-85ed-ae22c6067d41depends_on_disabled_cells§runtime{published_object_keysdepends_on_skipped_cells§errored$da8d0bca-105b-4d0b-a73d-ee5c9059aeafqueued¤logsrunning¦outputbody

Notice now that all of the parameters associated with the state-value estimate are irrelevent since they always cancel out in the parameter update. Even though we have added a parameter, this method effectively removes two from the analysis. Also, we seem to actually benefit from an intermediate value of $\lambda_{\boldsymbol{\theta}}$ unlike in the episodic case where using the Monte Carlo method was always the best.

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$da8d0bca-105b-4d0b-a73d-ee5c9059aeafdepends_on_disabled_cells§runtimeZpublished_object_keysdepends_on_skipped_cells§errored$3e7cecec-eb77-4862-8e3c-b510422e06dbqueued¤logsrunning¦outputbody mimetext/htmlrootassigneelast_run_timestampA !apersist_js_state·has_pluto_hook_features§cell_id$3e7cecec-eb77-4862-8e3c-b510422e06dbdepends_on_disabled_cells§runtimeYpublished_object_keysdepends_on_skipped_cellsçerrored$0284f0d7-b8a9-4ae6-add0-ac1078571d9bqueued¤logsrunning¦outputbody*

$$\begin{flalign} J(\boldsymbol{\theta}) \doteq r(\pi) &\doteq \lim_{h \rightarrow \infty} \frac{1}{h} \sum_{t=1}^h \mathbb{E} [R_t \mid S_0, A_{0:t-1} \sim \pi] \tag{13.15} \\ &= \lim_{t \rightarrow \infty} \mathbb{E}[R_t \vert S_0,A_{0:t-1} \sim \pi] \\ &= \sum_s \mu(s) \sum_a \pi(a \vert s) \sum_{s^\prime, r} p(s^\prime, r \vert s, a) r \end{flalign}$$

where $\mu$ is the steady-state distribution under $\pi$, $\mu(s) \doteq \lim_{t \rightarrow \infty} \Pr \{ S_t = s \vert A_{0:t} \sim \pi \}$, which is assumed to exist and to be independent of $S_0$ (an ergodicity assumption). Remember that this is the special distribution under which, if you select actions according to $\pi$, you remain the same distribution:

$$\sum_s \mu(s) \sum_a \pi(a \vert s, \boldsymbol{\theta})p(s^\prime \vert s, a) = \mu(s^\prime), \: \forall s^\prime \in \mathcal{S}$$

Naturally, in the continuing case, we define values, $v_\pi(s) \doteq \mathbb{E}_\pi [G_t \vert S_t = s]$ and $q_\pi(s, a) \doteq \mathbb{E}_\pi[G_t \vert S_t = s, A_t = a]$, with respect to the differential return:

$$G_t \doteq R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + R_{t+3} - r(\pi) + \cdots \tag{13.17}$$

With these alternate definitions, the policy gradient theorem as given for the episodic case (13.5) remains true for the continuing case. See proof below:

mimetext/htmlrootassigneelast_run_timestampA 3ΰpersist_js_state·has_pluto_hook_features§cell_id$0284f0d7-b8a9-4ae6-add0-ac1078571d9bdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$b94fc99c-f439-4df2-8da3-c01718a136c4queued¤logsrunning¦outputbody

Repeating this process for state 2 yields:

$$\begin{flalign} v_2 &= -\frac{2+p}{p(1-p)} \\ \frac{\partial v_2}{\partial p} &= -\frac{p(1-p) - (2+p)(1 - 2p)}{p^2(1-p)^2} \end{flalign}$$

Setting this equal to 0 implies

$$\begin{flalign} p - p^2 &= 2 - 4p + p - 2p^2 \\ p^2 + 4p - 2 &= 0 \\ \end{flalign}$$

Using the quadratic equation and taking only the positive solution yields:

$$p = \frac{-4 + \sqrt{16 + 8}}{2} = \frac{-4 + \sqrt{24}}{2} = -2 + \sqrt{6} \approx 0.4495$$

So, in order to maximize the value at state 2, we have $p_{\text{left}} \approx 0.4495$ and $p_{\text{right}} \approx 0.5505$. Which is different from the value we got for state 1. So There is a different optimal policy depending on the starting state. It should be obvious for example that starting in the third state results in an optimial policy of choosing the right action every time. The value functions for each state are plotted below. The behavior of $v_3$ is not well defined at $p=0$ because for any finite $v_2$ it should be 0 but the limit approaching from the right side is -3. This is because for $p=0$ both $v_1$ and $v_2$ are not finite and the episode never terminates.

The value of the state at this probability is: $v_2 = - \frac{2+p}{p(1-p)} = -\frac{\sqrt{6}}{(\sqrt{6}-2)(3 - \sqrt{6})} = - \frac{\sqrt{6}}{3 \sqrt{6} - 6 - 6 + 2 \sqrt{6}} = - \frac{\sqrt{6}}{5 \sqrt{6} - 12} \approx -9.9$

mimetext/htmlrootassigneelast_run_timestampA \Spersist_js_state·has_pluto_hook_features§cell_id$b94fc99c-f439-4df2-8da3-c01718a136c4depends_on_disabled_cells§runtime[Npublished_object_keysdepends_on_skipped_cells§errored$b8532822-179b-4cd5-a279-4b71dafb544aqueued¤logsrunning¦outputbodyelementsstep_rewardsprefixFloat32elementstypeArrayprefix_shortobjectid1719e83743ccf99b!application/vnd.pluto.tree+objectepisode_stepsprefixInt64elements255500text/plain258973text/plain271063text/plain282869text/plain292280text/plain295131text/plain298359text/plain302972text/plain 306408text/plainmoreƒ999862text/plaintypeArrayprefix_shortobjectid7dd28e625537ba9!application/vnd.pluto.tree+objectepisode_rewardsprefixFloat32elements-255499.0text/plain-3473.0text/plain-12090.0text/plain-11806.0text/plain-9411.0text/plain-2851.0text/plain-3228.0text/plain-4613.0text/plain -3436.0text/plainmoreƒ-161.0text/plaintypeArrayprefix_shortobjectid31815ca7f89be3d0!application/vnd.pluto.tree+objectpolicy_parameters?1452×2 Matrix{Float32}: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0766672 -0.267737 0.0786335 -0.219924 -0.0847249 -0.0164793 -4.25479f-5 0.000205706 0.0 0.0 0.0384359 -0.00628892text/plainvalue_parametersprefixFloat32elements0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain 0.0text/plainmore-0.0174626text/plaintypeArrayprefix_shortobjectid1587a1bc73b13b35!application/vnd.pluto.tree+objectpolicy_functionπtext/plainpolicy_sample_actionπ_sampletext/plainestimate_state_valueestimate_state_valuetext/plainpolicy_and_valuepolicy_and_valuetext/plaintypeNamedTupleobjectidc5001092224d61demime!application/vnd.pluto.tree+objectrootassignee'const mountaincar_continuous_test_trainlast_run_timestampA =8persist_js_state·has_pluto_hook_features§cell_id$b8532822-179b-4cd5-a279-4b71dafb544adepends_on_disabled_cells§runtimeDpublished_object_keysdepends_on_skipped_cells§errored$07ba9fe4-aaa7-4123-9865-cbfa79d0d44aqueued¤logsrunning¦outputbody mimetext/htmlrootassigneelast_run_timestampA 7Opersist_js_state·has_pluto_hook_features§cell_id$07ba9fe4-aaa7-4123-9865-cbfa79d0d44adepends_on_disabled_cells§runtime;published_object_keysdepends_on_skipped_cellsçerrored$f487f2dd-ad09-48ac-ae34-bf50cfa6ac7dqueued¤logsrunning¦outputbodymimetext/htmlrootassigneelast_run_timestampA !7?persist_js_state·has_pluto_hook_features§cell_id$f487f2dd-ad09-48ac-ae34-bf50cfa6ac7ddepends_on_disabled_cells§runtime x8published_object_keysdepends_on_skipped_cellsçerrored$5c4a383f-fcf2-4f2b-819f-6d84471dda00queued¤logsrunning¦outputbody=update_fcann_value_gradient! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA ,persist_js_state·has_pluto_hook_features§cell_id$5c4a383f-fcf2-4f2b-819f-6d84471dda00depends_on_disabled_cells§runtime!Bpublished_object_keysdepends_on_skipped_cells§errored$135f205a-f87e-4691-8e87-d317d6312c84queued¤logsrunning¦outputbody [

The plots below visualize these distributions for the corridor problem starting with the normalized distributions per step which include the terminal states. If we continued to create these plots for larger values of $k$, then the distribution would collapse to a value of 1 for being in a terminal state. In order to calculate other distributions such as the stationary state distribution, it is necessary to renormalize these probabilities by excluding the terminal states:

On-policy Distributions

$$\begin{flalign} &\mu_{k, \pi}(s) = \Pr\{S_k = s \mid \pi \} \; \forall s \in \mathcal{S}^+ \tag{state visits per step}\\ &\Pr \{ T \leq k \vert \pi \} = 1 - \sum_{s \in \mathcal{S}} \Pr\{S_k = s \mid \pi \} \; \forall k \tag{Chance of terminating already (distribution function not density)}\\ &\mu_\pi(s) = \frac{\sum_k \Pr \{ S_k = s \mid \pi \}}{\sum_{k} \sum_{s \in \mathcal{S}} \Pr \{ S_k = s \mid \pi \}} \; \forall s \in \mathcal{S} \tag{non-terminal state visits}\\ &\mu_\pi(s, k) = \frac{\Pr \{ S_k = s \mid \pi \}}{\sum_{k} \sum_{s \in \mathcal{S}} \Pr \{ S_k = s \mid \pi \}} \; \forall s \in \mathcal{S} \tag{non-terminal state and step visits}\\ \end{flalign}$$

Note that final two distributions are only defined for non-terminal states. If we tried to include terminal states we would be unable to normalize the distribution since $\lim_{k \rightarrow \infty} \Pr \{ S_k = S_T \mid \pi \} = 1$ and we would have a diverging sum in the denominator. The only reason these calculation is possible is that the probabilities reach zero quickly enough at higher $k$ for the non-terminal states.

The plots below visualize the four expressions above. The second expression notably is not a probability density but a cummulative distribution function since it includes a sum of all probabilities that meet the condition.

mimetext/htmlrootassigneelast_run_timestampA Jpersist_js_state·has_pluto_hook_features§cell_id$135f205a-f87e-4691-8e87-d317d6312c84depends_on_disabled_cells§runtime@published_object_keysdepends_on_skipped_cells§errored$4a39f9a7-72d4-44ad-895a-742cd1291f92queued¤logsrunning¦outputbody 0.5 mimetext/htmlrootassigneelast_run_timestampA |persist_js_state·has_pluto_hook_features§cell_id$4a39f9a7-72d4-44ad-895a-742cd1291f92depends_on_disabled_cells§runtime tpublished_object_keysdepends_on_skipped_cellsçerrored$ee72af8d-3cb8-4314-82df-580f068e1252queued¤logsrunning¦outputbody _

One common form of linear feature vector is one that selects active features per state. Tile coding is an example of this where a state is assigned a tile in each tiling used and the number of tilings control how many active features a given state will have. Because the only possible feature vector values are 1 or 0, this style of encoding need not be as complex as other methods. We can see by the form of the gradients an abbreviated algorithm that need not compute the eligibility vector explicitely.

We can define a binary feature encoding by the function $\mathcal{F}(s)$ which returns the indices of active features for a state $s$ as well as the knowledge of how many total features there are, $d$. All of the values of $\mathbf{x}(s)$ are zero except for the indices in $\mathcal{F}(s)$ whose values are 1. That simplifies the expression we have before for the linear feature eligibility vector:

$$\begin{flalign} \nabla \left ( \ln \pi_a \right )_{i, j} &= \frac{\nabla \left ( \pi_a \right )_{i, j}}{\pi_a} \\ &= \begin{cases} \mathbf{x}(s)_i (1 - \pi_j), & \text{ if } j = a \\ -\pi_j \mathbf{x}(s)_i, & \text{ else }\\ \end{cases} \\ &= \begin{cases} (1 - \pi_j), & \text{ if } j = a \text{, } i \in \mathcal{F}(s) \\ -\pi_j, & \text{ if } j \neq a \text{, } i \in \mathcal{F}(s) \\ 0, & \text{ otherwise} \end{cases} \end{flalign}$$

We can see from this form of the eligibility vector that it need not be computed explicitely and we do not need to instantiate a feature vector either. Rather we can simply go through the active feature indices and subtract the policy output for the column index at each row and then add 1 to the column corresponding to the selected action:

Loop for each step of the episode $t = 0, 1, \cdots, T-1$

$$G \leftarrow \sum_{k=t+1} \gamma^{k-t-1}R_k$$

$$c = \alpha \times \gamma^t \times G$$

Loop for each action index j

Loop for each feature i

$$\theta_{i, j} \leftarrow \theta_{i, j} - c \times \pi(a_j, S_t, \mathbf{\theta})$$

Define $j_a$ as the column index corresponding to action $A_t$ Loop for each feature i

$$\theta_{i, j_a} \leftarrow \theta_{i, j_a} + c$$

Specialized versions of REINFORCE that use binary features and linear features can be found below as well as the general case that works for any type of parameterized function approximation.

mimetext/htmlrootassigneelast_run_timestampA *persist_js_state·has_pluto_hook_features§cell_id$ee72af8d-3cb8-4314-82df-580f068e1252depends_on_disabled_cells§runtime Ňpublished_object_keysdepends_on_skipped_cells§errored$e524f8cc-ab69-4f8b-a59f-28156696a104queued¤logsrunning¦outputbodymimetext/htmlrootassigneelast_run_timestampA >Jpersist_js_state·has_pluto_hook_features§cell_id$e524f8cc-ab69-4f8b-a59f-28156696a104depends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cellsçerrored$1894ae1a-bb68-4de0-a4d2-ac5d02c49f09queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestamppersist_js_state·has_pluto_hook_features§cell_id$1894ae1a-bb68-4de0-a4d2-ac5d02c49f09depends_on_disabled_cellsçruntimepublished_object_keysdepends_on_skipped_cells§errored$f3bc47b5-03fc-4bd9-a890-26f9608a730bqueued¤logsrunning¦outputbodyS

Continuing Corridor Gridworld Example

Note that if we try to apply this algorithm to the short corridor gridworld it fails because a terminal state is encountered. This condition is checked inside the algorithm because there is nothing about an MDP the way it is defined which tells you in advance if it is a continuing task or not. In the tabular case you can always check to see if a terminal state exists since every state is available, but for the non-tabular case, all we can do is note the problem if a terminal state is encountered.

mimetext/htmlrootassigneelast_run_timestampA 􌷷persist_js_state·has_pluto_hook_features§cell_id$f3bc47b5-03fc-4bd9-a890-26f9608a730bdepends_on_disabled_cells§runtimeRpublished_object_keysdepends_on_skipped_cells§errored$4915b1ed-ad53-4ece-9b00-bc136d47d8dcqueued¤logsrunning¦outputbody

It is implicit in all expressions below that $\pi$ is a function of $\boldsymbol{\theta}$ and that the gradients are with respect to $\boldsymbol{\theta}$. The performance measure for the continuing case is $J(\boldsymbol{\theta}) = r(\boldsymbol{\theta})$ (13.15) and all value functions use the definition of the differential return. We begin by expressing the gradient of the state value function in terms of the state-action value function, the policy, the average return and gradients thereof:

$$\begin{flalign} \nabla v_\pi(s) &= \nabla \left [ \sum_a \pi(a \vert s) q_\pi (s, a) \right ], \: \forall s \in \mathcal{S} \\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \nabla q_\pi(s, a) \right ] \tag{product rule} \\ &=\sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \nabla \sum_{s^\prime, r} p(s^\prime, r, \vert s, a)\left (r - r(\boldsymbol{\theta}) + v_\pi(s^\prime) \right ) \right ] \tag{differential return definitions} \\ &=\sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) [ -\nabla r(\boldsymbol{\theta}) + \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) ] \right ] \tag{distributing gradient}\\ \end{flalign}$$

The purpose of this expression is to isolate the term which is the gradient of the average return since this is the performance metric gradient we originally sought. Note that if we separate the terms inside the sum, the one with the gradient of $r$ is $\sum_a \pi(a\vert s) [- \nabla r(\boldsymbol{\theta})] = -\nabla r(\boldsymbol{\theta}) \sum_a \pi(a \vert s)$. But the policy function is a probability distribution so its sum over actions is just 1. Therefore, this term simplifies to just $-\nabla r(\boldsymbol{\theta})$ which we can simply move to the other side of the expression swapping its place with the state value function:

$$\begin{flalign} \nabla v_\pi(s)&=-\nabla r(\boldsymbol{\theta}) + \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) \right ] \\ \nabla r(\boldsymbol{\theta}) &=-\nabla v_\pi(s) + \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) \right ] \end{flalign}$$

Now the left hand side is $\nabla J(\boldsymbol{\theta})$ and does not depend on $s$. As such, the right hand side as a whole must be independent of $s$ as well so we are free to take a weighted sum of it over some probability distribution on $s$ since all the terms sum to 1. That is, if $f$ is independent of $s$, then $f = \sum_s \mu(s) f = f \sum_s \mu(s) = f \times 1 = f$:

$$\begin{flalign} \nabla J(\boldsymbol{\theta}) &= \sum_s \mu(s) \left ( \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) \right ] - \nabla v_\pi(s) \right ) \\ &= \sum_s \mu(s) \sum_a \nabla \pi(a \vert s) q_\pi(s, a) + \sum_s \mu(s) \sum_a \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) - \sum_s \mu(s) \nabla v_\pi(s) \tag{separating sum terms}\\ &= \sum_s \mu(s) \sum_a \nabla \pi(a \vert s) q_\pi(s, a) + \sum_{s^\prime} \sum_s \mu(s) \sum_a \pi(a \vert s) p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) - \sum_s \mu(s) \nabla v_\pi(s) \tag{swapping sum order in second term}\\ &= \sum_s \mu(s) \sum_a \nabla \pi(a \vert s) q_\pi(s, a) + \sum_{s^\prime} \mu(s^\prime) \nabla v_\pi(s^\prime) - \sum_s \mu(s) \nabla v_\pi(s) \tag{stationary state distribution definition}\\ &= \sum_s \mu(s) \sum_a \nabla \pi(a \vert s) q_\pi(s, a) \tag{cancelling equivalent sum terms}\\ &= \mathbb{E}_\pi \left [ \sum_a \nabla \pi(a \vert S_t) q_\pi(S_t, a) \right ] \tag{expected value definition}\\ &= \mathbb{E}_\pi \left [ \sum_a \pi(a \vert S_t) \frac{\nabla \pi(a \vert S_t)}{\pi(a \vert S_t)} q_\pi(S_t, a) \right ] \tag{multiplying and dividing by the policy}\\ &= \mathbb{E}_\pi \left [\frac{\nabla \pi(A_t \vert S_t)}{\pi(A_t \vert S_t)} q_\pi(S_t, A_t) \right ] \tag{expected value definition}\\ &= \mathbb{E}_\pi \left [\frac{\nabla \pi(A_t \vert S_t)}{\pi(A_t \vert S_t)} G_t \right ] \tag{differential return definition}\\ &= \mathbb{E}_\pi \left [G_t \nabla \ln \pi(A_t \vert S_t) \right ] \tag{chain rule}\\ \end{flalign}$$

The expression inside the expected value can be sampled on every time step and the gradient is only in terms of the policy function which we have selected as something differentiable with respect to the parameters. Since this method will only be used for continuing problems, we cannot rely on Monte Carlo sampling for the differential return. Instead, our only option is to use a bootstrap value estimate in combination with a running estimate of the average reward and the immediate sample reward: $R - \overline{R} + \hat v^\prime$ where $\hat v^\prime$ is the differential value function estimate at the transition state and $\overline{R}$ is an estimate of the average reward. We can apply the existing actor-critic algorithms to these continuing problems as long as we track that additional information and use an additional step size parameter to update the average reward estimate. This step size parameter replaces the discount rate. See a full implementation below:

mimetext/htmlrootassigneelast_run_timestampA wWpersist_js_state·has_pluto_hook_features§cell_id$4915b1ed-ad53-4ece-9b00-bc136d47d8dcdepends_on_disabled_cells§runtime 3published_object_keysdepends_on_skipped_cells§errored$f924eb30-d1cc-4941-8fb5-ff70ad425ab9queued¤logsrunning¦outputbody E

13.3 REINFORCE: Monte Carlo Policy Gradient

If we replace the true action-value function in (13.5) with a learned approximation $\hat q_\pi$, then we have a method called the all-actions method because the update involves the sum over all actions. For the REINFORCE algorithm, we instead sample this value using the actual return and the policy distribution.

We can re-write (13.5) using an expected value under the policy and continue from there:

$$\begin{flalign} \nabla J(\boldsymbol{\theta}) & \propto \mathbb{E}_\pi \left [ \gamma^t \sum_a q_\pi (S_t, a) \nabla \pi(a|S_t, \boldsymbol{\theta}) \right ] \tag{13.6}\\ &= \mathbb{E}_\pi \left [\gamma^t \sum_a \pi(a|S_t, \boldsymbol{\theta}) q_\pi (S_t, a) \frac{\nabla \pi(a|S_t, \boldsymbol{\theta})}{\pi(a|S_t, \boldsymbol{\theta})} \right ] \tag{multiply and divide by policy} \\ &= \mathbb{E}_\pi \left [ \gamma^t q_\pi (S_t, A_t) \frac{\nabla \pi(A_t|S_t, \boldsymbol{\theta})}{\pi(A_t|S_t, \boldsymbol{\theta})} \right ] \tag{replace a with sample under policy} \\ &= \mathbb{E}_\pi \left [ \gamma^t G_t \frac{\nabla \pi(A_t|S_t, \boldsymbol{\theta})}{\pi(A_t|S_t, \boldsymbol{\theta})} \right ] \tag{replace value with sample return} \\ \end{flalign}$$

Using the expression in the brackets we can write down an update rule for the parameters that can be sampled on each time step. This is the REINFORCE update:

$$\begin{align} \boldsymbol{\theta}_{t+1} \doteq \boldsymbol{\theta}_t + \alpha \gamma^t G_t \frac{\nabla \pi(A_t|S_t, \boldsymbol{\theta}_t)}{\pi(A_t|S_t, \boldsymbol{\theta}_t)} \tag{13.8} \end{align}$$

Because it uses all future returns after step t, REINFORCE is a Monte Carlo algorithm and is well defined only for the episodic case. For implementation purposes we can replace $\frac{\nabla \pi(A_t|S_t, \boldsymbol{\theta})}{\pi(A_t|S_t, \boldsymbol{\theta})}$ with $\nabla \ln \pi(A_t|S_t, \boldsymbol{\theta}_t)$ which is usually refered to as the eligibility vector.

With the alternative parameterization, the eligibility vector is $\nabla \ln \pi(S_t, \theta_t)_{A_t}$ where $\pi$ is a vector and the $A_t$ subscript takes the value of that vector at the index corresponding to the action $A_t$.

mimetext/htmlrootassigneelast_run_timestampA +persist_js_state·has_pluto_hook_features§cell_id$f924eb30-d1cc-4941-8fb5-ff70ad425ab9depends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cells§errored$d83dc659-dce7-41dd-a8e7-2933ab39d15cqueued¤logsrunning¦outputbody

REINFORCE with Baseline Implementation

These functions use two sets of parameters, one to calculate the policy function and another to calculate the state value function. The state representation vector is shared between the two functions, but the policy function will return a distribution of preferences over actions while the value function will return a single value. If linear approximation is used to estimate both functions, the the policy parameters $\boldsymbol{\theta}$ will be a $d \times N_a$ matrix where $d$ is the length of the state feature vector representation and the value function parameters $\mathbf{w}$ will be a length $d$ vector. It is also possible to mix linear and non-linear approximation with this method.

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$d83dc659-dce7-41dd-a8e7-2933ab39d15cdepends_on_disabled_cells§runtimeYrpublished_object_keysdepends_on_skipped_cells§errored$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fcequeued¤logsrunning¦outputbodyXh mimetext/htmlrootassigneelast_run_timestampA 3!M'persist_js_state·has_pluto_hook_features§cell_id$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fcedepends_on_disabled_cells§runtimeޘpublished_object_keysdepends_on_skipped_cellsçerrored$83ca0577-15d7-4448-b597-c77810b812bfqueued¤logsrunning¦outputbody1figure_13_2_test (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA %Wj8persist_js_state·has_pluto_hook_features§cell_id$83ca0577-15d7-4448-b597-c77810b812bfdepends_on_disabled_cells§runtime _published_object_keysdepends_on_skipped_cellsçerrored$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbqueued¤logsrunning¦outputbody\reinforce_with_baseline_monte_carlo_control_binary_features (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA %Hpersist_js_state·has_pluto_hook_features§cell_id$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbdepends_on_disabled_cells§runtime@_published_object_keysdepends_on_skipped_cells§errored$a7dcc8cd-04ec-48f2-a387-116330eaffb2queued¤logsrunning¦outputbody5 mimetext/htmlrootassigneelast_run_timestampA &ðpersist_js_state·has_pluto_hook_features§cell_id$a7dcc8cd-04ec-48f2-a387-116330eaffb2depends_on_disabled_cells§runtime Gpublished_object_keysdepends_on_skipped_cellsçerrored$0ab70fc3-6188-42eb-aba2-d808f319be9fqueued¤logsrunning¦outputbody2

Dependencies

mimetext/htmlrootassigneelast_run_timestampA Apersist_js_state·has_pluto_hook_features§cell_id$0ab70fc3-6188-42eb-aba2-d808f319be9fdepends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cells§errored$047656d1-2921-40f2-b75b-ce4a87098007queued¤logsrunning¦outputbodyI

Switched Corridor Parameter Studies

mimetext/htmlrootassigneelast_run_timestampA t6persist_js_state·has_pluto_hook_features§cell_id$047656d1-2921-40f2-b75b-ce4a87098007depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$5d434c83-c9ca-499f-8695-c7733031c2dequeued¤logsrunning¦outputbody9cartpole_continuing_step (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA Ѱpersist_js_state·has_pluto_hook_features§cell_id$5d434c83-c9ca-499f-8695-c7733031c2dedepends_on_disabled_cells§runtimejpublished_object_keysdepends_on_skipped_cellsçerrored$3a37b53d-9174-4faa-9404-74a40c385b0aqueued¤logsrunning¦outputbody*Total Reward: -1000.0
mimetext/htmlrootassigneelast_run_timestampA A+persist_js_state·has_pluto_hook_features§cell_id$3a37b53d-9174-4faa-9404-74a40c385b0adepends_on_disabled_cells§runtimeLpublished_object_keysdepends_on_skipped_cellsçerrored$820752af-8966-4ee8-82f7-a40934522de5queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestamppersist_js_state·has_pluto_hook_features§cell_id$820752af-8966-4ee8-82f7-a40934522de5depends_on_disabled_cellsçruntimepublished_object_keysdepends_on_skipped_cells§errored$6acb549a-5d90-4457-a347-d22448ad8071queued¤logsrunning¦outputbody1mimetext/htmlrootassigneelast_run_timestampA 3(npersist_js_state·has_pluto_hook_features§cell_id$6acb549a-5d90-4457-a347-d22448ad8071depends_on_disabled_cells§runtime~Wpublished_object_keysdepends_on_skipped_cellsçerrored$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62queued¤logsrunning¦outputbodyJcartpole_fcann_continuing_parameter_study (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 2persist_js_state·has_pluto_hook_features§cell_id$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62depends_on_disabled_cells§runtimeֵpublished_object_keysdepends_on_skipped_cellsçerrored$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728queued¤logsrunning¦outputbodyelementsaction_probabilitiesprefixFloat32elements0.000339307text/plain0.999661text/plaintypeArrayprefix_shortobjectid6ea9ffc26d27fa73!application/vnd.pluto.tree+objectstate_value_estimate-91.9871text/plaintypeNamedTupleobjectid6155308e34755079mime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA +$-persist_js_state·has_pluto_hook_features§cell_id$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728depends_on_disabled_cells§runtime ?published_object_keysdepends_on_skipped_cellsçerrored$ae0f5a96-7a4b-47f9-be1e-e803a238a071queued¤logsrunning¦outputbody_

MDP Types and Transitions for Continuous Actions

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$ae0f5a96-7a4b-47f9-be1e-e803a238a071depends_on_disabled_cells§runtimevpublished_object_keysdepends_on_skipped_cells§errored$41d62de1-2c92-41ee-9430-b9ca3007afd9queued¤logsrunning¦outputbody

The above matrix represents an estimate of $\Pr \{ S_k = s \mid \pi \}$; however note that the terminal states are excluded from the rows. This corridor problem only has three non-terminal states. If we sum across each row, then we have the probability of reaching that step prior to terminating. The vector defined below measures the probability of an episode terminating prior to each step. Notably, this probablity is 0 for the first three steps since no policy starting from the left can terminate that quickly. As expected, the probability of terminating under the random policy grows with time approaching 1.

mimetext/htmlrootassigneelast_run_timestampA Xpersist_js_state·has_pluto_hook_features§cell_id$41d62de1-2c92-41ee-9430-b9ca3007afd9depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$8eb42403-1234-4e59-993e-057cc3a6d5c9queued¤logsrunning¦outputbodyB

Waiting to run parameter study

mimetext/htmlrootassigneelast_run_timestampA @/̰persist_js_state·has_pluto_hook_features§cell_id$8eb42403-1234-4e59-993e-057cc3a6d5c9depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cellsçerrored$bbc8864a-1545-433f-bc7c-0ddf6e907138queued¤logsrunning¦outputbody?plot_mountaincar_policy_values (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA @-0persist_js_state·has_pluto_hook_features§cell_id$bbc8864a-1545-433f-bc7c-0ddf6e907138depends_on_disabled_cells§runtimeipublished_object_keysdepends_on_skipped_cellsçerrored$a12b92d1-e045-4f92-b8cd-eee5d56fa67dqueued¤logsrunning¦outputbodyelementsepisode_rewardsprefixFloat32elements-6.0text/plain-5.0text/plain-9.0text/plain-7.0text/plain-4.0text/plain-22.0text/plain-34.0text/plain-6.0text/plain -27.0text/plainmored-12.0text/plaintypeArrayprefix_shortobjectid561caa86399f4303!application/vnd.pluto.tree+objectepisode_stepsprefixInt64elements6text/plain5text/plain9text/plain7text/plain4text/plain22text/plain34text/plain6text/plain 27text/plainmored12text/plaintypeArrayprefix_shortobjectid74a251baf6aa9110!application/vnd.pluto.tree+objectpolicy_functionπ2text/plainpolicy_sample_actionπ_sample2text/plainpolicy_parameters*1×2 Matrix{Float32}: -0.199834 0.199834text/plainestimate_state_valueestimate_state_valuetext/plainvalue_parametersprefixFloat32elements-9.63535text/plaintypeArrayprefix_shortobjectid35fd187379601766!application/vnd.pluto.tree+objectpolicy_and_valuepolicy_and_valuetext/plaintypeNamedTupleobjectid9c86eb7ef1a9a622mime!application/vnd.pluto.tree+objectrootassigneeconst best_mc_corridorlast_run_timestampA & persist_js_state·has_pluto_hook_features§cell_id$a12b92d1-e045-4f92-b8cd-eee5d56fa67ddepends_on_disabled_cells§runtimeGEpublished_object_keysdepends_on_skipped_cellsçerrored$ce33f710-fd9d-4dfa-acda-40204e54d518queued¤logsrunning¦outputbodyx

13.5 Actor-Critic Methods

Here we also use the value function estimator to calculate the the return estimate using the one step bootstrap return. When the state value function is used in this way we call it the critic. In general we can use this function with n-step returns and eligibility traces. Recall from the subject of TD learning of value functions that the one-step return is often superior to the actual return regarding variance and ease of computation, although it does introduce bias to the estimate. With the use of eligibility traces we can smoothly vary arbitrarily close to the Monte Carlo return. Note that the bias in the gradient estimate is n due to the bootstrapping as such; the actor would be biased even if the critic was learned by a Monte Carlo method.

The one-step actor-critic method is the analog of the one step methods such as TD$(0)$, Sarsa$(0)$, and Q learning. These methods replace the full return of REINFORCE with the one step return as follows:

$$\begin{flalign} \boldsymbol{\theta}_{t+1} &\doteq \boldsymbol{\theta}_t + \alpha(G_{t:t+1} - \hat v(S_t, \mathbf{w}))\ln\nabla\pi(A_t|S_t, \mathbf{\theta_t}) \tag{13.12} \\ & = \boldsymbol{\theta}_t + \alpha(R_{t+1} + \gamma \hat v(S_{t+1}, \mathbf{w}) - \hat v(S_t, \mathbf{w}))\ln\nabla\pi(A_t|S_t, \mathbf{\theta_t}) \tag{13.13} \\ & = \boldsymbol{\theta}_t + \delta_t\ln\nabla\pi(A_t|S_t, \mathbf{\theta_t}) \tag{13.14} \\ \end{flalign}$$

This can be implemented as a fully online algorithm because we do not have to wait until the end of an episode to calculate return estimates. The natural state-value-function learning method to pair with this is semi-gradient TD(0). See a full implementation below.

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$ce33f710-fd9d-4dfa-acda-40204e54d518depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$339b4d2b-2237-46a3-9867-ecc3332856c1queued¤logsrunning¦outputbody

This expression repeats terms of the form $\nabla \pi(a \vert s) q_\pi(s, a)$ summed over different probabilities. The first appearance of this term is just a sum over all actions at the state $s$ which is the state we are using for the gradient expression. The next appearance of the expression is a sum over actions at state $s^\prime$. Let's define a new expressions:

$$\begin{flalign} f(s) &\doteq \sum_a \nabla \pi(a \vert s) q_\pi(s, a) \\ \end{flalign}$$

Then we can rewrite the second term as follows:

$$\gamma \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) f(s^\prime) \right ] = \gamma \sum_{s^\prime} f(s^\prime) \sum_a \left [ \pi(a \vert s) p(s^\prime \vert s, a) \right ] = \gamma \mathbb{E}_\pi [f(s^\prime) \vert s] = \gamma \sum_{s^\prime} f(s ^\prime) \Pr \{ S_1 = s^\prime \mid S_0 = s, A_1 \sim \pi(s) \}$$

Define a new term $g(s) = \sum_{s^\prime} f(s^\prime) \Pr \{ S_1 = s^\prime \vert S_0 = s, A_1 \sim \pi(s) \} = \sum_{s^\prime} f(s^\prime) \sum_a [\pi(a \vert s) p(s^\prime \vert s, a)$

So the second term can be written as $\gamma g(s)$

where the final expression uses the probability that the agent transitions from state $s$ to $s^\prime$ in one step under the policy $\pi$. Using this same logic, we can rewrite the third expression as well.

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$339b4d2b-2237-46a3-9867-ecc3332856c1depends_on_disabled_cells§runtime!published_object_keysdepends_on_skipped_cells§errored$a8349352-3242-46d5-b0d5-1b6eb5d77e90queued¤logsrunning¦outputbody0mimetext/htmlrootassigneelast_run_timestampA :ؾpersist_js_state·has_pluto_hook_features§cell_id$a8349352-3242-46d5-b0d5-1b6eb5d77e90depends_on_disabled_cells§runtimeUpublished_object_keysdepends_on_skipped_cellsçerrored$7d63b960-3998-4f7b-8cbb-ccd49db9aeacqueued¤logsrunning¦outputbodyelementsaction_probabilitiesprefixFloat32elements0.000138965text/plain0.999861text/plaintypeArrayprefix_shortobjectidd09c2324e4b17111!application/vnd.pluto.tree+objectstate_value_estimate-96.5535text/plaintypeNamedTupleobjectid26386ed69de54735mime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA 'b~persist_js_state·has_pluto_hook_features§cell_id$7d63b960-3998-4f7b-8cbb-ccd49db9aeacdepends_on_disabled_cells§runtime7published_object_keysdepends_on_skipped_cellsçerrored$65d2add6-fd6f-456c-92ed-3cd9d1862ef6queued¤logsrunning¦outputbody=update_binary_policy_params! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA @persist_js_state·has_pluto_hook_features§cell_id$65d2add6-fd6f-456c-92ed-3cd9d1862ef6depends_on_disabled_cells§runtime!*published_object_keysdepends_on_skipped_cells§errored$f55afa58-962d-4551-8d95-a5b467d61adfqueued¤logsrunning¦outputbody?update_params_with_gradient! (generic function with 10 methods)mimetext/plainrootassigneelast_run_timestampA #Upersist_js_state·has_pluto_hook_features§cell_id$f55afa58-962d-4551-8d95-a5b467d61adfdepends_on_disabled_cells§runtimeLjpublished_object_keysdepends_on_skipped_cells§errored$d9d11d69-bc16-400a-8f46-f9a8ecb8516aqueued¤logsrunning¦outputbodyNactor_critic_binary_episodic_parameter_study (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA =l persist_js_state·has_pluto_hook_features§cell_id$d9d11d69-bc16-400a-8f46-f9a8ecb8516adepends_on_disabled_cells§runtime2xpublished_object_keysdepends_on_skipped_cells§errored$ed93259c-7b8b-46d7-97fb-f194e0e04b3aqueued¤logsrunning¦outputbodyCsetup_binary_beta_policy_arguments (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA '-Ŭpersist_js_state·has_pluto_hook_features§cell_id$ed93259c-7b8b-46d7-97fb-f194e0e04b3adepends_on_disabled_cells§runtimeɵpublished_object_keysdepends_on_skipped_cells§errored$d1ed25e6-60c6-411f-a541-99986e5da2c5queued¤logsrunning¦outputbody\reinforce_with_baseline_monte_carlo_control_linear_features (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA &8persist_js_state·has_pluto_hook_features§cell_id$d1ed25e6-60c6-411f-a541-99986e5da2c5depends_on_disabled_cells§runtime9published_object_keysdepends_on_skipped_cells§errored$b966b248-fb4d-457d-90f6-114370846242queued¤logsrunning¦outputbody7bad_continuous_action (generic function with 3 methods)mimetext/plainrootassigneelast_run_timestampA !ǰpersist_js_state·has_pluto_hook_features§cell_id$b966b248-fb4d-457d-90f6-114370846242depends_on_disabled_cells§runtimebpublished_object_keysdepends_on_skipped_cells§errored$4156d955-9daf-4429-b152-e8332980fb9equeued¤logsrunning¦outputbodyelementsstep_rewardsprefixFloat32elementstypeArrayprefix_shortobjectid3c0520dc1d635d04!application/vnd.pluto.tree+objectepisode_stepsprefixInt64elements30971text/plain33158text/plain36744text/plain39697text/plain42025text/plain44282text/plain45403text/plain47954text/plain 49838text/plainmoreE99724text/plaintypeArrayprefix_shortobjectid402d7f32a5b41291!application/vnd.pluto.tree+objectepisode_rewardsprefixFloat32elements-30970.0text/plain-2187.0text/plain-3586.0text/plain-2953.0text/plain-2328.0text/plain-2257.0text/plain-1121.0text/plain-2551.0text/plain -1884.0text/plainmoreE-532.0text/plaintypeArrayprefix_shortobjectid68896aa3766bea26!application/vnd.pluto.tree+objectpolicy_parameters$1452×2 Matrix{Float32}: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.0437175 -0.0569757 0.0326978 -0.0276905 0.00138512 0.0006306 0.0 0.0 0.0 0.0 0.0 0.0text/plainvalue_parametersprefixFloat32elements0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain 0.0text/plainmore0.0text/plaintypeArrayprefix_shortobjectidf01afe6ea747a0b0!application/vnd.pluto.tree+objectpolicy_functionπtext/plainpolicy_sample_actionπ_sampletext/plainestimate_state_valueestimate_state_valuetext/plainpolicy_and_valuepolicy_and_valuetext/plaintypeNamedTupleobjectid7aefd39e1e4696e7mime!application/vnd.pluto.tree+objectrootassignee,const mountaincar_continuous_test_train_betalast_run_timestampA >Xpersist_js_state·has_pluto_hook_features§cell_id$4156d955-9daf-4429-b152-e8332980fb9edepends_on_disabled_cells§runtime*<ŵpublished_object_keysdepends_on_skipped_cells§errored$b09e1e48-494e-4967-826a-6e70199acad4queued¤logsrunning¦outputbodyC

Squashed Gaussian Alternative

mimetext/htmlrootassigneelast_run_timestampA Epersist_js_state·has_pluto_hook_features§cell_id$b09e1e48-494e-4967-826a-6e70199acad4depends_on_disabled_cells§runtime(published_object_keysdepends_on_skipped_cells§errored$734573e5-547b-4dcc-89bb-412aa6cc42d6queued¤logsrunning¦outputbodyEactor_critic_linear_parameter_study (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA +{persist_js_state·has_pluto_hook_features§cell_id$734573e5-547b-4dcc-89bb-412aa6cc42d6depends_on_disabled_cells§runtimex",published_object_keysdepends_on_skipped_cellsçerrored$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54queued¤logsrunning¦outputbodyKactor_critic_with_eligibility_traces_fcann (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA /Bpersist_js_state·has_pluto_hook_features§cell_id$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54depends_on_disabled_cells§runtimeEm$published_object_keysdepends_on_skipped_cells§errored$692c1043-4eaf-491e-b8fe-368618867f99queued¤logsrunning¦outputbody
  1. The soft-max distribution is:

$$\sigma(a|s, \theta) = \frac{e^{h(s, a, \theta)}}{\sum_b e^{h(s, b, \theta)}}$$

We only have two possible actions in each state so the policy for action 1 would be given by:

$$\pi(1|S_t, \theta_t) = \frac{e^{h(s, 1, \theta_t)}}{e^{h(S_t, 0, \theta_t)} + e^{h(S_t, 1, \theta)}}$$

Simplify this expression by dividing by $e^{h(s, 1, \theta_t)}$ which results in:

$$\pi(1|S_t, \theta_t) = \frac{1}{e^{h(S_t, 0, \theta_t) - h(S_t, 1, \theta_t)} + 1}$$

Given the assumption that $h(s, 1, \theta)-h(s, 0, \theta) = \theta^\top\mathbf{x}(s)$, we replace the expression in the exponent resulting in the final expression of:

$$\pi(1|S_t, \theta_t) = \frac{1}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1}$$

Using the notation $f(x) = 1/(1+e^{-x})$ we can write $\pi(1|S_t, \theta_t) = f(\theta_t^\top \mathbf{x}(S_t))$ where $f$ is the logistic function. Consider this notation for the rest of the exercises.

  1. The REINFORCE update is given by: $\theta_{t+1} = \theta_t + \alpha G_t \frac{\nabla\pi(A_t|S_t, \theta_t)}{\pi(A_t|S_t, \theta_t)}$, so we need to compute the gradient of the policy in terms of the parameters for this action selection: $\nabla \pi(1|S_t, \theta_t)$. Luckily, the derivative of the logistic function is simply given by: $f(x)(1-f(x))$ where $f(x)$ is the logistic function itself. In our case $x = \theta_t^\top \mathbf{x}_t$ so after applying the chain rule we have:

$$\nabla\pi(1|S_t, \theta_t) = f(x)(1-f(x))\nabla x = f(x)(1-f(x)) \mathbf{x_t}$$

since $x$ is just a linear function of the parameters. So for the parameter update step we have:

$$\frac{\nabla\pi(1|S_t, \theta_t)}{\pi(1|S_t, \theta_t)} = \frac{f(x)(1-f(x))\mathbf{x}_t}{f(x)} = (1 - f(x))\mathbf{x}_t$$

Also note that:

$$1 - f(x) = 1 - \frac{1}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1} = \frac{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1 - 1}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1} = \frac{e^{-\theta_t^\top\mathbf{x}(S_t)}}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1}$$

The REINFORCE update will then be:

$$\theta_{t+1} = \theta_t + \alpha G_t \left ( \frac{e^{-\theta_t^\top\mathbf{x}(S_t)}}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1} \right ) \mathbf{x}_t$$

  1. For the general case, we want to calculate $\frac{\nabla\pi(a|s, \theta)}{\pi(a|s, \theta)}$. We already know this expression for $a = 1$.

$$\nabla {\pi(1|s, \mathbf{\theta})} = f(x)(1 - f(x))\mathbf{x}(s) = \pi(1|s, \mathbf{\theta})(1 - \pi(1|s, \mathbf{\theta})\mathbf{x}(s)$$

Since $\pi(a|s, \theta)$ is a probability distribution across actions, we also know that

$$\pi(0|s, \theta) = 1 - \pi(1|s, \theta)$$

which implies that

$$\nabla \pi(0|s, \theta) = -\nabla \pi(1|s, \theta) = -\pi(1|s, \mathbf{\theta})(1 - \pi(1|s, \mathbf{\theta}))\mathbf{x}(s)$$

We can express this in terms of $\pi(0|s, \theta)$ completely:

$$\nabla \pi(0|s, \theta) = (\pi(0|s, \mathbf{\theta}) - 1)\pi(0|s, \theta)\mathbf{x}(s) = -\pi(0|s, \theta)(1 - \pi(0|s, \mathbf{\theta}))\mathbf{x}(s)$$

Let's now compare the two expressions for the policy gradient at each action:

$$\begin{align} \nabla {\pi(1|s, \mathbf{\theta})} &= \pi(1|s, \mathbf{\theta})(1 - \pi(1|s, \mathbf{\theta})\mathbf{x}(s) \\ \nabla \pi(0|s, \theta) &= -\pi(0|s, \theta)(1 - \pi(0|s, \mathbf{\theta}))\mathbf{x}(s) \\ \therefore \\ \nabla \pi(a|s, \theta) &= \chi (a) \pi(a|s, \theta)(1 - \pi(a|s, \mathbf{\theta}))\mathbf{x}(s) \\ \end{align}$$

Where $\chi (a)$ is a function that returns 1 for $a=1$ and -1 for $a=0$. There are many ways to achieve this but the following expression is simple and works: $\chi(a) = 2a - 1$. Dividing by the policy yields a unified expression for the eligibility vector:

$$\nabla \ln{\pi(a|s,\theta)} = (2a - 1) (1 - \pi(a|s, \mathbf{\theta}))\mathbf{x}(s)$$

mimetext/htmlrootassigneelast_run_timestampA ݐpersist_js_state·has_pluto_hook_features§cell_id$692c1043-4eaf-491e-b8fe-368618867f99depends_on_disabled_cells§runtime+published_object_keysdepends_on_skipped_cells§errored$2c5d221a-2469-49e1-9249-dfdc2457f2faqueued¤logsrunning¦outputbodymimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$2c5d221a-2469-49e1-9249-dfdc2457f2fadepends_on_disabled_cells§runtime apublished_object_keysdepends_on_skipped_cellsçerrored$7c592385-e8d3-4efe-962c-d39debb64405queued¤logsrunning¦outputbodyelementsnum_features1452text/plainget_active_featuresftext/plaintypeNamedTupleobjectid349d5e4a8f483b41mime!application/vnd.pluto.tree+objectrootassignee"const mountaincar_tilecoding_setuplast_run_timestampA =,Qpersist_js_state·has_pluto_hook_features§cell_id$7c592385-e8d3-4efe-962c-d39debb64405depends_on_disabled_cells§runtimeZ^published_object_keysdepends_on_skipped_cells§errored$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaebqueued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestampA >persist_js_state·has_pluto_hook_features§cell_id$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaebdepends_on_disabled_cells§runtime)published_object_keysdepends_on_skipped_cellsçerrored$8eab55a5-41b7-4f5e-a02f-4c19388bc9eaqueued¤logsrunning¦outputbody>update_binary_feature_vector! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$8eab55a5-41b7-4f5e-a02f-4c19388bc9eadepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$0ac7ea44-14f6-4e80-80f9-d6df8059bb38queued¤logsrunning¦outputbody?reinforce_monte_carlo_control! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA #n$lpersist_js_state·has_pluto_hook_features§cell_id$0ac7ea44-14f6-4e80-80f9-d6df8059bb38depends_on_disabled_cells§runtime/Epublished_object_keysdepends_on_skipped_cells§errored$5ffc271f-c73f-494a-9727-8d7516af2191queued¤logsrunning¦outputbodyd=

$\lambda_\theta$: 0.8

$\lambda_\mathbf{w}$: 0.15

$\alpha_{\overline{r}}$:

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$5ffc271f-c73f-494a-9727-8d7516af2191depends_on_disabled_cells§runtimeȵpublished_object_keysdepends_on_skipped_cellsçerrored$c5a2879c-e89b-47f7-bbd6-48200d7e89e3queued¤logsrunning¦outputbody`actor_critic_binary_episodic_squashed_gaussian_parameter_study (generic function with 3 methods)mimetext/plainrootassigneelast_run_timestampA /persist_js_state·has_pluto_hook_features§cell_id$c5a2879c-e89b-47f7-bbd6-48200d7e89e3depends_on_disabled_cells§runtime%lpublished_object_keysdepends_on_skipped_cells§errored$537270ba-122b-4f2b-880b-31d086766295queued¤logsrunning¦outputbodyContinuousMDPmimetext/plainrootassigneelast_run_timestampA !^persist_js_state·has_pluto_hook_features§cell_id$537270ba-122b-4f2b-880b-31d086766295depends_on_disabled_cells§runtimergpublished_object_keysdepends_on_skipped_cells§errored$dc2efc6c-8da8-425b-aa5f-290949109565queued¤logsrunning¦outputbodyE>
mimetext/htmlrootassigneelast_run_timestampA @Ipersist_js_state·has_pluto_hook_features§cell_id$dc2efc6c-8da8-425b-aa5f-290949109565depends_on_disabled_cells§runtimeX1published_object_keysdepends_on_skipped_cellsçerrored$a019925a-460a-410e-a54b-50a4cfe0e90equeued¤logsrunning¦outputbody6# mimetext/htmlrootassigneelast_run_timestampA !persist_js_state·has_pluto_hook_features§cell_id$a019925a-460a-410e-a54b-50a4cfe0e90edepends_on_disabled_cells§runtimeJpublished_object_keysdepends_on_skipped_cellsçerrored$f92bb265-4b19-4f0e-a698-d7547bb6dd41queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestampA .persist_js_state·has_pluto_hook_features§cell_id$f92bb265-4b19-4f0e-a698-d7547bb6dd41depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$ac9c8845-284d-4c21-b05d-d930f86598a3queued¤logsrunning¦outputbodymimetext/htmlrootassigneelast_run_timestampA =^̰persist_js_state·has_pluto_hook_features§cell_id$ac9c8845-284d-4c21-b05d-d930f86598a3depends_on_disabled_cells§runtime #published_object_keysdepends_on_skipped_cellsçerrored$192cc1cf-9ea1-492d-baa7-f2e197abecd4queued¤logsrunning¦outputbodymimetext/htmlrootassigneelast_run_timestampA =rҕpersist_js_state·has_pluto_hook_features§cell_id$192cc1cf-9ea1-492d-baa7-f2e197abecd4depends_on_disabled_cells§runtime ܵpublished_object_keysdepends_on_skipped_cellsçerrored$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547dqueued¤logsrunning¦outputbody1mimetext/htmlrootassigneelast_run_timestampA 7ʰpersist_js_state·has_pluto_hook_features§cell_id$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547ddepends_on_disabled_cells§runtimeApublished_object_keysdepends_on_skipped_cellsçerrored$c8b47eac-2d45-419a-bec6-2ae0cdc59393queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestampA !persist_js_state·has_pluto_hook_features§cell_id$c8b47eac-2d45-419a-bec6-2ae0cdc59393depends_on_disabled_cells§runtime

Chapter 13 Policy Gradient Methods Introduction

Instead of selection actions based on action-value estimates we learn a parameterized policy with parameters $\boldsymbol{θ}$. $\pi(a|s, \boldsymbol{\theta}) = \text{Pr}\{A_t=a|S_t=s, \boldsymbol{\theta}_t=\boldsymbol{\theta\}}$ denotes the probability that action a is taken at time t given that the environment is in state s at time t with parameter $\boldsymbol{θ}$.

We consider methods that improve the policy parameter using the gradient of some scalar performance measure $J(\boldsymbol{\theta})$ with respect to the policy parameters. We follow gradient ascent since we are trying to maximize this value and methods that use this approach are called policy gradient methods. Methods that learn approximations to both policy and value functions are often called actor-critic methods, where 'actor' is a reference to the learned policy, and 'critic' refers to the learned value function, usually a state-value function.

13.1 Policy Approximation and its Advantages

mimetext/htmlrootassigneelast_run_timestampA 0persist_js_state·has_pluto_hook_features§cell_id$36a6e43f-6bcf-4c27-bfbb-047760e77adadepends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cells§errored$436c52d2-280b-4ca4-9360-d6587b8254c7queued¤logsrunning¦outputbody

In order to test this algorithm we need to use a continuing task which is lacking a terminal state. We could simply modify the corridor MDP to be a continuing task by altering the reward structure so a reward of 1 is received upon moving to the right from state 3 after which the state is reset to 1. Se below for a version of this MDP updated to be a continuing problem.

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$436c52d2-280b-4ca4-9360-d6587b8254c7depends_on_disabled_cells§runtime2rpublished_object_keysdepends_on_skipped_cells§errored$e96d592d-1e54-486d-8ad9-b857f85476e8queued¤logsrunning¦outputbodyDactor_critic_linear_parameter_study (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA e|1persist_js_state·has_pluto_hook_features§cell_id$e96d592d-1e54-486d-8ad9-b857f85476e8depends_on_disabled_cells§runtime2published_object_keysdepends_on_skipped_cells§errored$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6cqueued¤logsrunning¦outputbody/b mimetext/htmlrootassigneelast_run_timestampA *1 persist_js_state·has_pluto_hook_features§cell_id$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6cdepends_on_disabled_cells§runtimeQpublished_object_keysdepends_on_skipped_cellsçerrored$4da20fd7-b897-4f26-bf2a-f08d66ddf90fqueued¤logsrunning¦outputbodyGactor_critic_with_eligibility_traces! (generic function with 4 methods)mimetext/plainrootassigneelast_run_timestampA + Npersist_js_state·has_pluto_hook_features§cell_id$4da20fd7-b897-4f26-bf2a-f08d66ddf90fdepends_on_disabled_cells§runtime}lpublished_object_keysdepends_on_skipped_cells§errored$11ea640c-3981-404d-87c6-4d3d0708a2b8queued¤logsrunning¦outputbodyMactor_critic_linear_episodic_parameter_study (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA +ǖpersist_js_state·has_pluto_hook_features§cell_id$11ea640c-3981-404d-87c6-4d3d0708a2b8depends_on_disabled_cells§runtimeE.published_object_keysdepends_on_skipped_cellsçerrored$281360af-46bf-4c73-bf11-3cb1153ad3e2queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestamppersist_js_state·has_pluto_hook_features§cell_id$281360af-46bf-4c73-bf11-3cb1153ad3e2depends_on_disabled_cellsçruntimepublished_object_keysdepends_on_skipped_cells§errored$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690cqueued¤logsrunning¦outputbodyNupdate_squashed_gaussian_eligibility_vector! (generic function with 4 methods)mimetext/plainrootassigneelast_run_timestampA #^persist_js_state·has_pluto_hook_features§cell_id$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690cdepends_on_disabled_cells§runtime#published_object_keysdepends_on_skipped_cells§errored$da3cb392-78f2-48b2-b0dc-5f016664798cqueued¤logsrunning¦outputbodyڗ%Total Reward: -142.0
mimetext/htmlrootassigneelast_run_timestampA ANpersist_js_state·has_pluto_hook_features§cell_id$da3cb392-78f2-48b2-b0dc-5f016664798cdepends_on_disabled_cells§runtime Vpublished_object_keysdepends_on_skipped_cellsçerrored$dca2f8e2-76af-4679-bf81-3824c15fc76dqueued¤logsrunning¦outputbodyelementsstep_rewardsprefixFloat32elementstypeArrayprefix_shortobjectid9f06ece3690faedd!application/vnd.pluto.tree+objectepisode_stepsprefixInt64elements19text/plain34text/plain61text/plain75text/plain90text/plain117text/plain135text/plain178text/plain 204text/plainmore99994text/plaintypeArrayprefix_shortobjectid245ae93c73b7a831!application/vnd.pluto.tree+objectepisode_rewardsprefixFloat32elements18.0text/plain15.0text/plain27.0text/plain14.0text/plain15.0text/plain27.0text/plain18.0text/plain43.0text/plain 26.0text/plainmore11.0text/plaintypeArrayprefix_shortobjectid1b64855ae6e6478!application/vnd.pluto.tree+objectpolicy_parameters52488×3 Matrix{Float32}: NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ⋮ NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaNtext/plainvalue_parametersprefixFloat32elementsNaNtext/plainNaNtext/plainNaNtext/plainNaNtext/plainNaNtext/plainNaNtext/plainNaNtext/plainNaNtext/plain NaNtext/plainmoreNaNtext/plaintypeArrayprefix_shortobjectidd26d759cd1b52e3f!application/vnd.pluto.tree+objectpolicy_functionπtext/plainpolicy_sample_actionπ_sampletext/plainestimate_state_valueestimate_state_valuetext/plainpolicy_and_valuepolicy_and_valuetext/plaintypeNamedTupleobjectide492fba9453c9690mime!application/vnd.pluto.tree+objectrootassigneeconst reinforce_test3last_run_timestampA 2k;Űpersist_js_state·has_pluto_hook_features§cell_id$dca2f8e2-76af-4679-bf81-3824c15fc76ddepends_on_disabled_cells§runtime G>ŵpublished_object_keysdepends_on_skipped_cells§errored$8019bec9-1228-407b-9199-2fe29f26a981queued¤logsrunning¦outputbody

Exercise 13.1

Use your knowledge of the gridworld and its dynamics to determine an exact symbolic expression for the optimal probability of selecting the right action in Example 13.1

Example 13.1 is a gridworld with 3 non-terminal states and a terminal state at the far right. The reward is -1 per step. States 1 and 3 have actions left/right that move in the expected directions but state 2 reverses the directions. We use a performance measure $J(\mathbf{\theta}) = v_{\pi_\theta}(S)$. Given our feature representations of $\mathbf{x}(s, \text{right}) = [1, 0]^{\top}$ and $\mathbf{x}(s, \text{left}) = [0, 1]^{\top}$, we can only learn policies that are stochastic in terms of left/right action selection but do not vary between states. Also observe that due to probability constraints $p_{\text{right}} = 1 - p_{\text{left}}$. For simplicity, we will use the notation $p \doteq p_{\text{left}}$ and the following for the three state values: $v1, v2, v3$.

$$\begin{flalign} v_1 &= p \times v_1 + (1-p) \times v_2 - 1 \tag{1} \\ v_1 (1-p) &= v_2 (1-p) - 1 \\ v_1 &= v_2 - \frac{1}{1-p} \tag{1′}\\ v_2 &= p \times v_3 + (1-p) \times v_1 - 1 \tag{2} \\ v_3 &= p \times v_2 - 1 \tag{3}\\ v_2 &= p \times [p\times v_2 - 1] +(1-p) \times v_1 - 1 \tag{substituting 3 into 2} \\ v_2(1 - p^2) &= -p +(1-p) \times v_1 - 1 \\ v_2 &= \frac{(1-p) v_1 - (1+p)}{(1+p)(1-p)} \tag{collecting terms} \\ &= \frac{(1-p) v_2 - 1 - (1+p)}{(1+p)(1-p)} \tag{using 1′} \\ &= \frac{v_2}{1+p} - \frac{2 + p}{(1+p)(1-p)} \\ v_2 \left [1 - \frac{1}{1+p} \right ] &= - \frac{2 + p}{(1+p)(1-p)} \\ v_2 \frac{1+p-1}{1+p} &= - \frac{2 + p}{(1+p)(1-p)} \\ v_2 &= - \frac{2 + p}{(1-p)p} \\ v_1 &= - \frac{2 + p}{(1-p)p} - \frac{1}{1-p} \\ &= \frac{-2 - p - p}{(1-p)p} \\ &= -\frac{2 + 2p}{(1-p)p} \\ v_3 &= -\frac{2 + p}{1-p} - 1\\ &= \frac{-2 - p - 1 + p}{1-p}\\ &= -\frac{3}{1-p}\\ \end{flalign}$$

To summarize all the state values:

$$\begin{flalign} v_1 &= -\frac{2 + 2p}{(1-p)p} \\ v_2 &= - \frac{2 + p}{(1-p)p} \\ v_3 &= -\frac{3}{1-p} \end{flalign}$$

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$8019bec9-1228-407b-9199-2fe29f26a981depends_on_disabled_cells§runtime(published_object_keysdepends_on_skipped_cells§errored$fd964539-2baf-4ff1-b286-5a0bb1b222c4queued¤logsrunning¦outputbody

The beta distribution has two parameters like the normal distribution but is only defined from 0 to 1. The two parameters $\alpha$ and $\beta$ are positive real numbers and control the shape of the distribution. The density function is given below:

$$f(x; \alpha, \beta) = \frac{x^{\alpha-1} (1-x)^{\beta - 1}}{\text{B}(\alpha, \beta)}$$

where $\text{B}(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha + \beta)}$ and $\Gamma(z) = \int_0^\infty t^{z-1}e^{-t} \text{d} t$

We saw earlier from the treatment of the gaussian distribution that we need to find the gradient of a function of each distribution parameter with respect to the parameters of the function approximation. Luckily, the maximum likelihood estimator already computes the gradient we are interested in for this distribution. Note that the likelihood function for a single sample of the random variable $x$ which follows the beta distribution is given by $\mathcal{L}(\alpha, \beta \vert X) = \ln(f(X_i; \alpha, \beta))$ and the partial derivative of this function with respect to each parameter $\alpha$ and $\beta$ is given by:

$$\frac{\partial \mathcal{L}(\alpha, \beta, \vert X)}{\partial \alpha} = \ln X - \frac{\partial \ln \text{B}(\alpha, \beta)}{\partial \alpha}$$

$$\frac{\partial \mathcal{L}(\alpha, \beta, \vert X)}{\partial \beta} = \ln (1-X) - \frac{\partial \ln \text{B}(\alpha, \beta)}{\partial \beta}$$

where $\frac{\partial \ln \text{B}(\alpha, \beta)}{\partial \alpha} = -\psi(\alpha + \beta) + \psi(\alpha)$ and $\frac{\partial \ln \text{B}(\alpha, \beta)}{\partial \beta} = -\psi(\alpha + \beta) + \psi(\beta)$ and $\phi(\alpha)$ is the digamma function which is just the derivative of the logarithm of the gamma function.

Since both $\alpha$ and $\beta$ must be greater than zero, we can use for an estimate for each one the exponential function applied to a dot product of the parameter vector with the feature vector: $\alpha(s, \boldsymbol{\theta}) \doteq \exp \left (\boldsymbol{\theta}_\alpha^\top \mathbf{x}(s) \right )$ and $\beta(s, \boldsymbol{\theta}) \doteq \exp \left (\boldsymbol{\theta}_\beta^\top \mathbf{x}(s) \right )$.

The eligibility vector for this distribution is then:

$$\nabla \ln f(a \vert \alpha(s, \boldsymbol{\theta}_\alpha), \beta(s, \boldsymbol{\theta}_\beta))$$

where $\alpha$ is a function of its parameters and $\beta$ is a function of the other parameter vector. The gradient components corresponding to each vector is only a function of a partial derivative of the distribution with respect to $\alpha$ and $\beta$. That is, since $\frac{\partial \alpha}{\partial \theta_{\beta_i}} = 0 \forall i$ and vice versa, then we can treat each part of the gradient separately.

$$\begin{flalign} \nabla_{\boldsymbol{\theta}_\alpha} \ln f(a \vert \alpha, \beta) &= \frac{\partial f(a \vert \alpha, \beta)}{\partial \alpha} \nabla_{\boldsymbol{\theta}_\alpha}\alpha \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\alpha) \right ) \nabla_{\boldsymbol{\theta}_\alpha} \alpha \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\alpha) \right ) \nabla_{\boldsymbol{\theta}_\alpha} \exp \left ( \boldsymbol{\theta}_\alpha^\top \mathbf{x}(s) \right ) \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\alpha) \right ) \alpha \mathbf{x}(s)\\ \end{flalign}$$

$$\begin{flalign} \nabla_{\boldsymbol{\theta}_\beta} \ln f(a \vert \alpha, \beta) &= \frac{\partial f(a \vert \alpha, \beta)}{\partial \beta} \nabla_{\boldsymbol{\theta}_\beta}\beta \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\beta) \right ) \nabla_{\boldsymbol{\theta}_\beta} \beta \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\beta) \right ) \nabla_{\boldsymbol{\theta}_\beta} \exp \left ( \boldsymbol{\theta}_\beta^\top \mathbf{x}(s) \right ) \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\beta) \right ) \beta \mathbf{x}(s)\\ \end{flalign}$$

mimetext/htmlrootassigneelast_run_timestampA +Zpersist_js_state·has_pluto_hook_features§cell_id$fd964539-2baf-4ff1-b286-5a0bb1b222c4depends_on_disabled_cells§runtime Rpublished_object_keysdepends_on_skipped_cells§errored$5720e942-d3f8-4329-83a8-8bcedf078b6aqueued¤logsrunning¦outputbodyprefixFloat32elements0.47295text/plain0.52705text/plaintypeArrayprefix_shortobjectid81035272d8cf80fdmime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA $)persist_js_state·has_pluto_hook_features§cell_id$5720e942-d3f8-4329-83a8-8bcedf078b6adepends_on_disabled_cells§runtime!*published_object_keysdepends_on_skipped_cellsçerrored$62e677ac-2070-4f6b-9df2-90849d89fa9fqueued¤logsrunning¦outputbodyٯ152×1 Matrix{Float64}: 0.0 0.0 0.0 0.12513599999999991 0.1874460000000001 0.28129099999999996 0.343742 ⋮ 0.999999 0.999999 0.999999 0.999999 0.999999 0.999999mimetext/plainrootassignee%const corridor_terminal_probabilitieslast_run_timestampA #'Gpersist_js_state·has_pluto_hook_features§cell_id$62e677ac-2070-4f6b-9df2-90849d89fa9fdepends_on_disabled_cells§runtimePEpublished_object_keysdepends_on_skipped_cellsçerrored$11b9beea-b0cd-45eb-84c6-151728894df0queued¤logsrunning¦outputbodyHform_state_and_policy_function_outputs (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA 'E￰persist_js_state·has_pluto_hook_features§cell_id$11b9beea-b0cd-45eb-84c6-151728894df0depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290queued¤logsrunning¦outputbodyNreinforce_monte_carlo_control_binary_features (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA #umPpersist_js_state·has_pluto_hook_features§cell_id$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290depends_on_disabled_cells§runtime/published_object_keysdepends_on_skipped_cells§errored$55ba8725-0ddf-4196-a41d-3f3c490a8d84queued¤logsrunning¦outputbodyVactor_critic_binary_episodic_gaussian_parameter_study (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA /ɰpersist_js_state·has_pluto_hook_features§cell_id$55ba8725-0ddf-4196-a41d-3f3c490a8d84depends_on_disabled_cells§runtimeXɵpublished_object_keysdepends_on_skipped_cellsçerrored$a540814a-57a1-4b98-9443-59e401425444queued¤logsrunning¦outputbody6binary_value_function (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$a540814a-57a1-4b98-9443-59e401425444depends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cells§errored$1b102220-6d78-480d-a77f-0e57bad23dcaqueued¤logsrunning¦outputbodyKcartpole_binary_continuing_parameter_study (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA /ְpersist_js_state·has_pluto_hook_features§cell_id$1b102220-6d78-480d-a77f-0e57bad23dcadepends_on_disabled_cells§runtime"published_object_keysdepends_on_skipped_cellsçerrored$4d4ae57b-afc3-44f9-b6fc-892f59f82921queued¤logsrunning¦outputbody7one_step_actor_critic! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 'O) persist_js_state·has_pluto_hook_features§cell_id$4d4ae57b-afc3-44f9-b6fc-892f59f82921depends_on_disabled_cells§runtime[Bpublished_object_keysdepends_on_skipped_cells§errored$61949faa-8174-4b7b-8fbc-01d5f850b419queued¤logsrunning¦outputbodyXactor_critic_binary_continuing_gaussian_parameter_study (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA /*persist_js_state·has_pluto_hook_features§cell_id$61949faa-8174-4b7b-8fbc-01d5f850b419depends_on_disabled_cells§runtime%;published_object_keysdepends_on_skipped_cellsçerrored$5b15f5c9-80bf-47f0-898a-f8dead5b927cqueued¤logsrunning¦outputbody'

Continuing Case Actor-Critic Implementation

Note that this function has the same name as the episodic version. The only difference other than keyword arguments is that the max_episodes argument is missing. Since we already defined the versions of the algorithm for linear and non-linear cases in a generic manner, we only need to define the core version of this algorithm and the other functions will dispatch to it if they are called without the max_episodes argument.

mimetext/htmlrootassigneelast_run_timestampA ٰpersist_js_state·has_pluto_hook_features§cell_id$5b15f5c9-80bf-47f0-898a-f8dead5b927cdepends_on_disabled_cells§runtime͵published_object_keysdepends_on_skipped_cells§errored$266d2234-26c8-43f1-9e75-49440a230ed6queued¤logsrunning¦outputbodyFactor_critic_with_eligibility_traces! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA + pPpersist_js_state·has_pluto_hook_features§cell_id$266d2234-26c8-43f1-9e75-49440a230ed6depends_on_disabled_cells§runtimen2published_object_keysdepends_on_skipped_cells§errored$aa69e4ea-91e0-496a-a7be-529e67f4dbecqueued¤logsrunning¦outputbodyprefixFloat32elements0.508736text/plain0.491264text/plaintypeArrayprefix_shortobjectid7c15309e849c4f78mime!application/vnd.pluto.tree+objectrootassigneelast_run_timestampA '&persist_js_state·has_pluto_hook_features§cell_id$aa69e4ea-91e0-496a-a7be-529e67f4dbecdepends_on_disabled_cells§runtime#X:published_object_keysdepends_on_skipped_cellsçerrored$10ee7709-0816-48d2-abe0-9be3dd04700fqueued¤logsrunning¦outputbody| mimetext/htmlrootassigneelast_run_timestampA =%persist_js_state·has_pluto_hook_features§cell_id$10ee7709-0816-48d2-abe0-9be3dd04700fdepends_on_disabled_cells§runtime%΂published_object_keysdepends_on_skipped_cellsçerrored$7d94922e-dc9f-4953-b539-24aaa2c85b12queued¤logsrunning¦outputbodyd1

$\lambda_\theta$: 0.75

$\lambda_\mathbf{w}$: 0.25

$\alpha_{\overline{r}}$:

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

mimetext/htmlrootassigneelast_run_timestampA 9persist_js_state·has_pluto_hook_features§cell_id$7d94922e-dc9f-4953-b539-24aaa2c85b12depends_on_disabled_cells§runtimelGpublished_object_keysdepends_on_skipped_cellsçerrored$df7f84e8-b42a-4001-9dbf-6bc3ced94207queued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestampA (persist_js_state·has_pluto_hook_features§cell_id$df7f84e8-b42a-4001-9dbf-6bc3ced94207depends_on_disabled_cells§runtime?]Apublished_object_keysdepends_on_skipped_cells§errored$352d2952-cb83-47d3-9078-2b2ef9927443queued¤logsrunning¦outputbody:create_cartpole_functions (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA $persist_js_state·has_pluto_hook_features§cell_id$352d2952-cb83-47d3-9078-2b2ef9927443depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$0964133c-3a5b-433b-a8c4-a97813c37583queued¤logsrunning¦outputbody=plot_continuing_step_rewards (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA ! persist_js_state·has_pluto_hook_features§cell_id$0964133c-3a5b-433b-a8c4-a97813c37583depends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cellsçerrored$349631b2-4686-49a9-9f3a-1e4ad588b568queued¤logsrunning¦outputbodyprefix"ContinuousMDP{Float32, Tuple{Float32, Float32}, Float32, ContinuousMDPTransitionSampler{Float32, Tuple{Float32, Float32}, Float32, var"#step#1603"{Float32}}, typeof(Main.var"workspace#8".MountainCarTask.initialize_state), typeof(Main.var"workspace#8".MountainCarTask.isterm), Returns{Bool}}elementsptfprefixcContinuousMDPTransitionSampler{Float32, Tuple{Float32, Float32}, Float32, var"#step#1603"{Float32}}elementsstepS(::Main.var"workspace#8".var"#step#1603"{Float32}) (generic function with 1 method)text/plaintypestructprefix_shortContinuousMDPTransitionSamplerobjectid5f924a83!application/vnd.pluto.tree+objectinitialize_state1initialize_state (generic function with 1 method)text/plainisterm'isterm (generic function with 1 method)text/plainis_valid_actionReturns{Bool}(true)text/plaintypestructprefix_shortContinuousMDPobjectidd13aca4d611093b3mime!application/vnd.pluto.tree+objectrootassignee!const mountaincar_continuous_mdp2last_run_timestampA =#persist_js_state·has_pluto_hook_features§cell_id$349631b2-4686-49a9-9f3a-1e4ad588b568depends_on_disabled_cells§runtimec|Opublished_object_keysdepends_on_skipped_cells§errored$8544eddb-2095-4a3c-82e0-920123a88e6dqueued¤logsrunning¦outputbody

Test REINFORCE With and Without Baseline

The following function calls execute the REINFORCE algorithm on Example 13.1. The output displayed is the policy function acting on the single state representation for the problem. The two values represent the probability of taking the left and right action respectively. If converged properly, the right action probability should be higher, approaching a value of about 60%.

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$8544eddb-2095-4a3c-82e0-920123a88e6ddepends_on_disabled_cells§runtimeϟpublished_object_keysdepends_on_skipped_cells§errored$31f7e903-30b6-4193-9174-88093e004de4queued¤logsrunning¦outputbody 4

In policy gradient methods, the policy can be parameterized in any way, as long as $\pi(a \vert s, \boldsymbol{\theta})$ is differentiable with respect to its parameters, that is, as long as $\nabla \pi(a \vert s, \boldsymbol{\theta})$ exists and is finite for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$, and $\boldsymbol{\theta} \in \mathbb{R}^{d^\prime}$ where $d^\prime$ is the number of parameters.

If the action space is discrete and not too large then we can have numerical preferences for each state/action pair parameterized by $\boldsymbol{\theta}$. $h(s, a, \boldsymbol{\theta})$ and the corresponding policy can be to select actions according to the probability distribution generated by the soft-max. $\pi(a|s, \boldsymbol{\theta}) \doteq \frac{\exp{h(s, a, \boldsymbol{\theta})}}{\sum_b \exp{h(s, b, \boldsymbol{\theta})}}$. One advantage of using the soft-max is that the optimal policy can be stochastic or we can approach a deterministic policy by selecting the action with the highest probability. If we include a temperature parameter in the soft-max then we can vary the same policy to be more or less stochastic as needed.

If we calculate preferences with linear features, then we would have feature vectors $\mathbf{x}(s, a) \in \mathbb{R}^{d^\prime}$ to match with the parameter vector $\boldsymbol{\theta} \in \mathbb{R}^{d^\prime}$. Then the preferences would be calculated:

$$h(s, a, \boldsymbol{\theta}) = \boldsymbol{\theta}^\top \mathbf{x}(s, a)$$

Another advantage is that for some problems the policy may be easier to approximate than the action-value function. We can also inject some prior knowledge of the environment into how the policy is parametrized.

mimetext/htmlrootassigneelast_run_timestampA Ypersist_js_state·has_pluto_hook_features§cell_id$31f7e903-30b6-4193-9174-88093e004de4depends_on_disabled_cells§runtimeW3published_object_keysdepends_on_skipped_cells§errored$fee14dfe-c5ca-4126-a830-cc9d7eda5433queued¤logsrunning¦outputbodyelementsstep_rewardsprefixFloat32elementstypeArrayprefix_shortobjectid119a1546ef839870!application/vnd.pluto.tree+objectepisode_stepsprefixInt64elements8703text/plain9879text/plain10695text/plain11518text/plain12342text/plain13106text/plain13404text/plain14030text/plain 14749text/plainmorej99984text/plaintypeArrayprefix_shortobjectide32ae83612a524cb!application/vnd.pluto.tree+objectepisode_rewardsprefixFloat32elements-8702.0text/plain-1176.0text/plain-816.0text/plain-823.0text/plain-824.0text/plain-764.0text/plain-298.0text/plain-626.0text/plain -719.0text/plainmorej-66.0text/plaintypeArrayprefix_shortobjectid4801074e5c516386!application/vnd.pluto.tree+objectpolicy_parameters31452×2 Matrix{Float32}: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 0.104785 0.0825784 0.389597 0.0243331 -0.269401 0.0689138 -0.134741 0.08453 -0.0867137 0.0451763 1.05009 -0.657261text/plainvalue_parametersprefixFloat32elements0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain0.0text/plain 0.0text/plainmore-12.9161text/plaintypeArrayprefix_shortobjectide67dc14f194ae6f1!application/vnd.pluto.tree+objectpolicy_functionπtext/plainpolicy_sample_actionπ_sampletext/plainestimate_state_valueestimate_state_valuetext/plainpolicy_and_valuepolicy_and_valuetext/plaintypeNamedTupleobjectid95334601429b2f05mime!application/vnd.pluto.tree+objectrootassignee(const mountaincar_continuous_test_train2last_run_timestampA > ðpersist_js_state·has_pluto_hook_features§cell_id$fee14dfe-c5ca-4126-a830-cc9d7eda5433depends_on_disabled_cells§runtime1published_object_keysdepends_on_skipped_cells§errored$b53dba81-a9e9-41da-8fc2-7736bf25f2dcqueued¤logsrunning¦outputbodyB

Waiting to run parameter study

mimetext/htmlrootassigneelast_run_timestampA @/persist_js_state·has_pluto_hook_features§cell_id$b53dba81-a9e9-41da-8fc2-7736bf25f2dcdepends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cellsçerrored$beb01fb8-c77d-4b5c-a66d-3812415e04a3queued¤logsrunning¦outputbody

Exercise 13.4

For the Gaussian policy parameterization, derive the formula for the eligibility vector $\nabla \ln{\pi(a|s, \mathbf{\theta})}$

Starting with our expression for the parameter function, we can calculate the gradient:

$$\nabla \pi(a|s, \mathbf{\theta}) = \nabla \left ( \frac{1}{\sigma(s, \mathbf{\theta}) \sqrt{2\pi}} \exp \left ( - \frac{(a-\mu(s, \mathbf{\theta}))^2}{2\sigma(s, \mathbf{\theta})^2} \right ) \right )$$

We will eventually need $\nabla \mu$ and $\nabla \sigma$ so let's calculate them now.

$$\nabla (\sigma(s, \mathbf{\theta})) = \nabla \exp{( \mathbf{\theta}_\sigma ^ \top \mathbf{x}_\sigma (s))} = \sigma(s, \mathbf{\theta})\mathbf{x}_\sigma (s)$$

$$\nabla(\mu(s, \mathbf{\theta})) = \nabla ( \mathbf{\theta}_\mu ^\top \mathbf{x}_\mu(s)) = \mathbf{x}_\mu (s)$$

The first application of the quotient rule is trivial, I will omit the input arguments to μ and σ keeping in mind that these are functions of the parameters. Also let $\left ( - \frac{(a-\mu)^2}{2\sigma^2} \right ) = f(\mu, \sigma)$ which results in $\pi(a|s, \mathbf{\theta}) = \frac{1}{\sigma \sqrt{2\pi}} \exp{(f(\mu, \sigma))}$. Therefore:

$$\begin{flalign} \nabla \pi(a|s, \mathbf{\theta}) \sqrt{2\pi} &= \frac{1}{\sigma ^2} \left (- \exp{(f(\mu, \sigma))} \nabla \sigma + \sigma \exp{(f(\mu, \sigma))}\nabla f(\mu, \sigma) \right ) \\ &= \frac{1}{\sigma ^2} \left ( -\exp{(f(\mu, \sigma))} \sigma\mathbf{x}_\sigma + \sigma \exp{(f(\mu, \sigma))}\nabla f(\mu, \sigma) \right ) \\ &=\frac{\exp{(f(\mu, \sigma))}}{\sigma} \left (-\mathbf{x}_\sigma + \nabla f(\mu, \sigma) \right ) \\ \end{flalign}$$

Now we need only calculate the gradient of $f$:

$$\begin{flalign} \nabla f(\mu, \sigma) &= \frac{-1}{2} \nabla \left [ \frac{(a-\mu)^2}{\sigma^2} \right ] \\ & = \frac{-1}{2\sigma^4} \left [-2 \sigma^2 (a - \mu) \nabla \mu - (a - \mu)^2 2\sigma \nabla \sigma \right ] \\ & = \frac{-1}{\sigma^3} \left [ -\sigma (a - \mu) \nabla \mu - (a - \mu)^2 \nabla \sigma \right ] \\ & = \frac{-1}{\sigma^3} \left [ -\sigma (a - \mu) \mathbf{x}_\mu (s) - (a - \mu)^2 \sigma \mathbf{x}_\sigma \right ] \tag{substituting gradients}\\ & = \frac{(a - \mu)}{\sigma^2} ((a - \mu) \mathbf{x}_\sigma + \mathbf{x}_\mu) \tag{simplifying}\\ \end{flalign}$$

Now substitute this back into the policy gradient:

$$\nabla \pi(a|s, \mathbf{\theta}) \sqrt{2\pi} = \frac{\exp{(f(\mu, \sigma))}}{\sigma} \left (-\mathbf{x}_\sigma + \frac{(a - \mu)}{\sigma^2} ((a - \mu) \mathbf{x}_\sigma + \mathbf{x}_\mu) \right )$$

Furthermore, observe that $\pi(a|s, \mathbf{\theta}) = \frac{1}{\sigma\sqrt{2\pi}} \exp(f(\mu, \sigma))$

So our expression for the policy gradient is:

$$\nabla \pi(a|s, \mathbf{\theta}) = \pi(a|s, \mathbf{\theta}) \left (-\mathbf{x}_\sigma + \frac{(a - \mu)}{\sigma^2} ((a - \mu) \mathbf{x}_\sigma + \mathbf{x}_\mu) \right )$$

To get the eligibility vector we must divide this by the policy which is conveniently already in the expression:

$$\begin{flalign} \frac{\nabla \pi(a|s, \mathbf{\theta})}{\pi(a|s, \mathbf{\theta})} &= -\mathbf{x}_\sigma + \frac{(a - \mu)}{\sigma^2} ((a - \mu) \mathbf{x}_\sigma + \mathbf{x}_\mu)\\ &= \mathbf{x}_\mu \left [ \frac{(a - \mu)}{\sigma^2} \right ] + \mathbf{x}_\sigma \left [\frac{(a-\mu)^2}{\sigma^2} -1 \right ] \\ \end{flalign}$$

There are two components to the sum, one for $\mu$ and one for $\sigma$. If we think of the paramters and feature vectors as concatenated, then this sum would be an element by element sum where $\mathbf{x}_\mu$ has a zero value for all the feature indices corresponding to $\sigma$ and vice-versa. This way doing the sum will form one complete vector that has gradient components for all the parameters $\mathbf{\theta}_\mu$ and $\mathbf{\theta}_\sigma$. Alternatively, the sum can be separated and each gradient can be treated separately with only those components keeping them separated throughout the calculation.

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$beb01fb8-c77d-4b5c-a66d-3812415e04a3depends_on_disabled_cells§runtime4Qpublished_object_keysdepends_on_skipped_cells§errored$8bc280db-e57d-4e40-be46-1790f4f7d9e7queued¤logsrunning¦outputbodyDactor_critic_fcann_parameter_study (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA /Rzpersist_js_state·has_pluto_hook_features§cell_id$8bc280db-e57d-4e40-be46-1790f4f7d9e7depends_on_disabled_cells§runtimez published_object_keysdepends_on_skipped_cellsçerrored$89901156-b874-416b-89c1-6dc434a4eb17queued¤logsrunning¦outputbodyG

REINFORCE Implementation

mimetext/htmlrootassigneelast_run_timestampA Fpersist_js_state·has_pluto_hook_features§cell_id$89901156-b874-416b-89c1-6dc434a4eb17depends_on_disabled_cells§runtimepublished_object_keysdepends_on_skipped_cells§errored$ff76ef94-fdf5-41f3-a31a-21c4629efabequeued¤logsrunning¦outputbodyprefixٗStateMDP{Float32, Int64, Symbol, StateMDPTransitionSampler{Float32, Int64, var"#step#1179"}, var"#1177#1180", var"#1178#1181", TabularRL.var"#164#169"}elementsactionsprefixSymbolelements:lefttext/plain:righttext/plaintypeArrayprefix_shortobjectid5c1f673d599d276!application/vnd.pluto.tree+objectptfprefix:StateMDPTransitionSampler{Float32, Int64, var"#step#1179"}elementsstepJ(::Main.var"workspace#8".var"#step#1179") (generic function with 1 method)text/plaintypestructprefix_shortStateMDPTransitionSamplerobjectidffffffff142bed64!application/vnd.pluto.tree+objectinitialize_stateҙ (generic function with 1 method)text/plainistermҚ (generic function with 1 method)text/plainis_valid_action%#164 (generic function with 1 method)text/plainaction_indexprefixDict{Symbol, Int64}elements:lefttext/plain1text/plain:righttext/plain2text/plaintypeDictprefix_shortDictobjectid9ee985827f4f600d!application/vnd.pluto.tree+objecttypestructprefix_shortStateMDPobjectid4a9335336ab44967mime!application/vnd.pluto.tree+objectrootassigneeconst corridor_mdplast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$ff76ef94-fdf5-41f3-a31a-21c4629efabedepends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cellsçerrored$581f7e9b-a5c2-4841-9605-85f9585b0274queued¤logsrunning¦outputbodyBupdate_linear_action_preferences! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA ]persist_js_state·has_pluto_hook_features§cell_id$581f7e9b-a5c2-4841-9605-85f9585b0274depends_on_disabled_cells§runtime GVpublished_object_keysdepends_on_skipped_cells§errored$8aa16866-bfda-48df-9cf1-cf3d2e203ccbqueued¤logsrunning¦outputbodyYcartpole_tilecoding_reinforce_continuous_parameter_study (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA 1Cpersist_js_state·has_pluto_hook_features§cell_id$8aa16866-bfda-48df-9cf1-cf3d2e203ccbdepends_on_disabled_cells§runtimeZcpublished_object_keysdepends_on_skipped_cellsçerrored$04b5929a-2058-49c9-963a-96c752a1d67dqueued¤logsrunning¦outputbody mimetext/htmlrootassigneelast_run_timestampA 3persist_js_state·has_pluto_hook_features§cell_id$04b5929a-2058-49c9-963a-96c752a1d67ddepends_on_disabled_cells§runtime published_object_keysdepends_on_skipped_cellsçerrored$f0104778-81a6-417b-8501-f916e5e7f3afqueued¤logsrunning¦outputbody=make_corridor_continuing_mdp (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA I_persist_js_state·has_pluto_hook_features§cell_id$f0104778-81a6-417b-8501-f916e5e7f3afdepends_on_disabled_cells§runtime!published_object_keysdepends_on_skipped_cells§errored$3e3c5897-809f-46e3-bb58-f115b082443equeued¤logsrunning¦outputbodybactor_critic_with_eligibility_traces_binary_features_beta_actions (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA /Ɖʰpersist_js_state·has_pluto_hook_features§cell_id$3e3c5897-809f-46e3-bb58-f115b082443edepends_on_disabled_cells§runtimeAJ4published_object_keysdepends_on_skipped_cells§errored$a9db3f85-ff56-4bbc-be87-47b893ef3b7bqueued¤logsrunning¦outputbody // We start by putting all the variable interpolation here at the beginning // Publish the plot object to JS let plot_obj = {"layout": {"template": {"layout": {"coloraxis": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "xaxis": {"gridcolor": "white", "zerolinewidth": 2, "title": {"standoff": 15}, "ticks": "", "zerolinecolor": "white", "automargin": true, "linecolor": "white"}, "hovermode": "closest", "paper_bgcolor": "white", "geo": {"showlakes": true, "showland": true, "landcolor": "#E5ECF6", "bgcolor": "white", "subunitcolor": "white", "lakecolor": "white"}, "colorscale": {"sequential": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]], "diverging": [[0, "#8e0152"], [0.1, "#c51b7d"], [0.2, "#de77ae"], [0.3, "#f1b6da"], [0.4, "#fde0ef"], [0.5, "#f7f7f7"], [0.6, "#e6f5d0"], [0.7, "#b8e186"], [0.8, "#7fbc41"], [0.9, "#4d9221"], [1, "#276419"]], "sequentialminus": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}, "yaxis": {"gridcolor": "white", "zerolinewidth": 2, "title": {"standoff": 15}, "ticks": "", "zerolinecolor": "white", "automargin": true, "linecolor": "white"}, "shapedefaults": {"line": {"color": "#2a3f5f"}}, "hoverlabel": {"align": "left"}, "mapbox": {"style": "light"}, "polar": {"angularaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "bgcolor": "#E5ECF6", "radialaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}}, "autotypenumbers": "strict", "font": {"color": "#2a3f5f"}, "ternary": {"baxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "bgcolor": "#E5ECF6", "caxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "aaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}}, "annotationdefaults": {"arrowhead": 0, "arrowwidth": 1, "arrowcolor": "#2a3f5f"}, "plot_bgcolor": "#E5ECF6", "title": {"x": 0.05}, "scene": {"xaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}, "zaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}, "yaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}}, "colorway": ["#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52"]}, "data": {"barpolar": [{"type": "barpolar", "marker": {"line": {"color": "#E5ECF6", "width": 0.5}}}], "carpet": [{"aaxis": {"gridcolor": "white", "endlinecolor": "#2a3f5f", "minorgridcolor": "white", "startlinecolor": "#2a3f5f", "linecolor": "white"}, "type": "carpet", "baxis": {"gridcolor": "white", "endlinecolor": "#2a3f5f", "minorgridcolor": "white", "startlinecolor": "#2a3f5f", "linecolor": "white"}}], "scatterpolar": [{"type": "scatterpolar", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "parcoords": [{"line": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "type": "parcoords"}], "scatter": [{"type": "scatter", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram2dcontour": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "histogram2dcontour", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "contour": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "contour", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scattercarpet": [{"type": "scattercarpet", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "mesh3d": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "mesh3d"}], "surface": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "surface", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scattermapbox": [{"type": "scattermapbox", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scattergeo": [{"type": "scattergeo", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram": [{"type": "histogram", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "pie": [{"type": "pie", "automargin": true}], "choropleth": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "choropleth"}], "heatmapgl": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "heatmapgl", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "bar": [{"type": "bar", "error_y": {"color": "#2a3f5f"}, "error_x": {"color": "#2a3f5f"}, "marker": {"line": {"color": "#E5ECF6", "width": 0.5}}}], "heatmap": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "heatmap", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "contourcarpet": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "contourcarpet"}], "table": [{"type": "table", "header": {"line": {"color": "white"}, "fill": {"color": "#C8D4E3"}}, "cells": {"line": {"color": "white"}, "fill": {"color": "#EBF0F8"}}}], "scatter3d": [{"line": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "type": "scatter3d", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scattergl": [{"type": "scattergl", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram2d": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "histogram2d", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scatterternary": [{"type": "scatterternary", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scatterpolargl": [{"type": "scatterpolargl", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}]}}, "margin": {"l": 50, "b": 50, "r": 50, "t": 60}}, "config": {"showLink": false, "editable": false, "responsive": true, "staticPlot": false, "scrollZoom": true}, "frames": [], "data": [{"y": [Infinity, 4.667032721489843, 2.352080488429527, 1.5759894085474497, 1.1865767661252582, 0.952329542347748, 0.7958488403711905, 0.6838903357678036, 0.5998022307539005, 0.5343196607298939, 0.481876506359228, 0.43892664782610735, 0.4031035909389137, 0.3727674794521093, 0.3467459986828037, 0.3241787831808724, 0.3044200943416014, 0.28697580091461466, 0.27146133736832384, 0.25757292023624623, 0.24506738751259943, 0.23374779010782812, 0.2234529073255359, 0.21404949351780772, 0.20542646028339206, 0.19749045291363473, 0.19016244616993805, 0.18337509544622463, 0.17707065470347907, 0.1711993245392484, 0.16571793015256217, 0.16058885480517326, 0.15577917295968144, 0.1512599407923877, 0.14700561172131324, 0.1429935519781872, 0.13920363679625566, 0.13561791198181308, 0.13222030884055902, 0.1289964028946112, 0.12593320873670621, 0.12301900485981125, 0.12024318347272628, 0.11759612123947868, 0.11506906761806092, 0.11265404806440946, 0.11034377984248032, 0.10813159856537688, 0.10601139390463747, 0.10397755315966713, 0.10202491158834057, 0.10014870857199591, 0.09834454883045635, 0.09660836802097619, 0.0949364021535767, 0.09332516033769218, 0.09177140044426524, 0.09027210732572798, 0.08882447328556972, 0.08742588053094563, 0.08607388537727942, 0.0847662040040796, 0.08350069958706603, 0.08227537065388697, 0.08108834052977568, 0.07993784775592509, 0.07882223737755151, 0.07773995301090596, 0.07668952960915504, 0.07566958685632605, 0.07467882312659452, 0.0737160099532511, 0.07277998695786166, 0.07186965719555248, 0.07098398287710962, 0.0701219814327709, 0.06928272188628098, 0.06846532151104168, 0.06766894274307564, 0.06689279032807831, 0.06613610868210289, 0.06539817944744149, 0.06467831922706141, 0.06397587748255844, 0.06329023458201889, 0.06262079998546165, 0.06196701055667709, 0.06132832899130668, 0.060704242351929394, 0.06009426070175016, 0.05949791582923129, 0.0589147600566812, 0.05834436512642086, 0.05778632115869659, 0.05724023567600365, 0.056705732688934064, 0.05618245183906824, 0.0556700475947998, 0.05516818849631744, 0.05467655644627305, 0.0541948460429423, 0.053722763952936625, 0.05326002832075602, 0.05280636821268242, 0.05236152309270511, 0.05192524232834598, 0.05149728472441248, 0.051077418082853834, 0.05066541878703102, 0.050261071408834686, 0.04986416833719888, 0.04947450942666355, 0.04909190166473398, 0.048716158856874775, 0.04834710132805668, 0.04798455563985037, 0.047628354322129966, 0.04727833561851367, 0.04693434324472758, 0.046596226159133586, 0.046263838344712566, 0.045937038601841465, 0.04561569035124531, 0.04529966144654626, 0.04498882399586848, 0.04468305419199243, 0.04438223215058426, 0.04408624175605587, 0.043794970514638834, 0.04350830941428131, 0.043226152791001074, 0.042948398201350095, 0.04267494630066698, 0.042405700726813295, 0.042140567989107335, 0.04187945736218653, 0.04162228078454498, 0.0413689527615075, 0.04111939027241576, 0.040873512681814356, 0.04063124165443721, 0.040392501073805896, 0.04015721696426216, 0.03992531741626653, 0.039696732514804794, 0.03947139427075216, 0.039249236555053725, 0.039030195035587335, 0.03881420711658188, 0.03860121188047153, 0.03839115003207204, 0.03818396384497177, 0.037979597110035596, 0.03777799508592522, 0.037579104451544026, 0.03738287326032021, 0.037189250896245384, 0.036998188031590606, 0.036809636586225755, 0.036623549688471604, 0.036439881637417695, 0.03625858786664232, 0.036079624909274216, 0.03590295036433832, 0.03572852286433089, 0.03555630204397195, 0.035386248510085434, 0.03521832381255977, 0.035052490416344145, 0.03488871167443736, 0.03472695180182868, 0.03456717585035154, 0.03440934968441326, 0.034253439957565204, 0.034099414089879584, 0.033947240246100865, 0.03379688731454087, 0.03364832488668817, 0.03350152323750392, 0.0333564533063771, 0.03321308667871366, 0.033071395568135147, 0.03293135279926322, 0.03279293179106793, 0.03265610654075808, 0.0325208516081934, 0.032387142100798794, 0.0322549536589619, 0.032124262441896145, 0.03199504511395181, 0.03186727883135887, 0.031740941229385686, 0.03161601040989839, 0.031492464929306616, 0.031370283786881344, 0.031249446413431824, 0.03112993266032874, 0.031011722788861003, 0.030894797459914896, 0.03077913772396381, 0.030664725011357967, 0.030551541122903637, 0.030439568220721764, 0.030328788819376507, 0.030219185777264318, 0.030110742288254746, 0.030003441873574455, 0.029897268373926113, 0.02979220594183445, 0.029688239034211677, 0.02958535240513517, 0.029483531098830215, 0.02938276044285117, 0.02928302604145444, 0.029184313769157155, 0.029086609764475243, 0.028989900423835417, 0.02889417239565525, 0.028799412574585966, 0.028705608095912896, 0.02861274633010843, 0.028520814877532743, 0.02842980156327758, 0.028339694432148663, 0.028250481743782377, 0.02816215196789255, 0.028074693779643334, 0.027988096055144258, 0.027902347867063783, 0.02781743848035765, 0.027733357348108598, 0.0276500941074741, 0.02756763857573873, 0.02748598074646827, 0.027405110785762198, 0.027325019028601922, 0.02724569597529179, 0.027167132287990087, 0.02708931878732754, 0.027012246449110586, 0.026935906401107084, 0.026860289919912017, 0.026785388427890865, 0.02671119349019843, 0.026637696811870965, 0.02656489023498947, 0.026492765735912242, 0.026421315422574573, 0.026350531531853794, 0.026280406426997836, 0.02621093259511549, 0.026142102644726693, 0.026073909303371102, 0.02600634541527347, 0.02593940393906413, 0.025873077945553147, 0.025807360615556708, 0.025742245237774207, 0.025677725206714783, 0.02561379402067193, 0.025550445279744836, 0.025487672683905303, 0.025425470031108913, 0.02536383121544939, 0.025302750225354928, 0.02524222114182535, 0.025182238136709202, 0.0251227954710195, 0.025063887493287296, 0.02500550863795202, 0.02494765342378762, 0.024890316452363646, 0.02483349240654034, 0.024777176048996825, 0.02472136222079166, 0.024666045839954766, 0.02461122190011018, 0.024556885469128556, 0.024503031687808954, 0.024449655768588964, 0.024396752994282588, 0.02434431871684517, 0.02429234835616463, 0.02424083739887848, 0.024189781397215858, 0.02413917596786403, 0.02408901679085884, 0.02403929960849835, 0.023990020224279235, 0.023941174501855397, 0.023892758364018152, 0.023844767791697623, 0.023797198822984644, 0.023750047552172915, 0.023703310128820727, 0.02365698275683186, 0.023611061693555327, 0.023565543248903267, 0.023520423784486814, 0.023475699712769398, 0.02343136749623704, 0.023387423646585335, 0.023343864723922744, 0.0233006873359897, 0.023257888137393292, 0.023215463828857124, 0.023173411156485996, 0.02313172691104504, 0.023090407927253108, 0.02304945108308994, 0.02300885329911689, 0.022968611537810872, 0.022928722802911243, 0.022889184138779408, 0.02284999262977068, 0.022811145399618316, 0.02277263961082937, 0.0227344724640921, 0.02269664119769473, 0.02265914308695525, 0.022621975443662064, 0.02258513561552524, 0.022548620985638142, 0.02251242897194914, 0.022476557026743386, 0.02244100263613417, 0.022405763319563853, 0.022370836629314127, 0.02233622015002532, 0.022301911498224733, 0.022267908321863584, 0.022234208299862578, 0.022200809141665882, 0.02216770858680317, 0.022134904404459876, 0.02210239439305514, 0.02207017637982758, 0.022038248220428557, 0.0220066077985228, 0.021975253025396355, 0.021944181839571537, 0.021913392206428895, 0.021882882117835973, 0.02185264959178267, 0.021822692672023296, 0.021793009427724875, 0.02176359795312182, 0.021734456367176736, 0.021705582813247262, 0.021676975458758855, 0.021648632494883305, 0.021620552136222996, 0.02159273262050078, 0.021565172208255167, 0.021537869182541077, 0.02151082184863568, 0.021484028533749495, 0.021457487586742523, 0.021431197377845285, 0.021405156298384874, 0.021379362760515584, 0.02135381519695438, 0.021328512060720946, 0.021303451824882122, 0.02127863298230095, 0.021254054045389922, 0.02122971354586855, 0.021205610034525122, 0.02118174208098257, 0.02115810827346834, 0.021134707218588265, 0.021111537541104276, 0.021088597883715993, 0.02106588690684599, 0.021043403288428814, 0.02102114572370358, 0.020999112925010086, 0.020977303621588516, 0.020955716559382432, 0.020934350500845265, 0.020913204224750022, 0.020892276526002253, 0.02087156621545629, 0.020851072119734474, 0.02083079308104963, 0.02081072795703043, 0.020790875620549855, 0.020771234959556482, 0.020751804876908687, 0.020732584290211725, 0.020713572131657518, 0.020694767347867193, 0.020676168899736367, 0.02065777576228298, 0.02063958692449783, 0.020621601389197598, 0.020603818172880383, 0.020586236305583806, 0.020568854830745397, 0.02055167280506554, 0.02053468929837266, 0.02051790339349078, 0.02050131418610934, 0.02048492078465529, 0.020468722310167376, 0.020452717896172617, 0.020436906688564883, 0.02042128784548568, 0.020405860537206878, 0.02039062394601562, 0.020375577266101175, 0.02036071970344374, 0.02034605047570535, 0.020331568812122485, 0.020317273953400803, 0.02030316515161161, 0.02028924167009018, 0.02027550278333597, 0.02026194777691453, 0.020248575947361246, 0.02023538660208679, 0.02022237905928431, 0.020209552647838264, 0.020196906707234993, 0.020184440587474886, 0.02017215364898616, 0.02016004526254031, 0.020148114809169074, 0.02013636168008294, 0.020124785276591273, 0.02011338501002391, 0.020102160301654216, 0.020091110582623722, 0.02008023529386807, 0.020069533886044603, 0.020059005819461167, 0.020048650564006475, 0.020038467599081785, 0.020028456413533974, 0.02001861650558995, 0.020008947382792436, 0.01999944856193704, 0.019990119569010673, 0.019980959939131217, 0.019971969216488514, 0.019963146954286602, 0.019954492714687185, 0.019946006068754386, 0.01993768659640067, 0.019929533886334033, 0.019921547536006338, 0.01991372715156286, 0.019906072347793048, 0.01989858274808236, 0.019891257984365338, 0.01988409769707977, 0.01987710153512198, 0.0198702691558033, 0.01986360022480756, 0.01985709441614975, 0.01985075141213575, 0.0198445709033231, 0.019838552588482928, 0.019832696174562844, 0.01982700137665095, 0.019821467917940903, 0.019816095529697923, 0.01981088395122598, 0.01980583292983587, 0.01980094222081435, 0.019796211587394332, 0.019791640800725974, 0.019787229639848893, 0.01978297789166522, 0.019778885350913777, 0.019774951820145124, 0.01977117710969763, 0.019767561037674516, 0.019764103429921808, 0.019760804120007282, 0.01975766294920036, 0.019754679766452937, 0.019751854428381153, 0.019749186799248076, 0.019746676750947406, 0.019744324162987985, 0.0197421289224793, 0.019740090924117947, 0.019738210070174914, 0.019736486270483855, 0.019734919442430287, 0.019733509510941594, 0.019732256408478102, 0.019731160075024907, 0.019730220458084716, 0.019729437512671526, 0.019728811201305235, 0.019728341494007158, 0.019728028368296416, 0.019727871809187256, 0.019727871809187256, 0.019728028368296416, 0.019728341494007158, 0.019728811201305235, 0.019729437512671526, 0.019730220458084716, 0.019731160075024907, 0.019732256408478102, 0.019733509510941594, 0.019734919442430283, 0.019736486270483862, 0.019738210070174914, 0.019740090924117947, 0.0197421289224793, 0.019744324162987985, 0.019746676750947406, 0.019749186799248076, 0.019751854428381153, 0.01975467976645294, 0.01975766294920036, 0.019760804120007282, 0.019764103429921808, 0.019767561037674516, 0.01977117710969763, 0.01977495182014512, 0.019778885350913777, 0.01978297789166522, 0.019787229639848893, 0.019791640800725978, 0.019796211587394325, 0.01980094222081435, 0.01980583292983587, 0.01981088395122598, 0.01981609552969792, 0.019821467917940896, 0.019827001376650954, 0.019832696174562844, 0.019838552588482928, 0.019844570903323103, 0.01985075141213575, 0.019857094416149752, 0.01986360022480756, 0.0198702691558033, 0.019877101535121986, 0.01988409769707977, 0.019891257984365338, 0.01989858274808236, 0.019906072347793048, 0.01991372715156286, 0.01992154753600633, 0.019929533886334033, 0.01993768659640067, 0.019946006068754386, 0.01995449271468718, 0.0199631469542866, 0.019971969216488514, 0.019980959939131217, 0.019990119569010673, 0.01999944856193704, 0.020008947382792436, 0.02001861650558995, 0.020028456413533974, 0.020038467599081785, 0.020048650564006475, 0.02005900581946117, 0.020069533886044603, 0.02008023529386807, 0.020091110582623722, 0.020102160301654223, 0.020113385010023913, 0.020124785276591273, 0.02013636168008294, 0.020148114809169074, 0.020160045262540314, 0.020172153648986158, 0.020184440587474886, 0.020196906707234993, 0.020209552647838264, 0.02022237905928431, 0.020235386602086798, 0.020248575947361246, 0.02026194777691453, 0.02027550278333597, 0.02028924167009018, 0.020303165151611607, 0.020317273953400803, 0.020331568812122485, 0.020346050475705348, 0.020360719703443747, 0.020375577266101168, 0.02039062394601562, 0.020405860537206878, 0.02042128784548568, 0.020436906688564883, 0.020452717896172617, 0.020468722310167376, 0.02048492078465529, 0.02050131418610934, 0.02051790339349078, 0.02053468929837266, 0.02055167280506554, 0.020568854830745397, 0.020586236305583802, 0.020603818172880383, 0.020621601389197598, 0.02063958692449783, 0.02065777576228298, 0.020676168899736364, 0.020694767347867196, 0.020713572131657518, 0.020732584290211725, 0.020751804876908687, 0.020771234959556475, 0.020790875620549855, 0.02081072795703043, 0.02083079308104963, 0.020851072119734474, 0.02087156621545629, 0.020892276526002257, 0.020913204224750022, 0.020934350500845265, 0.020955716559382432, 0.020977303621588516, 0.02099911292501009, 0.02102114572370358, 0.021043403288428814, 0.02106588690684599, 0.02108859788371599, 0.021111537541104276, 0.021134707218588265, 0.02115810827346834, 0.02118174208098257, 0.021205610034525122, 0.021229713545868552, 0.021254054045389922, 0.02127863298230095, 0.021303451824882122, 0.021328512060720946, 0.021353815196954385, 0.021379362760515584, 0.021405156298384874, 0.021431197377845285, 0.021457487586742516, 0.021484028533749498, 0.02151082184863568, 0.021537869182541077, 0.021565172208255167, 0.021592732620500776, 0.021620552136223, 0.021648632494883305, 0.021676975458758855, 0.021705582813247266, 0.021734456367176736, 0.021763597953121824, 0.021793009427724875, 0.021822692672023296, 0.02185264959178267, 0.021882882117835973, 0.021913392206428906, 0.021944181839571537, 0.021975253025396355, 0.0220066077985228, 0.022038248220428557, 0.022070176379827583, 0.02210239439305514, 0.022134904404459876, 0.022167708586803173, 0.022200809141665882, 0.022234208299862585, 0.022267908321863584, 0.022301911498224733, 0.02233622015002532, 0.022370836629314127, 0.022405763319563853, 0.02244100263613417, 0.022476557026743386, 0.02251242897194914, 0.022548620985638142, 0.02258513561552524, 0.022621975443662064, 0.02265914308695525, 0.02269664119769473, 0.0227344724640921, 0.02277263961082937, 0.022811145399618316, 0.02284999262977068, 0.022889184138779408, 0.022928722802911243, 0.022968611537810872, 0.02300885329911689, 0.02304945108308994, 0.023090407927253108, 0.02313172691104504, 0.023173411156485996, 0.023215463828857124, 0.023257888137393292, 0.0233006873359897, 0.023343864723922744, 0.023387423646585335, 0.02343136749623704, 0.023475699712769398, 0.023520423784486814, 0.023565543248903267, 0.023611061693555327, 0.02365698275683186, 0.02370331012882072, 0.023750047552172915, 0.023797198822984644, 0.023844767791697623, 0.023892758364018152, 0.023941174501855393, 0.02399002022427923, 0.02403929960849835, 0.02408901679085884, 0.02413917596786403, 0.02418978139721585, 0.024240837398878477, 0.02429234835616463, 0.02434431871684517, 0.024396752994282588, 0.024449655768588964, 0.024503031687808957, 0.02455688546912856, 0.02461122190011018, 0.024666045839954766, 0.024721362220791653, 0.02477717604899683, 0.024833492406540342, 0.024890316452363646, 0.02494765342378762, 0.02500550863795202, 0.025063887493287303, 0.0251227954710195, 0.025182238136709202, 0.02524222114182535, 0.02530275022535492, 0.025363831215449394, 0.025425470031108913, 0.025487672683905303, 0.025550445279744836, 0.025613794020671925, 0.025677725206714783, 0.025742245237774207, 0.025807360615556708, 0.025873077945553147, 0.025939403939064118, 0.026006345415273475, 0.026073909303371102, 0.026142102644726693, 0.02621093259511549, 0.026280406426997832, 0.026350531531853797, 0.026421315422574573, 0.026492765735912242, 0.02656489023498947, 0.02663769681187096, 0.026711193490198435, 0.026785388427890865, 0.026860289919912017, 0.026935906401107084, 0.02701224644911058, 0.027089318787327545, 0.027167132287990094, 0.02724569597529179, 0.02732501902860192, 0.027405110785762195, 0.02748598074646827, 0.02756763857573873, 0.0276500941074741, 0.027733357348108598, 0.027817438480357646, 0.027902347867063783, 0.027988096055144258, 0.028074693779643334, 0.028162151967892547, 0.028250481743782373, 0.028339694432148663, 0.02842980156327758, 0.028520814877532743, 0.02861274633010843, 0.028705608095912893, 0.028799412574585972, 0.02889417239565525, 0.028989900423835417, 0.029086609764475236, 0.02918431376915716, 0.029283026041454448, 0.02938276044285117, 0.029483531098830215, 0.029585352405135164, 0.029688239034211684, 0.02979220594183445, 0.029897268373926113, 0.03000344187357445, 0.03011074228825474, 0.03021918577726432, 0.03032878881937651, 0.030439568220721764, 0.030551541122903633, 0.03066472501135796, 0.030779137723963818, 0.030894797459914903, 0.031011722788861003, 0.031129932660328735, 0.031249446413431824, 0.031370283786881344, 0.03149246492930662, 0.03161601040989839, 0.03174094122938568, 0.03186727883135887, 0.03199504511395182, 0.03212426244189615, 0.0322549536589619, 0.032387142100798794, 0.032520851608193395, 0.03265610654075809, 0.03279293179106793, 0.03293135279926322, 0.03307139556813514, 0.033213086678713664, 0.033356453306377105, 0.03350152323750393, 0.03364832488668817, 0.03379688731454087, 0.03394724024610086, 0.03409941408987959, 0.034253439957565204, 0.03440934968441326, 0.03456717585035153, 0.03472695180182867, 0.03488871167443738, 0.03505249041634415, 0.03521832381255977, 0.03538624851008543, 0.03555630204397195, 0.0357285228643309, 0.03590295036433832, 0.036079624909274216, 0.036258587866642315, 0.03643988163741768, 0.03662354968847162, 0.036809636586225755, 0.036998188031590606, 0.03718925089624538, 0.03738287326032022, 0.037579104451544026, 0.03777799508592522, 0.037979597110035596, 0.03818396384497176, 0.03839115003207205, 0.03860121188047153, 0.03881420711658188, 0.03903019503558733, 0.039249236555053725, 0.03947139427075218, 0.03969673251480481, 0.03992531741626653, 0.04015721696426215, 0.04039250107380589, 0.04063124165443722, 0.04087351268181437, 0.04111939027241576, 0.04136895276150748, 0.04162228078454497, 0.04187945736218656, 0.04214056798910734, 0.042405700726813295, 0.042674946300666976, 0.04294839820135009, 0.0432261527910011, 0.04350830941428132, 0.043794970514638834, 0.04408624175605586, 0.04438223215058425, 0.04468305419199244, 0.04498882399586849, 0.04529966144654626, 0.0456156903512453, 0.04593703860184145, 0.046263838344712586, 0.04659622615913359, 0.04693434324472758, 0.04727833561851366, 0.047628354322129945, 0.0479845556398504, 0.04834710132805669, 0.048716158856874775, 0.04909190166473397, 0.04947450942666354, 0.0498641683371989, 0.05026107140883469, 0.05066541878703102, 0.05107741808285381, 0.05149728472441245, 0.05192524232834599, 0.052361523092705115, 0.05280636821268241, 0.053260028320756006, 0.0537227639529366, 0.054194846042942314, 0.05467655644627305, 0.05516818849631743, 0.055670047594799786, 0.056182451839068205, 0.05670573268893408, 0.057240235676003656, 0.05778632115869659, 0.05834436512642085, 0.05891476005668123, 0.059497915829231314, 0.060094260701750175, 0.06070424235192939, 0.06132832899130665, 0.06196701055667712, 0.06262079998546166, 0.0632902345820189, 0.06397587748255842, 0.06467831922706137, 0.06539817944744152, 0.06613610868210291, 0.06689279032807831, 0.06766894274307562, 0.06846532151104165, 0.06928272188628103, 0.07012198143277093, 0.07098398287710962, 0.07186965719555247, 0.07277998695786163, 0.07371600995325114, 0.07467882312659455, 0.07566958685632605, 0.07668952960915502, 0.0777399530109059, 0.07882223737755155, 0.07993784775592512, 0.08108834052977568, 0.08227537065388693, 0.08350069958706596, 0.08476620400407965, 0.08607388537727945, 0.08742588053094563, 0.0888244732855697, 0.09027210732572791, 0.0917714004442653, 0.09332516033769221, 0.0949364021535767, 0.09660836802097614, 0.09834454883045624, 0.100148708571996, 0.10202491158834061, 0.1039775531596671, 0.10601139390463742, 0.10813159856537677, 0.11034377984248044, 0.11265404806440948, 0.1150690676180609, 0.11759612123947859, 0.12024318347272615, 0.12301900485981138, 0.12593320873670627, 0.12899640289461117, 0.13222030884055894, 0.1356179119818129, 0.1392036367962558, 0.14299355197818725, 0.1470056117213132, 0.15125994079238753, 0.15577917295968122, 0.16058885480517346, 0.16571793015256223, 0.17119932453924833, 0.17707065470347888, 0.18337509544622496, 0.19016244616993824, 0.19749045291363485, 0.20542646028339195, 0.21404949351780742, 0.2234529073255364, 0.23374779010782848, 0.24506738751259952, 0.25757292023624606, 0.2714613373683233, 0.28697580091461555, 0.3044200943416019, 0.32417878318087257, 0.34674599868280326, 0.37276747945210814, 0.40310359093891523, 0.43892664782610835, 0.4818765063592282, 0.5343196607298928, 0.5998022307538973, 0.6838903357678081, 0.7958488403711937, 0.9523295423477482, 1.186576766125252, 1.5759894085474269, 2.3520804884295794, 4.667032721489947, Infinity], "type": "scatter", "x": [0.0, 0.001001001001001001, 0.002002002002002002, 0.003003003003003003, 0.004004004004004004, 0.005005005005005005, 0.006006006006006006, 0.007007007007007007, 0.008008008008008008, 0.009009009009009009, 0.01001001001001001, 0.011011011011011011, 0.012012012012012012, 0.013013013013013013, 0.014014014014014014, 0.015015015015015015, 0.016016016016016016, 0.01701701701701702, 0.018018018018018018, 0.01901901901901902, 0.02002002002002002, 0.021021021021021023, 0.022022022022022022, 0.023023023023023025, 0.024024024024024024, 0.025025025025025027, 0.026026026026026026, 0.02702702702702703, 0.028028028028028028, 0.02902902902902903, 0.03003003003003003, 0.031031031031031032, 0.03203203203203203, 0.03303303303303303, 0.03403403403403404, 0.035035035035035036, 0.036036036036036036, 0.037037037037037035, 0.03803803803803804, 0.03903903903903904, 0.04004004004004004, 0.04104104104104104, 0.042042042042042045, 0.043043043043043044, 0.044044044044044044, 0.04504504504504504, 0.04604604604604605, 0.04704704704704705, 0.04804804804804805, 0.04904904904904905, 0.05005005005005005, 0.05105105105105105, 0.05205205205205205, 0.05305305305305305, 0.05405405405405406, 0.055055055055055056, 0.056056056056056056, 0.057057057057057055, 0.05805805805805806, 0.05905905905905906, 0.06006006006006006, 0.06106106106106106, 0.062062062062062065, 0.06306306306306306, 0.06406406406406406, 0.06506506506506507, 0.06606606606606606, 0.06706706706706707, 0.06806806806806807, 0.06906906906906907, 0.07007007007007007, 0.07107107107107107, 0.07207207207207207, 0.07307307307307308, 0.07407407407407407, 0.07507507507507508, 0.07607607607607608, 0.07707707707707707, 0.07807807807807808, 0.07907907907907907, 0.08008008008008008, 0.08108108108108109, 0.08208208208208208, 0.08308308308308308, 0.08408408408408409, 0.08508508508508508, 0.08608608608608609, 0.08708708708708708, 0.08808808808808809, 0.0890890890890891, 0.09009009009009009, 0.09109109109109109, 0.0920920920920921, 0.09309309309309309, 0.0940940940940941, 0.09509509509509509, 0.0960960960960961, 0.0970970970970971, 0.0980980980980981, 0.0990990990990991, 0.1001001001001001, 0.1011011011011011, 0.1021021021021021, 0.1031031031031031, 0.1041041041041041, 0.10510510510510511, 0.1061061061061061, 0.10710710710710711, 0.10810810810810811, 0.1091091091091091, 0.11011011011011011, 0.1111111111111111, 0.11211211211211211, 0.11311311311311312, 0.11411411411411411, 0.11511511511511512, 0.11611611611611612, 0.11711711711711711, 0.11811811811811812, 0.11911911911911911, 0.12012012012012012, 0.12112112112112113, 0.12212212212212212, 0.12312312312312312, 0.12412412412412413, 0.12512512512512514, 0.12612612612612611, 0.12712712712712712, 0.12812812812812813, 0.12912912912912913, 0.13013013013013014, 0.13113113113113112, 0.13213213213213212, 0.13313313313313313, 0.13413413413413414, 0.13513513513513514, 0.13613613613613615, 0.13713713713713713, 0.13813813813813813, 0.13913913913913914, 0.14014014014014015, 0.14114114114114115, 0.14214214214214213, 0.14314314314314314, 0.14414414414414414, 0.14514514514514515, 0.14614614614614616, 0.14714714714714713, 0.14814814814814814, 0.14914914914914915, 0.15015015015015015, 0.15115115115115116, 0.15215215215215216, 0.15315315315315314, 0.15415415415415415, 0.15515515515515516, 0.15615615615615616, 0.15715715715715717, 0.15815815815815815, 0.15915915915915915, 0.16016016016016016, 0.16116116116116116, 0.16216216216216217, 0.16316316316316315, 0.16416416416416416, 0.16516516516516516, 0.16616616616616617, 0.16716716716716717, 0.16816816816816818, 0.16916916916916916, 0.17017017017017017, 0.17117117117117117, 0.17217217217217218, 0.17317317317317318, 0.17417417417417416, 0.17517517517517517, 0.17617617617617617, 0.17717717717717718, 0.1781781781781782, 0.17917917917917917, 0.18018018018018017, 0.18118118118118118, 0.18218218218218218, 0.1831831831831832, 0.1841841841841842, 0.18518518518518517, 0.18618618618618618, 0.1871871871871872, 0.1881881881881882, 0.1891891891891892, 0.19019019019019018, 0.19119119119119118, 0.1921921921921922, 0.1931931931931932, 0.1941941941941942, 0.19519519519519518, 0.1961961961961962, 0.1971971971971972, 0.1981981981981982, 0.1991991991991992, 0.2002002002002002, 0.2012012012012012, 0.2022022022022022, 0.2032032032032032, 0.2042042042042042, 0.20520520520520522, 0.2062062062062062, 0.2072072072072072, 0.2082082082082082, 0.2092092092092092, 0.21021021021021022, 0.21121121121121122, 0.2122122122122122, 0.2132132132132132, 0.21421421421421422, 0.21521521521521522, 0.21621621621621623, 0.2172172172172172, 0.2182182182182182, 0.21921921921921922, 0.22022022022022023, 0.22122122122122123, 0.2222222222222222, 0.22322322322322322, 0.22422422422422422, 0.22522522522522523, 0.22622622622622623, 0.22722722722722724, 0.22822822822822822, 0.22922922922922923, 0.23023023023023023, 0.23123123123123124, 0.23223223223223224, 0.23323323323323322, 0.23423423423423423, 0.23523523523523523, 0.23623623623623624, 0.23723723723723725, 0.23823823823823823, 0.23923923923923923, 0.24024024024024024, 0.24124124124124124, 0.24224224224224225, 0.24324324324324326, 0.24424424424424424, 0.24524524524524524, 0.24624624624624625, 0.24724724724724725, 0.24824824824824826, 0.24924924924924924, 0.2502502502502503, 0.25125125125125125, 0.25225225225225223, 0.25325325325325326, 0.25425425425425424, 0.2552552552552553, 0.25625625625625625, 0.25725725725725723, 0.25825825825825827, 0.25925925925925924, 0.2602602602602603, 0.26126126126126126, 0.26226226226226224, 0.26326326326326327, 0.26426426426426425, 0.2652652652652653, 0.26626626626626626, 0.2672672672672673, 0.2682682682682683, 0.26926926926926925, 0.2702702702702703, 0.27127127127127126, 0.2722722722722723, 0.2732732732732733, 0.27427427427427425, 0.2752752752752753, 0.27627627627627627, 0.2772772772772773, 0.2782782782782783, 0.27927927927927926, 0.2802802802802803, 0.28128128128128127, 0.2822822822822823, 0.2832832832832833, 0.28428428428428426, 0.2852852852852853, 0.2862862862862863, 0.2872872872872873, 0.2882882882882883, 0.28928928928928926, 0.2902902902902903, 0.2912912912912913, 0.2922922922922923, 0.2932932932932933, 0.29429429429429427, 0.2952952952952953, 0.2962962962962963, 0.2972972972972973, 0.2982982982982983, 0.2992992992992993, 0.3003003003003003, 0.3013013013013013, 0.3023023023023023, 0.3033033033033033, 0.30430430430430433, 0.3053053053053053, 0.3063063063063063, 0.3073073073073073, 0.3083083083083083, 0.30930930930930933, 0.3103103103103103, 0.3113113113113113, 0.3123123123123123, 0.3133133133133133, 0.31431431431431434, 0.3153153153153153, 0.3163163163163163, 0.3173173173173173, 0.3183183183183183, 0.31931931931931934, 0.3203203203203203, 0.3213213213213213, 0.32232232232232233, 0.3233233233233233, 0.32432432432432434, 0.3253253253253253, 0.3263263263263263, 0.32732732732732733, 0.3283283283283283, 0.32932932932932935, 0.3303303303303303, 0.33133133133133136, 0.33233233233233234, 0.3333333333333333, 0.33433433433433435, 0.3353353353353353, 0.33633633633633636, 0.33733733733733734, 0.3383383383383383, 0.33933933933933935, 0.34034034034034033, 0.34134134134134136, 0.34234234234234234, 0.3433433433433433, 0.34434434434434436, 0.34534534534534533, 0.34634634634634637, 0.34734734734734735, 0.3483483483483483, 0.34934934934934936, 0.35035035035035034, 0.35135135135135137, 0.35235235235235235, 0.3533533533533533, 0.35435435435435436, 0.35535535535535534, 0.3563563563563564, 0.35735735735735735, 0.35835835835835833, 0.35935935935935936, 0.36036036036036034, 0.3613613613613614, 0.36236236236236236, 0.3633633633633634, 0.36436436436436437, 0.36536536536536535, 0.3663663663663664, 0.36736736736736736, 0.3683683683683684, 0.36936936936936937, 0.37037037037037035, 0.3713713713713714, 0.37237237237237236, 0.3733733733733734, 0.3743743743743744, 0.37537537537537535, 0.3763763763763764, 0.37737737737737737, 0.3783783783783784, 0.3793793793793794, 0.38038038038038036, 0.3813813813813814, 0.38238238238238237, 0.3833833833833834, 0.3843843843843844, 0.38538538538538536, 0.3863863863863864, 0.38738738738738737, 0.3883883883883884, 0.3893893893893894, 0.39039039039039036, 0.3913913913913914, 0.3923923923923924, 0.3933933933933934, 0.3943943943943944, 0.3953953953953954, 0.3963963963963964, 0.3973973973973974, 0.3983983983983984, 0.3993993993993994, 0.4004004004004004, 0.4014014014014014, 0.4024024024024024, 0.4034034034034034, 0.4044044044044044, 0.40540540540540543, 0.4064064064064064, 0.4074074074074074, 0.4084084084084084, 0.4094094094094094, 0.41041041041041043, 0.4114114114114114, 0.4124124124124124, 0.4134134134134134, 0.4144144144144144, 0.41541541541541543, 0.4164164164164164, 0.4174174174174174, 0.4184184184184184, 0.4194194194194194, 0.42042042042042044, 0.4214214214214214, 0.42242242242242245, 0.42342342342342343, 0.4244244244244244, 0.42542542542542544, 0.4264264264264264, 0.42742742742742745, 0.42842842842842843, 0.4294294294294294, 0.43043043043043044, 0.4314314314314314, 0.43243243243243246, 0.43343343343343343, 0.4344344344344344, 0.43543543543543545, 0.4364364364364364, 0.43743743743743746, 0.43843843843843844, 0.4394394394394394, 0.44044044044044045, 0.44144144144144143, 0.44244244244244246, 0.44344344344344344, 0.4444444444444444, 0.44544544544544545, 0.44644644644644643, 0.44744744744744747, 0.44844844844844844, 0.4494494494494494, 0.45045045045045046, 0.45145145145145144, 0.45245245245245247, 0.45345345345345345, 0.4544544544544545, 0.45545545545545546, 0.45645645645645644, 0.4574574574574575, 0.45845845845845845, 0.4594594594594595, 0.46046046046046046, 0.46146146146146144, 0.4624624624624625, 0.46346346346346345, 0.4644644644644645, 0.46546546546546547, 0.46646646646646645, 0.4674674674674675, 0.46846846846846846, 0.4694694694694695, 0.47047047047047047, 0.47147147147147145, 0.4724724724724725, 0.47347347347347346, 0.4744744744744745, 0.4754754754754755, 0.47647647647647645, 0.4774774774774775, 0.47847847847847846, 0.4794794794794795, 0.4804804804804805, 0.48148148148148145, 0.4824824824824825, 0.48348348348348347, 0.4844844844844845, 0.4854854854854855, 0.4864864864864865, 0.4874874874874875, 0.48848848848848847, 0.4894894894894895, 0.4904904904904905, 0.4914914914914915, 0.4924924924924925, 0.4934934934934935, 0.4944944944944945, 0.4954954954954955, 0.4964964964964965, 0.4974974974974975, 0.4984984984984985, 0.4994994994994995, 0.5005005005005005, 0.5015015015015015, 0.5025025025025025, 0.5035035035035035, 0.5045045045045045, 0.5055055055055055, 0.5065065065065065, 0.5075075075075075, 0.5085085085085085, 0.5095095095095095, 0.5105105105105106, 0.5115115115115115, 0.5125125125125125, 0.5135135135135135, 0.5145145145145145, 0.5155155155155156, 0.5165165165165165, 0.5175175175175175, 0.5185185185185185, 0.5195195195195195, 0.5205205205205206, 0.5215215215215215, 0.5225225225225225, 0.5235235235235235, 0.5245245245245245, 0.5255255255255256, 0.5265265265265265, 0.5275275275275275, 0.5285285285285285, 0.5295295295295295, 0.5305305305305306, 0.5315315315315315, 0.5325325325325325, 0.5335335335335335, 0.5345345345345346, 0.5355355355355356, 0.5365365365365365, 0.5375375375375375, 0.5385385385385385, 0.5395395395395396, 0.5405405405405406, 0.5415415415415415, 0.5425425425425425, 0.5435435435435435, 0.5445445445445446, 0.5455455455455456, 0.5465465465465466, 0.5475475475475475, 0.5485485485485485, 0.5495495495495496, 0.5505505505505506, 0.5515515515515516, 0.5525525525525525, 0.5535535535535535, 0.5545545545545546, 0.5555555555555556, 0.5565565565565566, 0.5575575575575575, 0.5585585585585585, 0.5595595595595596, 0.5605605605605606, 0.5615615615615616, 0.5625625625625625, 0.5635635635635635, 0.5645645645645646, 0.5655655655655656, 0.5665665665665666, 0.5675675675675675, 0.5685685685685685, 0.5695695695695696, 0.5705705705705706, 0.5715715715715716, 0.5725725725725725, 0.5735735735735735, 0.5745745745745746, 0.5755755755755756, 0.5765765765765766, 0.5775775775775776, 0.5785785785785785, 0.5795795795795796, 0.5805805805805806, 0.5815815815815816, 0.5825825825825826, 0.5835835835835835, 0.5845845845845846, 0.5855855855855856, 0.5865865865865866, 0.5875875875875876, 0.5885885885885885, 0.5895895895895896, 0.5905905905905906, 0.5915915915915916, 0.5925925925925926, 0.5935935935935935, 0.5945945945945946, 0.5955955955955956, 0.5965965965965966, 0.5975975975975976, 0.5985985985985987, 0.5995995995995996, 0.6006006006006006, 0.6016016016016016, 0.6026026026026026, 0.6036036036036037, 0.6046046046046046, 0.6056056056056056, 0.6066066066066066, 0.6076076076076076, 0.6086086086086087, 0.6096096096096096, 0.6106106106106106, 0.6116116116116116, 0.6126126126126126, 0.6136136136136137, 0.6146146146146146, 0.6156156156156156, 0.6166166166166166, 0.6176176176176176, 0.6186186186186187, 0.6196196196196196, 0.6206206206206206, 0.6216216216216216, 0.6226226226226226, 0.6236236236236237, 0.6246246246246246, 0.6256256256256256, 0.6266266266266266, 0.6276276276276276, 0.6286286286286287, 0.6296296296296297, 0.6306306306306306, 0.6316316316316316, 0.6326326326326326, 0.6336336336336337, 0.6346346346346347, 0.6356356356356356, 0.6366366366366366, 0.6376376376376376, 0.6386386386386387, 0.6396396396396397, 0.6406406406406406, 0.6416416416416416, 0.6426426426426426, 0.6436436436436437, 0.6446446446446447, 0.6456456456456456, 0.6466466466466466, 0.6476476476476476, 0.6486486486486487, 0.6496496496496497, 0.6506506506506506, 0.6516516516516516, 0.6526526526526526, 0.6536536536536537, 0.6546546546546547, 0.6556556556556556, 0.6566566566566566, 0.6576576576576577, 0.6586586586586587, 0.6596596596596597, 0.6606606606606606, 0.6616616616616616, 0.6626626626626627, 0.6636636636636637, 0.6646646646646647, 0.6656656656656657, 0.6666666666666666, 0.6676676676676677, 0.6686686686686687, 0.6696696696696697, 0.6706706706706707, 0.6716716716716716, 0.6726726726726727, 0.6736736736736737, 0.6746746746746747, 0.6756756756756757, 0.6766766766766766, 0.6776776776776777, 0.6786786786786787, 0.6796796796796797, 0.6806806806806807, 0.6816816816816816, 0.6826826826826827, 0.6836836836836837, 0.6846846846846847, 0.6856856856856857, 0.6866866866866866, 0.6876876876876877, 0.6886886886886887, 0.6896896896896897, 0.6906906906906907, 0.6916916916916916, 0.6926926926926927, 0.6936936936936937, 0.6946946946946947, 0.6956956956956957, 0.6966966966966966, 0.6976976976976977, 0.6986986986986987, 0.6996996996996997, 0.7007007007007007, 0.7017017017017017, 0.7027027027027027, 0.7037037037037037, 0.7047047047047047, 0.7057057057057057, 0.7067067067067067, 0.7077077077077077, 0.7087087087087087, 0.7097097097097097, 0.7107107107107107, 0.7117117117117117, 0.7127127127127127, 0.7137137137137137, 0.7147147147147147, 0.7157157157157157, 0.7167167167167167, 0.7177177177177178, 0.7187187187187187, 0.7197197197197197, 0.7207207207207207, 0.7217217217217218, 0.7227227227227228, 0.7237237237237237, 0.7247247247247247, 0.7257257257257257, 0.7267267267267268, 0.7277277277277278, 0.7287287287287287, 0.7297297297297297, 0.7307307307307307, 0.7317317317317318, 0.7327327327327328, 0.7337337337337337, 0.7347347347347347, 0.7357357357357357, 0.7367367367367368, 0.7377377377377378, 0.7387387387387387, 0.7397397397397397, 0.7407407407407407, 0.7417417417417418, 0.7427427427427428, 0.7437437437437437, 0.7447447447447447, 0.7457457457457457, 0.7467467467467468, 0.7477477477477478, 0.7487487487487487, 0.7497497497497497, 0.7507507507507507, 0.7517517517517518, 0.7527527527527528, 0.7537537537537538, 0.7547547547547547, 0.7557557557557557, 0.7567567567567568, 0.7577577577577578, 0.7587587587587588, 0.7597597597597597, 0.7607607607607607, 0.7617617617617618, 0.7627627627627628, 0.7637637637637638, 0.7647647647647647, 0.7657657657657657, 0.7667667667667668, 0.7677677677677678, 0.7687687687687688, 0.7697697697697697, 0.7707707707707707, 0.7717717717717718, 0.7727727727727728, 0.7737737737737738, 0.7747747747747747, 0.7757757757757757, 0.7767767767767768, 0.7777777777777778, 0.7787787787787788, 0.7797797797797797, 0.7807807807807807, 0.7817817817817818, 0.7827827827827828, 0.7837837837837838, 0.7847847847847848, 0.7857857857857858, 0.7867867867867868, 0.7877877877877878, 0.7887887887887888, 0.7897897897897898, 0.7907907907907908, 0.7917917917917918, 0.7927927927927928, 0.7937937937937938, 0.7947947947947948, 0.7957957957957958, 0.7967967967967968, 0.7977977977977978, 0.7987987987987988, 0.7997997997997998, 0.8008008008008008, 0.8018018018018018, 0.8028028028028028, 0.8038038038038038, 0.8048048048048048, 0.8058058058058059, 0.8068068068068068, 0.8078078078078078, 0.8088088088088088, 0.8098098098098098, 0.8108108108108109, 0.8118118118118118, 0.8128128128128128, 0.8138138138138138, 0.8148148148148148, 0.8158158158158159, 0.8168168168168168, 0.8178178178178178, 0.8188188188188188, 0.8198198198198198, 0.8208208208208209, 0.8218218218218218, 0.8228228228228228, 0.8238238238238238, 0.8248248248248248, 0.8258258258258259, 0.8268268268268268, 0.8278278278278278, 0.8288288288288288, 0.8298298298298298, 0.8308308308308309, 0.8318318318318318, 0.8328328328328328, 0.8338338338338338, 0.8348348348348348, 0.8358358358358359, 0.8368368368368369, 0.8378378378378378, 0.8388388388388388, 0.8398398398398398, 0.8408408408408409, 0.8418418418418419, 0.8428428428428428, 0.8438438438438438, 0.8448448448448449, 0.8458458458458459, 0.8468468468468469, 0.8478478478478478, 0.8488488488488488, 0.8498498498498499, 0.8508508508508509, 0.8518518518518519, 0.8528528528528528, 0.8538538538538538, 0.8548548548548549, 0.8558558558558559, 0.8568568568568569, 0.8578578578578578, 0.8588588588588588, 0.8598598598598599, 0.8608608608608609, 0.8618618618618619, 0.8628628628628628, 0.8638638638638638, 0.8648648648648649, 0.8658658658658659, 0.8668668668668669, 0.8678678678678678, 0.8688688688688688, 0.8698698698698699, 0.8708708708708709, 0.8718718718718719, 0.8728728728728729, 0.8738738738738738, 0.8748748748748749, 0.8758758758758759, 0.8768768768768769, 0.8778778778778779, 0.8788788788788788, 0.8798798798798799, 0.8808808808808809, 0.8818818818818819, 0.8828828828828829, 0.8838838838838838, 0.8848848848848849, 0.8858858858858859, 0.8868868868868869, 0.8878878878878879, 0.8888888888888888, 0.8898898898898899, 0.8908908908908909, 0.8918918918918919, 0.8928928928928929, 0.8938938938938938, 0.8948948948948949, 0.8958958958958959, 0.8968968968968969, 0.8978978978978979, 0.8988988988988988, 0.8998998998998999, 0.9009009009009009, 0.9019019019019019, 0.9029029029029029, 0.9039039039039038, 0.9049049049049049, 0.9059059059059059, 0.9069069069069069, 0.9079079079079079, 0.908908908908909, 0.9099099099099099, 0.9109109109109109, 0.9119119119119119, 0.9129129129129129, 0.913913913913914, 0.914914914914915, 0.9159159159159159, 0.9169169169169169, 0.9179179179179179, 0.918918918918919, 0.91991991991992, 0.9209209209209209, 0.9219219219219219, 0.9229229229229229, 0.923923923923924, 0.924924924924925, 0.9259259259259259, 0.9269269269269269, 0.9279279279279279, 0.928928928928929, 0.92992992992993, 0.9309309309309309, 0.9319319319319319, 0.9329329329329329, 0.933933933933934, 0.934934934934935, 0.9359359359359359, 0.9369369369369369, 0.9379379379379379, 0.938938938938939, 0.93993993993994, 0.9409409409409409, 0.9419419419419419, 0.9429429429429429, 0.943943943943944, 0.944944944944945, 0.9459459459459459, 0.9469469469469469, 0.9479479479479479, 0.948948948948949, 0.94994994994995, 0.950950950950951, 0.9519519519519519, 0.9529529529529529, 0.953953953953954, 0.954954954954955, 0.955955955955956, 0.9569569569569569, 0.9579579579579579, 0.958958958958959, 0.95995995995996, 0.960960960960961, 0.9619619619619619, 0.9629629629629629, 0.963963963963964, 0.964964964964965, 0.965965965965966, 0.9669669669669669, 0.9679679679679679, 0.968968968968969, 0.96996996996997, 0.970970970970971, 0.9719719719719719, 0.972972972972973, 0.973973973973974, 0.974974974974975, 0.975975975975976, 0.9769769769769769, 0.977977977977978, 0.978978978978979, 0.97997997997998, 0.980980980980981, 0.9819819819819819, 0.982982982982983, 0.983983983983984, 0.984984984984985, 0.985985985985986, 0.986986986986987, 0.987987987987988, 0.988988988988989, 0.98998998998999, 0.990990990990991, 0.991991991991992, 0.992992992992993, 0.993993993993994, 0.994994994994995, 0.995995995995996, 0.996996996996997, 0.997997997997998, 0.998998998998999, 1.0]}]} // Get the plotly listeners const plotly_listeners = {} // Get the JS listeners const js_listeners = {} // Deal with eventual custom classes let custom_classlist = [] // Load the plotly library if (!window.Plotly) { const {plotly} = await import('https://cdn.plot.ly/plotly-2.16.1.min.js') } // Check if we have to force local mathjax font cache if (false && window?.MathJax?.config?.svg?.fontCache === 'global') { window.MathJax.config.svg.fontCache = 'local' } // Flag to check if this cell was manually ran or reactively ran const firstRun = this ? false : true const PLOT = this ?? document.createElement("div"); const parent = currentScript.parentElement const isPlutoWrapper = parent.classList.contains('raw-html-wrapper') if (firstRun) { // It seem plot divs would not autosize themself inside flexbox containers without this parent.appendChild(PLOT) } // If width is not specified, set it to 100% PLOT.style.width = plot_obj.layout.width ? "" : "100%" // For the height we have to also put a fixed value in case the plot is put on a non-fixed-size container (like the default wrapper) PLOT.style.height = plot_obj.layout.height ? "" : (isPlutoWrapper || parent.clientHeight == 0) ? "400px" : "100%" PLOT.classList.forEach(cn => { if (cn !== 'js-plotly-plot' && !custom_classlist.includes(cn)) { PLOT.classList.toggle(cn, false) } }) for (const className of custom_classlist) { PLOT.classList.toggle(className, true) } // Create the resizeObserver to make the plot even more responsive! :magic: const resizeObserver = new ResizeObserver(entries => { PLOT.style.height = plot_obj.layout.height ? "" : (isPlutoWrapper || parent.clientHeight == 0) ? "400px" : "100%" /* The addition of the invalid argument `plutoresize` seems to fix the problem with calling `relayout` simply with `{autosize: true}` as update breaking mouse relayout events tracking. See https://github.com/plotly/plotly.js/issues/6156 for details */ Plotly.relayout(PLOT, {..._.pick(PLOT.layout, ['width','height']), autosize: true, plutoresize: true}) }) resizeObserver.observe(PLOT) Plotly.react(PLOT, plot_obj).then(() => { // Assign the Plotly event listeners for (const [key, listener_vec] of Object.entries(plotly_listeners)) { for (const listener of listener_vec) { PLOT.on(key, listener) } } // Assign the JS event listeners for (const [key, listener_vec] of Object.entries(js_listeners)) { for (const listener of listener_vec) { PLOT.addEventListener(key, listener) } } } ) invalidation.then(() => { // Remove all plotly listeners PLOT.removeAllListeners() // Remove all JS listeners for (const [key, listener_vec] of Object.entries(js_listeners)) { for (const listener of listener_vec) { PLOT.removeEventListener(key, listener) } } // Remove the resizeObserver resizeObserver.disconnect() }) return PLOT mimetext/htmlrootassigneelast_run_timestampA !opersist_js_state·has_pluto_hook_features§cell_id$ad0009af-2cfc-4820-bd4a-698ad391f459depends_on_disabled_cells§runtime5published_object_keysdepends_on_skipped_cellsçerrored$16fcc2d0-9f2f-4226-9dcc-6d86248cab26queued¤logsrunning¦outputbody9plot_state_distributions (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA >Jpersist_js_state·has_pluto_hook_features§cell_id$16fcc2d0-9f2f-4226-9dcc-6d86248cab26depends_on_disabled_cells§runtimeYzpublished_object_keysdepends_on_skipped_cellsçerrored$11063fff-4d36-46d5-828f-dbed0f46b9cfqueued¤logsrunning¦outputbodyDactor_critic_fcann_parameter_study (generic function with 3 methods)mimetext/plainrootassigneelast_run_timestampA /persist_js_state·has_pluto_hook_features§cell_id$11063fff-4d36-46d5-828f-dbed0f46b9cfdepends_on_disabled_cells§runtimeexpublished_object_keysdepends_on_skipped_cells§errored$8fcdca63-01a0-4d4b-933c-06a7621d980aqueued¤logsrunning¦outputbodymimetext/plainrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$8fcdca63-01a0-4d4b-933c-06a7621d980adepends_on_disabled_cells§runtime&ȵpublished_object_keysdepends_on_skipped_cells§errored$33c99850-67cd-4754-94b9-6df97b238e27queued¤logsrunning¦outputbody*soft_max! (generic function with 1 method)mimetext/plainrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$33c99850-67cd-4754-94b9-6df97b238e27depends_on_disabled_cells§runtime'Wpublished_object_keysdepends_on_skipped_cells§errored$786a5385-b648-4fc3-8e19-bf6582828136queued¤logsrunning¦outputbodyh

Continuous Action Space

Now that we have verified the success of policy gradient methods on this problem, we can consider using a continuous action space where the policy can output a distribution over throttles. In the original problem, the maximum throttle value is 1, but the velocity of the car is already capped at 0.07. We can see if a policy attempts to use much higher throttle values to end the episode faster even if the physics is unrealistic. That observation would confirm a successful use of continuous actions where the throttle is an unbounded continuous value. The optimal policy would likely try to use the highest throttle possible to reach the maximum speed in either direction faster. We could apply friction to the problem so that the car would actually slip if it attempts to accelerate too quickly.

mimetext/htmlrootassigneelast_run_timestampA persist_js_state·has_pluto_hook_features§cell_id$786a5385-b648-4fc3-8e19-bf6582828136depends_on_disabled_cells§runtime$ҵpublished_object_keysdepends_on_skipped_cells§errored$573878bb-020d-40f6-9329-3d5f91843010queued¤logsrunning¦outputbody11.995292mimetext/plainrootassigneelast_run_timestampA "n0mimetext/htmlrootassigneelast_run_timestampA :apersist_js_state·has_pluto_hook_features§cell_id$2e7c737c-c798-4442-a7e1-d74ccfd73119depends_on_disabled_cells§runtime // We start by putting all the variable interpolation here at the beginning // Publish the plot object to JS let plot_obj = {"layout": {"template": {"layout": {"coloraxis": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "xaxis": {"gridcolor": "white", "zerolinewidth": 2, "title": {"standoff": 15}, "ticks": "", "zerolinecolor": "white", "automargin": true, "linecolor": "white"}, "hovermode": "closest", "paper_bgcolor": "white", "geo": {"showlakes": true, "showland": true, "landcolor": "#E5ECF6", "bgcolor": "white", "subunitcolor": "white", "lakecolor": "white"}, "colorscale": {"sequential": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]], "diverging": [[0, "#8e0152"], [0.1, "#c51b7d"], [0.2, "#de77ae"], [0.3, "#f1b6da"], [0.4, "#fde0ef"], [0.5, "#f7f7f7"], [0.6, "#e6f5d0"], [0.7, "#b8e186"], [0.8, "#7fbc41"], [0.9, "#4d9221"], [1, "#276419"]], "sequentialminus": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}, "yaxis": {"gridcolor": "white", "zerolinewidth": 2, "title": {"standoff": 15}, "ticks": "", "zerolinecolor": "white", "automargin": true, "linecolor": "white"}, "shapedefaults": {"line": {"color": "#2a3f5f"}}, "hoverlabel": {"align": "left"}, "mapbox": {"style": "light"}, "polar": {"angularaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "bgcolor": "#E5ECF6", "radialaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}}, "autotypenumbers": "strict", "font": {"color": "#2a3f5f"}, "ternary": {"baxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "bgcolor": "#E5ECF6", "caxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "aaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}}, "annotationdefaults": {"arrowhead": 0, "arrowwidth": 1, "arrowcolor": "#2a3f5f"}, "plot_bgcolor": "#E5ECF6", "title": {"x": 0.05}, "scene": {"xaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}, "zaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}, "yaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}}, "colorway": ["#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52"]}, "data": {"barpolar": [{"type": "barpolar", "marker": {"line": {"color": "#E5ECF6", "width": 0.5}}}], "carpet": [{"aaxis": {"gridcolor": "white", "endlinecolor": "#2a3f5f", "minorgridcolor": "white", "startlinecolor": "#2a3f5f", "linecolor": "white"}, "type": "carpet", "baxis": {"gridcolor": "white", "endlinecolor": "#2a3f5f", "minorgridcolor": "white", "startlinecolor": "#2a3f5f", "linecolor": "white"}}], "scatterpolar": [{"type": "scatterpolar", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "parcoords": [{"line": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "type": "parcoords"}], "scatter": [{"type": "scatter", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram2dcontour": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "histogram2dcontour", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "contour": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "contour", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scattercarpet": [{"type": "scattercarpet", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "mesh3d": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "mesh3d"}], "surface": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "surface", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scattermapbox": [{"type": "scattermapbox", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scattergeo": [{"type": "scattergeo", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram": [{"type": "histogram", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "pie": [{"type": "pie", "automargin": true}], "choropleth": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "choropleth"}], "heatmapgl": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "heatmapgl", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "bar": [{"type": "bar", "error_y": {"color": "#2a3f5f"}, "error_x": {"color": "#2a3f5f"}, "marker": {"line": {"color": "#E5ECF6", "width": 0.5}}}], "heatmap": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "heatmap", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "contourcarpet": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "contourcarpet"}], "table": [{"type": "table", "header": {"line": {"color": "white"}, "fill": {"color": "#C8D4E3"}}, "cells": {"line": {"color": "white"}, "fill": {"color": "#EBF0F8"}}}], "scatter3d": [{"line": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "type": "scatter3d", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scattergl": [{"type": "scattergl", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram2d": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "histogram2d", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scatterternary": [{"type": "scatterternary", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scatterpolargl": [{"type": "scatterpolargl", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}]}}, "margin": {"l": 50, "b": 50, "r": 50, "t": 60}}, "config": {"showLink": false, "editable": false, "responsive": true, "staticPlot": false, "scrollZoom": true}, "frames": [], "data": [{"y": [22.0, 21.454545454545453, 25.238095238095237, 24.35483870967742, 24.48780487804878, 25.50980392156863, 26.229508196721312, 25.35211267605634, 25.074074074074073, 24.692307692307693, 24.26732673267327, 24.44144144144144, 25.297520661157026, 25.778625954198475, 25.836879432624112, 25.7682119205298, 26.0, 26.058479532163744, 25.917127071823206, 26.05235602094241, 26.417910447761194, 26.53080568720379, 27.11764705882353, 27.17748917748918, 27.224066390041493, 27.768924302788843, 27.528735632183906, 27.726937269372694, 28.01779359430605, 27.924398625429554, 28.13953488372093, 28.324758842443728, 28.180685358255452, 28.205438066465256, 28.33724340175953, 28.48148148148148, 28.440443213296398, 28.485175202156334, 28.522309711286088, 28.672634271099746, 28.668329177057355, 28.683698296836983, 28.676959619952495, 28.535962877030162, 28.514739229024944, 28.463414634146343, 28.449023861171366, 28.590233545647557, 28.592515592515593, 28.65173116089613, 28.694610778443113, 28.73385518590998, 28.706333973128597, 28.67984934086629, 28.765249537892792, 28.865698729582576, 28.871657754010695, 29.03677758318739, 29.10327022375215, 29.03384094754653, 29.001663893510816, 29.075286415711947, 29.088566827697264, 29.0364500792393, 29.045241809672387, 29.019969278033795, 29.015128593040846, 29.06855439642325, 29.18942731277533, 29.23589001447178, 29.476462196861625, 29.49929676511955, 29.57281553398058, 29.49794801641587, 29.453441295546558, 29.422103861517975, 29.419185282522996, 29.424124513618676, 29.541613316261202, 29.57016434892541, 29.60299625468165, 29.65474722564735, 29.678440925700365, 29.71239470517449, 29.70630202140309, 29.693301997649822, 29.722415795586528, 29.752009184845004, 29.833144154370036, 29.910213243546576, 29.926748057713652, 29.98243688254665, 30.04885993485342, 30.106337271750807, 30.20828905419766, 30.291272344900104, 30.36732570239334, 30.425334706488158, 30.411824668705403, 30.479313824419776, 30.498501498501497, 30.55786350148368, 30.58765915768854, 30.66149369544132, 30.756964457252643, 30.788772597526165, 30.851083883129125, 30.935574229691877, 30.991674375578167, 31.05774518790101, 31.108991825613078, 31.196219621962197, 31.320249776984834, 31.374889478337753, 31.411919368974583, 31.47784535186794, 31.598621877691645, 31.710503842869343, 31.745131244707874, 31.79261125104954, 31.834304746044964, 31.872832369942195, 31.964782964782966, 32.025995125913894, 32.082191780821915, 32.11430855315747, 32.16494845360825, 32.20535011801731, 32.25526932084309, 32.342370255615805, 32.42966948501153, 32.490465293668954, 32.58364875094625, 32.63185574755823, 32.65398956002983, 32.67431532198371, 32.66568699485672, 32.74398249452954, 32.829833454018825, 32.85549964054637, 32.91720199857245, 32.9900779588944, 33.06826178747361, 33.15583508036338, 33.21443442054129, 33.305995864920746, 33.37645448323067, 33.3942895989123, 33.457798784605, 33.46143527833669, 33.48301132578281, 33.493050959629386, 33.486522024983564, 33.48073154800784, 33.42310188189487, 33.43584784010316, 33.40038436899423, 33.44812221514959, 33.42947501581278, 33.40100565681961, 33.43472829481574, 33.461204220980754, 33.49475632325725, 33.531575720416924, 33.553930530164536, 33.549969715324046, 33.556291390728475, 33.549970077797724, 33.52766210588935, 33.52335895919574, 33.51381540270429, 33.47457627118644, 33.44915746658919, 33.42229924898902, 33.43251005169443, 33.432324386065105, 33.42589437819421, 33.39920948616601, 33.39752947782145, 33.45226130653266, 33.45197112715158, 33.49144119271121, 33.48215266337177, 33.46750409612234, 33.47474198804997, 33.45867098865478, 33.44492208490059, 33.42437199358632, 33.432748538011694, 33.44738233738762, 33.43713834823777, 33.480376766091055, 33.51900052056221, 33.535473847747284, 33.56723338485317, 33.58944131214761, 33.58388577256502, 33.61998985286657, 33.63553760726906, 33.60622802611753, 33.64917541229385, 33.63998010939831, 33.624938149430974, 33.64795667159035, 33.65409113179814, 33.67576791808874, 33.67491508976225, 33.69531627233221, 33.68909178279673, 33.6910569105691, 33.69347929557354, 33.713405968735195, 33.714285714285715, 33.698732989206945, 33.712283979448856, 33.71780567178057, 33.725590004627485, 33.72224781206817, 33.7542411737735, 33.766773162939295, 33.798273512039984, 33.84350972410674, 33.88473660513282, 33.90542357687136, 33.91387773315484, 33.93602843180808, 33.97478991596638, 34.0035226772347, 34.03068829460763, 34.053688345700564, 34.090395480225986, 34.13846819558633, 34.168892718655755, 34.1956241956242, 34.20333190944041, 34.220757124627816, 34.25878864887759, 34.261071277941795, 34.27341453170936, 34.2961104140527, 34.352769679300295, 34.38075487349647, 34.39487814952499, 34.42081447963801, 34.433428922572716, 34.461036311709506, 34.48476229175132, 34.51639012545528, 34.523982265215636, 34.52268165395424, 34.525389844062374, 34.55276782158503, 34.56882189607299, 34.60885025681549, 34.62691853600945, 34.637789102312816, 34.65677469738383, 34.68689225982108, 34.70631538163502, 34.703203396372054, 34.72241445597847, 34.76024511681348, 34.77375047691721, 34.787153173698215, 34.77887163953048, 34.80045265937382, 34.82111987974446, 34.82178959191314, 34.826184259604624, 34.875882571534746, 34.88152536097741, 34.88638878642567, 34.88239617787578, 34.917246429879164, 34.93104706311565, 34.98327880770629, 35.00796812749004, 35.00902201371346, 35.02948579647609, 35.0412038695808, 35.072474116387006, 35.110281038776236, 35.1520737327189, 35.1578947368421, 35.17810630059838, 35.19221325850579, 35.218455085634396, 35.23754789272031, 35.257202360291565, 35.29228640608786, 35.30299896587384, 35.33734111989007, 35.35638479972612, 35.38382804503583, 35.4138048282897, 35.46526601152152, 35.480918608578186, 35.50622685964322, 35.53539080845354, 35.53961885656971, 35.56047984005332, 35.578545333776155, 35.6140350877193, 35.62817551963048, 35.64255179217363, 35.676827269747626, 35.70565174779484, 35.70530771735591, 35.74553716325868, 35.776124231640246, 35.79135762657207, 35.802635808421726, 35.834027555270744, 35.83296071542638, 35.85004775549188, 35.859727070771186, 35.88484656754192, 35.90318511510564, 35.92298019490726, 35.94797869006581, 35.96594814120587, 35.98069137340392, 35.9844768705371, 36.01238006809037, 36.01851280468991, 36.0218394340203, 36.0622508432996, 36.06725771935188, 36.072843645230115, 36.09358857490125, 36.114813692820356, 36.119601328903656, 36.138512496236075, 36.148604022815974, 36.15294821909608, 36.1641301104148, 36.160963998809876, 36.16760605161673, 36.175391895888794, 36.19492774992627, 36.20876212878565, 36.22017003811199, 36.220988015200234, 36.232293791897405, 36.24266201685557, 36.24398725007244, 36.263507656746604, 36.26649380581965, 36.27291008330939, 36.289601833285595, 36.30676949443016, 36.31899743662774, 36.336552115876174, 36.347210421976776, 36.35922055916408, 36.364122782314844, 36.39707947205841, 36.40072808737048, 36.415247137671045, 36.4380395433027, 36.45959455706748, 36.46967599003046, 36.48632974316487, 36.48030845497108, 36.483109035979126, 36.49767187072035, 36.53346080305927, 36.54753473168074, 36.56615050258082, 36.59279328095367, 36.61577951904891, 36.6254378873619, 36.647406611126044, 36.656124363441435, 36.66907244052393, 36.70141295654492, 36.70752459452273, 36.725271811190666, 36.758000528960594, 36.762331838565025, 36.789529071297025, 36.83836263447914, 36.88170635959173, 36.892195249282175, 36.921374642020304, 37.01038691249026, 37.03626003626004, 37.06199948333764, 37.11852615305334, 37.16499614494988, 37.189438605485776, 37.24776272053183, 37.27952053047692, 37.28415161536505, 37.317685866531335, 37.34902556314857, 37.39535470840697, 37.4379249559305, 37.48003014318011, 37.5114006514658, 37.566858285428644, 37.62478184991274, 37.700323302661026, 37.74820143884892, 37.825290769611485, 37.88126388546038, 37.90421078552081, 37.99385900270204, 38.091644204851754, 38.17159618675141, 38.239697634723235, 38.36438822670883, 38.45377335598156, 38.57129024449286, 38.70007244626902, 38.92411467116357, 39.0732996875751, 39.29177655238552, 39.50944750059794, 39.73824862801241, 39.887645798619374, 40.06411778674899, 40.240227434257285, 40.38147010163082, 40.598679556708326, 40.85344624794166, 41.065712274114055, 41.35120580660267, 41.57860313010979, 41.748310417152176, 41.92815624273425, 42.22778937601485, 42.51909280259199, 42.7513276379589, 43.00506795669201, 43.24339232360377, 43.57188718183903, 43.80599405170442, 44.13604199954349, 44.454793896606695, 44.79072937968643, 45.00657447290864, 45.33363492422529, 45.64161588806139, 45.911956766494036, 46.171871489552906, 46.49338713292984, 46.86714381570118, 47.2088819459942, 47.50701402805611, 47.866696289713396, 48.26734648636666, 48.573324485733245, 48.91260207459722, 49.41026205681568, 49.69984618765107, 50.120149090111816, 50.35659593086852, 50.55380921196245, 50.94815944238728, 51.24407737448381, 51.5757970071568, 51.915169876650076, 52.2541567695962, 52.603749191984484, 52.86519028165986, 53.116927697918904, 53.47784200385357, 53.71587267677847, 53.930505222767, 54.10253137630291, 54.40416047548291, 54.73226011438255, 54.80722891566265, 54.95275258384307, 55.3902336350242, 55.70279353077085, 55.83839865856215, 56.24555532315415, 56.6078063034857, 56.85503020204124, 57.361879027229264, 57.78116573325036, 58.05278410267025, 58.351993389795496, 58.68047825190683, 59.04608105328122, 59.46068569082324, 59.902274124154886, 60.269883459415254, 60.64048153438074, 61.043779271024235, 61.51046535257061, 61.780369093490165, 62.26128314106456, 62.439507170268634, 62.933481152993345, 63.45624622812311, 63.58181088134913, 63.93528351031858, 64.37612477504499, 64.31410896028737, 64.4192391953794, 64.96143907771815, 65.7002578853402, 66.22846960997822, 66.91879075281565, 67.41333070400316, 67.76638456996655, 68.14240817128265, 68.67222113311115, 68.86930150655449, 69.40538957234915, 69.88345351783278, 70.60377358490567, 71.22947000582411, 71.7083898469289, 72.46141945465094, 73.07855626326963, 73.62685417068002, 74.01768890597963, 74.71233928228747, 75.21413522313733, 75.83616899254444, 76.49704254913185, 77.21557798514569, 77.82113666603307, 78.43008916714096, 79.0655178943382, 79.8066528066528, 80.50688549330314, 80.91527019393712, 81.38714527344484, 81.97974113674732, 82.47575360419397, 83.12801345542888, 83.63812721507182, 84.20461738968535, 84.57126928080282, 84.74698571693564, 85.3623403073505, 86.03548327481057, 86.64379265818114, 87.03056527343031, 87.55228818231943, 88.05411851036507, 88.64292254165903, 89.02979345640651, 89.45174238277687, 90.18867237297395, 90.63824759134702, 91.15913627290873, 91.73809092555696, 91.65774724281323, 91.83540877097997, 92.32912988650693, 92.93652220823593, 93.51265481960151, 94.06540046586633, 94.80754784475049, 95.45831101589002, 96.10301194083051, 96.50311332503114, 96.93109572012077, 97.33274242155646, 97.56733321536011, 98.16445857622328, 98.83459707282667, 99.51487414187643, 100.06132489896328, 100.48377477635502, 100.73804937839257, 101.2613179514071, 102.14569883091956, 103.0080125413691, 103.7320466005912, 104.96059711855581, 105.86449488823428, 106.53572046358761, 107.25470557762044, 107.74056197207378, 108.16348304938909, 109.00669987974575, 109.4194820785457, 110.2386577640815, 110.76157921722782, 111.45009384064153, 112.53415091125873, 113.12650909709234, 113.6095739263283, 114.05236400610066, 114.72559634579598, 115.43421719304172, 115.88214466363176, 116.22403635751557, 116.93883380944379, 117.43734272773024, 118.0418690336627, 118.56930279217522, 119.08496077449507, 119.39276787202132, 119.76360006654467, 120.23733599069922, 120.63405737025369, 121.06157920874027, 121.39332341761693, 122.18907770994885, 122.4195354966233, 122.83719782930439, 123.63421441471023, 123.69250942468447, 123.6926853215513, 123.67456297990525, 123.73919425868537, 123.70949356782283, 123.65875467403674, 123.64064275279986, 123.68789499270783, 125.21064552661382, 125.77661120982071, 126.32494758909853, 126.95024955723716, 128.06204790226653, 128.56780613063714, 129.55984617849703, 130.26507758758598, 130.98945855294681, 131.5450486365811, 131.82853048877567, 132.4419011285964, 132.9479447706713, 133.33639676754873, 134.26372409428888, 135.05022903174853, 135.5406087367923, 136.29822075263738, 137.1645967615155, 137.67744467116623, 138.5220184923993, 139.35283993115317, 140.35041399781284, 141.37217282795197, 142.48777448995483, 143.2209609703001, 143.7165036485018, 144.41109905441016, 145.05479028014238, 145.79910369340132, 146.28637555932727, 146.95193344631028, 147.37055837563452, 147.78528643833513, 148.36405459285385, 149.01806767723167, 149.48203638587373, 149.4504655777744, 149.3714372809023, 149.29219296910668, 149.20817504938458, 149.117432862995, 148.98697167095895, 148.83240054454697, 148.67512460353421, 148.52661740310663, 148.35868092154797, 148.2192151556157, 148.0599009157784, 147.95412981561984, 147.87651549169286, 147.8152742489912, 147.77376510968512, 147.76516167486216, 148.1365868174379], "type": "scatter", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673]}]} // Get the plotly listeners const plotly_listeners = {} // Get the JS listeners const js_listeners = {} // Deal with eventual custom classes let custom_classlist = [] // Load the plotly library if (!window.Plotly) { const {plotly} = await import('https://cdn.plot.ly/plotly-2.16.1.min.js') } // Check if we have to force local mathjax font cache if (false && window?.MathJax?.config?.svg?.fontCache === 'global') { window.MathJax.config.svg.fontCache = 'local' } // Flag to check if this cell was manually ran or reactively ran const firstRun = this ? false : true const PLOT = this ?? document.createElement("div"); const parent = currentScript.parentElement const isPlutoWrapper = parent.classList.contains('raw-html-wrapper') if (firstRun) { // It seem plot divs would not autosize themself inside flexbox containers without this parent.appendChild(PLOT) } // If width is not specified, set it to 100% PLOT.style.width = plot_obj.layout.width ? "" : "100%" // For the height we have to also put a fixed value in case the plot is put on a non-fixed-size container (like the default wrapper) PLOT.style.height = plot_obj.layout.height ? "" : (isPlutoWrapper || parent.clientHeight == 0) ? "400px" : "100%" PLOT.classList.forEach(cn => { if (cn !== 'js-plotly-plot' && !custom_classlist.includes(cn)) { PLOT.classList.toggle(cn, false) } }) for (const className of custom_classlist) { PLOT.classList.toggle(className, true) } // Create the resizeObserver to make the plot even more responsive! :magic: const resizeObserver = new ResizeObserver(entries => { PLOT.style.height = plot_obj.layout.height ? "" : (isPlutoWrapper || parent.clientHeight == 0) ? "400px" : "100%" /* The addition of the invalid argument `plutoresize` seems to fix the problem with calling `relayout` simply with `{autosize: true}` as update breaking mouse relayout events tracking. See https://github.com/plotly/plotly.js/issues/6156 for details */ Plotly.relayout(PLOT, {..._.pick(PLOT.layout, ['width','height']), autosize: true, plutoresize: true}) }) resizeObserver.observe(PLOT) Plotly.react(PLOT, plot_obj).then(() => { // Assign the Plotly event listeners for (const [key, listener_vec] of Object.entries(plotly_listeners)) { for (const listener of listener_vec) { PLOT.on(key, listener) } } // Assign the JS event listeners for (const [key, listener_vec] of Object.entries(js_listeners)) { for (const listener of listener_vec) { PLOT.addEventListener(key, listener) } } } ) invalidation.then(() => { // Remove all plotly listeners PLOT.removeAllListeners() // Remove all JS listeners for (const [key, listener_vec] of Object.entries(js_listeners)) { for (const listener of listener_vec) { PLOT.removeEventListener(key, listener) } } // Remove the resizeObserver resizeObserver.disconnect() }) return PLOT mimetext/htmlrootassigneelast_run_timestampA 7#hpersist_js_state·has_pluto_hook_features§cell_id$9d264543-33ab-498a-90f5-5f913c252484depends_on_disabled_cells§runtime+published_object_keysdepends_on_skipped_cellsçerrored$9cf3dc5f-8a25-479f-93db-06e34f0d37a0queued¤logsrunning¦outputbody7UPolicy Probability for Left Action is 0.5 and Average Episode Length is 12.0009365

State Distribution Per Step Including Terminal State

mimetext/htmlrootassigneelast_run_timestampA ?Fpersist_js_state·has_pluto_hook_features§cell_id$9cf3dc5f-8a25-479f-93db-06e34f0d37a0depends_on_disabled_cells§runtimet_published_object_keysdepends_on_skipped_cellsçerrored$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70queued¤logsrunning¦outputbodyM mimetext/htmlrootassigneelast_run_timestampA 0Upersist_js_state·has_pluto_hook_features§cell_id$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70depends_on_disabled_cells§runtime 9[published_object_keysdepends_on_skipped_cellsçerrored$bd6a7c16-6c25-4fc2-8e1b-4dab693ce19fqueued¤logsrunning¦outputbody`actor_critic_binary_episodic_squashed_gaussian_parameter_study (generic function with 2 methods)mimetext/plainrootassigneelast_run_timestampA /D,persist_js_state·has_pluto_hook_features§cell_id$bd6a7c16-6c25-4fc2-8e1b-4dab693ce19fdepends_on_disabled_cells§runtime0 published_object_keysdepends_on_skipped_cells§errored$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72cqueued¤logsrunning¦outputbodyelementsvalue_functionvalue_functiontext/plaingreedy_policygreedy_policytext/plainhistoryelementsepisode_stepsprefixFloat32elementstypeArrayprefix_shortobjectidc648579b1e24c833!application/vnd.pluto.tree+objectstep_rewardsprefixFloat32elementstypeArrayprefix_shortobjectid3ce720620b625923!application/vnd.pluto.tree+objecttypeNamedTupleobjectidad66844189f9d9c6!application/vnd.pluto.tree+objecttypeNamedTupleobjectid5dcda75341102be0mime!application/vnd.pluto.tree+objectrootassigneecorridor_trainlast_run_timestampA lpersist_js_state·has_pluto_hook_features§cell_id$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72cdepends_on_disabled_cells§runtime@hpublished_object_keysdepends_on_skipped_cellsçerrored±cell_dependencies$4f96be72-ef3e-4e08-ac4c-be4271dcd14cprecedence_heuristic cell_id$4f96be72-ef3e-4e08-ac4c-be4271dcd14cdownstream_cells_mapupstream_cells_map$19dfabda-7049-4050-8662-0385529c0c5aprecedence_heuristic cell_id$19dfabda-7049-4050-8662-0385529c0c5adownstream_cells_mapsref_cartpole_binary$0574f5a0-72e7-4aa2-80ac-f4ce4f0fe7c2upstream_cells_map@md_strCore:PlutoUI$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70|>Base.get@bindSliderBasePlutoRunnerPlutoRunner.create_bondconfirmCore.applicablePlutoUI.combinegetindex$b71145a4-2614-4f62-bfd2-7d5d1fecec56precedence_heuristic cell_id$b71145a4-2614-4f62-bfd2-7d5d1fecec56downstream_cells_map%actor_critic_with_eligibility_traces!$05bfd818-bf4e-4bda-baa9-5ba647867097$68806899-9972-460a-9f11-daa708a9d610$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54$20776e09-7d9b-4db8-a060-7bceeec65b47$3e3c5897-809f-46e3-bb58-f115b082443e$05f120be-9695-4824-82fd-142a0df13098upstream_cells_mapzerotypeminzero_params!$e6cf9550-2e69-4b82-92cf-5e07a35490aaupdate_traces_with_gradient!$25be5dcf-be63-46c4-b6de-6cf79fa28fd0$056a8adc-92f4-4b33-90d9-4b3b4026bbbconeContinuousMDP$537270ba-122b-4f2b-880b-31d086766295Base.CoreLogging.!VectorReal'Base.CoreLogging.Base.fixup_stdlib_pathdeepcopy/@infoBase.invokelatestupdate_params_with_gradient!$b0a66a19-ee76-463b-a704-8fcee85444d0$a893a87b-2d07-4db5-9d1a-9da8646216f4$f55afa58-962d-4551-8d95-a5b467d61adfBase.CoreLogging.invokelatestBase.CoreLogging.===&form_state_and_policy_function_outputs$e7e49ff8-32df-48a4-afb2-462859592e92$11b9beea-b0cd-45eb-84c6-151728894df0#___this_pluto_module_nameIntegerFunctionBase<=Int64push!Base.CoreLogging.isa-bad_continuous_action$b966b248-fb4d-457d-90f6-114370846242+*Base.CoreLogging.>=$c0876a48-ea18-494d-8bfc-e2bceb73b417precedence_heuristic cell_id$c0876a48-ea18-494d-8bfc-e2bceb73b417downstream_cells_mapupstream_cells_mapplot_mountaincar_values$f9facbba-39d4-483e-9066-275603156db0!mountaincar_continuing_fcann_test$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091precedence_heuristic cell_id$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091downstream_cells_map#reinforce_monte_carlo_control_fcann$07ad517a-c2ac-4377-99fb-adb13d0f1d0cupstream_cells_mapFCANN$d963ff6d-f1b6-4799-aa0e-1ae100310d84FCANNParamsIntegerStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84FCANN.initializeparams_saxeInt64VectorRealFunctionlengthreinforce_monte_carlo_control!$0ac7ea44-14f6-4e80-80f9-d6df8059bb38fillsetup_fcann_policy_arguments$0e9de19e-bcd4-40ac-9831-afb6cad38422$f4b6f10b-4cd0-4be6-98ec-4d4ffb696392precedence_heuristic cell_id$f4b6f10b-4cd0-4be6-98ec-4d4ffb696392downstream_cells_mapupstream_cells_map@md_strgetindex$9db9ff71-bee9-4bea-a45b-748f8517fed1precedence_heuristic cell_id$9db9ff71-bee9-4bea-a45b-748f8517fed1downstream_cells_mapupstream_cells_mapInt64corridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabe^%one_step_actor_critic_linear_features$57e5e12a-b722-4ea3-ab3b-e5711029e640update_corridor_features!$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811typemax$4634267b-5dea-4164-8bb2-1eb2fd4d7954precedence_heuristic cell_id$4634267b-5dea-4164-8bb2-1eb2fd4d7954downstream_cells_map!update_linear_eligibility_vector!$8e39bd15-862e-4941-88f9-2794b861a523$d1ed25e6-60c6-411f-a541-99986e5da2c5$57e5e12a-b722-4ea3-ab3b-e5711029e640$68806899-9972-460a-9f11-daa708a9d610upstream_cells_mapzeroBLASisless@inboundsonenothingVectorPlutoUI.combineconfirmSlidergetindex$ba41f521-4ee2-42a6-bf18-078bfa4b875eprecedence_heuristic cell_id$ba41f521-4ee2-42a6-bf18-078bfa4b875edownstream_cells_mapmake_n_param_dist_policy_params$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00$d5020a8d-1dd7-403c-9d1f-665b95543943$20776e09-7d9b-4db8-a060-7bceeec65b47$3e3c5897-809f-46e3-bb58-f115b082443e$05f120be-9695-4824-82fd-142a0df13098$55ba8725-0ddf-4196-a41d-3f3c490a8d84$61949faa-8174-4b7b-8fbc-01d5f850b419$dd8e8cd2-7b41-46c4-8530-adefb7aea684$08505e88-9c23-4e95-91e3-d18bf5133dbc$87482ea5-5265-4e02-92c0-1a8bb44ff0f4upstream_cells_mapzerosNTuple*IntegerReal$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03precedence_heuristic cell_id$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03downstream_cells_map-cartpole_tilecoding_reinforce_parameter_studyupstream_cells_mapcartpole_setup$26880577-d267-4950-8725-7afe0d0402b6:|>setup_cartpole_problemmax_steps;reinforce_with_baseline_monte_carlo_control_binary_features$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbscatterplot/foldxt+isemptyMapmeanLayout$3c695d54-c30f-4f04-bd40-f5da53be2a95precedence_heuristic cell_id$3c695d54-c30f-4f04-bd40-f5da53be2a95downstream_cells_mapupstream_cells_map@md_strgetindex$0d45ae72-572f-4d17-83cf-9814f2854131precedence_heuristic cell_id$0d45ae72-572f-4d17-83cf-9814f2854131downstream_cells_map%mountaincar_binary_continuous_params2$0d93132d-5819-47dc-8cf2-462d480d9c3dupstream_cells_mapCoreBasePlutoRunner.create_bondPlutoRunnerCore.applicable@bindBase.getcreate_actor_critic_params_UI$a8b40b8f-051a-4e6f-a079-ece4f32873de$cd9c9eeb-c90d-4499-9503-7773d5250f47precedence_heuristic cell_id$cd9c9eeb-c90d-4499-9503-7773d5250f47downstream_cells_mapupstream_cells_mapmountaincar_continuous_mdp2$349631b2-4686-49a9-9f3a-1e4ad588b568&show_mountaincar_continuous_trajectory$b5319d8b-0420-4ebf-b603-ea0b93365ac1"mountaincar_continuous_test_train2$fee14dfe-c5ca-4126-a830-cc9d7eda5433$fd58402f-da65-44cf-b81a-e21192fd0e63precedence_heuristic cell_id$fd58402f-da65-44cf-b81a-e21192fd0e63downstream_cells_mapupstream_cells_map$cartpole_fcann_continuing_test_state$28ce6e60-59cf-408a-8081-b978507b3c72CartPoleStateplot_cartpole_policy$602a07dd-8928-4b44-97e5-01c5cbf38351cartpole_continuing_fcann_test$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27$8e39bd15-862e-4941-88f9-2794b861a523precedence_heuristic cell_id$8e39bd15-862e-4941-88f9-2794b861a523downstream_cells_map-reinforce_monte_carlo_control_linear_features$5720e942-d3f8-4329-83a8-8bcedf078b6aupstream_cells_mapzerosIntegerStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84copyVectorReal!update_linear_eligibility_vector!$4634267b-5dea-4164-8bb2-1eb2fd4d7954Functionlength!update_linear_action_preferences!$581f7e9b-a5c2-4841-9605-85f9585b0274Matrixreinforce_monte_carlo_control!$0ac7ea44-14f6-4e80-80f9-d6df8059bb38$64900586-ef92-48e4-839e-ff952a46671bprecedence_heuristic cell_id$64900586-ef92-48e4-839e-ff952a46671bdownstream_cells_mapupstream_cells_map$fddef10c-7695-4596-9e16-987fd45a57e6precedence_heuristic cell_id$fddef10c-7695-4596-9e16-987fd45a57e6downstream_cells_map!setup_cartpole_continuous_problem$26880577-d267-4950-8725-7afe0d0402b6upstream_cells_maptile_coding_setup-deg2radTuple/create_cartpole_mdps$3c316495-bb6c-41e2-a38f-ba867a319fbbrand$e2b09af1-0f22-4f7f-b806-54fa522adb20precedence_heuristic cell_id$e2b09af1-0f22-4f7f-b806-54fa522adb20downstream_cells_mapupstream_cells_map$2be8a812-4f21-4fe8-a2de-50497db0345aprecedence_heuristic cell_id$2be8a812-4f21-4fe8-a2de-50497db0345adownstream_cells_mapupstream_cells_map@md_strgetindex$68806899-9972-460a-9f11-daa708a9d610precedence_heuristic cell_id$68806899-9972-460a-9f11-daa708a9d610downstream_cells_map4actor_critic_with_eligibility_traces_linear_features$11ea640c-3981-404d-87c6-4d3d0708a2b8$734573e5-547b-4dcc-89bb-412aa6cc42d6$ff4f977e-48df-4c12-845c-c245b4d39d6dupstream_cells_mapzerosIntegerStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84update_linear_value_gradient!$1753b5ed-c00b-4b60-b492-822180778e8ccopyVector!update_linear_eligibility_vector!$4634267b-5dea-4164-8bb2-1eb2fd4d7954RealFunction!update_linear_action_preferences!$581f7e9b-a5c2-4841-9605-85f9585b0274Matrixlengthlinear_value_function$0bf3b988-b3fb-49d5-8dde-b25766596363%actor_critic_with_eligibility_traces!$266d2234-26c8-43f1-9e75-49440a230ed6$83640f5b-fe13-4ec1-98a0-67a56c189ba1$b71145a4-2614-4f62-bfd2-7d5d1fecec56$4da20fd7-b897-4f26-bf2a-f08d66ddf90f$189798b3-ec6b-48b9-918c-ee0f65935ab3precedence_heuristic cell_id$189798b3-ec6b-48b9-918c-ee0f65935ab3downstream_cells_mapupstream_cells_map@md_strgetindex$00152954-dc98-4120-b94b-2ea4d987832bprecedence_heuristic cell_id$00152954-dc98-4120-b94b-2ea4d987832bdownstream_cells_map!create_mountaincar_continuing_mdp$46fea69b-599e-46ab-8455-d2da865d9a8eupstream_cells_mapStateMDPTransitionSampler$d963ff6d-f1b6-4799-aa0e-1ae100310d84mountaincar_continuing_step$a9db3f85-ff56-4bbc-be87-47b893ef3b7bMountainCarTaskStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84$42d4600a-bf3c-45ac-b7f5-d23917713ff5precedence_heuristic cell_id$42d4600a-bf3c-45ac-b7f5-d23917713ff5downstream_cells_map(cartpole_continuing_fcann_network_params$50ae94c4-70f3-4215-82bd-eb2227c2badfupstream_cells_map@md_strCore:PlutoUI$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70|>Base.get@bindBasePlutoRunnerPlutoRunner.create_bondNumberFieldconfirmCore.applicablePlutoUI.combinegetindex$4e29c621-223e-4859-8e96-db04b967815aprecedence_heuristic cell_id$4e29c621-223e-4859-8e96-db04b967815adownstream_cells_map/setup_binary_squashed_gaussian_policy_arguments$05f120be-9695-4824-82fd-142a0df13098upstream_cells_map'BinarySquashedGaussianEligibilityVector$76fd79a2-2bc8-45f8-a243-48415118898aBinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41randIntegerContinuousMDP$537270ba-122b-4f2b-880b-31d086766295RealFunctionupdate_binary_feature_vector!$8eab55a5-41b7-4f5e-a02f-4c19388bc9eamake_n_param_dist_params$76eb6743-cac0-4174-9ba3-a0691c200b54UnionNTuple$5981f52b-d829-4c7d-b47b-33310f7d64a2precedence_heuristic cell_id$5981f52b-d829-4c7d-b47b-33310f7d64a2downstream_cells_mapupstream_cells_mapcorridor_train.value_functioncorridor_train$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72cmake_ϵ_greedy_policy!$d963ff6d-f1b6-4799-aa0e-1ae100310d84$0e9de19e-bcd4-40ac-9831-afb6cad38422precedence_heuristic cell_id$0e9de19e-bcd4-40ac-9831-afb6cad38422downstream_cells_mapsetup_fcann_policy_arguments$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091$e1aec891-d95a-47d1-97d7-d2a4cfb16e64upstream_cells_mapFCANN$d963ff6d-f1b6-4799-aa0e-1ae100310d84zerosBoolsizeFCANNParamsoneIntegerVectorRealInt64FCANN.form_activationseachindexdeepcopy update_fcann_action_preferences!$cc3ac95e-a398-438a-ba3d-62b6733f6342length/+ update_fcann_eligibility_vector!$45f0a385-6465-4acc-8637-1b007a0fe215fill$ff3009eb-23f9-44fe-8e56-85dbc7b463d0precedence_heuristic cell_id$ff3009eb-23f9-44fe-8e56-85dbc7b463d0downstream_cells_mapshow_squashed_policy$f8215517-b18f-4a03-9421-8edab4ca8089upstream_cells_mapexpplot_squashed_gaussian$00bd2835-b006-4244-9877-bc7e031e3ef8Function$4fb83451-b6f8-4e6e-a131-1accc8e10b08precedence_heuristic cell_id$4fb83451-b6f8-4e6e-a131-1accc8e10b08downstream_cells_map,reinforce_with_baseline_monte_carlo_control!$0ac7ea44-14f6-4e80-80f9-d6df8059bb38$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb$d1ed25e6-60c6-411f-a541-99986e5da2c5$697b2310-9d96-4f7f-be62-c3bd6bf736f3$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00$d5020a8d-1dd7-403c-9d1f-665b95543943upstream_cells_mapzerooneStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84lengthsample_action$d963ff6d-f1b6-4799-aa0e-1ae100310d84copyeachindexrunepisode!$d963ff6d-f1b6-4799-aa0e-1ae100310d84deepcopyReal/^form_state_policy_function$37ec6802-d4c2-4470-ad69-439d5a732f77update_params_with_gradient!$b0a66a19-ee76-463b-a704-8fcee85444d0$a893a87b-2d07-4db5-9d1a-9da8646216f4$f55afa58-962d-4551-8d95-a5b467d61adfrunepisode$d963ff6d-f1b6-4799-aa0e-1ae100310d84:zerosIntegerFunctionInt64-+*form_state_value_function$e7566274-5518-4e28-8738-d4b1747d0cfb$406638af-1e08-44d2-9ee4-97aa9294a94bprecedence_heuristic cell_id$406638af-1e08-44d2-9ee4-97aa9294a94bdownstream_cells_mapupstream_cells_map@md_strgetindex$57e5e12a-b722-4ea3-ab3b-e5711029e640precedence_heuristic cell_id$57e5e12a-b722-4ea3-ab3b-e5711029e640downstream_cells_map%one_step_actor_critic_linear_features$9db9ff71-bee9-4bea-a45b-748f8517fed1upstream_cells_mapzerosIntegerStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84update_linear_value_gradient!$1753b5ed-c00b-4b60-b492-822180778e8ccopyVector!update_linear_eligibility_vector!$4634267b-5dea-4164-8bb2-1eb2fd4d7954Realone_step_actor_critic!$4d4ae57b-afc3-44f9-b6fc-892f59f82921FunctionMatrix!update_linear_action_preferences!$581f7e9b-a5c2-4841-9605-85f9585b0274lengthlinear_value_function$0bf3b988-b3fb-49d5-8dde-b25766596363$374af774-3a97-49b5-a3bb-bc3f7f63a3faprecedence_heuristic cell_id$374af774-3a97-49b5-a3bb-bc3f7f63a3fadownstream_cells_mapupstream_cells_mapep$e1274f57-75cb-4659-a82f-e5870c5367e2ep_step$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547dplot_cart$63fbf8f4-e4e2-4893-be09-67450e92dbd7$7bf209c8-ef0a-46d1-937e-b1a6e45dc62eprecedence_heuristic cell_id$7bf209c8-ef0a-46d1-937e-b1a6e45dc62edownstream_cells_mapbeta_params$ad0009af-2cfc-4820-bd4a-698ad391f459upstream_cells_map@md_strCore:PlutoUI$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70Base.get@bindSliderBasePlutoRunnerPlutoRunner.create_bondCore.applicablePlutoUI.combinegetindex$dd8e8cd2-7b41-46c4-8530-adefb7aea684precedence_heuristic cell_id$dd8e8cd2-7b41-46c4-8530-adefb7aea684downstream_cells_map1actor_critic_binary_episodic_beta_parameter_studyupstream_cells_mapRandom$df7f84e8-b42a-4001-9dbf-6bc3ced94207ContinuousMDP$537270ba-122b-4f2b-880b-31d086766295copyAactor_critic_with_eligibility_traces_binary_features_beta_actions$3e3c5897-809f-46e3-bb58-f115b082443eVectorRealscatter/Matrixisemptymean:AbstractVector|>Infmake_n_param_dist_policy_params$ba41f521-4ee2-42a6-bf18-078bfa4b875ezerosrandIntegerFunctionUInt64-plotfoldxt+MapLayoutRandom.seed!$4fea7232-f286-4a8b-93f8-a0702818ab31precedence_heuristic cell_id$4fea7232-f286-4a8b-93f8-a0702818ab31downstream_cells_mapupstream_cells_map@md_strgetindex$26880577-d267-4950-8725-7afe0d0402b6precedence_heuristic cell_id$26880577-d267-4950-8725-7afe0d0402b6downstream_cells_mapcartpole_setup$0cd96c44-cae6-421f-9fae-26141600bef4$64b38d1f-ecf9-4843-89a1-4c8953048265$24fa139c-ad4b-49db-ac8f-23c476ed8608$dddc4a2f-34b2-41dc-85b3-55aba4880fa6$d3b56fca-5b79-4465-8987-8d0005f854d8$5859ca11-90f8-4fd6-88ed-c56efe796fe8$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03$8aa16866-bfda-48df-9cf1-cf3d2e203ccb$dca2f8e2-76af-4679-bf81-3824c15fc76d$11a55af7-5301-4507-bb26-88e1e11236db$61650a97-b353-4a85-b50b-93fee296ac7b$d34d22ad-89c2-423e-91dd-bfb895dc6540$407a0724-4bb6-4c83-ab2d-17a0e19c4072$27487ad0-4779-42ce-8def-e660ef04bee0$07ba9fe4-aaa7-4123-9865-cbfa79d0d44a$e1274f57-75cb-4659-a82f-e5870c5367e2$5ee4ce72-7740-4297-8d84-619e0708e4ac$82e0e9a0-9662-429a-87e3-e6bdae02709a$daf35bfe-8f9c-4f55-971d-4d443be8f8bf$a5b002c9-5e11-462a-9da0-6e060c7963f8$d21617aa-6f38-4a90-8586-4b32022497adupstream_cells_map!setup_cartpole_continuous_problem$fddef10c-7695-4596-9e16-987fd45a57e6$a7891c63-18d6-4c1f-ba67-adf7c547d334precedence_heuristic cell_id$a7891c63-18d6-4c1f-ba67-adf7c547d334downstream_cells_mapupstream_cells_map$44f14d4f-7414-4c6f-883a-042ca261a403precedence_heuristic cell_id$44f14d4f-7414-4c6f-883a-042ca261a403downstream_cells_mapupstream_cells_map$94354552-9920-4b90-98d9-f75286d1f53eprecedence_heuristic cell_id$94354552-9920-4b90-98d9-f75286d1f53edownstream_cells_mapupstream_cells_map:corridor_parameter_studies$e5c1aca8-7575-4835-8273-e69ca0a55fe8$646bc853-b7fc-49fa-a201-ff98e8f952d4^$e5faaa1b-88cb-43e2-8d04-8972b58b4bdaprecedence_heuristic cell_id$e5faaa1b-88cb-43e2-8d04-8972b58b4bdadownstream_cells_mapplistv2v1v3tracesupstream_cells_map:-scatterplot/+*zipLayoutbgcolor$9c342958-1971-48ec-b919-5dfdcbc915a4$70096b14-beab-4f71-9886-6355c749bb8aprecedence_heuristic cell_id$70096b14-beab-4f71-9886-6355c749bb8adownstream_cells_mapupstream_cells_map@md_strgetindex$90d3b96b-ad2b-405c-951b-f48ec7ccf24aprecedence_heuristic cell_id$90d3b96b-ad2b-405c-951b-f48ec7ccf24adownstream_cells_mapupstream_cells_map@md_strgetindex$700dcbc4-c94c-4287-8cf0-0b2c7a320a3aprecedence_heuristic cell_id$700dcbc4-c94c-4287-8cf0-0b2c7a320a3adownstream_cells_mapupstream_cells_map reinforce_test5.policy_and_valueCartPoleStatereinforce_test5$82e0e9a0-9662-429a-87e3-e6bdae02709a$f59a5dcd-9f4a-4336-a391-e64af35ef799precedence_heuristic cell_id$f59a5dcd-9f4a-4336-a391-e64af35ef799downstream_cells_mapupstream_cells_mapBaseBase.Docs.HTML@html_str$5864a5a3-a5a5-43c2-9cb4-7d13b2d20bedprecedence_heuristic cell_id$5864a5a3-a5a5-43c2-9cb4-7d13b2d20beddownstream_cells_mapupstream_cells_map@md_strgetindex$e3a2fb12-37ce-4c23-ad93-5fc89991aabbprecedence_heuristic cell_id$e3a2fb12-37ce-4c23-ad93-5fc89991aabbdownstream_cells_mapupstream_cells_map@md_strgetindex$e5c1aca8-7575-4835-8273-e69ca0a55fe8precedence_heuristic cell_id$e5c1aca8-7575-4835-8273-e69ca0a55fe8downstream_cells_mapcorridor_parameter_studies$94354552-9920-4b90-98d9-f75286d1f53e$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6cupstream_cells_mapLayout:sumRandom$df7f84e8-b42a-4001-9dbf-6bc3ced94207|>Int64corridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabeget_corridor_features$6bb0263e-368e-462a-948c-baf9cfa82512scatterplot/log2+foldxt-reinforce_monte_carlo_control_binary_features$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290Mapround;reinforce_with_baseline_monte_carlo_control_binary_features$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbRandom.seed!$44b32cc0-36a8-41fd-89bc-ce894536926cprecedence_heuristic cell_id$44b32cc0-36a8-41fd-89bc-ce894536926cdownstream_cells_mapupstream_cells_mapbest_mc_corridor$a12b92d1-e045-4f92-b8cd-eee5d56fa67d!best_mc_corridor.policy_and_value$646bc853-b7fc-49fa-a201-ff98e8f952d4precedence_heuristic cell_id$646bc853-b7fc-49fa-a201-ff98e8f952d4downstream_cells_mapcorridor_parameter_studies$94354552-9920-4b90-98d9-f75286d1f53e$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6cupstream_cells_mapLayout%one_step_actor_critic_binary_features$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2:sum|>lengthInt64corridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabeget_corridor_features$6bb0263e-368e-462a-948c-baf9cfa82512scatter-/plot+Inf32log2Maproundfoldxtisempty$25be5dcf-be63-46c4-b6de-6cf79fa28fd0precedence_heuristic cell_id$25be5dcf-be63-46c4-b6de-6cf79fa28fd0downstream_cells_mapupdate_traces_with_gradient!$266d2234-26c8-43f1-9e75-49440a230ed6$83640f5b-fe13-4ec1-98a0-67a56c189ba1$b71145a4-2614-4f62-bfd2-7d5d1fecec56$4da20fd7-b897-4f26-bf2a-f08d66ddf90fupstream_cells_mapzeroisless@inboundsFCANNParamsoneBinaryEligibilityVector$41dc149d-c6f3-4b0d-a856-06f3aaae3049nothingVectorFunctionlengthcorridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabe/foldxt+Maprunepisode$d963ff6d-f1b6-4799-aa0e-1ae100310d84$4c4e643b-d4b9-44f0-8d30-dc521bcc55acprecedence_heuristic cell_id$4c4e643b-d4b9-44f0-8d30-dc521bcc55acdownstream_cells_mapcartpole_continuing_mdp$1b102220-6d78-480d-a77f-0e57bad23dca$3c89209c-9202-4d5d-841c-ea34be369616$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27upstream_cells_map#cartpole_functions.initialize_stateStateMDPTransitionSampler$d963ff6d-f1b6-4799-aa0e-1ae100310d84cartpole_functions$f27f2bcd-05b6-44fe-bf9e-a3e51556db7ccartpole_continuing_step$5d434c83-c9ca-499f-8695-c7733031c2deStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84$738ada7f-edc7-4ed3-a15e-e92113468738precedence_heuristic cell_id$738ada7f-edc7-4ed3-a15e-e92113468738downstream_cells_mapupstream_cells_map$cacaaca6-6e01-464f-a2ee-cbf62737a426precedence_heuristic cell_id$cacaaca6-6e01-464f-a2ee-cbf62737a426downstream_cells_mapupstream_cells_mapcorridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabe;reinforce_with_baseline_monte_carlo_control_linear_features$d1ed25e6-60c6-411f-a541-99986e5da2c5^update_corridor_features!$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811$daf35bfe-8f9c-4f55-971d-4d443be8f8bfprecedence_heuristic cell_id$daf35bfe-8f9c-4f55-971d-4d443be8f8bfdownstream_cells_mapupstream_cells_mapcartpole_setup$26880577-d267-4950-8725-7afe0d0402b6display_cartpole_episode$822e4d69-2582-4956-858e-06ecb091e76a|>reinforce_test5$82e0e9a0-9662-429a-87e3-e6bdae02709arunepisode$d963ff6d-f1b6-4799-aa0e-1ae100310d84$8e096fae-9941-49d8-ae87-c68b02f68da5precedence_heuristic cell_id$8e096fae-9941-49d8-ae87-c68b02f68da5downstream_cells_mapmountaincar_continuous_beta_mdp$4156d955-9daf-4429-b152-e8332980fb9e$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8fupstream_cells_map)create_continuous_action_mountaincar_beta$d2729657-d0bf-4d39-8ec7-f242a1ad48d6$666a4e89-306b-4fb2-bdc4-3dda2c63153fprecedence_heuristiccell_id$666a4e89-306b-4fb2-bdc4-3dda2c63153fdownstream_cells_mapSpecialFunctionsupstream_cells_map$5d35e515-e2d3-443e-becf-eb28c25db346precedence_heuristic cell_id$5d35e515-e2d3-443e-becf-eb28c25db346downstream_cells_map#mountaincar_continuing_fcann_params$cb70d400-3e9c-441c-b17c-e727e8c928f3upstream_cells_mapCoreBasePlutoRunner.create_bondPlutoRunner(create_actor_critic_continuing_params_UI$5b15d91e-7119-4f85-a54a-7d4f1fdaf097Core.applicable@bindBase.get$4c34640f-efa2-4e1d-8a70-0acd2ce45428precedence_heuristic cell_id$4c34640f-efa2-4e1d-8a70-0acd2ce45428downstream_cells_mapupstream_cells_map@md_strgetindex$e7566274-5518-4e28-8738-d4b1747d0cfbprecedence_heuristic cell_id$e7566274-5518-4e28-8738-d4b1747d0cfbdownstream_cells_mapform_state_value_function$4fb83451-b6f8-4e6e-a131-1accc8e10b08$e7e49ff8-32df-48a4-afb2-462859592e92$5b868eba-c1af-49f6-8f93-79b78c319a6f$11b9beea-b0cd-45eb-84c6-151728894df0upstream_cells_mapFunction$6bf5ad39-1400-4e1f-a843-a1934b8aaa48precedence_heuristic cell_id$6bf5ad39-1400-4e1f-a843-a1934b8aaa48downstream_cells_map,update_squashed_gaussian_eligibility_vector!$05f120be-9695-4824-82fd-142a0df13098upstream_cells_mapzeroisless@inboundsonenothingVector*cartpole_binary_continuing_parameter_study$1b102220-6d78-480d-a77f-0e57bad23dcaislessgetindex$c5dd7e99-57e0-4bc7-97d2-2c780b23bcffprecedence_heuristic cell_id$c5dd7e99-57e0-4bc7-97d2-2c780b23bcffdownstream_cells_mapupstream_cells_map@md_strgetindex$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9precedence_heuristic cell_id$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9downstream_cells_map!mountaincar_continuing_fcann_test$10ee7709-0816-48d2-abe0-9be3dd04700f$c0876a48-ea18-494d-8bfc-e2bceb73b417$3a37b53d-9174-4faa-9404-74a40c385b0aupstream_cells_map*actor_critic_with_eligibility_traces_fcann$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54mountaincar_fcann_setup$023f67b8-8f38-470a-9766-ac60a75678aamountaincar_continuing_mdp$46fea69b-599e-46ab-8455-d2da865d9a8e$042fbafe-2401-4fb7-ac13-4531e0782c79precedence_heuristic cell_id$042fbafe-2401-4fb7-ac13-4531e0782c79downstream_cells_map!update_binary_eligibility_vector!$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2$05bfd818-bf4e-4bda-baa9-5ba647867097upstream_cells_mapRealMatrix!update_binary_action_preferences!$a361f4c9-47ce-42ad-899c-87b611c0d471BinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41soft_max!$33c99850-67cd-4754-94b9-6df97b238e27BinaryEligibilityVector$41dc149d-c6f3-4b0d-a856-06f3aaae3049IntegerVector$d57375a5-b9e0-4742-b5f7-6a7da891604aprecedence_heuristic cell_id$d57375a5-b9e0-4742-b5f7-6a7da891604adownstream_cells_map-mountaincar_binary_continuing_parameter_study$04f42c09-8ab5-4233-b196-51c4aa2dcedbupstream_cells_mapmountaincar_tilecoding_setup$7c592385-e8d3-4efe-962c-d39debb64405mountaincar_continuing_mdp$46fea69b-599e-46ab-8455-d2da865d9a8e#actor_critic_linear_parameter_study$734573e5-547b-4dcc-89bb-412aa6cc42d6$e96d592d-1e54-486d-8ad9-b857f85476e8$ff4f977e-48df-4c12-845c-c245b4d39d6d$07ad517a-c2ac-4377-99fb-adb13d0f1d0cprecedence_heuristic cell_id$07ad517a-c2ac-4377-99fb-adb13d0f1d0cdownstream_cells_mapupstream_cells_map#reinforce_monte_carlo_control_fcann$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091corridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabe^update_corridor_features!$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811$71a5fce8-6d9a-4625-bad1-a951d61bff28precedence_heuristic cell_id$71a5fce8-6d9a-4625-bad1-a951d61bff28downstream_cells_map$mountaincar_binary_continuous_params$b53dba81-a9e9-41da-8fc2-7736bf25f2dcupstream_cells_mapCoreBasePlutoRunner.create_bondPlutoRunnerCore.applicable@bindBase.getcreate_actor_critic_params_UI$a8b40b8f-051a-4e6f-a079-ece4f32873de$77906355-08f8-4b08-b051-84697199b519precedence_heuristic cell_id$77906355-08f8-4b08-b051-84697199b519downstream_cells_mapmountaincar_max_vals$023f67b8-8f38-470a-9766-ac60a75678aa$7c592385-e8d3-4efe-962c-d39debb64405upstream_cells_map$5207308e-f636-4d47-b135-036a6e7b8ecdprecedence_heuristic cell_id$5207308e-f636-4d47-b135-036a6e7b8ecddownstream_cells_mapupstream_cells_map&show_mountaincar_continuous_trajectory$b5319d8b-0420-4ebf-b603-ea0b93365ac1"mountaincar_continuous_test_train3$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7d$16113560-e911-47b4-abc4-641bbd246454precedence_heuristic cell_id$16113560-e911-47b4-abc4-641bbd246454downstream_cells_mapupstream_cells_map&mountaincar_continuous_test_train_beta$4156d955-9daf-4429-b152-e8332980fb9eplotLayout$b7f77935-bcab-4ef1-8e1b-a7d059784ff3precedence_heuristic cell_id$b7f77935-bcab-4ef1-8e1b-a7d059784ff3downstream_cells_maptest_mountaincar_state$f8215517-b18f-4a03-9421-8edab4ca8089upstream_cells_map@md_strCore:PlutoUI$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70Base.get@bindSliderBasePlutoRunnerPlutoRunner.create_bondCore.applicablePlutoUI.combinegetindex$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5precedence_heuristic cell_id$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5downstream_cells_mapupstream_cells_mapreinforce_test.policy_and_valuecartpole_mdps$024dcd1a-8eaa-4a95-8037-2f578828309creinforce_test$24fa139c-ad4b-49db-ac8f-23c476ed86082cartpole_mdps.episodic.continuous.initialize_state$00bd2835-b006-4244-9877-bc7e031e3ef8precedence_heuristic cell_id$00bd2835-b006-4244-9877-bc7e031e3ef8downstream_cells_mapplot_squashed_gaussian$3e7cecec-eb77-4862-8e3c-b510422e06db$ff3009eb-23f9-44fe-8e56-85dbc7b463d0upstream_cells_mapsquashed_gaussian_pdf$b16899b7-36bf-4a5e-8e2f-4496b8450687-plotLinRange*oneLayoutReal$50ae94c4-70f3-4215-82bd-eb2227c2badfprecedence_heuristic cell_id$50ae94c4-70f3-4215-82bd-eb2227c2badfdownstream_cells_mapupstream_cells_map@md_str<(cartpole_continuing_fcann_network_params$42d4600a-bf3c-45ac-b7f5-d23917713ff5)cartpole_fcann_continuing_parameter_study$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62>+start_cartpole_continuing_fcann_param_study$2c5d221a-2469-49e1-9249-dfdc2457f2faisless&cartpole_continuing_fcann_study_params$5ffc271f-c73f-494a-9727-8d7516af2191getindex$cc3ac95e-a398-438a-ba3d-62b6733f6342precedence_heuristic cell_id$cc3ac95e-a398-438a-ba3d-62b6733f6342downstream_cells_map update_fcann_action_preferences!$0e9de19e-bcd4-40ac-9831-afb6cad38422upstream_cells_mapFCANN.forwardNOGRAD_base!endFCANNActivations$5c11a92d-7496-4aba-af15-2537eac49dd7Float32FCANN$d963ff6d-f1b6-4799-aa0e-1ae100310d84FCANNParamsIntegerVector$c926b6df-c40b-4c4c-8a95-ce9e41feb100precedence_heuristic cell_id$c926b6df-c40b-4c4c-8a95-ce9e41feb100downstream_cells_mapupstream_cells_map$740a3f41-9302-481d-b373-762c0dea8effprecedence_heuristic cell_id$740a3f41-9302-481d-b373-762c0dea8effdownstream_cells_map#update_gaussian_eligibility_vector!$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00$d5020a8d-1dd7-403c-9d1f-665b95543943$20776e09-7d9b-4db8-a060-7bceeec65b47upstream_cells_mapexp:kfirstBinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41VectorRealBinaryGaussianEligibilityVector$10cdd16e-a337-4421-a7a0-6de4e4b60c0fMatrix+NTuplelast$ba642a22-6623-482a-ab4a-81585b83e457precedence_heuristic cell_id$ba642a22-6623-482a-ab4a-81585b83e457downstream_cells_map$##average_continuing_runs_unmemoizedaverage_continuing_runs$734573e5-547b-4dcc-89bb-412aa6cc42d6$ff4f977e-48df-4c12-845c-c245b4d39d6d$8bc280db-e57d-4e40-be46-1790f4f7d9e7$11063fff-4d36-46d5-828f-dbed0f46b9cfupstream_cells_map:@memoizeRandom$df7f84e8-b42a-4001-9dbf-6bc3ced94207|>empty!IntegerRealdeepcopy/foldxt+Mapget!Random.seed!$d17a4bd0-5992-4247-912d-73d51758d2f3precedence_heuristic cell_id$d17a4bd0-5992-4247-912d-73d51758d2f3downstream_cells_mapupstream_cells_map@md_strgetindex$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7efprecedence_heuristic cell_id$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7efdownstream_cells_mapupstream_cells_map&cartpole_fcann_continuing_test_episode$64b38d1f-ecf9-4843-89a1-4c8953048265plot_cartpole_policy$602a07dd-8928-4b44-97e5-01c5cbf38351-cartpole_fcann_continuing_episode_step_select$6acb549a-5d90-4457-a347-d22448ad8071cartpole_continuing_fcann_test$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27$5ee4ce72-7740-4297-8d84-619e0708e4acprecedence_heuristic cell_id$5ee4ce72-7740-4297-8d84-619e0708e4acdownstream_cells_map)cartpole_continuing_fcann_parameter_studyupstream_cells_mapcartpole_setup$26880577-d267-4950-8725-7afe0d0402b6:*actor_critic_with_eligibility_traces_fcann$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54|>setup_cartpole_problemscatterplot/foldxt+cartpole_fcann_feature_setup$61650a97-b353-4a85-b50b-93fee296ac7bMap3cartpole_fcann_feature_setup.update_feature_vector!Layout$645e93e7-e92e-49c4-9757-8294fabf4e9bprecedence_heuristic cell_id$645e93e7-e92e-49c4-9757-8294fabf4e9bdownstream_cells_mapupstream_cells_mapcartpole_continuing_test$3c89209c-9202-4d5d-841c-ea34be369616plot_continuing_step_rewards$0964133c-3a5b-433b-a8c4-a97813c37583$0c56b341-24eb-4c78-844e-182f44a7221aprecedence_heuristic cell_id$0c56b341-24eb-4c78-844e-182f44a7221adownstream_cells_mapupstream_cells_map^figure_13_1$d037ea92-915c-4dc7-97c6-d006d92e088a$d34d22ad-89c2-423e-91dd-bfb895dc6540precedence_heuristic cell_id$d34d22ad-89c2-423e-91dd-bfb895dc6540downstream_cells_mapcartpole_fcann_parameter_studyupstream_cells_map+actor_critic_fcann_episodic_parameter_study$f8614042-7c94-4d47-a1b6-4e96676b4e8bcartpole_setup$26880577-d267-4950-8725-7afe0d0402b6cartpole_vector_update!$192b9f82-8d3a-408f-91c2-829cfcd32572cartpole_fcann_feature_setup$61650a97-b353-4a85-b50b-93fee296ac7b$20776e09-7d9b-4db8-a060-7bceeec65b47precedence_heuristic cell_id$20776e09-7d9b-4db8-a060-7bceeec65b47downstream_cells_mapEactor_critic_with_eligibility_traces_binary_features_gaussian_actions$55ba8725-0ddf-4196-a41d-3f3c490a8d84$61949faa-8174-4b7b-8fbc-01d5f850b419$b8532822-179b-4cd5-a279-4b71dafb544a$fee14dfe-c5ca-4126-a830-cc9d7eda5433upstream_cells_mapbinary_value_function$a540814a-57a1-4b98-9443-59e401425444make_n_param_dist_policy_params$ba41f521-4ee2-42a6-bf18-078bfa4b875ezerosBinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41randIntegermake_gaussian_sampler$bba13634-ff0e-47f7-a23b-8d56098f4ac6VectorContinuousMDP$537270ba-122b-4f2b-880b-31d086766295#update_gaussian_eligibility_vector!$5261651e-a51e-4e80-8e23-83a4c10e5259$740a3f41-9302-481d-b373-762c0dea8effRealFunction&setup_binary_gaussian_policy_arguments$ba5d6311-daee-4abc-b2fb-fae2184ef3ebupdate_binary_value_gradient!$03a218cb-aa83-4000-85b5-c6f247087053MatrixNTupleUnion!update_binary_action_preferences!$a361f4c9-47ce-42ad-899c-87b611c0d471%actor_critic_with_eligibility_traces!$266d2234-26c8-43f1-9e75-49440a230ed6$83640f5b-fe13-4ec1-98a0-67a56c189ba1$b71145a4-2614-4f62-bfd2-7d5d1fecec56$4da20fd7-b897-4f26-bf2a-f08d66ddf90f$7856b8a0-565d-4c86-9b3c-4424ff9b86ddprecedence_heuristic cell_id$7856b8a0-565d-4c86-9b3c-4424ff9b86dddownstream_cells_mapupstream_cells_map$735b548a-88f5-4a30-ab8f-dfb3d6401b2bprecedence_heuristic cell_id$735b548a-88f5-4a30-ab8f-dfb3d6401b2bdownstream_cells_mapupstream_cells_map@md_strgetindex$7cf26604-9c2b-4a77-9674-7d4dac2f99f0precedence_heuristiccell_id$7cf26604-9c2b-4a77-9674-7d4dac2f99f0downstream_cells_mapupstream_cells_mapjoinpath@__DIR__include$87ee21f3-16ca-4c8c-a0b9-f9e2fd258a91precedence_heuristic cell_id$87ee21f3-16ca-4c8c-a0b9-f9e2fd258a91downstream_cells_mapupstream_cells_map@md_strgetindex$54f1546d-87ae-49d2-92ed-6fcc9b66e027precedence_heuristic cell_id$54f1546d-87ae-49d2-92ed-6fcc9b66e027downstream_cells_mapupstream_cells_map@md_strgetindex$63fbf8f4-e4e2-4893-be09-67450e92dbd7precedence_heuristic cell_id$63fbf8f4-e4e2-4893-be09-67450e92dbd7downstream_cells_mapplot_cart$fad02876-efba-46a7-9cb7-43820528779f$374af774-3a97-49b5-a3bb-bc3f7f63a3fa$1ce4bc6c-7cde-48e9-8ff1-7281697fd121upstream_cells_mapCartPoleStateHypertextLiteral.BypassindicatorHypertextLiteral.content@htl$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaebInt64-scatterplotHypertextLiteral.ResultHypertextLiteral$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70attrcosLayoutsin$d5020a8d-1dd7-403c-9d1f-665b95543943precedence_heuristic cell_id$d5020a8d-1dd7-403c-9d1f-665b95543943downstream_cells_mapLreinforce_with_baseline_monte_carlo_control_linear_features_gaussian_actionsupstream_cells_map,reinforce_with_baseline_monte_carlo_control!$4fb83451-b6f8-4e6e-a131-1accc8e10b08$5b868eba-c1af-49f6-8f93-79b78c319a6fmake_n_param_dist_policy_params$ba41f521-4ee2-42a6-bf18-078bfa4b875ezerosrandIntegermake_gaussian_sampler$bba13634-ff0e-47f7-a23b-8d56098f4ac6update_linear_value_gradient!$1753b5ed-c00b-4b60-b492-822180778e8ccopy#update_gaussian_eligibility_vector!$5261651e-a51e-4e80-8e23-83a4c10e5259$740a3f41-9302-481d-b373-762c0dea8effVectorContinuousMDP$537270ba-122b-4f2b-880b-31d086766295RealFunctionMatrix!update_linear_action_preferences!$581f7e9b-a5c2-4841-9605-85f9585b0274NTupleUnionlinear_value_function$0bf3b988-b3fb-49d5-8dde-b25766596363make_gaussian_params$37a8ef7e-e859-4ef0-81e2-76c02a324031precedence_heuristic cell_id$37a8ef7e-e859-4ef0-81e2-76c02a324031downstream_cells_mapupstream_cells_map@md_strgetindex$98229733-a71e-44ca-a52a-b7229cf8b422precedence_heuristic cell_id$98229733-a71e-44ca-a52a-b7229cf8b422downstream_cells_mapupstream_cells_map@md_strgetindex$42775fd1-5b27-48e0-abf1-9b22bb775e6dprecedence_heuristic cell_id$42775fd1-5b27-48e0-abf1-9b22bb775e6ddownstream_cells_mapupstream_cells_map#corridor_continuing_parameter_study$7afb6fb0-248a-4518-b94f-9876f81eca64continuing_study_params$7d94922e-dc9f-4953-b539-24aaa2c85b12$7dbb42a3-aa8c-47e5-b668-18e6325d4038precedence_heuristic cell_id$7dbb42a3-aa8c-47e5-b668-18e6325d4038downstream_cells_mapupstream_cells_map@md_strgetindex$192b9f82-8d3a-408f-91c2-829cfcd32572precedence_heuristic cell_id$192b9f82-8d3a-408f-91c2-829cfcd32572downstream_cells_mapcartpole_vector_update!$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27$d34d22ad-89c2-423e-91dd-bfb895dc6540upstream_cells_mapRealCartPoleStatecartpole_fcann_feature_setup$61650a97-b353-4a85-b50b-93fee296ac7b3cartpole_fcann_feature_setup.update_feature_vector!Vector$b5319d8b-0420-4ebf-b603-ea0b93365ac1precedence_heuristic cell_id$b5319d8b-0420-4ebf-b603-ea0b93365ac1downstream_cells_map&show_mountaincar_continuous_trajectory$c87dba8c-9a96-41b3-9dc7-a6c088ec1eaf$cd9c9eeb-c90d-4499-9503-7773d5250f47$5207308e-f636-4d47-b135-036a6e7b8ecd$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8fupstream_cells_mapsumHypertextLiteral.BypassHypertextLiteral.content@htl$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaebIntegerFunctionmountaincar_continuous_mdp$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2scatterplotHypertextLiteral.ResultHypertextLiteral$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70Layoutrunepisode$d963ff6d-f1b6-4799-aa0e-1ae100310d84$4cbdb082-22ba-49e9-a6ed-4380917625acprecedence_heuristic cell_id$4cbdb082-22ba-49e9-a6ed-4380917625acdownstream_cells_mapupstream_cells_map@md_strgetindex$cc80848a-6834-4272-9152-e17b45448814precedence_heuristic cell_id$cc80848a-6834-4272-9152-e17b45448814downstream_cells_mapwind_speedsupstream_cells_map:HypertextLiteral.BypassHypertextLiteral.ResultHypertextLiteral$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70PlutoUI.combinePlutoUI$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70HypertextLiteral.content@htl$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaebSlider$05bfd818-bf4e-4bda-baa9-5ba647867097precedence_heuristic cell_id$05bfd818-bf4e-4bda-baa9-5ba647867097downstream_cells_map4actor_critic_with_eligibility_traces_binary_features$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728$396e0047-d848-462f-a769-0cc2829abc78$1f041cb3-618c-4380-a1ec-d7bbe4a80f62$72273f27-d0b9-4645-a609-cb65cc9332ee$734573e5-547b-4dcc-89bb-412aa6cc42d6$ff4f977e-48df-4c12-845c-c245b4d39d6d$8b35661b-5075-4d63-bc31-044407f99acf$3c89209c-9202-4d5d-841c-ea34be369616$b02ba928-5b9f-4695-b980-07988c788bb9$dca2f8e2-76af-4679-bf81-3824c15fc76d$6d0925d3-af96-4b94-8e2e-4941cce39e51upstream_cells_mapbinary_value_function$a540814a-57a1-4b98-9443-59e401425444setup_binary_policy_arguments$96506201-6b66-49e6-8179-06952e2394e1zerosBinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41IntegerStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84VectorRealFunction!update_binary_eligibility_vector!$042fbafe-2401-4fb7-ac13-4531e0782c79lengthupdate_binary_value_gradient!$03a218cb-aa83-4000-85b5-c6f247087053Matrix!update_binary_action_preferences!$a361f4c9-47ce-42ad-899c-87b611c0d471%actor_critic_with_eligibility_traces!$266d2234-26c8-43f1-9e75-49440a230ed6$83640f5b-fe13-4ec1-98a0-67a56c189ba1$b71145a4-2614-4f62-bfd2-7d5d1fecec56$4da20fd7-b897-4f26-bf2a-f08d66ddf90f$f0962801-0dfa-421f-8ffc-e64068e49913precedence_heuristic cell_id$f0962801-0dfa-421f-8ffc-e64068e49913downstream_cells_mapmountaincar_fcann_feature_setup$c251a630-7114-4188-9323-8d8feb5c32e0upstream_cells_mapfcann_feature_vector_setup$9acdbf38-2e10-45ec-85a0-d0db8453a599$11a55af7-5301-4507-bb26-88e1e11236dbprecedence_heuristic cell_id$11a55af7-5301-4507-bb26-88e1e11236dbdownstream_cells_mapupstream_cells_mapcartpole_setup$26880577-d267-4950-8725-7afe0d0402b6display_cartpole_episode$822e4d69-2582-4956-858e-06ecb091e76a|>reinforce_test3$dca2f8e2-76af-4679-bf81-3824c15fc76drunepisode$d963ff6d-f1b6-4799-aa0e-1ae100310d84$ddbca73f-c692-46f2-95f3-a7dd849d33f7precedence_heuristic cell_id$ddbca73f-c692-46f2-95f3-a7dd849d33f7downstream_cells_mapupstream_cells_mapshow_mountaincar_trajectory$ba645f6b-143f-4e83-9003-707770ae308dmountaincar_test_train$6d0925d3-af96-4b94-8e2e-4941cce39e51$b4875f2b-5487-429f-80a3-d1032bbccfc1precedence_heuristic cell_id$b4875f2b-5487-429f-80a3-d1032bbccfc1downstream_cells_mapupstream_cells_map@md_strgetindex$0cd96c44-cae6-421f-9fae-26141600bef4precedence_heuristic cell_id$0cd96c44-cae6-421f-9fae-26141600bef4downstream_cells_mapupstream_cells_mapcartpole_setup$26880577-d267-4950-8725-7afe0d0402b6display_cartpole_episode$822e4d69-2582-4956-858e-06ecb091e76a|>cartpole_continuing_test$3c89209c-9202-4d5d-841c-ea34be369616runepisode$d963ff6d-f1b6-4799-aa0e-1ae100310d84$51d6337d-c0bd-40a9-9129-7d88e41e4093precedence_heuristic cell_id$51d6337d-c0bd-40a9-9129-7d88e41e4093downstream_cells_mapupstream_cells_map$5859ca11-90f8-4fd6-88ed-c56efe796fe8precedence_heuristic cell_id$5859ca11-90f8-4fd6-88ed-c56efe796fe8downstream_cells_mapupstream_cells_mapcartpole_setup$26880577-d267-4950-8725-7afe0d0402b6display_cartpole_episode$822e4d69-2582-4956-858e-06ecb091e76a|>runepisode$d963ff6d-f1b6-4799-aa0e-1ae100310d84reinforce_test2$d3b56fca-5b79-4465-8987-8d0005f854d8$3ea08816-705e-4be7-a175-dbd3f3e4c17dprecedence_heuristic cell_id$3ea08816-705e-4be7-a175-dbd3f3e4c17ddownstream_cells_mapupstream_cells_map@md_strgetindex$f3e2db06-9cb7-464a-96b8-938175efd26bprecedence_heuristic cell_id$f3e2db06-9cb7-464a-96b8-938175efd26bdownstream_cells_mapsetup_fcann_value_arguments$e1aec891-d95a-47d1-97d7-d2a4cfb16e64upstream_cells_mapfcann_value_function$635abb34-2c97-4f04-a74c-22fbec32f408onescale_fcann_params!$77cf3a74-899f-4ade-99f2-5aaf7a98c02dFCANN.makeorthonormalrandVectorlengthNamedTupleRealeachindexdeepcopy/^last==:FCANN$d963ff6d-f1b6-4799-aa0e-1ae100310d84zerosBoolIntegerendInt64FCANN.form_activations+*update_fcann_value_gradient!$5c4a383f-fcf2-4f2b-819f-6d84471dda00$b2082ab0-73a4-45a6-8772-a2e6e22b519aprecedence_heuristic cell_id$b2082ab0-73a4-45a6-8772-a2e6e22b519adownstream_cells_mapmake_beta_n_samplermake_beta_sampler$3e3c5897-809f-46e3-bb58-f115b082443ebeta_action_samplerupstream_cells_mapzeroexpmaxisnanislessrandIntegerVectorepsRealBetaVal+NTuplentuple$a361f4c9-47ce-42ad-899c-87b611c0d471precedence_heuristic cell_id$a361f4c9-47ce-42ad-899c-87b611c0d471downstream_cells_map!update_binary_action_preferences!$042fbafe-2401-4fb7-ac13-4531e0782c79$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2$05bfd818-bf4e-4bda-baa9-5ba647867097$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00$20776e09-7d9b-4db8-a060-7bceeec65b47$3e3c5897-809f-46e3-bb58-f115b082443e$05f120be-9695-4824-82fd-142a0df13098upstream_cells_mapzero:islessBinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41julia.simdloop@inboundsnothingVectorrandReallengthisless@inboundsoneviewnothingcopyzerosjulia.simdloopsizerandIntegerBase<=Base.simd_outer_rangeBase.simd_inner_lengthfoldxt+ArrayMap$6d0925d3-af96-4b94-8e2e-4941cce39e51precedence_heuristic cell_id$6d0925d3-af96-4b94-8e2e-4941cce39e51downstream_cells_mapmountaincar_test_train$dc2efc6c-8da8-425b-aa5f-290949109565$ddbca73f-c692-46f2-95f3-a7dd849d33f7upstream_cells_mapInt644actor_critic_with_eligibility_traces_binary_features$05bfd818-bf4e-4bda-baa9-5ba647867097mountaincar_tilecoding_setup$7c592385-e8d3-4efe-962c-d39debb64405MountainCarTasktypemax$6bb0263e-368e-462a-948c-baf9cfa82512precedence_heuristic cell_id$6bb0263e-368e-462a-948c-baf9cfa82512downstream_cells_mapget_corridor_features$f2f2dd1d-180c-4d36-b515-5079d129f93a$e1493cea-19c4-475d-98a0-86d27fb04af1$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72c$d037ea92-915c-4dc7-97c6-d006d92e088a$f2ed56c9-c2b7-42cb-a083-e12aeaa126ef$cbea5840-49d2-4e91-be9c-f5f15666d78a$83ca0577-15d7-4448-b597-c77810b812bf$e5c1aca8-7575-4835-8273-e69ca0a55fe8$7d63b960-3998-4f7b-8cbb-ccd49db9aeac$646bc853-b7fc-49fa-a201-ff98e8f952d4$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728$396e0047-d848-462f-a769-0cc2829abc78$bc8a399b-8864-4473-89d2-e3b0a03d15b5$72273f27-d0b9-4645-a609-cb65cc9332ee$7afb6fb0-248a-4518-b94f-9876f81eca64$8b35661b-5075-4d63-bc31-044407f99acfupstream_cells_map:$72273f27-d0b9-4645-a609-cb65cc9332eeprecedence_heuristic cell_id$72273f27-d0b9-4645-a609-cb65cc9332eedownstream_cells_mapupstream_cells_mapcorridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabeget_corridor_features$6bb0263e-368e-462a-948c-baf9cfa82512^4actor_critic_with_eligibility_traces_binary_features$05bfd818-bf4e-4bda-baa9-5ba647867097$87482ea5-5265-4e02-92c0-1a8bb44ff0f4precedence_heuristic cell_id$87482ea5-5265-4e02-92c0-1a8bb44ff0f4downstream_cells_map@actor_critic_binary_continuing_squashed_gaussian_parameter_studyupstream_cells_mapRandom$df7f84e8-b42a-4001-9dbf-6bc3ced94207ContinuousMDP$537270ba-122b-4f2b-880b-31d086766295copyVectorRealscatter/MatrixNactor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions$05f120be-9695-4824-82fd-142a0df13098$717e4c69-59d5-4929-923f-dd35a97fb160:AbstractVector|>make_n_param_dist_policy_params$ba41f521-4ee2-42a6-bf18-078bfa4b875ezerosrandIntegerFunctionUInt64plotfoldxt+MapLayoutRandom.seed!$3bafd7df-9bc0-4d13-874d-739590cf3ad9precedence_heuristic cell_id$3bafd7df-9bc0-4d13-874d-739590cf3ad9downstream_cells_mapupstream_cells_map@md_strgetindex$f27f2bcd-05b6-44fe-bf9e-a3e51556db7cprecedence_heuristic cell_id$f27f2bcd-05b6-44fe-bf9e-a3e51556db7cdownstream_cells_mapcartpole_functions$5d434c83-c9ca-499f-8695-c7733031c2de$4c4e643b-d4b9-44f0-8d30-dc521bcc55ac$de3cba34-9842-44d1-9b79-47126c0a0751upstream_cells_mapcreate_cartpole_functions$352d2952-cb83-47d3-9078-2b2ef9927443$41dc149d-c6f3-4b0d-a856-06f3aaae3049precedence_heuristic cell_id$41dc149d-c6f3-4b0d-a856-06f3aaae3049downstream_cells_mapBinaryEligibilityVector$b0a66a19-ee76-463b-a704-8fcee85444d0$042fbafe-2401-4fb7-ac13-4531e0782c79$96506201-6b66-49e6-8179-06952e2394e1$25be5dcf-be63-46c4-b6de-6cf79fa28fd0upstream_cells_mapInt64BinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41Vector$38e5d800-4d43-40d2-87ea-f7d4b4283dabprecedence_heuristic cell_id$38e5d800-4d43-40d2-87ea-f7d4b4283dabdownstream_cells_mapupstream_cells_map@md_strgetindex$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2precedence_heuristic cell_id$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2downstream_cells_map%one_step_actor_critic_binary_features$7d63b960-3998-4f7b-8cbb-ccd49db9aeac$646bc853-b7fc-49fa-a201-ff98e8f952d4upstream_cells_mapbinary_value_function$a540814a-57a1-4b98-9443-59e401425444setup_binary_policy_arguments$96506201-6b66-49e6-8179-06952e2394e1zerosBinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41IntegerStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84VectorRealone_step_actor_critic!$4d4ae57b-afc3-44f9-b6fc-892f59f82921Function!update_binary_eligibility_vector!$042fbafe-2401-4fb7-ac13-4531e0782c79lengthupdate_binary_value_gradient!$03a218cb-aa83-4000-85b5-c6f247087053Matrix!update_binary_action_preferences!$a361f4c9-47ce-42ad-899c-87b611c0d471$73b90260-d57a-449a-8db6-47f91e6a4e4fprecedence_heuristic cell_id$73b90260-d57a-449a-8db6-47f91e6a4e4fdownstream_cells_mapupstream_cells_map@md_strgetindex$5aba4f96-e877-457e-8e95-18737348f99fprecedence_heuristic cell_id$5aba4f96-e877-457e-8e95-18737348f99fdownstream_cells_map"actor_critic_fcann_parameter_study$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62$c251a630-7114-4188-9323-8d8feb5c32e0upstream_cells_map@NamedTuple:IntegerStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84VectorInt64RealFunction-Base^+$fed4dc4c-0d1c-4ee3-9d0e-8ef2a7db7486precedence_heuristic cell_id$fed4dc4c-0d1c-4ee3-9d0e-8ef2a7db7486downstream_cells_map$mountaincar_continuing_binary_params$04f42c09-8ab5-4233-b196-51c4aa2dcedbupstream_cells_mapCoreBasePlutoRunner.create_bondPlutoRunner(create_actor_critic_continuing_params_UI$5b15d91e-7119-4f85-a54a-7d4f1fdaf097Core.applicable@bindBase.get$27487ad0-4779-42ce-8def-e660ef04bee0precedence_heuristic cell_id$27487ad0-4779-42ce-8def-e660ef04bee0downstream_cells_mapupstream_cells_mapreinforce_test4$407a0724-4bb6-4c83-ab2d-17a0e19c4072cartpole_setup$26880577-d267-4950-8725-7afe0d0402b6 reinforce_test4.policy_and_value6cartpole_setup.mdps.episodic.discrete.initialize_state$0d93132d-5819-47dc-8cf2-462d480d9c3dprecedence_heuristic cell_id$0d93132d-5819-47dc-8cf2-462d480d9c3ddownstream_cells_mapupstream_cells_mapmountaincar_continuous_mdp$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2@md_str<>islessmountaincar_tilecoding_setup$7c592385-e8d3-4efe-962c-d39debb64405%mountaincar_binary_continuous_params2$0d45ae72-572f-4d17-83cf-9814f2854131>actor_critic_binary_episodic_squashed_gaussian_parameter_study$08505e88-9c23-4e95-91e3-d18bf5133dbc$bd6a7c16-6c25-4fc2-8e1b-4dab693ce19f$c5a2879c-e89b-47f7-bbd6-48200d7e89e38run_mountaincar_binary_episodic_countinuous_param_study2$e524f8cc-ab69-4f8b-a59f-28156696a104getindex$9978d537-49ff-4014-a971-b42704c50a6bprecedence_heuristic cell_id$9978d537-49ff-4014-a971-b42704c50a6bdownstream_cells_mapfcann_cartpole_study_paramsupstream_cells_mapCoreBasePlutoRunner.create_bondPlutoRunnerCore.applicable@bindBase.get#create_actor_critic_fcann_params_UI$5eebf3da-bfe7-46eb-81a3-f87f334ee270$f8215517-b18f-4a03-9421-8edab4ca8089precedence_heuristic cell_id$f8215517-b18f-4a03-9421-8edab4ca8089downstream_cells_mapupstream_cells_mapshow_squashed_policy$ff3009eb-23f9-44fe-8e56-85dbc7b463d0test_mountaincar_state$b7f77935-bcab-4ef1-8e1b-a7d059784ff3"mountaincar_continuous_test_train3$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7d$1ac9296f-047b-4051-ba5c-0c23d5f9cde9precedence_heuristic cell_id$1ac9296f-047b-4051-ba5c-0c23d5f9cde9downstream_cells_mapcorridor_continuing_mdp$7afb6fb0-248a-4518-b94f-9876f81eca64$8b35661b-5075-4d63-bc31-044407f99acfupstream_cells_mapmake_corridor_continuing_mdp$f0104778-81a6-417b-8501-f916e5e7f3af$c87dba8c-9a96-41b3-9dc7-a6c088ec1eafprecedence_heuristic cell_id$c87dba8c-9a96-41b3-9dc7-a6c088ec1eafdownstream_cells_mapupstream_cells_map!mountaincar_continuous_test_train$b8532822-179b-4cd5-a279-4b71dafb544a&show_mountaincar_continuous_trajectory$b5319d8b-0420-4ebf-b603-ea0b93365ac1$5cc4d12d-b537-47e2-8109-4e7a234fdf25precedence_heuristic cell_id$5cc4d12d-b537-47e2-8109-4e7a234fdf25downstream_cells_mapmake_corridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabeupstream_cells_mapmaxStateMDPTransitionSampler$d963ff6d-f1b6-4799-aa0e-1ae100310d84-isless+*iseven==IntegerStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84$5334064b-5a16-4135-afa0-86a48291725bprecedence_heuristic cell_id$5334064b-5a16-4135-afa0-86a48291725bdownstream_cells_mapupstream_cells_mapcorridor_train.value_functioncorridor_train$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72c$9c342958-1971-48ec-b919-5dfdcbc915a4precedence_heuristic cell_id$9c342958-1971-48ec-b919-5dfdcbc915a4downstream_cells_mapbgcolor$e5faaa1b-88cb-43e2-8d04-8972b58b4bdaupstream_cells_map@md_strBaseColorStringPickerPlutoRunner.create_bondPlutoRunnerCoreCore.applicableBase.get@bindgetindex$966ef17c-23be-49dc-bc37-4cb52b34c049precedence_heuristic cell_id$966ef17c-23be-49dc-bc37-4cb52b34c049downstream_cells_mapupstream_cells_map@md_strgetindex$e7e49ff8-32df-48a4-afb2-462859592e92precedence_heuristic cell_id$e7e49ff8-32df-48a4-afb2-462859592e92downstream_cells_map&form_state_and_policy_function_outputs$4d4ae57b-afc3-44f9-b6fc-892f59f82921$266d2234-26c8-43f1-9e75-49440a230ed6$83640f5b-fe13-4ec1-98a0-67a56c189ba1$b71145a4-2614-4f62-bfd2-7d5d1fecec56$4da20fd7-b897-4f26-bf2a-f08d66ddf90fupstream_cells_mapVectorsample_action$d963ff6d-f1b6-4799-aa0e-1ae100310d84deepcopyform_state_policy_function$37ec6802-d4c2-4470-ad69-439d5a732f77form_state_value_function$e7566274-5518-4e28-8738-d4b1747d0cfbFunctioncopy$78c83673-2117-4542-b4d8-1c243e8f610bprecedence_heuristic cell_id$78c83673-2117-4542-b4d8-1c243e8f610bdownstream_cells_mapupstream_cells_map@md_strgetindex$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8fprecedence_heuristic cell_id$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8fdownstream_cells_mapupstream_cells_map&mountaincar_continuous_test_train_beta$4156d955-9daf-4429-b152-e8332980fb9e&show_mountaincar_continuous_trajectory$b5319d8b-0420-4ebf-b603-ea0b93365ac1mountaincar_continuous_beta_mdp$8e096fae-9941-49d8-ae87-c68b02f68da5$396e0047-d848-462f-a769-0cc2829abc78precedence_heuristic cell_id$396e0047-d848-462f-a769-0cc2829abc78downstream_cells_mapupstream_cells_mapInt64corridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabeget_corridor_features$6bb0263e-368e-462a-948c-baf9cfa82512^4actor_critic_with_eligibility_traces_binary_features$05bfd818-bf4e-4bda-baa9-5ba647867097typemax$ff4f977e-48df-4c12-845c-c245b4d39d6dprecedence_heuristic cell_id$ff4f977e-48df-4c12-845c-c245b4d39d6ddownstream_cells_map#actor_critic_linear_parameter_study$7afb6fb0-248a-4518-b94f-9876f81eca64$1b102220-6d78-480d-a77f-0e57bad23dca$d57375a5-b9e0-4742-b5f7-6a7da891604aupstream_cells_map:AbstractVectorzerosrandIntegerStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84UInt64RealFunctionlength4actor_critic_with_eligibility_traces_binary_features$05bfd818-bf4e-4bda-baa9-5ba647867097Matrixaverage_continuing_runs$ba642a22-6623-482a-ab4a-81585b83e457DataFrame4actor_critic_with_eligibility_traces_linear_features$68806899-9972-460a-9f11-daa708a9d610$aa450da4-fe84-4eea-b6c4-9820b7982437precedence_heuristic cell_id$aa450da4-fe84-4eea-b6c4-9820b7982437downstream_cells_mapupstream_cells_map@md_strgetindex$bb1ef180-39ac-475f-beea-ef573e71a3bfprecedence_heuristic cell_id$bb1ef180-39ac-475f-beea-ef573e71a3bfdownstream_cells_mapupstream_cells_mapdisplay_cartpole_episode$822e4d69-2582-4956-858e-06ecb091e76a|>ep2$a5b002c9-5e11-462a-9da0-6e060c7963f8$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27precedence_heuristic cell_id$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27downstream_cells_mapcartpole_continuing_fcann_test$04b5929a-2058-49c9-963a-96c752a1d67d$64b38d1f-ecf9-4843-89a1-4c8953048265$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7ef$fd58402f-da65-44cf-b81a-e21192fd0e63upstream_cells_map*actor_critic_with_eligibility_traces_fcann$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54cartpole_vector_update!$192b9f82-8d3a-408f-91c2-829cfcd32572cartpole_continuing_mdp$4c4e643b-d4b9-44f0-8d30-dc521bcc55accartpole_fcann_feature_setup$61650a97-b353-4a85-b50b-93fee296ac7b$5b868eba-c1af-49f6-8f93-79b78c319a6fprecedence_heuristic cell_id$5b868eba-c1af-49f6-8f93-79b78c319a6fdownstream_cells_map,reinforce_with_baseline_monte_carlo_control!$0ac7ea44-14f6-4e80-80f9-d6df8059bb38$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb$d1ed25e6-60c6-411f-a541-99986e5da2c5$697b2310-9d96-4f7f-be62-c3bd6bf736f3$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00$d5020a8d-1dd7-403c-9d1f-665b95543943upstream_cells_mapzerooneContinuousMDP$537270ba-122b-4f2b-880b-31d086766295copyVector%form_state_continuous_policy_function$f545c800-0bf3-491f-9d7d-42341cfdb573Realeachindexrunepisode!$d963ff6d-f1b6-4799-aa0e-1ae100310d84deepcopy/^update_params_with_gradient!$b0a66a19-ee76-463b-a704-8fcee85444d0$a893a87b-2d07-4db5-9d1a-9da8646216f4$f55afa58-962d-4551-8d95-a5b467d61adfrunepisode$d963ff6d-f1b6-4799-aa0e-1ae100310d84:zerosIntegerFunctionInt64-+*form_state_value_function$e7566274-5518-4e28-8738-d4b1747d0cfb$68469a40-7976-48b7-b7a1-eaa4c5f33a18precedence_heuristic cell_id$68469a40-7976-48b7-b7a1-eaa4c5f33a18downstream_cells_map"plot_mountaincar_continuous_values$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580a$b695ef21-a1ac-4d1f-a0e1-71cd81cede18$a0ca7a5e-0089-4a45-9278-c0f27cd096a0upstream_cells_mapHypertextLiteral.BypassLinRangezerosHypertextLiteral.content@htl$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaebFunctionenumerateFloat32plotHypertextLiteral$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70heatmapHypertextLiteral.ResultLayout$2a586e46-66e4-461a-85c8-5817e4d1aa43precedence_heuristic cell_id$2a586e46-66e4-461a-85c8-5817e4d1aa43downstream_cells_mapupstream_cells_map@md_strgetindex$a206c759-3f6e-4003-8cba-5f6ce6742646precedence_heuristic cell_id$a206c759-3f6e-4003-8cba-5f6ce6742646downstream_cells_mapupstream_cells_map@md_strgetindex$fc3dcd26-c5cf-4141-bf6c-eaed5fc9bb1dprecedence_heuristic cell_id$fc3dcd26-c5cf-4141-bf6c-eaed5fc9bb1ddownstream_cells_mapupstream_cells_map@md_strgetindex$3cfd63ad-b1a2-4b99-ae97-2ff10351e4f5precedence_heuristic cell_id$3cfd63ad-b1a2-4b99-ae97-2ff10351e4f5downstream_cells_mapupstream_cells_map@md_strgetindex$31db0f58-28e4-454f-9394-25565687266fprecedence_heuristic cell_id$31db0f58-28e4-454f-9394-25565687266fdownstream_cells_mapupstream_cells_maprandndisplay_cartpole_episode$822e4d69-2582-4956-858e-06ecb091e76aFloat32|>cartpole_mdps$024dcd1a-8eaa-4a95-8037-2f578828309crunepisode$d963ff6d-f1b6-4799-aa0e-1ae100310d84$822e4d69-2582-4956-858e-06ecb091e76aprecedence_heuristic cell_id$822e4d69-2582-4956-858e-06ecb091e76adownstream_cells_mapdisplay_cartpole_episode$0cd96c44-cae6-421f-9fae-26141600bef4$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fce$31db0f58-28e4-454f-9394-25565687266f$dddc4a2f-34b2-41dc-85b3-55aba4880fa6$5859ca11-90f8-4fd6-88ed-c56efe796fe8$11a55af7-5301-4507-bb26-88e1e11236db$07ba9fe4-aaa7-4123-9865-cbfa79d0d44a$daf35bfe-8f9c-4f55-971d-4d443be8f8bf$bb1ef180-39ac-475f-beea-ef573e71a3bfupstream_cells_mapgetfieldCartPoleStateenumeratescatterplotattrLayoutVector$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580aprecedence_heuristic cell_id$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580adownstream_cells_mapupstream_cells_map"plot_mountaincar_continuous_values$68469a40-7976-48b7-b7a1-eaa4c5f33a18!mountaincar_continuous_test_train$b8532822-179b-4cd5-a279-4b71dafb544a$05b0fcad-628b-48d2-aa24-f6f562dbb660precedence_heuristic cell_id$05b0fcad-628b-48d2-aa24-f6f562dbb660downstream_cells_mapupstream_cells_map@md_strgetindex$d2729657-d0bf-4d39-8ec7-f242a1ad48d6precedence_heuristic cell_id$d2729657-d0bf-4d39-8ec7-f242a1ad48d6downstream_cells_map)create_continuous_action_mountaincar_beta$8e096fae-9941-49d8-ae87-c68b02f68da5upstream_cells_mapMountainCarTask.step-*ContinuousMDP$537270ba-122b-4f2b-880b-31d086766295MountainCarTask$5c11a92d-7496-4aba-af15-2537eac49dd7precedence_heuristic cell_id$5c11a92d-7496-4aba-af15-2537eac49dd7downstream_cells_mapFCANNActivations$cc3ac95e-a398-438a-ba3d-62b6733f6342$5c4a383f-fcf2-4f2b-819f-6d84471dda00$635abb34-2c97-4f04-a74c-22fbec32f408upstream_cells_mapRealVector$1753b5ed-c00b-4b60-b492-822180778e8cprecedence_heuristic cell_id$1753b5ed-c00b-4b60-b492-822180778e8cdownstream_cells_mapupdate_linear_value_gradient!$d1ed25e6-60c6-411f-a541-99986e5da2c5$57e5e12a-b722-4ea3-ab3b-e5711029e640$68806899-9972-460a-9f11-daa708a9d610$d5020a8d-1dd7-403c-9d1f-665b95543943upstream_cells_mapRealVector$f7ede764-5ad8-426b-a805-cc21b622d977precedence_heuristic cell_id$f7ede764-5ad8-426b-a805-cc21b622d977downstream_cells_mapupstream_cells_map@md_strgetindex$36d514fa-b27a-4c6b-8399-9d108377b9b5precedence_heuristic cell_id$36d514fa-b27a-4c6b-8399-9d108377b9b5downstream_cells_mapstudy_params$c52c4cec-0ea8-4af3-831a-d284f0e086eeupstream_cells_mapCoreBasePlutoRunner.create_bondPlutoRunnerCore.applicable@bindBase.getcreate_actor_critic_params_UI$a8b40b8f-051a-4e6f-a079-ece4f32873de$6b1acb57-159a-4b7f-99fe-5f996522243bprecedence_heuristic cell_id$6b1acb57-159a-4b7f-99fe-5f996522243bdownstream_cells_mapupstream_cells_map$45f0a385-6465-4acc-8637-1b007a0fe215precedence_heuristic cell_id$45f0a385-6465-4acc-8637-1b007a0fe215downstream_cells_map update_fcann_eligibility_vector!$0e9de19e-bcd4-40ac-9831-afb6cad38422upstream_cells_map:CrossEntropyLoss$d963ff6d-f1b6-4799-aa0e-1ae100310d84FCANN$d963ff6d-f1b6-4799-aa0e-1ae100310d84@inboundsFCANNParamsIntegerVectoreachindexFloat32*FCANN.nnCostFunction$c52c4cec-0ea8-4af3-831a-d284f0e086eeprecedence_heuristic cell_id$c52c4cec-0ea8-4af3-831a-d284f0e086eedownstream_cells_mapupstream_cells_map:^+study_params$36d514fa-b27a-4c6b-8399-9d108377b9b5corridor_parameter_study$bc8a399b-8864-4473-89d2-e3b0a03d15b5$f8614042-7c94-4d47-a1b6-4e96676b4e8bprecedence_heuristic cell_id$f8614042-7c94-4d47-a1b6-4e96676b4e8bdownstream_cells_map+actor_critic_fcann_episodic_parameter_study$d34d22ad-89c2-423e-91dd-bfb895dc6540upstream_cells_map!Random$df7f84e8-b42a-4001-9dbf-6bc3ced94207FilterStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84VectorRealscatterisemptymeanmissing:AbstractVector*actor_critic_with_eligibility_traces_fcann$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54|>ismissingrandIntegerFunctionUInt64tcollectInt64plotMapLayoutRandom.seed!$76eb6743-cac0-4174-9ba3-a0691c200b54precedence_heuristic cell_id$76eb6743-cac0-4174-9ba3-a0691c200b54downstream_cells_mapmake_n_param_dist_params$ba5d6311-daee-4abc-b2fb-fae2184ef3eb$ed93259c-7b8b-46d7-97fb-f194e0e04b3a$4e29c621-223e-4859-8e96-db04b967815aupstream_cells_mapzerosNTuple*IntegerReal$94517664-6988-44dc-a297-e9d5873ee540precedence_heuristic cell_id$94517664-6988-44dc-a297-e9d5873ee540downstream_cells_mapsquashed_gaussian_plot_params$3e7cecec-eb77-4862-8e3c-b510422e06dbupstream_cells_map@md_strCore:PlutoUI$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70Base.get@bindSliderBasePlutoRunnerPlutoRunner.create_bondCore.applicablePlutoUI.combinegetindex$d037ea92-915c-4dc7-97c6-d006d92e088aprecedence_heuristic cell_id$d037ea92-915c-4dc7-97c6-d006d92e088adownstream_cells_mapfigure_13_1$0c56b341-24eb-4c78-844e-182f44a7221aupstream_cells_mapfoldxtLayout:*sqrtRandom$df7f84e8-b42a-4001-9dbf-6bc3ced94207|>Int64corridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabeget_corridor_features$6bb0263e-368e-462a-948c-baf9cfa82512scatter-/plot+log2-reinforce_monte_carlo_control_binary_features$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290fillMaproundRandom.seed!$24fa139c-ad4b-49db-ac8f-23c476ed8608precedence_heuristic cell_id$24fa139c-ad4b-49db-ac8f-23c476ed8608downstream_cells_mapreinforce_test$dddc4a2f-34b2-41dc-85b3-55aba4880fa6$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5upstream_cells_mapcartpole_setup$26880577-d267-4950-8725-7afe0d0402b6^Lreinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00$2025ff38-f2ec-4224-b771-ff72ffe1af28precedence_heuristic cell_id$2025ff38-f2ec-4224-b771-ff72ffe1af28downstream_cells_mapmountaincar_min_vals$023f67b8-8f38-470a-9766-ac60a75678aa$7c592385-e8d3-4efe-962c-d39debb64405upstream_cells_map$cb70d400-3e9c-441c-b17c-e727e8c928f3precedence_heuristic cell_id$cb70d400-3e9c-441c-b17c-e727e8c928f3downstream_cells_mapupstream_cells_map@md_str<.start_mountaincar_continuing_fcann_param_study$f487f2dd-ad09-48ac-ae34-bf50cfa6ac7d>#mountaincar_continuing_fcann_params$5d35e515-e2d3-443e-becf-eb28c25db346,mountaincar_fcann_continuing_parameter_study$c251a630-7114-4188-9323-8d8feb5c32e0islessgetindex$e034b9cb-f4ee-46f4-bea6-72c93c75d966precedence_heuristiccell_id$e034b9cb-f4ee-46f4-bea6-72c93c75d966downstream_cells_mapDataFramesupstream_cells_map$e6cf9550-2e69-4b82-92cf-5e07a35490aaprecedence_heuristic cell_id$e6cf9550-2e69-4b82-92cf-5e07a35490aadownstream_cells_mapzero_params!$266d2234-26c8-43f1-9e75-49440a230ed6$83640f5b-fe13-4ec1-98a0-67a56c189ba1$b71145a4-2614-4f62-bfd2-7d5d1fecec56$4da20fd7-b897-4f26-bf2a-f08d66ddf90fupstream_cells_mapzeroeachindex:ArrayFCANNParamsReal$717e4c69-59d5-4929-923f-dd35a97fb160precedence_heuristic cell_id$717e4c69-59d5-4929-923f-dd35a97fb160downstream_cells_mapNactor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions$08505e88-9c23-4e95-91e3-d18bf5133dbc$87482ea5-5265-4e02-92c0-1a8bb44ff0f4$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7dupstream_cells_mapRealFunctionNTupleUniononeIntegerContinuousMDP$537270ba-122b-4f2b-880b-31d086766295$1386ffdb-940d-4f1b-a872-4e38647b5335precedence_heuristic cell_id$1386ffdb-940d-4f1b-a872-4e38647b5335downstream_cells_mapupstream_cells_map@md_strgetindex$a893a87b-2d07-4db5-9d1a-9da8646216f4precedence_heuristic cell_id$a893a87b-2d07-4db5-9d1a-9da8646216f4downstream_cells_mapupdate_params_with_gradient!$4fb83451-b6f8-4e6e-a131-1accc8e10b08$4d4ae57b-afc3-44f9-b6fc-892f59f82921$266d2234-26c8-43f1-9e75-49440a230ed6$83640f5b-fe13-4ec1-98a0-67a56c189ba1$5b868eba-c1af-49f6-8f93-79b78c319a6f$b71145a4-2614-4f62-bfd2-7d5d1fecec56$4da20fd7-b897-4f26-bf2a-f08d66ddf90fupstream_cells_mapzero:islessBinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41julia.simdloop@inboundsnothingVectorreinforce_test5$82e0e9a0-9662-429a-87e3-e6bdae02709aplotcumsumround$fac138d9-3c5d-44b0-a87c-b13872f19450precedence_heuristiccell_id$fac138d9-3c5d-44b0-a87c-b13872f19450downstream_cells_mapMemoizeupstream_cells_map$82e0e9a0-9662-429a-87e3-e6bdae02709aprecedence_heuristic cell_id$82e0e9a0-9662-429a-87e3-e6bdae02709adownstream_cells_mapreinforce_test5$27441783-d3c6-40be-9c36-4941613e6ae9$daf35bfe-8f9c-4f55-971d-4d443be8f8bf$a5b002c9-5e11-462a-9da0-6e060c7963f8$d4e87ac4-6008-43b2-aa06-e232ec2b2b5b$700dcbc4-c94c-4287-8cf0-0b2c7a320a3aupstream_cells_mapcartpole_setup$26880577-d267-4950-8725-7afe0d0402b6*actor_critic_with_eligibility_traces_fcann$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54cartpole_fcann_feature_setup$61650a97-b353-4a85-b50b-93fee296ac7b3cartpole_fcann_feature_setup.update_feature_vector!$d3c1379f-acd6-4e15-be7e-a5dbe46a4f62precedence_heuristic cell_id$d3c1379f-acd6-4e15-be7e-a5dbe46a4f62downstream_cells_map(start_mountaincar_continuing_param_study$04f42c09-8ab5-4233-b196-51c4aa2dcedbupstream_cells_mapCoreBasePlutoRunner.create_bondPlutoRunnerCounterButtonCore.applicable@bindBase.get$fad02876-efba-46a7-9cb7-43820528779fprecedence_heuristic cell_id$fad02876-efba-46a7-9cb7-43820528779fdownstream_cells_mapupstream_cells_map&cartpole_fcann_continuing_test_episode$64b38d1f-ecf9-4843-89a1-4c8953048265plot_cart$63fbf8f4-e4e2-4893-be09-67450e92dbd7-cartpole_fcann_continuing_episode_step_select$6acb549a-5d90-4457-a347-d22448ad8071$1ce4bc6c-7cde-48e9-8ff1-7281697fd121precedence_heuristic cell_id$1ce4bc6c-7cde-48e9-8ff1-7281697fd121downstream_cells_mapupstream_cells_mapep2_step$9bce6fdb-2cbc-4758-9a8b-794e490c973dep2$a5b002c9-5e11-462a-9da0-6e060c7963f8plot_cart$63fbf8f4-e4e2-4893-be09-67450e92dbd7$024dcd1a-8eaa-4a95-8037-2f578828309cprecedence_heuristic cell_id$024dcd1a-8eaa-4a95-8037-2f578828309cdownstream_cells_mapcartpole_mdps$cf1859d6-f889-4923-8c87-2d7c039f26c3$31db0f58-28e4-454f-9394-25565687266f$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5upstream_cells_mapcreate_cartpole_mdps$3c316495-bb6c-41e2-a38f-ba867a319fbb$e1274f57-75cb-4659-a82f-e5870c5367e2precedence_heuristic cell_id$e1274f57-75cb-4659-a82f-e5870c5367e2downstream_cells_mapep$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547d$374af774-3a97-49b5-a3bb-bc3f7f63a3fa$af144759-fe66-4ad0-b378-e9eb4e859db4upstream_cells_mapreinforce_test4$407a0724-4bb6-4c83-ab2d-17a0e19c4072cartpole_setup$26880577-d267-4950-8725-7afe0d0402b6runepisode$d963ff6d-f1b6-4799-aa0e-1ae100310d84$fdd3f4fd-4706-4d6b-b150-6ee6b4b370cbprecedence_heuristic cell_id$fdd3f4fd-4706-4d6b-b150-6ee6b4b370cbdownstream_cells_mapupstream_cells_map@md_strgetindex$b02ba928-5b9f-4695-b980-07988c788bb9precedence_heuristic cell_id$b02ba928-5b9f-4695-b980-07988c788bb9downstream_cells_map mountaincar_continuing_tile_test$98222fcd-b456-477c-90dd-844df36877e5$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6$e89bdc84-dbb5-4c73-a39c-6392e5f79704$da3cb392-78f2-48b2-b0dc-5f016664798cupstream_cells_map4actor_critic_with_eligibility_traces_binary_features$05bfd818-bf4e-4bda-baa9-5ba647867097mountaincar_tilecoding_setup$7c592385-e8d3-4efe-962c-d39debb64405mountaincar_continuing_mdp$46fea69b-599e-46ab-8455-d2da865d9a8e$f946c886-6246-4f98-a96f-f06984691ad8precedence_heuristic cell_id$f946c886-6246-4f98-a96f-f06984691ad8downstream_cells_mapApproximationUtils.runepisode!ApproximationUtils.runepisodeupstream_cells_map$@assertTuple>isless!ContinuousMDP$537270ba-122b-4f2b-880b-31d086766295Base.CoreLogging.!length=$3c316495-bb6c-41e2-a38f-ba867a319fbbprecedence_heuristic cell_id$3c316495-bb6c-41e2-a38f-ba867a319fbbdownstream_cells_mapcreate_cartpole_mdps$024dcd1a-8eaa-4a95-8037-2f578828309c$fddef10c-7695-4596-9e16-987fd45a57e6upstream_cells_mapzeroTabularRL$d963ff6d-f1b6-4799-aa0e-1ae100310d84>islessoneContinuousMDP$537270ba-122b-4f2b-880b-31d086766295Realtypemaxsarsa_λ$553b0ceb-f2ca-41ee-99bc-9f53a4487b49precedence_heuristic cell_id$553b0ceb-f2ca-41ee-99bc-9f53a4487b49downstream_cells_mapupstream_cells_mapbest_mc_corridor$a12b92d1-e045-4f92-b8cd-eee5d56fa67dget_corridor_episode_stats$fb8904a9-ae64-41cc-93b6-5a25855edad0$cecc2a35-3850-4f66-84e8-e29da4f3d4b0$f9facbba-39d4-483e-9066-275603156db0precedence_heuristic cell_id$f9facbba-39d4-483e-9066-275603156db0downstream_cells_mapplot_mountaincar_values$e89bdc84-dbb5-4c73-a39c-6392e5f79704$c0876a48-ea18-494d-8bfc-e2bceb73b417$d82e7ab8-c372-4462-afb5-1617560cdb56upstream_cells_mapHypertextLiteral.BypassLinRangezerosHypertextLiteral.content@htl$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaebenumerateFloat32plotHypertextLiteral$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70heatmapHypertextLiteral.ResultLayout$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0precedence_heuristic cell_id$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0downstream_cells_mapupstream_cells_mapone_step_actor_critic_fcann$57bbdb10-bed8-459d-8f67-9ea637cf12baInt64corridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabe^update_corridor_features!$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811typemax$d41f1dd1-45fe-4456-9a01-ed47fd6704a7precedence_heuristic cell_id$d41f1dd1-45fe-4456-9a01-ed47fd6704a7downstream_cells_mapupdate_beta_eligibility_vector!$3e3c5897-809f-46e3-bb58-f115b082443eupstream_cells_mapexp:kfirstBinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41VectorRealBinaryBetaEligibilityVector$54fff14b-cf53-47b0-9cfa-8b9ee33df54eMatrix+NTuplelast$ba5d6311-daee-4abc-b2fb-fae2184ef3ebprecedence_heuristic cell_id$ba5d6311-daee-4abc-b2fb-fae2184ef3ebdownstream_cells_map&setup_binary_gaussian_policy_arguments$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00$20776e09-7d9b-4db8-a060-7bceeec65b47upstream_cells_mapBinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41randIntegerContinuousMDP$537270ba-122b-4f2b-880b-31d086766295RealFunctionBinaryGaussianEligibilityVector$10cdd16e-a337-4421-a7a0-6de4e4b60c0fupdate_binary_feature_vector!$8eab55a5-41b7-4f5e-a02f-4c19388bc9eamake_n_param_dist_params$76eb6743-cac0-4174-9ba3-a0691c200b54UnionNTuple$8e742d32-c074-4981-b35b-b596b64c869bprecedence_heuristic cell_id$8e742d32-c074-4981-b35b-b596b64c869bdownstream_cells_map'cartpole_continuing_binary_study_params$b2539398-fdbc-42a2-a8f3-d327358f3643upstream_cells_mapCoreBasePlutoRunner.create_bondPlutoRunner(create_actor_critic_continuing_params_UI$5b15d91e-7119-4f85-a54a-7d4f1fdaf097Core.applicable@bindBase.get$03a218cb-aa83-4000-85b5-c6f247087053precedence_heuristic cell_id$03a218cb-aa83-4000-85b5-c6f247087053downstream_cells_mapupdate_binary_value_gradient!$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2$05bfd818-bf4e-4bda-baa9-5ba647867097$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00$20776e09-7d9b-4db8-a060-7bceeec65b47$3e3c5897-809f-46e3-bb58-f115b082443e$05f120be-9695-4824-82fd-142a0df13098upstream_cells_mapRealBinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41Vector$1ec1acf1-f833-4478-9b3c-88029340a629precedence_heuristic cell_id$1ec1acf1-f833-4478-9b3c-88029340a629downstream_cells_mapupstream_cells_map@md_strgetindex$de3cba34-9842-44d1-9b79-47126c0a0751precedence_heuristic cell_id$de3cba34-9842-44d1-9b79-47126c0a0751downstream_cells_mapcartpole_tilecoding_setup$1b102220-6d78-480d-a77f-0e57bad23dca$3c89209c-9202-4d5d-841c-ea34be369616upstream_cells_maptile_coding_setup/cartpole_functions$f27f2bcd-05b6-44fe-bf9e-a3e51556db7c$04f42c09-8ab5-4233-b196-51c4aa2dcedbprecedence_heuristic cell_id$04f42c09-8ab5-4233-b196-51c4aa2dcedbdownstream_cells_mapupstream_cells_map@md_str<$mountaincar_continuing_binary_params$fed4dc4c-0d1c-4ee3-9d0e-8ef2a7db7486>isless-mountaincar_binary_continuing_parameter_study$d57375a5-b9e0-4742-b5f7-6a7da891604a(start_mountaincar_continuing_param_study$d3c1379f-acd6-4e15-be7e-a5dbe46a4f62getindex$54ff46a2-489a-4dd2-bc30-df70c780cc42precedence_heuristic cell_id$54ff46a2-489a-4dd2-bc30-df70c780cc42downstream_cells_mapupstream_cells_map$7126aefd-b847-497a-9545-514e9b9afa71precedence_heuristic cell_id$7126aefd-b847-497a-9545-514e9b9afa71downstream_cells_mapupstream_cells_map$48dcd2d0-a940-41da-a097-90c780f2ec4dprecedence_heuristic cell_id$48dcd2d0-a940-41da-a097-90c780f2ec4ddownstream_cells_mapupstream_cells_map@md_strgetindex$e1493cea-19c4-475d-98a0-86d27fb04af1precedence_heuristic cell_id$e1493cea-19c4-475d-98a0-86d27fb04af1downstream_cells_mapupstream_cells_mapsarsa_λInt64corridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabeget_corridor_features$6bb0263e-368e-462a-948c-baf9cfa82512|>typemaxget_corridor_episode_stats$fb8904a9-ae64-41cc-93b6-5a25855edad0$cecc2a35-3850-4f66-84e8-e29da4f3d4b0$511a847f-234c-465e-8f4a-688e79d9b975precedence_heuristic cell_id$511a847f-234c-465e-8f4a-688e79d9b975downstream_cells_mapupstream_cells_map@md_strgetindex$697b2310-9d96-4f7f-be62-c3bd6bf736f3precedence_heuristic cell_id$697b2310-9d96-4f7f-be62-c3bd6bf736f3downstream_cells_map1reinforce_with_baseline_monte_carlo_control_fcann$aa69e4ea-91e0-496a-a7be-529e67f4dbecupstream_cells_map,reinforce_with_baseline_monte_carlo_control!$4fb83451-b6f8-4e6e-a131-1accc8e10b08$5b868eba-c1af-49f6-8f93-79b78c319a6fFCANN$d963ff6d-f1b6-4799-aa0e-1ae100310d84FCANNParamsIntegerStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84FCANN.initializeparams_saxeInt64VectorRealFunction&setup_fcann_policy_and_value_arguments$e1aec891-d95a-47d1-97d7-d2a4cfb16e64lengthfill$056a8adc-92f4-4b33-90d9-4b3b4026bbbcprecedence_heuristic cell_id$056a8adc-92f4-4b33-90d9-4b3b4026bbbcdownstream_cells_mapupdate_traces_with_gradient!$266d2234-26c8-43f1-9e75-49440a230ed6$83640f5b-fe13-4ec1-98a0-67a56c189ba1$b71145a4-2614-4f62-bfd2-7d5d1fecec56$4da20fd7-b897-4f26-bf2a-f08d66ddf90fupstream_cells_map*zero'BinarySquashedGaussianEligibilityVector$76fd79a2-2bc8-45f8-a243-48415118898aisnanislessdigamma@inboundsone'Base.CoreLogging.Base.fixup_stdlib_pathBase.CoreLogging.!nothing=Base.simd_inner_lengthBinaryGaussianEligibilityVector$10cdd16e-a337-4421-a7a0-6de4e4b60c0fθ$bc8a399b-8864-4473-89d2-e3b0a03d15b5precedence_heuristic cell_id$bc8a399b-8864-4473-89d2-e3b0a03d15b5downstream_cells_mapcorridor_parameter_study$c52c4cec-0ea8-4af3-831a-d284f0e086eeupstream_cells_mapcorridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabeget_corridor_features$6bb0263e-368e-462a-948c-baf9cfa82512,actor_critic_binary_episodic_parameter_study$1f041cb3-618c-4380-a1ec-d7bbe4a80f62$d9d11d69-bc16-400a-8f46-f9a8ecb8516a$bba13634-ff0e-47f7-a23b-8d56098f4ac6precedence_heuristic cell_id$bba13634-ff0e-47f7-a23b-8d56098f4ac6downstream_cells_mapgaussian_action_samplermake_gaussian_n_samplermake_gaussian_sampler$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00$d5020a8d-1dd7-403c-9d1f-665b95543943$20776e09-7d9b-4db8-a060-7bceeec65b47upstream_cells_mapisapproxzeroexpisnanrandIntegerVectorRealisinfVal+NTuplentupleNormal$407a0724-4bb6-4c83-ab2d-17a0e19c4072precedence_heuristic cell_id$407a0724-4bb6-4c83-ab2d-17a0e19c4072downstream_cells_mapreinforce_test4$27487ad0-4779-42ce-8def-e660ef04bee0$9d264543-33ab-498a-90f5-5f913c252484$07ba9fe4-aaa7-4123-9865-cbfa79d0d44a$af144759-fe66-4ad0-b378-e9eb4e859db4$e1274f57-75cb-4659-a82f-e5870c5367e2upstream_cells_mapInt64cartpole_setup$26880577-d267-4950-8725-7afe0d0402b6*actor_critic_with_eligibility_traces_fcann$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54cartpole_fcann_feature_setup$61650a97-b353-4a85-b50b-93fee296ac7b3cartpole_fcann_feature_setup.update_feature_vector!typemax$77cf3a74-899f-4ade-99f2-5aaf7a98c02dprecedence_heuristic cell_id$77cf3a74-899f-4ade-99f2-5aaf7a98c02ddownstream_cells_mapscale_fcann_params!$f3e2db06-9cb7-464a-96b8-938175efd26bupstream_cells_mapReal:eachindex/@inboundsFCANNParamsVector$28ce6e60-59cf-408a-8081-b978507b3c72precedence_heuristic cell_id$28ce6e60-59cf-408a-8081-b978507b3c72downstream_cells_map$cartpole_fcann_continuing_test_state$fd58402f-da65-44cf-b81a-e21192fd0e63upstream_cells_map@md_strCore:deg2radLinRangePlutoUI$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70|>Base.get@bindSliderBasePlutoRunner-PlutoRunner.create_bondconfirmCore.applicablePlutoUI.combinegetindex$7ccadf01-fbba-4dfd-a5ad-770dab9946f9precedence_heuristic cell_id$7ccadf01-fbba-4dfd-a5ad-770dab9946f9downstream_cells_mapupstream_cells_map@md_strgetindex$b72e030f-7d52-481f-b4f7-2b16b227e547precedence_heuristic cell_id$b72e030f-7d52-481f-b4f7-2b16b227e547downstream_cells_mapupstream_cells_map@md_strgetindex$4c5cb75e-79b5-4502-b1eb-6246e002feafprecedence_heuristic cell_id$4c5cb75e-79b5-4502-b1eb-6246e002feafdownstream_cells_mapmountaincar_binary_params$8eb42403-1234-4e59-993e-057cc3a6d5c9upstream_cells_mapCoreBasePlutoRunner.create_bondPlutoRunnerCore.applicable@bindBase.getcreate_actor_critic_params_UI$a8b40b8f-051a-4e6f-a079-ece4f32873de$48b342f2-e48f-457a-9bd3-b3504a79f3a6precedence_heuristic cell_id$48b342f2-e48f-457a-9bd3-b3504a79f3a6downstream_cells_mapupstream_cells_map@md_strgetindex$5d50a5d0-8fe2-4c6e-b76c-d5614e4fd884precedence_heuristic cell_id$5d50a5d0-8fe2-4c6e-b76c-d5614e4fd884downstream_cells_mapshow_or_lookup_plotupstream_cells_mapDictNamedTuple@md_strFunctionAbstractStringTuplehaskey==Integergetindex$ba645f6b-143f-4e83-9003-707770ae308dprecedence_heuristic cell_id$ba645f6b-143f-4e83-9003-707770ae308ddownstream_cells_mapshow_mountaincar_trajectory$da3cb392-78f2-48b2-b0dc-5f016664798c$3a37b53d-9174-4faa-9404-74a40c385b0a$ddbca73f-c692-46f2-95f3-a7dd849d33f7upstream_cells_mapsumHypertextLiteral.BypassHypertextLiteral.content@htl$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaebIntegerFunctionscatterplotHypertextLiteral.ResultHypertextLiteral$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70Layoutrunepisode$d963ff6d-f1b6-4799-aa0e-1ae100310d84MountainCarTask$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811precedence_heuristic cell_id$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811downstream_cells_mapupdate_corridor_features!$5720e942-d3f8-4329-83a8-8bcedf078b6a$cacaaca6-6e01-464f-a2ee-cbf62737a426$07ad517a-c2ac-4377-99fb-adb13d0f1d0c$aa69e4ea-91e0-496a-a7be-529e67f4dbec$a12b92d1-e045-4f92-b8cd-eee5d56fa67d$9db9ff71-bee9-4bea-a45b-748f8517fed1$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0upstream_cells_mapRealoneVector$8f1b2db4-ed35-44fc-a3d5-e06deae16d48precedence_heuristic cell_id$8f1b2db4-ed35-44fc-a3d5-e06deae16d48downstream_cells_mapupstream_cells_map$57bbdb10-bed8-459d-8f67-9ea637cf12baprecedence_heuristic cell_id$57bbdb10-bed8-459d-8f67-9ea637cf12badownstream_cells_mapone_step_actor_critic_fcann$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0upstream_cells_mapFCANN$d963ff6d-f1b6-4799-aa0e-1ae100310d84FCANNParamsIntegerStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84FCANN.initializeparams_saxeInt64VectorRealone_step_actor_critic!$4d4ae57b-afc3-44f9-b6fc-892f59f82921Function&setup_fcann_policy_and_value_arguments$e1aec891-d95a-47d1-97d7-d2a4cfb16e64lengthfill$ca360680-afc9-4dd9-9351-493643f91575precedence_heuristic cell_id$ca360680-afc9-4dd9-9351-493643f91575downstream_cells_mapupstream_cells_map@md_strgetindex$d95f75b5-21d8-4862-baa7-50b58d9725b8precedence_heuristic cell_id$d95f75b5-21d8-4862-baa7-50b58d9725b8downstream_cells_mapupstream_cells_map@md_strgetindex$65be0e58-24be-4932-92a9-9e4825b14144precedence_heuristic cell_id$65be0e58-24be-4932-92a9-9e4825b14144downstream_cells_map@actor_critic_binary_continuing_squashed_gaussian_parameter_studyupstream_cells_mapRealFunctiononeIntegerContinuousMDP$537270ba-122b-4f2b-880b-31d086766295$60c21e9c-e42d-4f0b-a910-3b318440fbc8precedence_heuristic cell_id$60c21e9c-e42d-4f0b-a910-3b318440fbc8downstream_cells_mapgaussian_plot_params$09dd1440-5d09-421f-addc-b1ede43ff517upstream_cells_map@md_strCore:PlutoUI$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70Base.get@bindSliderBasePlutoRunnerPlutoRunner.create_bondCore.applicablePlutoUI.combinegetindex$da2d3186-a778-41cc-9b49-759bf1e9b8faprecedence_heuristic cell_id$da2d3186-a778-41cc-9b49-759bf1e9b8fadownstream_cells_mapBinaryFeatures$65d2add6-fd6f-456c-92ed-3cd9d1862ef6$2cbc972b-c685-4c1c-8a8d-9d58b197ad90upstream_cells_mapC2C1C3TAbstractVectorNBaseNTupleUnionInteger$b695ef21-a1ac-4d1f-a0e1-71cd81cede18precedence_heuristic cell_id$b695ef21-a1ac-4d1f-a0e1-71cd81cede18downstream_cells_mapupstream_cells_map"plot_mountaincar_continuous_values$68469a40-7976-48b7-b7a1-eaa4c5f33a18"mountaincar_continuous_test_train2$fee14dfe-c5ca-4126-a830-cc9d7eda5433$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00precedence_heuristic cell_id$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00downstream_cells_mapLreinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions$24fa139c-ad4b-49db-ac8f-23c476ed8608$8aa16866-bfda-48df-9cf1-cf3d2e203ccbupstream_cells_map,reinforce_with_baseline_monte_carlo_control!$4fb83451-b6f8-4e6e-a131-1accc8e10b08$5b868eba-c1af-49f6-8f93-79b78c319a6fbinary_value_function$a540814a-57a1-4b98-9443-59e401425444make_n_param_dist_policy_params$ba41f521-4ee2-42a6-bf18-078bfa4b875ezerosBinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41randIntegermake_gaussian_sampler$bba13634-ff0e-47f7-a23b-8d56098f4ac6VectorContinuousMDP$537270ba-122b-4f2b-880b-31d086766295#update_gaussian_eligibility_vector!$5261651e-a51e-4e80-8e23-83a4c10e5259$740a3f41-9302-481d-b373-762c0dea8effRealFunction&setup_binary_gaussian_policy_arguments$ba5d6311-daee-4abc-b2fb-fae2184ef3ebupdate_binary_value_gradient!$03a218cb-aa83-4000-85b5-c6f247087053MatrixNTupleUnion!update_binary_action_preferences!$a361f4c9-47ce-42ad-899c-87b611c0d471$dcb306ae-a1b1-43d6-ba6e-e38668838689precedence_heuristic cell_id$dcb306ae-a1b1-43d6-ba6e-e38668838689downstream_cells_mapupstream_cells_map@md_strgetindex$54f559b6-8a62-4a42-894d-c56e41d5ebefprecedence_heuristic cell_id$54f559b6-8a62-4a42-894d-c56e41d5ebefdownstream_cells_mapcorridor_state_counts$62e677ac-2070-4f6b-9df2-90849d89fa9fupstream_cells_mapcollect_state_distributions$0c9986bb-54c0-4b08-9c29-4bfb0b68b54e$f545c800-0bf3-491f-9d7d-42341cfdb573precedence_heuristic cell_id$f545c800-0bf3-491f-9d7d-42341cfdb573downstream_cells_map%form_state_continuous_policy_function$5b868eba-c1af-49f6-8f93-79b78c319a6f$11b9beea-b0cd-45eb-84c6-151728894df0upstream_cells_mapFunction$8b35661b-5075-4d63-bc31-044407f99acfprecedence_heuristic cell_id$8b35661b-5075-4d63-bc31-044407f99acfdownstream_cells_mapupstream_cells_mapget_corridor_features$6bb0263e-368e-462a-948c-baf9cfa825124actor_critic_with_eligibility_traces_binary_features$05bfd818-bf4e-4bda-baa9-5ba647867097corridor_continuing_mdp$1ac9296f-047b-4051-ba5c-0c23d5f9cde9$09dd1440-5d09-421f-addc-b1ede43ff517precedence_heuristic cell_id$09dd1440-5d09-421f-addc-b1ede43ff517downstream_cells_mapupstream_cells_mapNormalpdfscatterplotLinRangeLayoutgaussian_plot_params$60c21e9c-e42d-4f0b-a910-3b318440fbc8$a0ca7a5e-0089-4a45-9278-c0f27cd096a0precedence_heuristic cell_id$a0ca7a5e-0089-4a45-9278-c0f27cd096a0downstream_cells_mapupstream_cells_map"plot_mountaincar_continuous_values$68469a40-7976-48b7-b7a1-eaa4c5f33a18"mountaincar_continuous_test_train3$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7d$64b38d1f-ecf9-4843-89a1-4c8953048265precedence_heuristic cell_id$64b38d1f-ecf9-4843-89a1-4c8953048265downstream_cells_map&cartpole_fcann_continuing_test_episode$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fce$6acb549a-5d90-4457-a347-d22448ad8071$fad02876-efba-46a7-9cb7-43820528779f$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7efupstream_cells_mapcartpole_setup$26880577-d267-4950-8725-7afe0d0402b6runepisode$d963ff6d-f1b6-4799-aa0e-1ae100310d84cartpole_continuing_fcann_test$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27$d963ff6d-f1b6-4799-aa0e-1ae100310d84precedence_heuristiccell_id$d963ff6d-f1b6-4799-aa0e-1ae100310d84downstream_cells_mapspzerosdroptol!requestCostFunctions#monte_carlo_off_policy_prediction_qdouble_expected_sarsaAbstractAveragingMethodpolicy_iteration_vTailRecsample_rolloutautoTuneParamsq_learninggetBackendfullTrainsparse_vcatmonte_carlo_control_ϵ_softreadBinParamsmake_greedy_policy!policy_iteration!smartTuneR$monte_carlo_control_exploring_startssparseinitialize_state_action_valuemultiTrainStateMRPTransitionDistribution!monte_carlo_off_policy_predictionrook_actionstd0_policy_prediction_vtd0_policy_prediction_qFCANN$cc3ac95e-a398-438a-ba3d-62b6733f6342$45f0a385-6465-4acc-8637-1b007a0fe215$0e9de19e-bcd4-40ac-9831-afb6cad38422$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091$5c4a383f-fcf2-4f2b-819f-6d84471dda00$635abb34-2c97-4f04-a74c-22fbec32f408$f3e2db06-9cb7-464a-96b8-938175efd26b$697b2310-9d96-4f7f-be62-c3bd6bf736f3$57bbdb10-bed8-459d-8f67-9ea637cf12ba$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54$8bc280db-e57d-4e40-be46-1790f4f7d9e7$11063fff-4d36-46d5-828f-dbed0f46b9cf@tailrecsimulate!TabularMRPTransitionSamplermultiTrainAutoRegget_cuda_toolkit_versionsStateMRPTransitionSamplerautoTuneRwriteArrayAbstractSparseMatrix@using_nvidialib_settingsbellman_state_valuedropoutRegsmartEvalLayersdistribution_rolloutcalcfeatureimpactSparseMatrixCSCADAMAXTrainNNCPUAbstractSparseVectorAbstractMDP$537270ba-122b-4f2b-880b-31d086766295policy_evaluation_qmonte_carlo_policy_predictionsample_action$4fb83451-b6f8-4e6e-a131-1accc8e10b08$e7e49ff8-32df-48a4-afb2-462859592e92$4d4ae57b-afc3-44f9-b6fc-892f59f82921$266d2234-26c8-43f1-9e75-49440a230ed6$83640f5b-fe13-4ec1-98a0-67a56c189ba1traintrialsrunepisode!$4fb83451-b6f8-4e6e-a131-1accc8e10b08$5b868eba-c1af-49f6-8f93-79b78c319a6fdouble_q_learningmaxNormRegmonte_carlo_policy_prediction_vNVIDIALibrariesAbstractAfterstateMDPsparse_hvcatAbstractMPpolicy_evaluation!ApproximationUtils$f946c886-6246-4f98-a96f-f06984691ad8sparse_hcatvalue_iteration_qinitializeParamsissparsemonte_carlo_tree_search_Previous_Controller_archEvalmonte_carlo_predictionAbstractSparseArraybenchmarkCPUThreadsnonzerosgeneralized_sarsa!policy_evaluation_vevalLayersftranspose!StateMRPnzrangeAbstractTabularTransitionTabularRL$3c316495-bb6c-41e2-a38f-ba867a319fbbConstantStepAveragingarchEvalSampletuneAlphamake_stochastic_gridworldwriteParamsbellman_afterstate_valueStateMDP $5cc4d12d-b537-47e2-8109-4e7a234fdf25$0ac7ea44-14f6-4e80-80f9-d6df8059bb38$96506201-6b66-49e6-8179-06952e2394e1$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290$8e39bd15-862e-4941-88f9-2794b861a523$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091$4fb83451-b6f8-4e6e-a131-1accc8e10b08$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb$d1ed25e6-60c6-411f-a541-99986e5da2c5$697b2310-9d96-4f7f-be62-c3bd6bf736f3$4d4ae57b-afc3-44f9-b6fc-892f59f82921$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2$57e5e12a-b722-4ea3-ab3b-e5711029e640$57bbdb10-bed8-459d-8f67-9ea637cf12ba$266d2234-26c8-43f1-9e75-49440a230ed6$05bfd818-bf4e-4bda-baa9-5ba647867097$68806899-9972-460a-9f11-daa708a9d610$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54$1f041cb3-618c-4380-a1ec-d7bbe4a80f62$11ea640c-3981-404d-87c6-4d3d0708a2b8$f8614042-7c94-4d47-a1b6-4e96676b4e8b$83640f5b-fe13-4ec1-98a0-67a56c189ba1$f0104778-81a6-417b-8501-f916e5e7f3af$734573e5-547b-4dcc-89bb-412aa6cc42d6$e96d592d-1e54-486d-8ad9-b857f85476e8$ff4f977e-48df-4c12-845c-c245b4d39d6d$8bc280db-e57d-4e40-be46-1790f4f7d9e7$5aba4f96-e877-457e-8e95-18737348f99f$11063fff-4d36-46d5-828f-dbed0f46b9cf$4c4e643b-d4b9-44f0-8d30-dc521bcc55ac$00152954-dc98-4120-b94b-2ea4d987832b$d9d11d69-bc16-400a-8f46-f9a8ecb8516amake_random_policyspdiagmdropzerospreptrainingbackendListsprandnLossTypemrp_evaluationAbstractStateTransitionexpected_sarsaevalMultiSampleAveragingmonte_carlo_off_policy_controltuneRsetBackendTabularMRPuctmakelookupapply_uct!L2RegGridworldActiontd0_predictionsparsevecTabularDeterministicTransitionvalue_iteration!blockdiagTabularAfterstateMDPtd0_policy_predictionGridworldStatemake_ϵ_greedy_policy!$5981f52b-d829-4c7d-b47b-33310f7d64a2value_iteration_vTabularMDPAbstractTransition$c8b47eac-2d45-419a-bec6-2ae0cdc59393mrp_evaluation!AbstractMRPsprandfind_terminal_statessarsaTabularMDPTransitionSamplerswitch_devicedropzeros!TabularTransitionDistributionOutputIndex$5c4a383f-fcf2-4f2b-819f-6d84471dda00monte_carlo_controlStateMDPTransitionDistributionpolicy_evaluationreadBinInputfkeep!bellman_policy_update!SparseVectorvalue_iterationtestTrainrunepisode$f7433324-acc3-49a5-b5b3-ada0c8f09d52$fb8904a9-ae64-41cc-93b6-5a25855edad0$cecc2a35-3850-4f66-84e8-e29da4f3d4b0$0c9986bb-54c0-4b08-9c29-4bfb0b68b54e$4fb83451-b6f8-4e6e-a131-1accc8e10b08$0cd96c44-cae6-421f-9fae-26141600bef4$64b38d1f-ecf9-4843-89a1-4c8953048265$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6$5b868eba-c1af-49f6-8f93-79b78c319a6f$cf1859d6-f889-4923-8c87-2d7c039f26c3$31db0f58-28e4-454f-9394-25565687266f$dddc4a2f-34b2-41dc-85b3-55aba4880fa6$5859ca11-90f8-4fd6-88ed-c56efe796fe8$11a55af7-5301-4507-bb26-88e1e11236db$07ba9fe4-aaa7-4123-9865-cbfa79d0d44a$e1274f57-75cb-4659-a82f-e5870c5367e2$daf35bfe-8f9c-4f55-971d-4d443be8f8bf$a5b002c9-5e11-462a-9da0-6e060c7963f8$ba645f6b-143f-4e83-9003-707770ae308d$b5319d8b-0420-4ebf-b603-ea0b93365ac1make_deterministic_gridworldCrossEntropyLoss$45f0a385-6465-4acc-8637-1b007a0fe215findnzSparseArrayscheckNumGradTabularStochasticTransitioninitialize_afterstate_valuennzbellman_state_action_valueStateMDPTransitionSampler$5cc4d12d-b537-47e2-8109-4e7a234fdf25$f0104778-81a6-417b-8501-f916e5e7f3af$4c4e643b-d4b9-44f0-8d30-dc521bcc55ac$00152954-dc98-4120-b94b-2ea4d987832b$3c316495-bb6c-41e2-a38f-ba867a319fbbbenchmarkDevicefind_available_actionsinitialize_state_valuemonte_carlo_policy_prediction_qpolicy_iterationpermuterowvalsupstream_cells_mapBase.CoreLogging.invokelatestBase.CoreLogging.===@raw_str#___this_pluto_module_namerethrowBase.CoreLogging.!'Base.CoreLogging.Base.fixup_stdlib_pathBaseBase.CoreLogging.isaBase.invokelatestBase.CoreLogging.>=@__DIR__PlutoDevMacros.@frompackage$b16899b7-36bf-4a5e-8e2f-4496b8450687precedence_heuristic cell_id$b16899b7-36bf-4a5e-8e2f-4496b8450687downstream_cells_mapsquashed_gaussian_pdf$00bd2835-b006-4244-9877-bc7e031e3ef8upstream_cells_mapexpsqrtAbstractArrayReal-atanh/^πUnion*absinv$10cdd16e-a337-4421-a7a0-6de4e4b60c0fprecedence_heuristic cell_id$10cdd16e-a337-4421-a7a0-6de4e4b60c0fdownstream_cells_mapBinaryGaussianEligibilityVector$f55afa58-962d-4551-8d95-a5b467d61adf$740a3f41-9302-481d-b373-762c0dea8eff$ba5d6311-daee-4abc-b2fb-fae2184ef3eb$056a8adc-92f4-4b33-90d9-4b3b4026bbbcupstream_cells_mapRealzeroNzerosBinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41UnionNTupleonesoneVector$a8b40b8f-051a-4e6f-a079-ece4f32873deprecedence_heuristic cell_id$a8b40b8f-051a-4e6f-a079-ece4f32873dedownstream_cells_mapcreate_actor_critic_params_UI$36d514fa-b27a-4c6b-8399-9d108377b9b5$4c5cb75e-79b5-4502-b1eb-6246e002feaf$71a5fce8-6d9a-4625-bad1-a951d61bff28$0d45ae72-572f-4d17-83cf-9814f2854131upstream_cells_map@md_strBase.getindex:PlutoUI$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70|>HypertextLiteral.contentSlider@htl$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaebBaseNumberFieldHypertextLiteral.ResultHypertextLiteral$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70PlutoUI.combineconfirm$5eebf3da-bfe7-46eb-81a3-f87f334ee270precedence_heuristic cell_id$5eebf3da-bfe7-46eb-81a3-f87f334ee270downstream_cells_map#create_actor_critic_fcann_params_UI$9978d537-49ff-4014-a971-b42704c50a6bupstream_cells_map@md_str:PlutoUI$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70NumberField|>PlutoUI.combineconfirmSlidergetindex$9bce6fdb-2cbc-4758-9a8b-794e490c973dprecedence_heuristic cell_id$9bce6fdb-2cbc-4758-9a8b-794e490c973ddownstream_cells_mapep2_step$1ce4bc6c-7cde-48e9-8ff1-7281697fd121upstream_cells_mapCore:Base.get@bindep2$a5b002c9-5e11-462a-9da0-6e060c7963f8SliderlengthBasePlutoRunnerPlutoRunner.create_bondCore.applicable$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbfprecedence_heuristic cell_id$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbfdownstream_cells_map$create_continuous_action_mountaincar$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2$349631b2-4686-49a9-9f3a-1e4ad588b568upstream_cells_mapabsisless*ContinuousMDP$537270ba-122b-4f2b-880b-31d086766295MountainCarTask$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6precedence_heuristic cell_id$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6downstream_cells_map#mountaincar_continuing_test_episodeupstream_cells_map mountaincar_continuing_tile_test$b02ba928-5b9f-4695-b980-07988c788bb9runepisode$d963ff6d-f1b6-4799-aa0e-1ae100310d84MountainCarTask$7afb6fb0-248a-4518-b94f-9876f81eca64precedence_heuristic cell_id$7afb6fb0-248a-4518-b94f-9876f81eca64downstream_cells_map#corridor_continuing_parameter_study$42775fd1-5b27-48e0-abf1-9b22bb775e6dupstream_cells_mapget_corridor_features$6bb0263e-368e-462a-948c-baf9cfa82512corridor_continuing_mdp$1ac9296f-047b-4051-ba5c-0c23d5f9cde9#actor_critic_linear_parameter_study$734573e5-547b-4dcc-89bb-412aa6cc42d6$e96d592d-1e54-486d-8ad9-b857f85476e8$ff4f977e-48df-4c12-845c-c245b4d39d6d$37a273b6-b104-46f0-987a-401dc1c97327precedence_heuristic cell_id$37a273b6-b104-46f0-987a-401dc1c97327downstream_cells_map,start_cartpole_continuing_binary_param_study$b2539398-fdbc-42a2-a8f3-d327358f3643upstream_cells_mapCoreBasePlutoRunner.create_bondPlutoRunnerCounterButtonCore.applicable@bindBase.get$7a6f3f79-ea06-4994-8b62-90b2056e4034precedence_heuristic cell_id$7a6f3f79-ea06-4994-8b62-90b2056e4034downstream_cells_map squashed_gaussian_action_sampler make_squashed_gaussian_n_samplermake_squashed_gaussian_sampler$05f120be-9695-4824-82fd-142a0df13098upstream_cells_mapisapproxzeroexpisnanrandIntegerVectorRealisinfsignVal+NTuplentupletanh*Normal$f2ed56c9-c2b7-42cb-a083-e12aeaa126efprecedence_heuristic cell_id$f2ed56c9-c2b7-42cb-a083-e12aeaa126efdownstream_cells_mapupstream_cells_mapcorridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabeget_corridor_features$6bb0263e-368e-462a-948c-baf9cfa82512^-reinforce_monte_carlo_control_binary_features$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290$cbea5840-49d2-4e91-be9c-f5f15666d78aprecedence_heuristic cell_id$cbea5840-49d2-4e91-be9c-f5f15666d78adownstream_cells_mapupstream_cells_mapcorridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabeget_corridor_features$6bb0263e-368e-462a-948c-baf9cfa82512;reinforce_with_baseline_monte_carlo_control_binary_features$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb^$1f041cb3-618c-4380-a1ec-d7bbe4a80f62precedence_heuristic cell_id$1f041cb3-618c-4380-a1ec-d7bbe4a80f62downstream_cells_map,actor_critic_binary_episodic_parameter_study$bc8a399b-8864-4473-89d2-e3b0a03d15b5$8eb42403-1234-4e59-993e-057cc3a6d5c9upstream_cells_mapRandom$df7f84e8-b42a-4001-9dbf-6bc3ced94207StateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84lengthcopyVectorRealscatter/Matrix4actor_critic_with_eligibility_traces_binary_features$05bfd818-bf4e-4bda-baa9-5ba647867097isemptymean:AbstractVector|>InfzerosrandIntegerFunctionUInt64-plotfoldxt+MapLayoutRandom.seed!$96506201-6b66-49e6-8179-06952e2394e1precedence_heuristic cell_id$96506201-6b66-49e6-8179-06952e2394e1downstream_cells_mapsetup_binary_policy_arguments$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2$05bfd818-bf4e-4bda-baa9-5ba647867097upstream_cells_mapcopyRealFunctionlengthupdate_binary_feature_vector!$8eab55a5-41b7-4f5e-a02f-4c19388bc9eazerosBinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41BinaryEligibilityVector$41dc149d-c6f3-4b0d-a856-06f3aaae3049IntegerStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84$76b03e72-da04-4530-8534-6d6468268cbdprecedence_heuristic cell_id$76b03e72-da04-4530-8534-6d6468268cbddownstream_cells_mapupstream_cells_map@md_strgetindex$fd89433e-643c-474b-b3c4-a997678421a6precedence_heuristic cell_id$fd89433e-643c-474b-b3c4-a997678421a6downstream_cells_mapupstream_cells_map@md_strgetindex$87feff3e-e510-4916-91a9-db3a2cd12225precedence_heuristic cell_id$87feff3e-e510-4916-91a9-db3a2cd12225downstream_cells_map&fcann_continuing_cartpole_study_paramsupstream_cells_map@md_strCore:PlutoUI$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70|>Base.get@bindSliderBasePlutoRunnerPlutoRunner.create_bondNumberFieldconfirmCore.applicablePlutoUI.combinegetindex$5261651e-a51e-4e80-8e23-83a4c10e5259precedence_heuristic cell_id$5261651e-a51e-4e80-8e23-83a4c10e5259downstream_cells_map#update_gaussian_eligibility_vector!$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00$d5020a8d-1dd7-403c-9d1f-665b95543943$20776e09-7d9b-4db8-a060-7bceeec65b47upstream_cells_mapzeroisless@inboundsonenothingVectorreinforce_test$24fa139c-ad4b-49db-ac8f-23c476ed8608runepisode$d963ff6d-f1b6-4799-aa0e-1ae100310d84$54fff14b-cf53-47b0-9cfa-8b9ee33df54eprecedence_heuristic cell_id$54fff14b-cf53-47b0-9cfa-8b9ee33df54edownstream_cells_mapBinaryBetaEligibilityVector$f55afa58-962d-4551-8d95-a5b467d61adf$d41f1dd1-45fe-4456-9a01-ed47fd6704a7$ed93259c-7b8b-46d7-97fb-f194e0e04b3a$056a8adc-92f4-4b33-90d9-4b3b4026bbbcupstream_cells_mapRealNBinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41UnionNTupleonesoneVector$023f67b8-8f38-470a-9766-ac60a75678aaprecedence_heuristic cell_id$023f67b8-8f38-470a-9766-ac60a75678aadownstream_cells_mapmountaincar_fcann_setup$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9upstream_cells_mapfcann_feature_vector_setup$9acdbf38-2e10-45ec-85a0-d0db8453a599mountaincar_min_vals$2025ff38-f2ec-4224-b771-ff72ffe1af28mountaincar_max_vals$77906355-08f8-4b08-b051-84697199b519$1558cec1-c4fd-4bc0-85ed-ae22c6067d41precedence_heuristic cell_id$1558cec1-c4fd-4bc0-85ed-ae22c6067d41downstream_cells_mapupstream_cells_map@md_strgetindex$da8d0bca-105b-4d0b-a73d-ee5c9059aeafprecedence_heuristic cell_id$da8d0bca-105b-4d0b-a73d-ee5c9059aeafdownstream_cells_mapupstream_cells_map@md_strgetindex$3e7cecec-eb77-4862-8e3c-b510422e06dbprecedence_heuristic cell_id$3e7cecec-eb77-4862-8e3c-b510422e06dbdownstream_cells_mapupstream_cells_mapsquashed_gaussian_plot_params$94517664-6988-44dc-a297-e9d5873ee540plot_squashed_gaussian$00bd2835-b006-4244-9877-bc7e031e3ef8$0284f0d7-b8a9-4ae6-add0-ac1078571d9bprecedence_heuristic cell_id$0284f0d7-b8a9-4ae6-add0-ac1078571d9bdownstream_cells_mapupstream_cells_map@md_strgetindex$b94fc99c-f439-4df2-8da3-c01718a136c4precedence_heuristic cell_id$b94fc99c-f439-4df2-8da3-c01718a136c4downstream_cells_mapupstream_cells_map@md_strgetindex$b8532822-179b-4cd5-a279-4b71dafb544aprecedence_heuristic cell_id$b8532822-179b-4cd5-a279-4b71dafb544adownstream_cells_map!mountaincar_continuous_test_train$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580a$c87dba8c-9a96-41b3-9dc7-a6c088ec1eafupstream_cells_mapmountaincar_continuous_mdp$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2Int64mountaincar_tilecoding_setup$7c592385-e8d3-4efe-962c-d39debb64405Eactor_critic_with_eligibility_traces_binary_features_gaussian_actions$20776e09-7d9b-4db8-a060-7bceeec65b47typemax$07ba9fe4-aaa7-4123-9865-cbfa79d0d44aprecedence_heuristic cell_id$07ba9fe4-aaa7-4123-9865-cbfa79d0d44adownstream_cells_mapupstream_cells_mapreinforce_test4$407a0724-4bb6-4c83-ab2d-17a0e19c4072cartpole_setup$26880577-d267-4950-8725-7afe0d0402b6display_cartpole_episode$822e4d69-2582-4956-858e-06ecb091e76a|>runepisode$d963ff6d-f1b6-4799-aa0e-1ae100310d84$f487f2dd-ad09-48ac-ae34-bf50cfa6ac7dprecedence_heuristic cell_id$f487f2dd-ad09-48ac-ae34-bf50cfa6ac7ddownstream_cells_map.start_mountaincar_continuing_fcann_param_study$cb70d400-3e9c-441c-b17c-e727e8c928f3upstream_cells_mapCoreBasePlutoRunner.create_bondPlutoRunnerCounterButtonCore.applicable@bindBase.get$5c4a383f-fcf2-4f2b-819f-6d84471dda00precedence_heuristic cell_id$5c4a383f-fcf2-4f2b-819f-6d84471dda00downstream_cells_mapupdate_fcann_value_gradient!$f3e2db06-9cb7-464a-96b8-938175efd26bupstream_cells_map:AbstractVectorFCANN$d963ff6d-f1b6-4799-aa0e-1ae100310d84BoolFCANNParams@inboundsOutputIndex$d963ff6d-f1b6-4799-aa0e-1ae100310d84IntegerVectorInt64FCANNActivations$5c11a92d-7496-4aba-af15-2537eac49dd7eachindexFloat32*FCANN.nnCostFunction$135f205a-f87e-4691-8e87-d317d6312c84precedence_heuristic cell_id$135f205a-f87e-4691-8e87-d317d6312c84downstream_cells_mapupstream_cells_map@md_strgetindex$4a39f9a7-72d4-44ad-895a-742cd1291f92precedence_heuristic cell_id$4a39f9a7-72d4-44ad-895a-742cd1291f92downstream_cells_mapdist_plot_p$9cf3dc5f-8a25-479f-93db-06e34f0d37a0upstream_cells_mapCore:|>Base.get@bindSliderBasePlutoRunnerPlutoRunner.create_bondconfirmCore.applicable$ee72af8d-3cb8-4314-82df-580f068e1252precedence_heuristic cell_id$ee72af8d-3cb8-4314-82df-580f068e1252downstream_cells_mapupstream_cells_map@md_strgetindex$e524f8cc-ab69-4f8b-a59f-28156696a104precedence_heuristic cell_id$e524f8cc-ab69-4f8b-a59f-28156696a104downstream_cells_map8run_mountaincar_binary_episodic_countinuous_param_study2$0d93132d-5819-47dc-8cf2-462d480d9c3dupstream_cells_mapCoreBasePlutoRunner.create_bondPlutoRunnerCounterButtonCore.applicable@bindBase.get$1894ae1a-bb68-4de0-a4d2-ac5d02c49f09precedence_heuristic cell_id$1894ae1a-bb68-4de0-a4d2-ac5d02c49f09downstream_cells_mapupstream_cells_map$f3bc47b5-03fc-4bd9-a890-26f9608a730bprecedence_heuristic cell_id$f3bc47b5-03fc-4bd9-a890-26f9608a730bdownstream_cells_mapupstream_cells_map@md_strgetindex$4915b1ed-ad53-4ece-9b00-bc136d47d8dcprecedence_heuristic cell_id$4915b1ed-ad53-4ece-9b00-bc136d47d8dcdownstream_cells_mapupstream_cells_map@md_strgetindex$f924eb30-d1cc-4941-8fb5-ff70ad425ab9precedence_heuristic cell_id$f924eb30-d1cc-4941-8fb5-ff70ad425ab9downstream_cells_mapupstream_cells_map@md_strgetindex$d83dc659-dce7-41dd-a8e7-2933ab39d15cprecedence_heuristic cell_id$d83dc659-dce7-41dd-a8e7-2933ab39d15cdownstream_cells_mapupstream_cells_map@md_strgetindex$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fceprecedence_heuristic cell_id$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fcedownstream_cells_mapupstream_cells_mapdisplay_cartpole_episode$822e4d69-2582-4956-858e-06ecb091e76a&cartpole_fcann_continuing_test_episode$64b38d1f-ecf9-4843-89a1-4c8953048265$83ca0577-15d7-4448-b597-c77810b812bfprecedence_heuristic cell_id$83ca0577-15d7-4448-b597-c77810b812bfdownstream_cells_mapfigure_13_2_test$a7dcc8cd-04ec-48f2-a387-116330eaffb2upstream_cells_mapsqrtRandom$df7f84e8-b42a-4001-9dbf-6bc3ced94207corridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabescatter/fillround:|>Int64get_corridor_features$6bb0263e-368e-462a-948c-baf9cfa82512-log2foldxtplot+*-reinforce_monte_carlo_control_binary_features$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290Map;reinforce_with_baseline_monte_carlo_control_binary_features$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbLayoutRandom.seed!$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbprecedence_heuristic cell_id$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbdownstream_cells_map;reinforce_with_baseline_monte_carlo_control_binary_features$cbea5840-49d2-4e91-be9c-f5f15666d78a$83ca0577-15d7-4448-b597-c77810b812bf$e5c1aca8-7575-4835-8273-e69ca0a55fe8$d3b56fca-5b79-4465-8987-8d0005f854d8$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03upstream_cells_map,reinforce_with_baseline_monte_carlo_control!$4fb83451-b6f8-4e6e-a131-1accc8e10b08$5b868eba-c1af-49f6-8f93-79b78c319a6fbinary_value_function$a540814a-57a1-4b98-9443-59e401425444setup_binary_policy_arguments$96506201-6b66-49e6-8179-06952e2394e1zerosBinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41IntegerStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84VectorRealFunction!update_binary_eligibility_vector!$042fbafe-2401-4fb7-ac13-4531e0782c79lengthupdate_binary_value_gradient!$03a218cb-aa83-4000-85b5-c6f247087053Matrix!update_binary_action_preferences!$a361f4c9-47ce-42ad-899c-87b611c0d471$a7dcc8cd-04ec-48f2-a387-116330eaffb2precedence_heuristic cell_id$a7dcc8cd-04ec-48f2-a387-116330eaffb2downstream_cells_mapupstream_cells_map:figure_13_2_test$83ca0577-15d7-4448-b597-c77810b812bfvcat^$0ab70fc3-6188-42eb-aba2-d808f319be9fprecedence_heuristic cell_id$0ab70fc3-6188-42eb-aba2-d808f319be9fdownstream_cells_mapupstream_cells_map@md_strgetindex$047656d1-2921-40f2-b75b-ce4a87098007precedence_heuristic cell_id$047656d1-2921-40f2-b75b-ce4a87098007downstream_cells_mapupstream_cells_map@md_strgetindex$5d434c83-c9ca-499f-8695-c7733031c2deprecedence_heuristic cell_id$5d434c83-c9ca-499f-8695-c7733031c2dedownstream_cells_mapcartpole_continuing_step$4c4e643b-d4b9-44f0-8d30-dc521bcc55acupstream_cells_mapCartPoleStatecartpole_functions.step#cartpole_functions.initialize_statecartpole_functions.failure+cartpole_functions$f27f2bcd-05b6-44fe-bf9e-a3e51556db7cInteger$3a37b53d-9174-4faa-9404-74a40c385b0aprecedence_heuristic cell_id$3a37b53d-9174-4faa-9404-74a40c385b0adownstream_cells_mapupstream_cells_mapshow_mountaincar_trajectory$ba645f6b-143f-4e83-9003-707770ae308d!mountaincar_continuing_fcann_test$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9$820752af-8966-4ee8-82f7-a40934522de5precedence_heuristic cell_id$820752af-8966-4ee8-82f7-a40934522de5downstream_cells_mapupstream_cells_map$6acb549a-5d90-4457-a347-d22448ad8071precedence_heuristic cell_id$6acb549a-5d90-4457-a347-d22448ad8071downstream_cells_map-cartpole_fcann_continuing_episode_step_select$fad02876-efba-46a7-9cb7-43820528779f$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7efupstream_cells_mapCore:Base.get@bindSliderlengthBasePlutoRunnerPlutoRunner.create_bond&cartpole_fcann_continuing_test_episode$64b38d1f-ecf9-4843-89a1-4c8953048265Core.applicable$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62precedence_heuristic cell_id$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62downstream_cells_map)cartpole_fcann_continuing_parameter_study$50ae94c4-70f3-4215-82bd-eb2227c2badfupstream_cells_mapcartpole_vector_update!$192b9f82-8d3a-408f-91c2-829cfcd32572cartpole_continuing_mdp$4c4e643b-d4b9-44f0-8d30-dc521bcc55accartpole_fcann_feature_setup$61650a97-b353-4a85-b50b-93fee296ac7bfill"actor_critic_fcann_parameter_study$8bc280db-e57d-4e40-be46-1790f4f7d9e7$5aba4f96-e877-457e-8e95-18737348f99f$11063fff-4d36-46d5-828f-dbed0f46b9cfInteger$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728precedence_heuristic cell_id$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728downstream_cells_mapupstream_cells_mapInt64corridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabeget_corridor_features$6bb0263e-368e-462a-948c-baf9cfa82512^4actor_critic_with_eligibility_traces_binary_features$05bfd818-bf4e-4bda-baa9-5ba647867097typemax$ae0f5a96-7a4b-47f9-be1e-e803a238a071precedence_heuristic cell_id$ae0f5a96-7a4b-47f9-be1e-e803a238a071downstream_cells_mapupstream_cells_map@md_strgetindex$41d62de1-2c92-41ee-9430-b9ca3007afd9precedence_heuristic cell_id$41d62de1-2c92-41ee-9430-b9ca3007afd9downstream_cells_mapupstream_cells_map@md_strgetindex$8eb42403-1234-4e59-993e-057cc3a6d5c9precedence_heuristic cell_id$8eb42403-1234-4e59-993e-057cc3a6d5c9downstream_cells_mapupstream_cells_map@md_str<,actor_critic_binary_episodic_parameter_study$1f041cb3-618c-4380-a1ec-d7bbe4a80f62$d9d11d69-bc16-400a-8f46-f9a8ecb8516a>mountaincar_binary_params$4c5cb75e-79b5-4502-b1eb-6246e002feafislessmountaincar_tilecoding_setup$7c592385-e8d3-4efe-962c-d39debb64405+run_mountaincar_binary_episodic_param_study$192cc1cf-9ea1-492d-baa7-f2e197abecd4MountainCarTaskgetindex$bbc8864a-1545-433f-bc7c-0ddf6e907138precedence_heuristic cell_id$bbc8864a-1545-433f-bc7c-0ddf6e907138downstream_cells_mapplot_mountaincar_policy_values$dc2efc6c-8da8-425b-aa5f-290949109565upstream_cells_map:HypertextLiteral.BypassLinRangezerosHypertextLiteral.content@htl$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaebFunctionenumerateFloat32plotHypertextLiteral$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70heatmapHypertextLiteral.ResultLayout$a12b92d1-e045-4f92-b8cd-eee5d56fa67dprecedence_heuristic cell_id$a12b92d1-e045-4f92-b8cd-eee5d56fa67ddownstream_cells_mapbest_mc_corridor$44b32cc0-36a8-41fd-89bc-ce894536926c$553b0ceb-f2ca-41ee-99bc-9f53a4487b49upstream_cells_mapcorridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabe;reinforce_with_baseline_monte_carlo_control_linear_features$d1ed25e6-60c6-411f-a541-99986e5da2c5^update_corridor_features!$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811$ce33f710-fd9d-4dfa-acda-40204e54d518precedence_heuristic cell_id$ce33f710-fd9d-4dfa-acda-40204e54d518downstream_cells_mapupstream_cells_map@md_strgetindex$339b4d2b-2237-46a3-9867-ecc3332856c1precedence_heuristic cell_id$339b4d2b-2237-46a3-9867-ecc3332856c1downstream_cells_mapupstream_cells_map@md_strgetindex$a8349352-3242-46d5-b0d5-1b6eb5d77e90precedence_heuristic cell_id$a8349352-3242-46d5-b0d5-1b6eb5d77e90downstream_cells_mapx$d4e87ac4-6008-43b2-aa06-e232ec2b2b5bupstream_cells_mapCoreBase:PlutoRunner.create_bondPlutoRunnerCore.applicable@bindBase.getSlider$7d63b960-3998-4f7b-8cbb-ccd49db9aeacprecedence_heuristic cell_id$7d63b960-3998-4f7b-8cbb-ccd49db9aeacdownstream_cells_mapupstream_cells_mapInt64%one_step_actor_critic_binary_features$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2corridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabeget_corridor_features$6bb0263e-368e-462a-948c-baf9cfa82512^typemax$65d2add6-fd6f-456c-92ed-3cd9d1862ef6precedence_heuristic cell_id$65d2add6-fd6f-456c-92ed-3cd9d1862ef6downstream_cells_mapupdate_binary_policy_params!upstream_cells_mapRealeachindex-BinaryFeatures$da2d3186-a778-41cc-9b49-759bf1e9b8faMatrix+@inbounds*IntegerVector$f55afa58-962d-4551-8d95-a5b467d61adfprecedence_heuristic cell_id$f55afa58-962d-4551-8d95-a5b467d61adfdownstream_cells_mapupdate_params_with_gradient!$4fb83451-b6f8-4e6e-a131-1accc8e10b08$4d4ae57b-afc3-44f9-b6fc-892f59f82921$266d2234-26c8-43f1-9e75-49440a230ed6$83640f5b-fe13-4ec1-98a0-67a56c189ba1$5b868eba-c1af-49f6-8f93-79b78c319a6f$b71145a4-2614-4f62-bfd2-7d5d1fecec56$4da20fd7-b897-4f26-bf2a-f08d66ddf90fupstream_cells_mapzero'BinarySquashedGaussianEligibilityVector$76fd79a2-2bc8-45f8-a243-48415118898aislessdigamma@inboundsonenothingVectorisless+BinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41Function$0ac7ea44-14f6-4e80-80f9-d6df8059bb38precedence_heuristic cell_id$0ac7ea44-14f6-4e80-80f9-d6df8059bb38downstream_cells_mapreinforce_monte_carlo_control!$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290$8e39bd15-862e-4941-88f9-2794b861a523$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091upstream_cells_mapRealnothingFunction,reinforce_with_baseline_monte_carlo_control!$4fb83451-b6f8-4e6e-a131-1accc8e10b08$5b868eba-c1af-49f6-8f93-79b78c319a6fReturnszero/oneIntegerStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84$5ffc271f-c73f-494a-9727-8d7516af2191precedence_heuristic cell_id$5ffc271f-c73f-494a-9727-8d7516af2191downstream_cells_map&cartpole_continuing_fcann_study_params$50ae94c4-70f3-4215-82bd-eb2227c2badfupstream_cells_mapCoreBasePlutoRunner.create_bondPlutoRunner(create_actor_critic_continuing_params_UI$5b15d91e-7119-4f85-a54a-7d4f1fdaf097Core.applicable@bindBase.get$c5a2879c-e89b-47f7-bbd6-48200d7e89e3precedence_heuristic cell_id$c5a2879c-e89b-47f7-bbd6-48200d7e89e3downstream_cells_map>actor_critic_binary_episodic_squashed_gaussian_parameter_study$0d93132d-5819-47dc-8cf2-462d480d9c3dupstream_cells_mapRealFunctiononeIntegerContinuousMDP$537270ba-122b-4f2b-880b-31d086766295$537270ba-122b-4f2b-880b-31d086766295precedence_heuristic cell_id$537270ba-122b-4f2b-880b-31d086766295downstream_cells_mapContinuousMDP$f946c886-6246-4f98-a96f-f06984691ad8$5b868eba-c1af-49f6-8f93-79b78c319a6f$ba5d6311-daee-4abc-b2fb-fae2184ef3eb$ed93259c-7b8b-46d7-97fb-f194e0e04b3a$4e29c621-223e-4859-8e96-db04b967815a$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00$d5020a8d-1dd7-403c-9d1f-665b95543943$b71145a4-2614-4f62-bfd2-7d5d1fecec56$4da20fd7-b897-4f26-bf2a-f08d66ddf90f$20776e09-7d9b-4db8-a060-7bceeec65b47$3e3c5897-809f-46e3-bb58-f115b082443e$05f120be-9695-4824-82fd-142a0df13098$717e4c69-59d5-4929-923f-dd35a97fb160$55ba8725-0ddf-4196-a41d-3f3c490a8d84$61949faa-8174-4b7b-8fbc-01d5f850b419$dd8e8cd2-7b41-46c4-8530-adefb7aea684$08505e88-9c23-4e95-91e3-d18bf5133dbc$87482ea5-5265-4e02-92c0-1a8bb44ff0f4$13ebc12f-ff6f-4266-88d3-28d6df5fcf59$3d065608-eef2-4caa-b17d-ec60714e3d58$bd6a7c16-6c25-4fc2-8e1b-4dab693ce19f$c5a2879c-e89b-47f7-bbd6-48200d7e89e3$65be0e58-24be-4932-92a9-9e4825b14144$3c316495-bb6c-41e2-a38f-ba867a319fbb$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbf$d2729657-d0bf-4d39-8ec7-f242a1ad48d6upstream_cells_mapFunctionnewContinuousMDPTransitionSampler$c8b47eac-2d45-419a-bec6-2ae0cdc59393ReturnsAbstractContinuousTransition$c8b47eac-2d45-419a-bec6-2ae0cdc59393AbstractMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84FReal$dc2efc6c-8da8-425b-aa5f-290949109565precedence_heuristic cell_id$dc2efc6c-8da8-425b-aa5f-290949109565downstream_cells_mapupstream_cells_mapplot_mountaincar_policy_values$bbc8864a-1545-433f-bc7c-0ddf6e907138mountaincar_test_train$6d0925d3-af96-4b94-8e2e-4941cce39e51$a019925a-460a-410e-a54b-50a4cfe0e90eprecedence_heuristic cell_id$a019925a-460a-410e-a54b-50a4cfe0e90edownstream_cells_mapupstream_cells_map-scatterplotLinRangeLayoutget_corridor_episode_stats$fb8904a9-ae64-41cc-93b6-5a25855edad0$cecc2a35-3850-4f66-84e8-e29da4f3d4b0$f92bb265-4b19-4f0e-a698-d7547bb6dd41precedence_heuristic cell_id$f92bb265-4b19-4f0e-a698-d7547bb6dd41downstream_cells_mapBinaryFeatureVector$b0a66a19-ee76-463b-a704-8fcee85444d0$8eab55a5-41b7-4f5e-a02f-4c19388bc9ea$a361f4c9-47ce-42ad-899c-87b611c0d471$41dc149d-c6f3-4b0d-a856-06f3aaae3049$042fbafe-2401-4fb7-ac13-4531e0782c79$96506201-6b66-49e6-8179-06952e2394e1$03a218cb-aa83-4000-85b5-c6f247087053$a893a87b-2d07-4db5-9d1a-9da8646216f4$a540814a-57a1-4b98-9443-59e401425444$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2$25be5dcf-be63-46c4-b6de-6cf79fa28fd0$05bfd818-bf4e-4bda-baa9-5ba647867097$10cdd16e-a337-4421-a7a0-6de4e4b60c0f$54fff14b-cf53-47b0-9cfa-8b9ee33df54e$76fd79a2-2bc8-45f8-a243-48415118898a$f55afa58-962d-4551-8d95-a5b467d61adf$740a3f41-9302-481d-b373-762c0dea8eff$d41f1dd1-45fe-4456-9a01-ed47fd6704a7$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690c$ba5d6311-daee-4abc-b2fb-fae2184ef3eb$ed93259c-7b8b-46d7-97fb-f194e0e04b3a$4e29c621-223e-4859-8e96-db04b967815a$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00$056a8adc-92f4-4b33-90d9-4b3b4026bbbc$20776e09-7d9b-4db8-a060-7bceeec65b47$3e3c5897-809f-46e3-bb58-f115b082443e$05f120be-9695-4824-82fd-142a0df13098upstream_cells_mapInt64newIntegerVector$ac9c8845-284d-4c21-b05d-d930f86598a3precedence_heuristic cell_id$ac9c8845-284d-4c21-b05d-d930f86598a3downstream_cells_map7run_mountaincar_binary_episodic_countinuous_param_study$b53dba81-a9e9-41da-8fc2-7736bf25f2dcupstream_cells_mapCoreBasePlutoRunner.create_bondPlutoRunnerCounterButtonCore.applicable@bindBase.get$192cc1cf-9ea1-492d-baa7-f2e197abecd4precedence_heuristic cell_id$192cc1cf-9ea1-492d-baa7-f2e197abecd4downstream_cells_map+run_mountaincar_binary_episodic_param_study$8eb42403-1234-4e59-993e-057cc3a6d5c9upstream_cells_mapCoreBasePlutoRunner.create_bondPlutoRunnerCounterButtonCore.applicable@bindBase.get$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547dprecedence_heuristic cell_id$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547ddownstream_cells_mapep_step$374af774-3a97-49b5-a3bb-bc3f7f63a3fa$af144759-fe66-4ad0-b378-e9eb4e859db4upstream_cells_mapCore:Base.get@bindSliderlengthep$e1274f57-75cb-4659-a82f-e5870c5367e2BasePlutoRunnerPlutoRunner.create_bondCore.applicable$c8b47eac-2d45-419a-bec6-2ae0cdc59393precedence_heuristic cell_id$c8b47eac-2d45-419a-bec6-2ae0cdc59393downstream_cells_mapContinuousMDPTransitionSampler$c8b47eac-2d45-419a-bec6-2ae0cdc59393$537270ba-122b-4f2b-880b-31d086766295$3c316495-bb6c-41e2-a38f-ba867a319fbbAbstractContinuousTransition$c8b47eac-2d45-419a-bec6-2ae0cdc59393$537270ba-122b-4f2b-880b-31d086766295upstream_cells_mapAbstractTransition$d963ff6d-f1b6-4799-aa0e-1ae100310d84Main.Base.inferencebarrier@assertAbstractContinuousTransition$c8b47eac-2d45-419a-bec6-2ae0cdc59393typeofpromote_typeAnyFunctionRealMainnewContinuousMDPTransitionSampler$c8b47eac-2d45-419a-bec6-2ae0cdc59393throwAssertionError!===$36a6e43f-6bcf-4c27-bfbb-047760e77adaprecedence_heuristic cell_id$36a6e43f-6bcf-4c27-bfbb-047760e77adadownstream_cells_mapupstream_cells_map@md_strgetindex$436c52d2-280b-4ca4-9360-d6587b8254c7precedence_heuristic cell_id$436c52d2-280b-4ca4-9360-d6587b8254c7downstream_cells_mapupstream_cells_map@md_strgetindex$e96d592d-1e54-486d-8ad9-b857f85476e8precedence_heuristic cell_id$e96d592d-1e54-486d-8ad9-b857f85476e8downstream_cells_map#actor_critic_linear_parameter_study$7afb6fb0-248a-4518-b94f-9876f81eca64$1b102220-6d78-480d-a77f-0e57bad23dca$d57375a5-b9e0-4742-b5f7-6a7da891604aupstream_cells_map@NamedTuple:IntegerStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84RealInt64FunctionBase-^+$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6cprecedence_heuristic cell_id$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6cdownstream_cells_mapupstream_cells_map:corridor_parameter_studies$e5c1aca8-7575-4835-8273-e69ca0a55fe8$646bc853-b7fc-49fa-a201-ff98e8f952d4^$4da20fd7-b897-4f26-bf2a-f08d66ddf90fprecedence_heuristic cell_id$4da20fd7-b897-4f26-bf2a-f08d66ddf90fdownstream_cells_map%actor_critic_with_eligibility_traces!$05bfd818-bf4e-4bda-baa9-5ba647867097$68806899-9972-460a-9f11-daa708a9d610$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54$20776e09-7d9b-4db8-a060-7bceeec65b47$3e3c5897-809f-46e3-bb58-f115b082443e$05f120be-9695-4824-82fd-142a0df13098upstream_cells_map"zerotypeminzero_params!$e6cf9550-2e69-4b82-92cf-5e07a35490aaupdate_traces_with_gradient!$25be5dcf-be63-46c4-b6de-6cf79fa28fd0$056a8adc-92f4-4b33-90d9-4b3b4026bbbconeContinuousMDP$537270ba-122b-4f2b-880b-31d086766295Base.CoreLogging.!VectorReal'Base.CoreLogging.Base.fixup_stdlib_pathepisode_stepsdeepcopy/@infoBase.invokelatestupdate_params_with_gradient!$b0a66a19-ee76-463b-a704-8fcee85444d0$a893a87b-2d07-4db5-9d1a-9da8646216f4$f55afa58-962d-4551-8d95-a5b467d61adfBase.CoreLogging.invokelatestBase.CoreLogging.===error&form_state_and_policy_function_outputs$e7e49ff8-32df-48a4-afb2-462859592e92$11b9beea-b0cd-45eb-84c6-151728894df0#___this_pluto_module_nameIntegerFunctionBase<=γpush!Base.CoreLogging.isa-bad_continuous_action$b966b248-fb4d-457d-90f6-114370846242+*Base.CoreLogging.>=episode_rewards$11ea640c-3981-404d-87c6-4d3d0708a2b8precedence_heuristic cell_id$11ea640c-3981-404d-87c6-4d3d0708a2b8downstream_cells_map,actor_critic_linear_episodic_parameter_studyupstream_cells_mapsumRandom$df7f84e8-b42a-4001-9dbf-6bc3ced94207StateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84lengthRealscatter/isempty4actor_critic_with_eligibility_traces_linear_features$68806899-9972-460a-9f11-daa708a9d610:AbstractVector|>InfrandIntegerFunctionUInt64-plotfoldxt+MapLayoutRandom.seed!$281360af-46bf-4c73-bf11-3cb1153ad3e2precedence_heuristic cell_id$281360af-46bf-4c73-bf11-3cb1153ad3e2downstream_cells_mapupstream_cells_map$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690cprecedence_heuristic cell_id$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690cdownstream_cells_map,update_squashed_gaussian_eligibility_vector!$05f120be-9695-4824-82fd-142a0df13098upstream_cells_mapexp:'BinarySquashedGaussianEligibilityVector$76fd79a2-2bc8-45f8-a243-48415118898akfirstBinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41VectorRealMatrix+NTuplelast$da3cb392-78f2-48b2-b0dc-5f016664798cprecedence_heuristic cell_id$da3cb392-78f2-48b2-b0dc-5f016664798cdownstream_cells_mapupstream_cells_map mountaincar_continuing_tile_test$b02ba928-5b9f-4695-b980-07988c788bb9show_mountaincar_trajectory$ba645f6b-143f-4e83-9003-707770ae308d$dca2f8e2-76af-4679-bf81-3824c15fc76dprecedence_heuristic cell_id$dca2f8e2-76af-4679-bf81-3824c15fc76ddownstream_cells_mapreinforce_test3$11a55af7-5301-4507-bb26-88e1e11236dbupstream_cells_mapInt64cartpole_setup$26880577-d267-4950-8725-7afe0d0402b6^4actor_critic_with_eligibility_traces_binary_features$05bfd818-bf4e-4bda-baa9-5ba647867097typemax$8019bec9-1228-407b-9199-2fe29f26a981precedence_heuristic cell_id$8019bec9-1228-407b-9199-2fe29f26a981downstream_cells_mapupstream_cells_map@md_strgetindex$fd964539-2baf-4ff1-b286-5a0bb1b222c4precedence_heuristic cell_id$fd964539-2baf-4ff1-b286-5a0bb1b222c4downstream_cells_mapupstream_cells_map@md_strgetindex$5720e942-d3f8-4329-83a8-8bcedf078b6aprecedence_heuristic cell_id$5720e942-d3f8-4329-83a8-8bcedf078b6adownstream_cells_mapupstream_cells_mapcorridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabe-reinforce_monte_carlo_control_linear_features$8e39bd15-862e-4941-88f9-2794b861a523^update_corridor_features!$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811$62e677ac-2070-4f6b-9df2-90849d89fa9fprecedence_heuristic cell_id$62e677ac-2070-4f6b-9df2-90849d89fa9fdownstream_cells_mapcorridor_terminal_probabilitiesupstream_cells_mapcorridor_state_counts$54f559b6-8a62-4a42-894d-c56e41d5ebefsum-$11b9beea-b0cd-45eb-84c6-151728894df0precedence_heuristic cell_id$11b9beea-b0cd-45eb-84c6-151728894df0downstream_cells_map&form_state_and_policy_function_outputs$4d4ae57b-afc3-44f9-b6fc-892f59f82921$266d2234-26c8-43f1-9e75-49440a230ed6$83640f5b-fe13-4ec1-98a0-67a56c189ba1$b71145a4-2614-4f62-bfd2-7d5d1fecec56$4da20fd7-b897-4f26-bf2a-f08d66ddf90fupstream_cells_mapVector%form_state_continuous_policy_function$f545c800-0bf3-491f-9d7d-42341cfdb573Realdeepcopyform_state_value_function$e7566274-5518-4e28-8738-d4b1747d0cfbFunctioncopy$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290precedence_heuristic cell_id$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290downstream_cells_map-reinforce_monte_carlo_control_binary_features$d037ea92-915c-4dc7-97c6-d006d92e088a$f2ed56c9-c2b7-42cb-a083-e12aeaa126ef$83ca0577-15d7-4448-b597-c77810b812bf$e5c1aca8-7575-4835-8273-e69ca0a55fe8upstream_cells_mapsetup_binary_policy_arguments$96506201-6b66-49e6-8179-06952e2394e1zerosIntegerStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84RealFunctionlength!update_binary_eligibility_vector!$042fbafe-2401-4fb7-ac13-4531e0782c79Matrix!update_binary_action_preferences!$a361f4c9-47ce-42ad-899c-87b611c0d471reinforce_monte_carlo_control!$0ac7ea44-14f6-4e80-80f9-d6df8059bb38$55ba8725-0ddf-4196-a41d-3f3c490a8d84precedence_heuristic cell_id$55ba8725-0ddf-4196-a41d-3f3c490a8d84downstream_cells_map5actor_critic_binary_episodic_gaussian_parameter_study$b53dba81-a9e9-41da-8fc2-7736bf25f2dcupstream_cells_mapRandom$df7f84e8-b42a-4001-9dbf-6bc3ced94207ContinuousMDP$537270ba-122b-4f2b-880b-31d086766295copyVectorRealscatter/Matrixisemptymean:AbstractVector|>Infmake_n_param_dist_policy_params$ba41f521-4ee2-42a6-bf18-078bfa4b875ezerosrandIntegerFunctionUInt64-plotfoldxt+MapEactor_critic_with_eligibility_traces_binary_features_gaussian_actions$20776e09-7d9b-4db8-a060-7bceeec65b47LayoutRandom.seed!$a540814a-57a1-4b98-9443-59e401425444precedence_heuristic cell_id$a540814a-57a1-4b98-9443-59e401425444downstream_cells_mapbinary_value_function$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2$05bfd818-bf4e-4bda-baa9-5ba647867097$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00$20776e09-7d9b-4db8-a060-7bceeec65b47$3e3c5897-809f-46e3-bb58-f115b082443e$05f120be-9695-4824-82fd-142a0df13098upstream_cells_mapzero:islessBinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41julia.simdloop@inboundsnothingVectormake_n_param_dist_policy_params$ba41f521-4ee2-42a6-bf18-078bfa4b875ezerosrandIntegerFunctionUInt64plotfoldxt+MapEactor_critic_with_eligibility_traces_binary_features_gaussian_actions$20776e09-7d9b-4db8-a060-7bceeec65b47LayoutRandom.seed!$5b15f5c9-80bf-47f0-898a-f8dead5b927cprecedence_heuristic cell_id$5b15f5c9-80bf-47f0-898a-f8dead5b927cdownstream_cells_mapupstream_cells_map@md_strgetindex$266d2234-26c8-43f1-9e75-49440a230ed6precedence_heuristic cell_id$266d2234-26c8-43f1-9e75-49440a230ed6downstream_cells_map%actor_critic_with_eligibility_traces!$05bfd818-bf4e-4bda-baa9-5ba647867097$68806899-9972-460a-9f11-daa708a9d610$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54$20776e09-7d9b-4db8-a060-7bceeec65b47$3e3c5897-809f-46e3-bb58-f115b082443e$05f120be-9695-4824-82fd-142a0df13098upstream_cells_mapzerozero_params!$e6cf9550-2e69-4b82-92cf-5e07a35490aaupdate_traces_with_gradient!$25be5dcf-be63-46c4-b6de-6cf79fa28fd0$056a8adc-92f4-4b33-90d9-4b3b4026bbbconeStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84lengthsample_action$d963ff6d-f1b6-4799-aa0e-1ae100310d84VectorRealdeepcopy/update_params_with_gradient!$b0a66a19-ee76-463b-a704-8fcee85444d0$a893a87b-2d07-4db5-9d1a-9da8646216f4$f55afa58-962d-4551-8d95-a5b467d61adfzerossoft_max!$33c99850-67cd-4754-94b9-6df97b238e27&form_state_and_policy_function_outputs$e7e49ff8-32df-48a4-afb2-462859592e92$11b9beea-b0cd-45eb-84c6-151728894df0IntegerFunction<=Int64push!-+*$aa69e4ea-91e0-496a-a7be-529e67f4dbecprecedence_heuristic cell_id$aa69e4ea-91e0-496a-a7be-529e67f4dbecdownstream_cells_mapupstream_cells_map1reinforce_with_baseline_monte_carlo_control_fcann$697b2310-9d96-4f7f-be62-c3bd6bf736f3corridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabe^update_corridor_features!$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811$10ee7709-0816-48d2-abe0-9be3dd04700fprecedence_heuristic cell_id$10ee7709-0816-48d2-abe0-9be3dd04700fdownstream_cells_mapupstream_cells_mapplot_continuing_step_rewards$0964133c-3a5b-433b-a8c4-a97813c37583!mountaincar_continuing_fcann_test$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9$7d94922e-dc9f-4953-b539-24aaa2c85b12precedence_heuristic cell_id$7d94922e-dc9f-4953-b539-24aaa2c85b12downstream_cells_mapcontinuing_study_params$42775fd1-5b27-48e0-abf1-9b22bb775e6dupstream_cells_mapCoreBasePlutoRunner.create_bondPlutoRunner(create_actor_critic_continuing_params_UI$5b15d91e-7119-4f85-a54a-7d4f1fdaf097Core.applicable@bindBase.get$df7f84e8-b42a-4001-9dbf-6bc3ced94207precedence_heuristiccell_id$df7f84e8-b42a-4001-9dbf-6bc3ced94207downstream_cells_mapStatisticsStaticArraysStatsBaseTransducersLinearAlgebraDistributionsPlutoDevMacrosRandom$d037ea92-915c-4dc7-97c6-d006d92e088a$83ca0577-15d7-4448-b597-c77810b812bf$e5c1aca8-7575-4835-8273-e69ca0a55fe8$1f041cb3-618c-4380-a1ec-d7bbe4a80f62$11ea640c-3981-404d-87c6-4d3d0708a2b8$f8614042-7c94-4d47-a1b6-4e96676b4e8b$ba642a22-6623-482a-ab4a-81585b83e457$8bc280db-e57d-4e40-be46-1790f4f7d9e7$11063fff-4d36-46d5-828f-dbed0f46b9cf$55ba8725-0ddf-4196-a41d-3f3c490a8d84$61949faa-8174-4b7b-8fbc-01d5f850b419$dd8e8cd2-7b41-46c4-8530-adefb7aea684$08505e88-9c23-4e95-91e3-d18bf5133dbc$87482ea5-5265-4e02-92c0-1a8bb44ff0f4Threadsupstream_cells_map$352d2952-cb83-47d3-9078-2b2ef9927443precedence_heuristic cell_id$352d2952-cb83-47d3-9078-2b2ef9927443downstream_cells_mapcreate_cartpole_functions$f27f2bcd-05b6-44fe-bf9e-a3e51556db7cupstream_cells_mapCartPoleStatezerodeg2rad>islessrandcartpole_runge_kutta_stepRealFunction<-Float32clampCartPoleVehicleabs$0964133c-3a5b-433b-a8c4-a97813c37583precedence_heuristic cell_id$0964133c-3a5b-433b-a8c4-a97813c37583downstream_cells_mapplot_continuing_step_rewards$645e93e7-e92e-49c4-9757-8294fabf4e9b$04b5929a-2058-49c9-963a-96c752a1d67d$98222fcd-b456-477c-90dd-844df36877e5$10ee7709-0816-48d2-abe0-9be3dd04700fupstream_cells_map:LinRangecumsumVectorRealInt64lengthscatterplot/Layoutround$349631b2-4686-49a9-9f3a-1e4ad588b568precedence_heuristic cell_id$349631b2-4686-49a9-9f3a-1e4ad588b568downstream_cells_mapmountaincar_continuous_mdp2$fee14dfe-c5ca-4126-a830-cc9d7eda5433$cd9c9eeb-c90d-4499-9503-7773d5250f47upstream_cells_map$create_continuous_action_mountaincar$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbf$8544eddb-2095-4a3c-82e0-920123a88e6dprecedence_heuristic cell_id$8544eddb-2095-4a3c-82e0-920123a88e6ddownstream_cells_mapupstream_cells_map@md_strgetindex$31f7e903-30b6-4193-9174-88093e004de4precedence_heuristic cell_id$31f7e903-30b6-4193-9174-88093e004de4downstream_cells_mapupstream_cells_map@md_strgetindex$fee14dfe-c5ca-4126-a830-cc9d7eda5433precedence_heuristic cell_id$fee14dfe-c5ca-4126-a830-cc9d7eda5433downstream_cells_map"mountaincar_continuous_test_train2$cd9c9eeb-c90d-4499-9503-7773d5250f47$b695ef21-a1ac-4d1f-a0e1-71cd81cede18upstream_cells_mapInt64mountaincar_continuous_mdp2$349631b2-4686-49a9-9f3a-1e4ad588b568mountaincar_tilecoding_setup$7c592385-e8d3-4efe-962c-d39debb64405Eactor_critic_with_eligibility_traces_binary_features_gaussian_actions$20776e09-7d9b-4db8-a060-7bceeec65b47typemax$b53dba81-a9e9-41da-8fc2-7736bf25f2dcprecedence_heuristic cell_id$b53dba81-a9e9-41da-8fc2-7736bf25f2dcdownstream_cells_mapupstream_cells_mapmountaincar_continuous_mdp$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2@md_str$mountaincar_binary_continuous_params$71a5fce8-6d9a-4625-bad1-a951d61bff28<>isless7run_mountaincar_binary_episodic_countinuous_param_study$ac9c8845-284d-4c21-b05d-d930f86598a3mountaincar_tilecoding_setup$7c592385-e8d3-4efe-962c-d39debb644055actor_critic_binary_episodic_gaussian_parameter_study$55ba8725-0ddf-4196-a41d-3f3c490a8d84$13ebc12f-ff6f-4266-88d3-28d6df5fcf59getindex$beb01fb8-c77d-4b5c-a66d-3812415e04a3precedence_heuristic cell_id$beb01fb8-c77d-4b5c-a66d-3812415e04a3downstream_cells_mapupstream_cells_map@md_strgetindex$8bc280db-e57d-4e40-be46-1790f4f7d9e7precedence_heuristic cell_id$8bc280db-e57d-4e40-be46-1790f4f7d9e7downstream_cells_map"actor_critic_fcann_parameter_study$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62$c251a630-7114-4188-9323-8d8feb5c32e0upstream_cells_mapAbstractVector*actor_critic_with_eligibility_traces_fcann$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54FCANN$d963ff6d-f1b6-4799-aa0e-1ae100310d84Random$df7f84e8-b42a-4001-9dbf-6bc3ced94207randIntegerStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84FCANN.initializeparams_saxeInt64VectorRealFunctionscatterplotUInt64lengthaverage_continuing_runs$ba642a22-6623-482a-ab4a-81585b83e457LayoutRandom.seed!$89901156-b874-416b-89c1-6dc434a4eb17precedence_heuristic cell_id$89901156-b874-416b-89c1-6dc434a4eb17downstream_cells_mapupstream_cells_map@md_strgetindex$ff76ef94-fdf5-41f3-a31a-21c4629efabeprecedence_heuristic cell_id$ff76ef94-fdf5-41f3-a31a-21c4629efabedownstream_cells_mapcorridor_mdp$f7433324-acc3-49a5-b5b3-ada0c8f09d52$fb8904a9-ae64-41cc-93b6-5a25855edad0$cecc2a35-3850-4f66-84e8-e29da4f3d4b0$f2f2dd1d-180c-4d36-b515-5079d129f93a$e1493cea-19c4-475d-98a0-86d27fb04af1$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72c$0c9986bb-54c0-4b08-9c29-4bfb0b68b54e$d037ea92-915c-4dc7-97c6-d006d92e088a$f2ed56c9-c2b7-42cb-a083-e12aeaa126ef$cbea5840-49d2-4e91-be9c-f5f15666d78a$5720e942-d3f8-4329-83a8-8bcedf078b6a$cacaaca6-6e01-464f-a2ee-cbf62737a426$07ad517a-c2ac-4377-99fb-adb13d0f1d0c$aa69e4ea-91e0-496a-a7be-529e67f4dbec$83ca0577-15d7-4448-b597-c77810b812bf$a12b92d1-e045-4f92-b8cd-eee5d56fa67d$e5c1aca8-7575-4835-8273-e69ca0a55fe8$7d63b960-3998-4f7b-8cbb-ccd49db9aeac$9db9ff71-bee9-4bea-a45b-748f8517fed1$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0$646bc853-b7fc-49fa-a201-ff98e8f952d4$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728$396e0047-d848-462f-a769-0cc2829abc78$bc8a399b-8864-4473-89d2-e3b0a03d15b5$72273f27-d0b9-4645-a609-cb65cc9332eeupstream_cells_mapmake_corridor_mdp$5cc4d12d-b537-47e2-8109-4e7a234fdf25$581f7e9b-a5c2-4841-9605-85f9585b0274precedence_heuristic cell_id$581f7e9b-a5c2-4841-9605-85f9585b0274downstream_cells_map!update_linear_action_preferences!$4634267b-5dea-4164-8bb2-1eb2fd4d7954$8e39bd15-862e-4941-88f9-2794b861a523$d1ed25e6-60c6-411f-a541-99986e5da2c5$57e5e12a-b722-4ea3-ab3b-e5711029e640$68806899-9972-460a-9f11-daa708a9d610$d5020a8d-1dd7-403c-9d1f-665b95543943upstream_cells_mapzeroBLASBLAS.gemv!MatrixoneAbstractFloatVector$8aa16866-bfda-48df-9cf1-cf3d2e203ccbprecedence_heuristic cell_id$8aa16866-bfda-48df-9cf1-cf3d2e203ccbdownstream_cells_map8cartpole_tilecoding_reinforce_continuous_parameter_studyupstream_cells_mapcartpole_setup$26880577-d267-4950-8725-7afe0d0402b6:|>setup_cartpole_problemscatterplot/Lreinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00+foldxtisemptyMapmeanLayout$04b5929a-2058-49c9-963a-96c752a1d67dprecedence_heuristic cell_id$04b5929a-2058-49c9-963a-96c752a1d67ddownstream_cells_mapupstream_cells_mapplot_continuing_step_rewards$0964133c-3a5b-433b-a8c4-a97813c37583cartpole_continuing_fcann_test$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27$f0104778-81a6-417b-8501-f916e5e7f3afprecedence_heuristic cell_id$f0104778-81a6-417b-8501-f916e5e7f3afdownstream_cells_mapmake_corridor_continuing_mdp$1ac9296f-047b-4051-ba5c-0c23d5f9cde9upstream_cells_mapifelseIntegerStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84Returns-StateMDPTransitionSampler$d963ff6d-f1b6-4799-aa0e-1ae100310d84Float32+*iseven==$3e3c5897-809f-46e3-bb58-f115b082443eprecedence_heuristic cell_id$3e3c5897-809f-46e3-bb58-f115b082443edownstream_cells_mapAactor_critic_with_eligibility_traces_binary_features_beta_actions$dd8e8cd2-7b41-46c4-8530-adefb7aea684$4156d955-9daf-4429-b152-e8332980fb9eupstream_cells_mapupdate_beta_eligibility_vector!$bfe7e41d-6318-4bd4-b892-287831876abc$d41f1dd1-45fe-4456-9a01-ed47fd6704a7binary_value_function$a540814a-57a1-4b98-9443-59e401425444make_beta_sampler$b2082ab0-73a4-45a6-8772-a2e6e22b519amake_n_param_dist_policy_params$ba41f521-4ee2-42a6-bf18-078bfa4b875ezerosBinaryFeatureVector$f92bb265-4b19-4f0e-a698-d7547bb6dd41randIntegerVectorContinuousMDP$537270ba-122b-4f2b-880b-31d086766295RealFunctionupdate_binary_value_gradient!$03a218cb-aa83-4000-85b5-c6f247087053"setup_binary_beta_policy_arguments$ed93259c-7b8b-46d7-97fb-f194e0e04b3aNTupleUnionMatrix!update_binary_action_preferences!$a361f4c9-47ce-42ad-899c-87b611c0d471%actor_critic_with_eligibility_traces!$266d2234-26c8-43f1-9e75-49440a230ed6$83640f5b-fe13-4ec1-98a0-67a56c189ba1$b71145a4-2614-4f62-bfd2-7d5d1fecec56$4da20fd7-b897-4f26-bf2a-f08d66ddf90f$a9db3f85-ff56-4bbc-be87-47b893ef3b7bprecedence_heuristic cell_id$a9db3f85-ff56-4bbc-be87-47b893ef3b7bdownstream_cells_mapmountaincar_continuing_step$00152954-dc98-4120-b94b-2ea4d987832bupstream_cells_mapMountainCarTask.step MountainCarTask.initialize_state==IntegerMountainCarTask$08505e88-9c23-4e95-91e3-d18bf5133dbcprecedence_heuristic cell_id$08505e88-9c23-4e95-91e3-d18bf5133dbcdownstream_cells_map>actor_critic_binary_episodic_squashed_gaussian_parameter_study$0d93132d-5819-47dc-8cf2-462d480d9c3dupstream_cells_mapRandom$df7f84e8-b42a-4001-9dbf-6bc3ced94207ContinuousMDP$537270ba-122b-4f2b-880b-31d086766295copyVectorRealscatter/MatrixisemptymeanNactor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions$05f120be-9695-4824-82fd-142a0df13098$717e4c69-59d5-4929-923f-dd35a97fb160:AbstractVector|>Infmake_n_param_dist_policy_params$ba41f521-4ee2-42a6-bf18-078bfa4b875ezerosrandIntegerFunctionUInt64-plotfoldxt+MapLayoutRandom.seed!$ad0009af-2cfc-4820-bd4a-698ad391f459precedence_heuristic cell_id$ad0009af-2cfc-4820-bd4a-698ad391f459downstream_cells_mapupstream_cells_mapbeta_params$7bf209c8-ef0a-46d1-937e-b1a6e45dc62escatterplotLinRangemake_beta_dist$0b01ba67-3921-4f3f-a7e8-235190bc84eb$16fcc2d0-9f2f-4226-9dcc-6d86248cab26precedence_heuristic cell_id$16fcc2d0-9f2f-4226-9dcc-6d86248cab26downstream_cells_mapplot_state_distributions$9cf3dc5f-8a25-479f-93db-06e34f0d37a0upstream_cells_map:sumvcatHypertextLiteral.BypassHypertextLiteral.contentsizeadjoint@htl$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaebbar-plot/HypertextLiteral$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70attrheatmapHypertextLiteral.ResultLayoutcollect_state_distributions$0c9986bb-54c0-4b08-9c29-4bfb0b68b54econj$11063fff-4d36-46d5-828f-dbed0f46b9cfprecedence_heuristic cell_id$11063fff-4d36-46d5-828f-dbed0f46b9cfdownstream_cells_map"actor_critic_fcann_parameter_study$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62$c251a630-7114-4188-9323-8d8feb5c32e0upstream_cells_map:AbstractVector*actor_critic_with_eligibility_traces_fcann$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54FCANN$d963ff6d-f1b6-4799-aa0e-1ae100310d84Random$df7f84e8-b42a-4001-9dbf-6bc3ced94207randIntegerStateMDP$d963ff6d-f1b6-4799-aa0e-1ae100310d84FCANN.initializeparams_saxeInt64VectorRealFunctionUInt64lengthaverage_continuing_runs$ba642a22-6623-482a-ab4a-81585b83e457DataFrameRandom.seed!$8fcdca63-01a0-4d4b-933c-06a7621d980aprecedence_heuristic cell_id$8fcdca63-01a0-4d4b-933c-06a7621d980adownstream_cells_mapupstream_cells_map$33c99850-67cd-4754-94b9-6df97b238e27precedence_heuristic cell_id$33c99850-67cd-4754-94b9-6df97b238e27downstream_cells_mapsoft_max!$4634267b-5dea-4164-8bb2-1eb2fd4d7954$042fbafe-2401-4fb7-ac13-4531e0782c79$37ec6802-d4c2-4470-ad69-439d5a732f77$4d4ae57b-afc3-44f9-b6fc-892f59f82921$266d2234-26c8-43f1-9e75-49440a230ed6$83640f5b-fe13-4ec1-98a0-67a56c189ba1upstream_cells_mapzeroisless@inboundsonenothinglengthactor_critic_binary_episodic_squashed_gaussian_parameter_study$0d93132d-5819-47dc-8cf2-462d480d9c3dupstream_cells_map@NamedTuple:IntegerContinuousMDP$537270ba-122b-4f2b-880b-31d086766295RealInt64FunctionBase-^+$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72cprecedence_heuristic cell_id$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72cdownstream_cells_mapcorridor_train$5334064b-5a16-4135-afa0-86a48291725b$5981f52b-d829-4c7d-b47b-33310f7d64a2$573878bb-020d-40f6-9329-3d5f91843010upstream_cells_mapInt64corridor_mdp$ff76ef94-fdf5-41f3-a31a-21c4629efabeget_corridor_features$6bb0263e-368e-462a-948c-baf9cfa82512typemaxsarsa_λcell_execution_order$fac138d9-3c5d-44b0-a87c-b13872f19450$e034b9cb-f4ee-46f4-bea6-72c93c75d966$666a4e89-306b-4fb2-bdc4-3dda2c63153f$df7f84e8-b42a-4001-9dbf-6bc3ced94207$d963ff6d-f1b6-4799-aa0e-1ae100310d84$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70$7cf26604-9c2b-4a77-9674-7d4dac2f99f0$36a6e43f-6bcf-4c27-bfbb-047760e77ada$31f7e903-30b6-4193-9174-88093e004de4$48dcd2d0-a940-41da-a097-90c780f2ec4d$d95f75b5-21d8-4862-baa7-50b58d9725b8$fc3dcd26-c5cf-4141-bf6c-eaed5fc9bb1d$dcb306ae-a1b1-43d6-ba6e-e38668838689$33c99850-67cd-4754-94b9-6df97b238e27$7a6fb1f0-fc3c-4c29-a6d9-769d32ca98a9$5cc4d12d-b537-47e2-8109-4e7a234fdf25$ff76ef94-fdf5-41f3-a31a-21c4629efabe$6bb0263e-368e-462a-948c-baf9cfa82512$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811$f2f2dd1d-180c-4d36-b515-5079d129f93a$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72c$5334064b-5a16-4135-afa0-86a48291725b$5981f52b-d829-4c7d-b47b-33310f7d64a2$8019bec9-1228-407b-9199-2fe29f26a981$38e5d800-4d43-40d2-87ea-f7d4b4283dab$b94fc99c-f439-4df2-8da3-c01718a136c4$9c342958-1971-48ec-b919-5dfdcbc915a4$e5faaa1b-88cb-43e2-8d04-8972b58b4bda$406638af-1e08-44d2-9ee4-97aa9294a94b$aa450da4-fe84-4eea-b6c4-9820b7982437$fdd3f4fd-4706-4d6b-b150-6ee6b4b370cb$41d62de1-2c92-41ee-9430-b9ca3007afd9$135f205a-f87e-4691-8e87-d317d6312c84$ca360680-afc9-4dd9-9351-493643f91575$4a39f9a7-72d4-44ad-895a-742cd1291f92$98229733-a71e-44ca-a52a-b7229cf8b422$37a8ef7e-e859-4ef0-81e2-76c02a324031$339b4d2b-2237-46a3-9867-ecc3332856c1$05b0fcad-628b-48d2-aa24-f6f562dbb660$17d07ef4-7c0a-47cc-a701-32c60336571b$76b03e72-da04-4530-8534-6d6468268cbd$2a586e46-66e4-461a-85c8-5817e4d1aa43$90d3b96b-ad2b-405c-951b-f48ec7ccf24a$f924eb30-d1cc-4941-8fb5-ff70ad425ab9$189798b3-ec6b-48b9-918c-ee0f65935ab3$70096b14-beab-4f71-9886-6355c749bb8a$1558cec1-c4fd-4bc0-85ed-ae22c6067d41$e3a2fb12-37ce-4c23-ad93-5fc89991aabb$58403c8e-0ee4-4466-ba25-ee0c86fb0b47$73b90260-d57a-449a-8db6-47f91e6a4e4f$ee72af8d-3cb8-4314-82df-580f068e1252$89901156-b874-416b-89c1-6dc434a4eb17$5c11a92d-7496-4aba-af15-2537eac49dd7$581f7e9b-a5c2-4841-9605-85f9585b0274$da2d3186-a778-41cc-9b49-759bf1e9b8fa$f92bb265-4b19-4f0e-a698-d7547bb6dd41$8eab55a5-41b7-4f5e-a02f-4c19388bc9ea$a361f4c9-47ce-42ad-899c-87b611c0d471$cc3ac95e-a398-438a-ba3d-62b6733f6342$4634267b-5dea-4164-8bb2-1eb2fd4d7954$45f0a385-6465-4acc-8637-1b007a0fe215$41dc149d-c6f3-4b0d-a856-06f3aaae3049$b0a66a19-ee76-463b-a704-8fcee85444d0$042fbafe-2401-4fb7-ac13-4531e0782c79$65d2add6-fd6f-456c-92ed-3cd9d1862ef6$96506201-6b66-49e6-8179-06952e2394e1$0e9de19e-bcd4-40ac-9831-afb6cad38422$a206c759-3f6e-4003-8cba-5f6ce6742646$3bafd7df-9bc0-4d13-874d-739590cf3ad9$cc45091e-b889-4d5a-9eef-84d80f792046$d83dc659-dce7-41dd-a8e7-2933ab39d15c$1753b5ed-c00b-4b60-b492-822180778e8c$03a218cb-aa83-4000-85b5-c6f247087053$a893a87b-2d07-4db5-9d1a-9da8646216f4$5c4a383f-fcf2-4f2b-819f-6d84471dda00$2cbc972b-c685-4c1c-8a8d-9d58b197ad90$77cf3a74-899f-4ade-99f2-5aaf7a98c02d$0bf3b988-b3fb-49d5-8dde-b25766596363$a540814a-57a1-4b98-9443-59e401425444$635abb34-2c97-4f04-a74c-22fbec32f408$37ec6802-d4c2-4470-ad69-439d5a732f77$e7566274-5518-4e28-8738-d4b1747d0cfb$f3e2db06-9cb7-464a-96b8-938175efd26b$e1aec891-d95a-47d1-97d7-d2a4cfb16e64$8544eddb-2095-4a3c-82e0-920123a88e6d$48b342f2-e48f-457a-9bd3-b3504a79f3a6$fd89433e-643c-474b-b3c4-a997678421a6$1ec1acf1-f833-4478-9b3c-88029340a629$b72e030f-7d52-481f-b4f7-2b16b227e547$047656d1-2921-40f2-b75b-ce4a87098007$738ada7f-edc7-4ed3-a15e-e92113468738$ce33f710-fd9d-4dfa-acda-40204e54d518$f4b6f10b-4cd0-4be6-98ec-4d4ffb696392$e7e49ff8-32df-48a4-afb2-462859592e92$1386ffdb-940d-4f1b-a872-4e38647b5335$e2b09af1-0f22-4f7f-b806-54fa522adb20$4cbdb082-22ba-49e9-a6ed-4380917625ac$e6cf9550-2e69-4b82-92cf-5e07a35490aa$25be5dcf-be63-46c4-b6de-6cf79fa28fd0$4fea7232-f286-4a8b-93f8-a0702818ab31$d8222abf-139c-4220-8e92-cc987ec6900c$511a847f-234c-465e-8f4a-688e79d9b975$0284f0d7-b8a9-4ae6-add0-ac1078571d9b$b4875f2b-5487-429f-80a3-d1032bbccfc1$4915b1ed-ad53-4ece-9b00-bc136d47d8dc$5b15f5c9-80bf-47f0-898a-f8dead5b927c$f3bc47b5-03fc-4bd9-a890-26f9608a730b$436c52d2-280b-4ca4-9360-d6587b8254c7$f0104778-81a6-417b-8501-f916e5e7f3af$1ac9296f-047b-4051-ba5c-0c23d5f9cde9$ba642a22-6623-482a-ab4a-81585b83e457$e96d592d-1e54-486d-8ad9-b857f85476e8$5aba4f96-e877-457e-8e95-18737348f99f$5b15d91e-7119-4f85-a54a-7d4f1fdaf097$7d94922e-dc9f-4953-b539-24aaa2c85b12$da8d0bca-105b-4d0b-a73d-ee5c9059aeaf$d17a4bd0-5992-4247-912d-73d51758d2f3$352d2952-cb83-47d3-9078-2b2ef9927443$f27f2bcd-05b6-44fe-bf9e-a3e51556db7c$b87ff1a9-abff-40f7-a1d8-f751a1c8b060$5d434c83-c9ca-499f-8695-c7733031c2de$4c4e643b-d4b9-44f0-8d30-dc521bcc55ac$7dbb42a3-aa8c-47e5-b668-18e6325d4038$de3cba34-9842-44d1-9b79-47126c0a0751$37a273b6-b104-46f0-987a-401dc1c97327$8e742d32-c074-4981-b35b-b596b64c869b$64900586-ef92-48e4-839e-ff952a46671b$19dfabda-7049-4050-8662-0385529c0c5a$966ef17c-23be-49dc-bc37-4cb52b34c049$2c5d221a-2469-49e1-9249-dfdc2457f2fa$5ffc271f-c73f-494a-9727-8d7516af2191$42d4600a-bf3c-45ac-b7f5-d23917713ff5$820752af-8966-4ee8-82f7-a40934522de5$0964133c-3a5b-433b-a8c4-a97813c37583$28ce6e60-59cf-408a-8081-b978507b3c72$5500fd8e-64cb-4af7-808d-230440746319$a9db3f85-ff56-4bbc-be87-47b893ef3b7b$00152954-dc98-4120-b94b-2ea4d987832b$46fea69b-599e-46ab-8455-d2da865d9a8e$d3c1379f-acd6-4e15-be7e-a5dbe46a4f62$fed4dc4c-0d1c-4ee3-9d0e-8ef2a7db7486$c926b6df-c40b-4c4c-8a95-ce9e41feb100$f487f2dd-ad09-48ac-ae34-bf50cfa6ac7d$5d35e515-e2d3-443e-becf-eb28c25db346$735b548a-88f5-4a30-ab8f-dfb3d6401b2b$60c21e9c-e42d-4f0b-a910-3b318440fbc8$09dd1440-5d09-421f-addc-b1ede43ff517$7ccadf01-fbba-4dfd-a5ad-770dab9946f9$beb01fb8-c77d-4b5c-a66d-3812415e04a3$68e6f17e-8c87-40f0-a673-1115ecd1b71d$692c1043-4eaf-491e-b8fe-368618867f99$3cfd63ad-b1a2-4b99-ae97-2ff10351e4f5$fd964539-2baf-4ff1-b286-5a0bb1b222c4$0b01ba67-3921-4f3f-a7e8-235190bc84eb$7bf209c8-ef0a-46d1-937e-b1a6e45dc62e$ad0009af-2cfc-4820-bd4a-698ad391f459$b09e1e48-494e-4967-826a-6e70199acad4$5864a5a3-a5a5-43c2-9cb4-7d13b2d20bed$94517664-6988-44dc-a297-e9d5873ee540$b16899b7-36bf-4a5e-8e2f-4496b8450687$00bd2835-b006-4244-9877-bc7e031e3ef8$3e7cecec-eb77-4862-8e3c-b510422e06db$78c83673-2117-4542-b4d8-1c243e8f610b$ae0f5a96-7a4b-47f9-be1e-e803a238a071$c8b47eac-2d45-419a-bec6-2ae0cdc59393$537270ba-122b-4f2b-880b-31d086766295$10cdd16e-a337-4421-a7a0-6de4e4b60c0f$54fff14b-cf53-47b0-9cfa-8b9ee33df54e$76fd79a2-2bc8-45f8-a243-48415118898a$87ee21f3-16ca-4c8c-a0b9-f9e2fd258a91$b966b248-fb4d-457d-90f6-114370846242$f946c886-6246-4f98-a96f-f06984691ad8$f7433324-acc3-49a5-b5b3-ada0c8f09d52$fb8904a9-ae64-41cc-93b6-5a25855edad0$cecc2a35-3850-4f66-84e8-e29da4f3d4b0$a019925a-460a-410e-a54b-50a4cfe0e90e$e1493cea-19c4-475d-98a0-86d27fb04af1$573878bb-020d-40f6-9329-3d5f91843010$0c9986bb-54c0-4b08-9c29-4bfb0b68b54e$54f559b6-8a62-4a42-894d-c56e41d5ebef$62e677ac-2070-4f6b-9df2-90849d89fa9f$bba13634-ff0e-47f7-a23b-8d56098f4ac6$b2082ab0-73a4-45a6-8772-a2e6e22b519a$7a6f3f79-ea06-4994-8b62-90b2056e4034$5261651e-a51e-4e80-8e23-83a4c10e5259$bfe7e41d-6318-4bd4-b892-287831876abc$6bf5ad39-1400-4e1f-a843-a1934b8aaa48$f55afa58-962d-4551-8d95-a5b467d61adf$4fb83451-b6f8-4e6e-a131-1accc8e10b08$740a3f41-9302-481d-b373-762c0dea8eff$d41f1dd1-45fe-4456-9a01-ed47fd6704a7$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690c$f545c800-0bf3-491f-9d7d-42341cfdb573$5b868eba-c1af-49f6-8f93-79b78c319a6f$0ac7ea44-14f6-4e80-80f9-d6df8059bb38$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290$d037ea92-915c-4dc7-97c6-d006d92e088a$0c56b341-24eb-4c78-844e-182f44a7221a$f2ed56c9-c2b7-42cb-a083-e12aeaa126ef$8e39bd15-862e-4941-88f9-2794b861a523$5720e942-d3f8-4329-83a8-8bcedf078b6a$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091$07ad517a-c2ac-4377-99fb-adb13d0f1d0c$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb$cbea5840-49d2-4e91-be9c-f5f15666d78a$83ca0577-15d7-4448-b597-c77810b812bf$a7dcc8cd-04ec-48f2-a387-116330eaffb2$e5c1aca8-7575-4835-8273-e69ca0a55fe8$d1ed25e6-60c6-411f-a541-99986e5da2c5$cacaaca6-6e01-464f-a2ee-cbf62737a426$a12b92d1-e045-4f92-b8cd-eee5d56fa67d$44b32cc0-36a8-41fd-89bc-ce894536926c$553b0ceb-f2ca-41ee-99bc-9f53a4487b49$697b2310-9d96-4f7f-be62-c3bd6bf736f3$aa69e4ea-91e0-496a-a7be-529e67f4dbec$76eb6743-cac0-4174-9ba3-a0691c200b54$ba41f521-4ee2-42a6-bf18-078bfa4b875e$ba5d6311-daee-4abc-b2fb-fae2184ef3eb$ed93259c-7b8b-46d7-97fb-f194e0e04b3a$4e29c621-223e-4859-8e96-db04b967815a$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00$d5020a8d-1dd7-403c-9d1f-665b95543943$2be8a812-4f21-4fe8-a2de-50497db0345a$056a8adc-92f4-4b33-90d9-4b3b4026bbbc$11b9beea-b0cd-45eb-84c6-151728894df0$4d4ae57b-afc3-44f9-b6fc-892f59f82921$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2$7d63b960-3998-4f7b-8cbb-ccd49db9aeac$646bc853-b7fc-49fa-a201-ff98e8f952d4$94354552-9920-4b90-98d9-f75286d1f53e$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6c$57e5e12a-b722-4ea3-ab3b-e5711029e640$9db9ff71-bee9-4bea-a45b-748f8517fed1$57bbdb10-bed8-459d-8f67-9ea637cf12ba$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0$266d2234-26c8-43f1-9e75-49440a230ed6$83640f5b-fe13-4ec1-98a0-67a56c189ba1$b71145a4-2614-4f62-bfd2-7d5d1fecec56$4da20fd7-b897-4f26-bf2a-f08d66ddf90f$05bfd818-bf4e-4bda-baa9-5ba647867097$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728$396e0047-d848-462f-a769-0cc2829abc78$1f041cb3-618c-4380-a1ec-d7bbe4a80f62$72273f27-d0b9-4645-a609-cb65cc9332ee$8b35661b-5075-4d63-bc31-044407f99acf$3c89209c-9202-4d5d-841c-ea34be369616$645e93e7-e92e-49c4-9757-8294fabf4e9b$68806899-9972-460a-9f11-daa708a9d610$11ea640c-3981-404d-87c6-4d3d0708a2b8$734573e5-547b-4dcc-89bb-412aa6cc42d6$ff4f977e-48df-4c12-845c-c245b4d39d6d$7afb6fb0-248a-4518-b94f-9876f81eca64$42775fd1-5b27-48e0-abf1-9b22bb775e6d$1b102220-6d78-480d-a77f-0e57bad23dca$b2539398-fdbc-42a2-a8f3-d327358f3643$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54$f8614042-7c94-4d47-a1b6-4e96676b4e8b$8bc280db-e57d-4e40-be46-1790f4f7d9e7$11063fff-4d36-46d5-828f-dbed0f46b9cf$20776e09-7d9b-4db8-a060-7bceeec65b47$3e3c5897-809f-46e3-bb58-f115b082443e$05f120be-9695-4824-82fd-142a0df13098$717e4c69-59d5-4929-923f-dd35a97fb160$55ba8725-0ddf-4196-a41d-3f3c490a8d84$61949faa-8174-4b7b-8fbc-01d5f850b419$dd8e8cd2-7b41-46c4-8530-adefb7aea684$08505e88-9c23-4e95-91e3-d18bf5133dbc$87482ea5-5265-4e02-92c0-1a8bb44ff0f4$13ebc12f-ff6f-4266-88d3-28d6df5fcf59$3d065608-eef2-4caa-b17d-ec60714e3d58$bd6a7c16-6c25-4fc2-8e1b-4dab693ce19f$c5a2879c-e89b-47f7-bbd6-48200d7e89e3$65be0e58-24be-4932-92a9-9e4825b14144$3c695d54-c30f-4f04-bd40-f5da53be2a95$3c316495-bb6c-41e2-a38f-ba867a319fbb$024dcd1a-8eaa-4a95-8037-2f578828309c$822e4d69-2582-4956-858e-06ecb091e76a$cf1859d6-f889-4923-8c87-2d7c039f26c3$31db0f58-28e4-454f-9394-25565687266f$fddef10c-7695-4596-9e16-987fd45a57e6$26880577-d267-4950-8725-7afe0d0402b6$0cd96c44-cae6-421f-9fae-26141600bef4$24fa139c-ad4b-49db-ac8f-23c476ed8608$dddc4a2f-34b2-41dc-85b3-55aba4880fa6$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5$d3b56fca-5b79-4465-8987-8d0005f854d8$5859ca11-90f8-4fd6-88ed-c56efe796fe8$281360af-46bf-4c73-bf11-3cb1153ad3e2$8f1b2db4-ed35-44fc-a3d5-e06deae16d48$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03$8aa16866-bfda-48df-9cf1-cf3d2e203ccb$dca2f8e2-76af-4679-bf81-3824c15fc76d$11a55af7-5301-4507-bb26-88e1e11236db$7856b8a0-565d-4c86-9b3c-4424ff9b86dd$8fcdca63-01a0-4d4b-933c-06a7621d980a$76d54520-baa3-44bf-b303-4cdcb8b87080$9acdbf38-2e10-45ec-85a0-d0db8453a599$f0962801-0dfa-421f-8ffc-e64068e49913$c251a630-7114-4188-9323-8d8feb5c32e0$cb70d400-3e9c-441c-b17c-e727e8c928f3$61650a97-b353-4a85-b50b-93fee296ac7b$192b9f82-8d3a-408f-91c2-829cfcd32572$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62$50ae94c4-70f3-4215-82bd-eb2227c2badf$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27$04b5929a-2058-49c9-963a-96c752a1d67d$64b38d1f-ecf9-4843-89a1-4c8953048265$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fce$6acb549a-5d90-4457-a347-d22448ad8071$d34d22ad-89c2-423e-91dd-bfb895dc6540$5eebf3da-bfe7-46eb-81a3-f87f334ee270$9978d537-49ff-4014-a971-b42704c50a6b$54ff46a2-489a-4dd2-bc30-df70c780cc42$407a0724-4bb6-4c83-ab2d-17a0e19c4072$27487ad0-4779-42ce-8def-e660ef04bee0$9d264543-33ab-498a-90f5-5f913c252484$07ba9fe4-aaa7-4123-9865-cbfa79d0d44a$e1274f57-75cb-4659-a82f-e5870c5367e2$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547d$5ee4ce72-7740-4297-8d84-619e0708e4ac$87feff3e-e510-4916-91a9-db3a2cd12225$6b1acb57-159a-4b7f-99fe-5f996522243b$82e0e9a0-9662-429a-87e3-e6bdae02709a$27441783-d3c6-40be-9c36-4941613e6ae9$daf35bfe-8f9c-4f55-971d-4d443be8f8bf$51d6337d-c0bd-40a9-9129-7d88e41e4093$a5b002c9-5e11-462a-9da0-6e060c7963f8$9bce6fdb-2cbc-4758-9a8b-794e490c973d$bb1ef180-39ac-475f-beea-ef573e71a3bf$a8349352-3242-46d5-b0d5-1b6eb5d77e90$2e7c737c-c798-4442-a7e1-d74ccfd73119$f7f58fd2-facc-4b87-9172-5e911677c8f4$d21617aa-6f38-4a90-8586-4b32022497ad$700dcbc4-c94c-4287-8cf0-0b2c7a320a3a$4f96be72-ef3e-4e08-ac4c-be4271dcd14c$54f1546d-87ae-49d2-92ed-6fcc9b66e027$c5dd7e99-57e0-4bc7-97d2-2c780b23bcff$2025ff38-f2ec-4224-b771-ff72ffe1af28$77906355-08f8-4b08-b051-84697199b519$023f67b8-8f38-470a-9766-ac60a75678aa$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9$10ee7709-0816-48d2-abe0-9be3dd04700f$7c592385-e8d3-4efe-962c-d39debb64405$d57375a5-b9e0-4742-b5f7-6a7da891604a$04f42c09-8ab5-4233-b196-51c4aa2dcedb$b02ba928-5b9f-4695-b980-07988c788bb9$98222fcd-b456-477c-90dd-844df36877e5$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6$d9d11d69-bc16-400a-8f46-f9a8ecb8516a$bc8a399b-8864-4473-89d2-e3b0a03d15b5$192cc1cf-9ea1-492d-baa7-f2e197abecd4$6d0925d3-af96-4b94-8e2e-4941cce39e51$786a5385-b648-4fc3-8e19-bf6582828136$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbf$38acd032-1d18-4760-9111-67c9cdd2e892$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2$349631b2-4686-49a9-9f3a-1e4ad588b568$ac9c8845-284d-4c21-b05d-d930f86598a3$b8532822-179b-4cd5-a279-4b71dafb544a$fee14dfe-c5ca-4126-a830-cc9d7eda5433$e524f8cc-ab69-4f8b-a59f-28156696a104$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7d$ff3009eb-23f9-44fe-8e56-85dbc7b463d0$b7f77935-bcab-4ef1-8e1b-a7d059784ff3$6c5e9bb2-4c38-4613-9652-dec99e97b512$f8215517-b18f-4a03-9421-8edab4ca8089$d2729657-d0bf-4d39-8ec7-f242a1ad48d6$8e096fae-9941-49d8-ae87-c68b02f68da5$44f14d4f-7414-4c6f-883a-042ca261a403$6c5f51bb-a6be-447e-b73d-4f9c2885e809$4156d955-9daf-4429-b152-e8332980fb9e$16113560-e911-47b4-abc4-641bbd246454$a7891c63-18d6-4c1f-ba67-adf7c547d334$7126aefd-b847-497a-9545-514e9b9afa71$1894ae1a-bb68-4de0-a4d2-ac5d02c49f09$4c34640f-efa2-4e1d-8a70-0acd2ce45428$f7ede764-5ad8-426b-a805-cc21b622d977$3ea08816-705e-4be7-a175-dbd3f3e4c17d$5d50a5d0-8fe2-4c6e-b76c-d5614e4fd884$0ab70fc3-6188-42eb-aba2-d808f319be9f$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaeb$16fcc2d0-9f2f-4226-9dcc-6d86248cab26$9cf3dc5f-8a25-479f-93db-06e34f0d37a0$cc80848a-6834-4272-9152-e17b45448814$a8b40b8f-051a-4e6f-a079-ece4f32873de$36d514fa-b27a-4c6b-8399-9d108377b9b5$c52c4cec-0ea8-4af3-831a-d284f0e086ee$4c5cb75e-79b5-4502-b1eb-6246e002feaf$8eb42403-1234-4e59-993e-057cc3a6d5c9$71a5fce8-6d9a-4625-bad1-a951d61bff28$b53dba81-a9e9-41da-8fc2-7736bf25f2dc$0d45ae72-572f-4d17-83cf-9814f2854131$0d93132d-5819-47dc-8cf2-462d480d9c3d$602a07dd-8928-4b44-97e5-01c5cbf38351$0574f5a0-72e7-4aa2-80ac-f4ce4f0fe7c2$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7ef$fd58402f-da65-44cf-b81a-e21192fd0e63$af144759-fe66-4ad0-b378-e9eb4e859db4$d4e87ac4-6008-43b2-aa06-e232ec2b2b5b$63fbf8f4-e4e2-4893-be09-67450e92dbd7$fad02876-efba-46a7-9cb7-43820528779f$374af774-3a97-49b5-a3bb-bc3f7f63a3fa$1ce4bc6c-7cde-48e9-8ff1-7281697fd121$f9facbba-39d4-483e-9066-275603156db0$e89bdc84-dbb5-4c73-a39c-6392e5f79704$c0876a48-ea18-494d-8bfc-e2bceb73b417$d82e7ab8-c372-4462-afb5-1617560cdb56$bbc8864a-1545-433f-bc7c-0ddf6e907138$dc2efc6c-8da8-425b-aa5f-290949109565$68469a40-7976-48b7-b7a1-eaa4c5f33a18$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580a$b695ef21-a1ac-4d1f-a0e1-71cd81cede18$a0ca7a5e-0089-4a45-9278-c0f27cd096a0$ba645f6b-143f-4e83-9003-707770ae308d$da3cb392-78f2-48b2-b0dc-5f016664798c$3a37b53d-9174-4faa-9404-74a40c385b0a$ddbca73f-c692-46f2-95f3-a7dd849d33f7$b5319d8b-0420-4ebf-b603-ea0b93365ac1$c87dba8c-9a96-41b3-9dc7-a6c088ec1eaf$cd9c9eeb-c90d-4499-9503-7773d5250f47$5207308e-f636-4d47-b135-036a6e7b8ecd$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8f$f59a5dcd-9f4a-4336-a391-e64af35ef799last_hot_reload_timeshortpath%Chapter_13_Policy_Gradient_Methods.jlprocess_statusreadypathٰ/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jlpluto_versionv0.20.8last_save_timeA  ycell_order$36a6e43f-6bcf-4c27-bfbb-047760e77ada$31f7e903-30b6-4193-9174-88093e004de4$48dcd2d0-a940-41da-a097-90c780f2ec4d$d95f75b5-21d8-4862-baa7-50b58d9725b8$fc3dcd26-c5cf-4141-bf6c-eaed5fc9bb1d$dcb306ae-a1b1-43d6-ba6e-e38668838689$33c99850-67cd-4754-94b9-6df97b238e27$7a6fb1f0-fc3c-4c29-a6d9-769d32ca98a9$5cc4d12d-b537-47e2-8109-4e7a234fdf25$ff76ef94-fdf5-41f3-a31a-21c4629efabe$f7433324-acc3-49a5-b5b3-ada0c8f09d52$fb8904a9-ae64-41cc-93b6-5a25855edad0$cecc2a35-3850-4f66-84e8-e29da4f3d4b0$6bb0263e-368e-462a-948c-baf9cfa82512$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811$a019925a-460a-410e-a54b-50a4cfe0e90e$f2f2dd1d-180c-4d36-b515-5079d129f93a$e1493cea-19c4-475d-98a0-86d27fb04af1$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72c$5334064b-5a16-4135-afa0-86a48291725b$5981f52b-d829-4c7d-b47b-33310f7d64a2$573878bb-020d-40f6-9329-3d5f91843010$8019bec9-1228-407b-9199-2fe29f26a981$38e5d800-4d43-40d2-87ea-f7d4b4283dab$b94fc99c-f439-4df2-8da3-c01718a136c4$9c342958-1971-48ec-b919-5dfdcbc915a4$e5faaa1b-88cb-43e2-8d04-8972b58b4bda$406638af-1e08-44d2-9ee4-97aa9294a94b$aa450da4-fe84-4eea-b6c4-9820b7982437$fdd3f4fd-4706-4d6b-b150-6ee6b4b370cb$0c9986bb-54c0-4b08-9c29-4bfb0b68b54e$54f559b6-8a62-4a42-894d-c56e41d5ebef$41d62de1-2c92-41ee-9430-b9ca3007afd9$62e677ac-2070-4f6b-9df2-90849d89fa9f$135f205a-f87e-4691-8e87-d317d6312c84$ca360680-afc9-4dd9-9351-493643f91575$4a39f9a7-72d4-44ad-895a-742cd1291f92$9cf3dc5f-8a25-479f-93db-06e34f0d37a0$16fcc2d0-9f2f-4226-9dcc-6d86248cab26$98229733-a71e-44ca-a52a-b7229cf8b422$37a8ef7e-e859-4ef0-81e2-76c02a324031$339b4d2b-2237-46a3-9867-ecc3332856c1$05b0fcad-628b-48d2-aa24-f6f562dbb660$17d07ef4-7c0a-47cc-a701-32c60336571b$76b03e72-da04-4530-8534-6d6468268cbd$2a586e46-66e4-461a-85c8-5817e4d1aa43$90d3b96b-ad2b-405c-951b-f48ec7ccf24a$f924eb30-d1cc-4941-8fb5-ff70ad425ab9$189798b3-ec6b-48b9-918c-ee0f65935ab3$70096b14-beab-4f71-9886-6355c749bb8a$1558cec1-c4fd-4bc0-85ed-ae22c6067d41$e3a2fb12-37ce-4c23-ad93-5fc89991aabb$58403c8e-0ee4-4466-ba25-ee0c86fb0b47$73b90260-d57a-449a-8db6-47f91e6a4e4f$ee72af8d-3cb8-4314-82df-580f068e1252$89901156-b874-416b-89c1-6dc434a4eb17$5c11a92d-7496-4aba-af15-2537eac49dd7$b0a66a19-ee76-463b-a704-8fcee85444d0$581f7e9b-a5c2-4841-9605-85f9585b0274$da2d3186-a778-41cc-9b49-759bf1e9b8fa$f92bb265-4b19-4f0e-a698-d7547bb6dd41$8eab55a5-41b7-4f5e-a02f-4c19388bc9ea$a361f4c9-47ce-42ad-899c-87b611c0d471$cc3ac95e-a398-438a-ba3d-62b6733f6342$4634267b-5dea-4164-8bb2-1eb2fd4d7954$45f0a385-6465-4acc-8637-1b007a0fe215$41dc149d-c6f3-4b0d-a856-06f3aaae3049$042fbafe-2401-4fb7-ac13-4531e0782c79$65d2add6-fd6f-456c-92ed-3cd9d1862ef6$0ac7ea44-14f6-4e80-80f9-d6df8059bb38$96506201-6b66-49e6-8179-06952e2394e1$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290$8e39bd15-862e-4941-88f9-2794b861a523$0e9de19e-bcd4-40ac-9831-afb6cad38422$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091$d037ea92-915c-4dc7-97c6-d006d92e088a$a206c759-3f6e-4003-8cba-5f6ce6742646$0c56b341-24eb-4c78-844e-182f44a7221a$3bafd7df-9bc0-4d13-874d-739590cf3ad9$cc45091e-b889-4d5a-9eef-84d80f792046$d83dc659-dce7-41dd-a8e7-2933ab39d15c$1753b5ed-c00b-4b60-b492-822180778e8c$03a218cb-aa83-4000-85b5-c6f247087053$a893a87b-2d07-4db5-9d1a-9da8646216f4$5c4a383f-fcf2-4f2b-819f-6d84471dda00$2cbc972b-c685-4c1c-8a8d-9d58b197ad90$77cf3a74-899f-4ade-99f2-5aaf7a98c02d$0bf3b988-b3fb-49d5-8dde-b25766596363$a540814a-57a1-4b98-9443-59e401425444$635abb34-2c97-4f04-a74c-22fbec32f408$37ec6802-d4c2-4470-ad69-439d5a732f77$e7566274-5518-4e28-8738-d4b1747d0cfb$4fb83451-b6f8-4e6e-a131-1accc8e10b08$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb$d1ed25e6-60c6-411f-a541-99986e5da2c5$f3e2db06-9cb7-464a-96b8-938175efd26b$e1aec891-d95a-47d1-97d7-d2a4cfb16e64$697b2310-9d96-4f7f-be62-c3bd6bf736f3$8544eddb-2095-4a3c-82e0-920123a88e6d$48b342f2-e48f-457a-9bd3-b3504a79f3a6$f2ed56c9-c2b7-42cb-a083-e12aeaa126ef$cbea5840-49d2-4e91-be9c-f5f15666d78a$fd89433e-643c-474b-b3c4-a997678421a6$5720e942-d3f8-4329-83a8-8bcedf078b6a$cacaaca6-6e01-464f-a2ee-cbf62737a426$1ec1acf1-f833-4478-9b3c-88029340a629$07ad517a-c2ac-4377-99fb-adb13d0f1d0c$aa69e4ea-91e0-496a-a7be-529e67f4dbec$83ca0577-15d7-4448-b597-c77810b812bf$b72e030f-7d52-481f-b4f7-2b16b227e547$a7dcc8cd-04ec-48f2-a387-116330eaffb2$047656d1-2921-40f2-b75b-ce4a87098007$94354552-9920-4b90-98d9-f75286d1f53e$a12b92d1-e045-4f92-b8cd-eee5d56fa67d$44b32cc0-36a8-41fd-89bc-ce894536926c$553b0ceb-f2ca-41ee-99bc-9f53a4487b49$738ada7f-edc7-4ed3-a15e-e92113468738$e5c1aca8-7575-4835-8273-e69ca0a55fe8$ce33f710-fd9d-4dfa-acda-40204e54d518$f4b6f10b-4cd0-4be6-98ec-4d4ffb696392$e7e49ff8-32df-48a4-afb2-462859592e92$4d4ae57b-afc3-44f9-b6fc-892f59f82921$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2$57e5e12a-b722-4ea3-ab3b-e5711029e640$57bbdb10-bed8-459d-8f67-9ea637cf12ba$1386ffdb-940d-4f1b-a872-4e38647b5335$7d63b960-3998-4f7b-8cbb-ccd49db9aeac$9db9ff71-bee9-4bea-a45b-748f8517fed1$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0$646bc853-b7fc-49fa-a201-ff98e8f952d4$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6c$e2b09af1-0f22-4f7f-b806-54fa522adb20$4cbdb082-22ba-49e9-a6ed-4380917625ac$e6cf9550-2e69-4b82-92cf-5e07a35490aa$25be5dcf-be63-46c4-b6de-6cf79fa28fd0$266d2234-26c8-43f1-9e75-49440a230ed6$05bfd818-bf4e-4bda-baa9-5ba647867097$68806899-9972-460a-9f11-daa708a9d610$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54$4fea7232-f286-4a8b-93f8-a0702818ab31$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728$396e0047-d848-462f-a769-0cc2829abc78$1f041cb3-618c-4380-a1ec-d7bbe4a80f62$11ea640c-3981-404d-87c6-4d3d0708a2b8$f8614042-7c94-4d47-a1b6-4e96676b4e8b$bc8a399b-8864-4473-89d2-e3b0a03d15b5$cc80848a-6834-4272-9152-e17b45448814$a8b40b8f-051a-4e6f-a079-ece4f32873de$36d514fa-b27a-4c6b-8399-9d108377b9b5$c52c4cec-0ea8-4af3-831a-d284f0e086ee$d8222abf-139c-4220-8e92-cc987ec6900c$511a847f-234c-465e-8f4a-688e79d9b975$0284f0d7-b8a9-4ae6-add0-ac1078571d9b$b4875f2b-5487-429f-80a3-d1032bbccfc1$4915b1ed-ad53-4ece-9b00-bc136d47d8dc$5b15f5c9-80bf-47f0-898a-f8dead5b927c$83640f5b-fe13-4ec1-98a0-67a56c189ba1$f3bc47b5-03fc-4bd9-a890-26f9608a730b$72273f27-d0b9-4645-a609-cb65cc9332ee$436c52d2-280b-4ca4-9360-d6587b8254c7$f0104778-81a6-417b-8501-f916e5e7f3af$1ac9296f-047b-4051-ba5c-0c23d5f9cde9$fac138d9-3c5d-44b0-a87c-b13872f19450$ba642a22-6623-482a-ab4a-81585b83e457$734573e5-547b-4dcc-89bb-412aa6cc42d6$e96d592d-1e54-486d-8ad9-b857f85476e8$ff4f977e-48df-4c12-845c-c245b4d39d6d$8bc280db-e57d-4e40-be46-1790f4f7d9e7$5aba4f96-e877-457e-8e95-18737348f99f$11063fff-4d36-46d5-828f-dbed0f46b9cf$7afb6fb0-248a-4518-b94f-9876f81eca64$5b15d91e-7119-4f85-a54a-7d4f1fdaf097$7d94922e-dc9f-4953-b539-24aaa2c85b12$42775fd1-5b27-48e0-abf1-9b22bb775e6d$da8d0bca-105b-4d0b-a73d-ee5c9059aeaf$8b35661b-5075-4d63-bc31-044407f99acf$d17a4bd0-5992-4247-912d-73d51758d2f3$352d2952-cb83-47d3-9078-2b2ef9927443$f27f2bcd-05b6-44fe-bf9e-a3e51556db7c$b87ff1a9-abff-40f7-a1d8-f751a1c8b060$5d434c83-c9ca-499f-8695-c7733031c2de$4c4e643b-d4b9-44f0-8d30-dc521bcc55ac$602a07dd-8928-4b44-97e5-01c5cbf38351$7dbb42a3-aa8c-47e5-b668-18e6325d4038$de3cba34-9842-44d1-9b79-47126c0a0751$1b102220-6d78-480d-a77f-0e57bad23dca$37a273b6-b104-46f0-987a-401dc1c97327$8e742d32-c074-4981-b35b-b596b64c869b$b2539398-fdbc-42a2-a8f3-d327358f3643$e034b9cb-f4ee-46f4-bea6-72c93c75d966$64900586-ef92-48e4-839e-ff952a46671b$3c89209c-9202-4d5d-841c-ea34be369616$645e93e7-e92e-49c4-9757-8294fabf4e9b$0cd96c44-cae6-421f-9fae-26141600bef4$19dfabda-7049-4050-8662-0385529c0c5a$0574f5a0-72e7-4aa2-80ac-f4ce4f0fe7c2$966ef17c-23be-49dc-bc37-4cb52b34c049$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62$2c5d221a-2469-49e1-9249-dfdc2457f2fa$5ffc271f-c73f-494a-9727-8d7516af2191$42d4600a-bf3c-45ac-b7f5-d23917713ff5$50ae94c4-70f3-4215-82bd-eb2227c2badf$820752af-8966-4ee8-82f7-a40934522de5$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27$0964133c-3a5b-433b-a8c4-a97813c37583$04b5929a-2058-49c9-963a-96c752a1d67d$64b38d1f-ecf9-4843-89a1-4c8953048265$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fce$6acb549a-5d90-4457-a347-d22448ad8071$fad02876-efba-46a7-9cb7-43820528779f$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7ef$28ce6e60-59cf-408a-8081-b978507b3c72$fd58402f-da65-44cf-b81a-e21192fd0e63$5500fd8e-64cb-4af7-808d-230440746319$a9db3f85-ff56-4bbc-be87-47b893ef3b7b$00152954-dc98-4120-b94b-2ea4d987832b$46fea69b-599e-46ab-8455-d2da865d9a8e$d57375a5-b9e0-4742-b5f7-6a7da891604a$d3c1379f-acd6-4e15-be7e-a5dbe46a4f62$fed4dc4c-0d1c-4ee3-9d0e-8ef2a7db7486$04f42c09-8ab5-4233-b196-51c4aa2dcedb$b02ba928-5b9f-4695-b980-07988c788bb9$98222fcd-b456-477c-90dd-844df36877e5$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6$e89bdc84-dbb5-4c73-a39c-6392e5f79704$da3cb392-78f2-48b2-b0dc-5f016664798c$f0962801-0dfa-421f-8ffc-e64068e49913$c251a630-7114-4188-9323-8d8feb5c32e0$c926b6df-c40b-4c4c-8a95-ce9e41feb100$f487f2dd-ad09-48ac-ae34-bf50cfa6ac7d$5d35e515-e2d3-443e-becf-eb28c25db346$cb70d400-3e9c-441c-b17c-e727e8c928f3$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9$10ee7709-0816-48d2-abe0-9be3dd04700f$c0876a48-ea18-494d-8bfc-e2bceb73b417$3a37b53d-9174-4faa-9404-74a40c385b0a$735b548a-88f5-4a30-ab8f-dfb3d6401b2b$60c21e9c-e42d-4f0b-a910-3b318440fbc8$09dd1440-5d09-421f-addc-b1ede43ff517$7ccadf01-fbba-4dfd-a5ad-770dab9946f9$beb01fb8-c77d-4b5c-a66d-3812415e04a3$68e6f17e-8c87-40f0-a673-1115ecd1b71d$692c1043-4eaf-491e-b8fe-368618867f99$3cfd63ad-b1a2-4b99-ae97-2ff10351e4f5$fd964539-2baf-4ff1-b286-5a0bb1b222c4$666a4e89-306b-4fb2-bdc4-3dda2c63153f$0b01ba67-3921-4f3f-a7e8-235190bc84eb$7bf209c8-ef0a-46d1-937e-b1a6e45dc62e$ad0009af-2cfc-4820-bd4a-698ad391f459$b09e1e48-494e-4967-826a-6e70199acad4$5864a5a3-a5a5-43c2-9cb4-7d13b2d20bed$94517664-6988-44dc-a297-e9d5873ee540$b16899b7-36bf-4a5e-8e2f-4496b8450687$00bd2835-b006-4244-9877-bc7e031e3ef8$3e7cecec-eb77-4862-8e3c-b510422e06db$78c83673-2117-4542-b4d8-1c243e8f610b$ae0f5a96-7a4b-47f9-be1e-e803a238a071$c8b47eac-2d45-419a-bec6-2ae0cdc59393$537270ba-122b-4f2b-880b-31d086766295$10cdd16e-a337-4421-a7a0-6de4e4b60c0f$54fff14b-cf53-47b0-9cfa-8b9ee33df54e$76fd79a2-2bc8-45f8-a243-48415118898a$87ee21f3-16ca-4c8c-a0b9-f9e2fd258a91$b966b248-fb4d-457d-90f6-114370846242$f946c886-6246-4f98-a96f-f06984691ad8$bba13634-ff0e-47f7-a23b-8d56098f4ac6$b2082ab0-73a4-45a6-8772-a2e6e22b519a$7a6f3f79-ea06-4994-8b62-90b2056e4034$5261651e-a51e-4e80-8e23-83a4c10e5259$bfe7e41d-6318-4bd4-b892-287831876abc$6bf5ad39-1400-4e1f-a843-a1934b8aaa48$f55afa58-962d-4551-8d95-a5b467d61adf$740a3f41-9302-481d-b373-762c0dea8eff$d41f1dd1-45fe-4456-9a01-ed47fd6704a7$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690c$f545c800-0bf3-491f-9d7d-42341cfdb573$5b868eba-c1af-49f6-8f93-79b78c319a6f$76eb6743-cac0-4174-9ba3-a0691c200b54$ba41f521-4ee2-42a6-bf18-078bfa4b875e$ba5d6311-daee-4abc-b2fb-fae2184ef3eb$ed93259c-7b8b-46d7-97fb-f194e0e04b3a$4e29c621-223e-4859-8e96-db04b967815a$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00$d5020a8d-1dd7-403c-9d1f-665b95543943$2be8a812-4f21-4fe8-a2de-50497db0345a$056a8adc-92f4-4b33-90d9-4b3b4026bbbc$11b9beea-b0cd-45eb-84c6-151728894df0$b71145a4-2614-4f62-bfd2-7d5d1fecec56$4da20fd7-b897-4f26-bf2a-f08d66ddf90f$20776e09-7d9b-4db8-a060-7bceeec65b47$3e3c5897-809f-46e3-bb58-f115b082443e$05f120be-9695-4824-82fd-142a0df13098$717e4c69-59d5-4929-923f-dd35a97fb160$55ba8725-0ddf-4196-a41d-3f3c490a8d84$61949faa-8174-4b7b-8fbc-01d5f850b419$dd8e8cd2-7b41-46c4-8530-adefb7aea684$08505e88-9c23-4e95-91e3-d18bf5133dbc$87482ea5-5265-4e02-92c0-1a8bb44ff0f4$13ebc12f-ff6f-4266-88d3-28d6df5fcf59$3d065608-eef2-4caa-b17d-ec60714e3d58$bd6a7c16-6c25-4fc2-8e1b-4dab693ce19f$c5a2879c-e89b-47f7-bbd6-48200d7e89e3$65be0e58-24be-4932-92a9-9e4825b14144$3c695d54-c30f-4f04-bd40-f5da53be2a95$3c316495-bb6c-41e2-a38f-ba867a319fbb$024dcd1a-8eaa-4a95-8037-2f578828309c$822e4d69-2582-4956-858e-06ecb091e76a$cf1859d6-f889-4923-8c87-2d7c039f26c3$31db0f58-28e4-454f-9394-25565687266f$fddef10c-7695-4596-9e16-987fd45a57e6$26880577-d267-4950-8725-7afe0d0402b6$24fa139c-ad4b-49db-ac8f-23c476ed8608$dddc4a2f-34b2-41dc-85b3-55aba4880fa6$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5$d3b56fca-5b79-4465-8987-8d0005f854d8$5859ca11-90f8-4fd6-88ed-c56efe796fe8$281360af-46bf-4c73-bf11-3cb1153ad3e2$8f1b2db4-ed35-44fc-a3d5-e06deae16d48$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03$8aa16866-bfda-48df-9cf1-cf3d2e203ccb$dca2f8e2-76af-4679-bf81-3824c15fc76d$11a55af7-5301-4507-bb26-88e1e11236db$7856b8a0-565d-4c86-9b3c-4424ff9b86dd$8fcdca63-01a0-4d4b-933c-06a7621d980a$76d54520-baa3-44bf-b303-4cdcb8b87080$9acdbf38-2e10-45ec-85a0-d0db8453a599$61650a97-b353-4a85-b50b-93fee296ac7b$192b9f82-8d3a-408f-91c2-829cfcd32572$d34d22ad-89c2-423e-91dd-bfb895dc6540$5eebf3da-bfe7-46eb-81a3-f87f334ee270$9978d537-49ff-4014-a971-b42704c50a6b$54ff46a2-489a-4dd2-bc30-df70c780cc42$407a0724-4bb6-4c83-ab2d-17a0e19c4072$27487ad0-4779-42ce-8def-e660ef04bee0$9d264543-33ab-498a-90f5-5f913c252484$07ba9fe4-aaa7-4123-9865-cbfa79d0d44a$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547d$374af774-3a97-49b5-a3bb-bc3f7f63a3fa$af144759-fe66-4ad0-b378-e9eb4e859db4$e1274f57-75cb-4659-a82f-e5870c5367e2$63fbf8f4-e4e2-4893-be09-67450e92dbd7$5ee4ce72-7740-4297-8d84-619e0708e4ac$87feff3e-e510-4916-91a9-db3a2cd12225$6b1acb57-159a-4b7f-99fe-5f996522243b$82e0e9a0-9662-429a-87e3-e6bdae02709a$27441783-d3c6-40be-9c36-4941613e6ae9$daf35bfe-8f9c-4f55-971d-4d443be8f8bf$51d6337d-c0bd-40a9-9129-7d88e41e4093$a5b002c9-5e11-462a-9da0-6e060c7963f8$9bce6fdb-2cbc-4758-9a8b-794e490c973d$1ce4bc6c-7cde-48e9-8ff1-7281697fd121$bb1ef180-39ac-475f-beea-ef573e71a3bf$a8349352-3242-46d5-b0d5-1b6eb5d77e90$2e7c737c-c798-4442-a7e1-d74ccfd73119$d4e87ac4-6008-43b2-aa06-e232ec2b2b5b$f7f58fd2-facc-4b87-9172-5e911677c8f4$d21617aa-6f38-4a90-8586-4b32022497ad$700dcbc4-c94c-4287-8cf0-0b2c7a320a3a$4f96be72-ef3e-4e08-ac4c-be4271dcd14c$54f1546d-87ae-49d2-92ed-6fcc9b66e027$c5dd7e99-57e0-4bc7-97d2-2c780b23bcff$2025ff38-f2ec-4224-b771-ff72ffe1af28$77906355-08f8-4b08-b051-84697199b519$023f67b8-8f38-470a-9766-ac60a75678aa$7c592385-e8d3-4efe-962c-d39debb64405$d9d11d69-bc16-400a-8f46-f9a8ecb8516a$192cc1cf-9ea1-492d-baa7-f2e197abecd4$4c5cb75e-79b5-4502-b1eb-6246e002feaf$8eb42403-1234-4e59-993e-057cc3a6d5c9$6d0925d3-af96-4b94-8e2e-4941cce39e51$dc2efc6c-8da8-425b-aa5f-290949109565$ddbca73f-c692-46f2-95f3-a7dd849d33f7$786a5385-b648-4fc3-8e19-bf6582828136$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbf$38acd032-1d18-4760-9111-67c9cdd2e892$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2$349631b2-4686-49a9-9f3a-1e4ad588b568$ac9c8845-284d-4c21-b05d-d930f86598a3$71a5fce8-6d9a-4625-bad1-a951d61bff28$b53dba81-a9e9-41da-8fc2-7736bf25f2dc$b8532822-179b-4cd5-a279-4b71dafb544a$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580a$c87dba8c-9a96-41b3-9dc7-a6c088ec1eaf$fee14dfe-c5ca-4126-a830-cc9d7eda5433$cd9c9eeb-c90d-4499-9503-7773d5250f47$b695ef21-a1ac-4d1f-a0e1-71cd81cede18$e524f8cc-ab69-4f8b-a59f-28156696a104$0d45ae72-572f-4d17-83cf-9814f2854131$0d93132d-5819-47dc-8cf2-462d480d9c3d$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7d$a0ca7a5e-0089-4a45-9278-c0f27cd096a0$5207308e-f636-4d47-b135-036a6e7b8ecd$ff3009eb-23f9-44fe-8e56-85dbc7b463d0$b7f77935-bcab-4ef1-8e1b-a7d059784ff3$6c5e9bb2-4c38-4613-9652-dec99e97b512$f8215517-b18f-4a03-9421-8edab4ca8089$d2729657-d0bf-4d39-8ec7-f242a1ad48d6$8e096fae-9941-49d8-ae87-c68b02f68da5$44f14d4f-7414-4c6f-883a-042ca261a403$6c5f51bb-a6be-447e-b73d-4f9c2885e809$4156d955-9daf-4429-b152-e8332980fb9e$d82e7ab8-c372-4462-afb5-1617560cdb56$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8f$16113560-e911-47b4-abc4-641bbd246454$a7891c63-18d6-4c1f-ba67-adf7c547d334$7126aefd-b847-497a-9545-514e9b9afa71$1894ae1a-bb68-4de0-a4d2-ac5d02c49f09$f9facbba-39d4-483e-9066-275603156db0$bbc8864a-1545-433f-bc7c-0ddf6e907138$68469a40-7976-48b7-b7a1-eaa4c5f33a18$ba645f6b-143f-4e83-9003-707770ae308d$b5319d8b-0420-4ebf-b603-ea0b93365ac1$4c34640f-efa2-4e1d-8a70-0acd2ce45428$f7ede764-5ad8-426b-a805-cc21b622d977$3ea08816-705e-4be7-a175-dbd3f3e4c17d$5d50a5d0-8fe2-4c6e-b76c-d5614e4fd884$0ab70fc3-6188-42eb-aba2-d808f319be9f$df7f84e8-b42a-4001-9dbf-6bc3ced94207$d963ff6d-f1b6-4799-aa0e-1ae100310d84$7cf26604-9c2b-4a77-9674-7d4dac2f99f0$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaeb$f59a5dcd-9f4a-4336-a391-e64af35ef799published_objectsnbpkginstall_time_nsinstantiatedòinstalled_versionsStatisticsstdlibTransducers0.4.84StatsBase0.33.21Memoize0.4.4Distributions0.25.117PlutoUI0.7.61ProgressLogging0.1.4RandomstdlibBenchmarkTools1.3.2StaticArrays1.5.26LinearAlgebrastdlibPlutoDevMacros0.9.0DataFrames1.7.0PlutoProfile0.4.0HypertextLiteral0.9.5SpecialFunctions2.5.0LaTeXStrings1.3.1PlutoPlotly0.3.9terminal_outputsStatistics Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ Transducers Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ Memoize Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ ProgressLogging Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ Distributions Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ PlutoUI Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ Random Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ BenchmarkTools Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ LinearAlgebra Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ PlutoProfile Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ LaTeXStrings Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ StatsBase Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ ApproximationUtils Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ StaticArrays Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ Base Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ nbpkg_sync Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ PlutoDevMacros Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ SpecialFunctions Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ HypertextLiteral Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ DataFrames Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ PlutoPlotly Resolving... ===  Installed PDMats ────────────────── v0.11.32 Installed Crayons ───────────────── v4.1.1  Installed HypergeometricFunctions ─ v0.3.27  Installed Accessors ─────────────── v0.1.41  Installed PlotlyBase ────────────── v0.8.20  Installed SpecialFunctions ──────── v2.5.0  Installed PrettyTables ──────────── v2.4.0  Installed MIMEs ─────────────────── v1.0.0  Installed JuliaInterpreter ──────── v0.9.41  Installed InvertedIndices ───────── v1.3.1  Installed PlutoUI ───────────────── v0.7.61  Installed StaticArrays ──────────── v1.5.26  Installed DataFrames ────────────── v1.7.0  Installed Memoize ───────────────── v0.4.4  Installed Distributions ─────────── v0.25.117  Installed StringManipulation ────── v0.4.1  Installed MacroTools ────────────── v0.5.15  No Changes to `/tmp/jl_E3c5X0/Project.toml`   No Changes to `/tmp/jl_E3c5X0/Manifest.toml` Instantiating... === Precompiling... ===  Activating project at `/tmp/jl_E3c5X0` Precompiling project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. 1 dependency had output during precompilation: ┌ PlutoPlotly │ ┌ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly. │ │ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning. │ └ @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43 └ enabled÷restart_recommended_msgrestart_required_msgbusy_packageswaiting_for_permission,waiting_for_permission_but_probably_disabled«cell_inputs$4f96be72-ef3e-4e08-ac4c-be4271dcd14ccell_id$4f96be72-ef3e-4e08-ac4c-be4271dcd14ccodemetadatashow_logsèdisabled®skip_as_script«code_folded$19dfabda-7049-4050-8662-0385529c0c5acell_id$19dfabda-7049-4050-8662-0385529c0c5acode@bind sref_cartpole_binary PlutoUI.combine() do Child md""" x position: $(Child(:x, Slider(-50f0:50f0, default = 0f0, show_value=true))) x velocity: $(Child(:ẋ, Slider(-50f0:50f0, default = 0f0, show_value=true))) """ end |> confirmmetadatashow_logsèdisabled®skip_as_script«code_folded$b71145a4-2614-4f62-bfd2-7d5d1fecec56cell_id$b71145a4-2614-4f62-bfd2-7d5d1fecec56code #version of reinforce for general function approximation function actor_critic_with_eligibility_traces!(policy_params::P1, ∇lnπ, value_params::P2, ∇v̂, mdp::ContinuousMDP{T, S, A, PTF, F1, F2, F3}, λ_θ::T, λ_w::T, update_action_distribution!::Function, action_dist_params::Vector{T}, action_sampler::Function, update_eligibility_vector!::Function, x, update_feature_vector!::Function, value_function::Function, update_value_gradient!::Function, max_episodes::Integer, max_steps::Integer; α_w::T = one(T)/10, α_θ::T = one(T)/10, γ::T = one(T), z_θ::P1 = deepcopy(policy_params), z_w::P2 = deepcopy(value_params), save_step_rewards = false) where {P1, P2, T<:Real, S, A, PTF, F1, F2, F3} step_rewards = Vector{T}() episode_steps = Vector{Int64}() episode_rewards = Vector{T}() #initialize variables ep = 1 step = 1 rtot = zero(T) c = one(T) zero_params!(z_θ) zero_params!(z_w) s = mdp.initialize_state() update_feature_vector!(x, s) while (ep <= max_episodes) && (step <= max_steps) update_value_gradient!(∇v̂, x, value_params) v̂ = value_function(x, value_params) update_action_distribution!(action_dist_params, x, policy_params) a = action_sampler(action_dist_params) if bad_continuous_action(a) @info "terminating after $step steps and episode $ep due to invalid continuous action $a taken in state $s with action distribution parameters $action_dist_params" push!(episode_steps, max_steps) push!(episode_rewards, typemin(T)) break end update_eligibility_vector!(∇lnπ, action_dist_params, x, a, policy_params) (r, s′) = mdp.ptf(s, a) rtot += r save_step_rewards && push!(step_rewards, r) step += 1 if mdp.isterm(s′) push!(episode_steps, step) push!(episode_rewards, rtot) v̂′ = zero(T) rtot = zero(T) zero_params!(z_θ) zero_params!(z_w) ep += 1 c = one(T) s = mdp.initialize_state() update_feature_vector!(x, s) else update_feature_vector!(x, s′) v̂′ = value_function(x, value_params) s = s′ c *= γ end δ = r + γ*v̂′ - v̂ update_traces_with_gradient!(γ*λ_w, z_w, ∇v̂) update_traces_with_gradient!(γ*λ_θ, z_θ, c, ∇lnπ) update_params_with_gradient!(value_params, α_w*δ, z_w) update_params_with_gradient!(policy_params, α_θ*c*δ, z_θ) end function_outputs = form_state_and_policy_function_outputs(update_feature_vector!, update_action_distribution!, action_dist_params, action_sampler, value_function, x, policy_params, value_params) return (;step_rewards = step_rewards, episode_steps = episode_steps, episode_rewards = episode_rewards, policy_parameters = policy_params, value_parameters = value_params, function_outputs...) endmetadatashow_logsèdisabled®skip_as_script«code_folded$c0876a48-ea18-494d-8bfc-e2bceb73b417cell_id$c0876a48-ea18-494d-8bfc-e2bceb73b417codeهplot_mountaincar_values(mountaincar_continuing_fcann_test.estimate_state_value, mountaincar_continuing_fcann_test.policy_sample_action)metadatashow_logsèdisabled®skip_as_script«code_folded$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091cell_id$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091codefunction reinforce_monte_carlo_control_fcann(mdp::StateMDP{T, S, A, P, F1, F2, F3}, input_length::Integer, hidden_layers::Vector{Int64}, update_feature_vector!::Function,max_episodes::Integer; params::FCANNParams = FCANN.initializeparams_saxe(input_length, hidden_layers, length(mdp.actions)), reslayers = 0, l2 = 0f0, dropout = 0f0, use_μP = true, activation_list = fill(true, length(hidden_layers)), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} setup = setup_fcann_policy_arguments(params, input_length, hidden_layers, reslayers, l2, dropout, use_μP, activation_list) reinforce_monte_carlo_control!(params, setup.eligibility_vector, mdp, setup.update_action_preferences!, setup.update_eligibility_vector!, setup.feature_vector, update_feature_vector!, max_episodes; kwargs...) endmetadatashow_logsèdisabled®skip_as_script«code_folded$f4b6f10b-4cd0-4be6-98ec-4d4ffb696392cell_id$f4b6f10b-4cd0-4be6-98ec-4d4ffb696392code4md""" ### *One-step Actor-Critic Implementation* """metadatashow_logsèdisabled®skip_as_script«code_folded$9db9ff71-bee9-4bea-a45b-748f8517fed1cell_id$9db9ff71-bee9-4bea-a45b-748f8517fed1codeٺone_step_actor_critic_linear_features(corridor_mdp, update_corridor_features!, 1, typemax(Int64), 100_000, α_θ = 2f0^-8, α_w = 2f0^-8, policy_params = [0f0 3.7f0]).policy_and_value(1)metadatashow_logsèdisabled®skip_as_script«code_folded$4634267b-5dea-4164-8bb2-1eb2fd4d7954cell_id$4634267b-5dea-4164-8bb2-1eb2fd4d7954codefunction update_linear_eligibility_vector!(∇lnπ::Matrix{T}, action_preferences::Vector{T}, x::Vector{T}, i_a::Integer, params::Matrix{T}) where T<:AbstractFloat update_linear_action_preferences!(action_preferences, x, params) soft_max!(action_preferences) BLAS.gemm!('N', 'T', -one(T), x, action_preferences, zero(T), ∇lnπ) @inbounds @simd for i in eachindex(x) ∇lnπ[i, i_a] += x[i] end endmetadatashow_logsèdisabled®skip_as_script«code_folded$6c5f51bb-a6be-447e-b73d-4f9c2885e809cell_id$6c5f51bb-a6be-447e-b73d-4f9c2885e809codeactor_critic_binary_episodic_beta_parameter_study(mountaincar_continuous_mdp, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, mountaincar_binary_continuous_params2, 5, 3, 10000; max_steps = 100_000)metadatashow_logsèdisabledîskip_as_script«code_folded$cc45091e-b889-4d5a-9eef-84d80f792046cell_id$cc45091e-b889-4d5a-9eef-84d80f792046codemd""" ## 13.4 REINFORCE with Baseline The policy gradient theorem (13.5) can be generalized to include a comparison of the action value to an arbitrary *baseline* b(s): $\nabla J(\boldsymbol{\theta}) \propto \sum_s \mu(s)\sum_a\left( q_\pi(s,a)-b(s) \right ) \nabla\pi(a|s,\boldsymbol{\theta}) \tag{13.10}$ The baseline can be any function, even a random variable, as long as it does not vary with $a$; the euation remains valid because the subtracted quantity is zero: $\sum_ab(s)\nabla\pi(a|s,\boldsymbol{\theta})=b(s)\nabla\sum_a\pi(a|s,\boldsymbol{\theta})=b(s)\nabla1=0$ The policy gradient theorem with baseline (13.10) can be used to derive an update rule using similar steps as in the previous section. The update rule that we end up with is a new version of REINFORCE that includes a general baseline: $\boldsymbol{\theta}_{t+1} \doteq \boldsymbol{\theta}_t+\alpha(G_t-b(S_t))\frac{\nabla\pi(A_t|S_t,\boldsymbol{\theta}_t)}{\pi(A_t|S_t,\boldsymbol{\theta}_t)} \tag{13.11}$ Since the baseline could be uniformly zero, this is a strict generalization of REINFORCE. To have an effective baseline that depends on state we can use a state value estimate that is also updated with gradient steps: $\hat v(S_t, \mathbf{w})$. Using such an estimate we can revise the previous REINFORCE algorithm. """metadatashow_logsèdisabled®skip_as_script«code_folded$5b15d91e-7119-4f85-a54a-7d4f1fdaf097cell_id$5b15d91e-7119-4f85-a54a-7d4f1fdaf097codefunction create_actor_critic_continuing_params_UI(;λ_θ = 0.5f0, λ_w = 0.5f0, log2α_θ = -10, log2α_w = -10, α_r̄ = 0.005f0) PlutoUI.combine() do Child md""" $$\lambda_\theta$$: $(Child(:λ_θ, Slider(0.00f0:0.001f0:.999f0, default = λ_θ, show_value=true))) $$\lambda_\mathbf{w}$$: $(Child(:λ_w, Slider(0.00f0:0.001f0:.999f0, default = λ_w, show_value=true))) $$\alpha_{\overline{r}}$$: $(Child(:α_r̄, NumberField(0.00f0:0.001f0:1f0, default = α_r̄))) $$\log_2 \alpha_\theta$$ min: $(Child(:α_θ_min, NumberField(-100:0, default = log2α_θ))) $$\log_2 \alpha_{\mathbf{w}}$$ min: $(Child(:α_w_min, NumberField(-100:0, default = log2α_w))) """ end |> confirm endmetadatashow_logsèdisabled®skip_as_script«code_folded$ba41f521-4ee2-42a6-bf18-078bfa4b875ecell_id$ba41f521-4ee2-42a6-bf18-078bfa4b875ecodebegin make_n_param_dist_policy_params(n::Integer, num_features::Integer, ::T) where T<:Real = zeros(T, num_features, n) make_n_param_dist_policy_params(n::Integer, num_features::Integer, ::NTuple{N, T}) where {N, T<:Real} = zeros(T, num_features, n*N) endmetadatashow_logsèdisabled®skip_as_script«code_folded$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03cell_id$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03codeYfunction cartpole_tilecoding_reinforce_parameter_study(α1_list, α2_list, max_episodes; num_trials = 100, kwargs...) setup = setup_cartpole_problem(;kwargs...) traces = [begin steps = [begin 1:num_trials |> Map() do i solution = reinforce_with_baseline_monte_carlo_control_binary_features(cartpole_setup.mdps.episodic.discrete, cartpole_setup.get_active_features, cartpole_setup.num_features, max_episodes; α_θ = α1, α_w = α2) steps = solution.episode_steps isempty(steps) && return max_steps mean(steps) end |> foldxt(+) |> x -> x / num_trials end for α1 in α1_list] scatter(x = α1_list, y = steps, name = "α_w = $α2") end for α2 in α2_list] plot(traces, Layout(xaxis_title = "Policy Learning Rate α_θ", yaxis_title = "Average Episode Duration Over First $max_episodes Episodes", xaxis_type = "log")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$3c695d54-c30f-4f04-bd40-f5da53be2a95cell_id$3c695d54-c30f-4f04-bd40-f5da53be2a95code/md""" ### *Cart Pole Continuous Action MDP* """metadatashow_logsèdisabled®skip_as_script«code_folded$0d45ae72-572f-4d17-83cf-9814f2854131cell_id$0d45ae72-572f-4d17-83cf-9814f2854131codeg@bind mountaincar_binary_continuous_params2 create_actor_critic_params_UI(λ_θ = 0.05f0, λ_w = 0.8f0)metadatashow_logsèdisabled®skip_as_script«code_folded$cd9c9eeb-c90d-4499-9503-7773d5250f47cell_id$cd9c9eeb-c90d-4499-9503-7773d5250f47codeىshow_mountaincar_continuous_trajectory(mountaincar_continuous_test_train2.policy_sample_action, 1_000; mdp = mountaincar_continuous_mdp2)metadatashow_logsèdisabled®skip_as_script«code_folded$fd58402f-da65-44cf-b81a-e21192fd0e63cell_id$fd58402f-da65-44cf-b81a-e21192fd0e63codeمplot_cartpole_policy(cartpole_continuing_fcann_test.policy_and_value; s_ref = CartPoleState(cartpole_fcann_continuing_test_state...))metadatashow_logsèdisabled®skip_as_script«code_folded$8e39bd15-862e-4941-88f9-2794b861a523cell_id$8e39bd15-862e-4941-88f9-2794b861a523codereinforce_monte_carlo_control_linear_features(mdp::StateMDP{T, S, A, P, F1, F2, F3}, update_feature_vector!::Function, num_features::Integer, max_episodes::Integer; params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), x::Vector{T} = zeros(T, num_features), ∇lnπ::Matrix{T} = copy(params), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = reinforce_monte_carlo_control!(params, ∇lnπ, mdp, update_linear_action_preferences!, update_linear_eligibility_vector!, x, update_feature_vector!, max_episodes; kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$64900586-ef92-48e4-839e-ff952a46671bcell_id$64900586-ef92-48e4-839e-ff952a46671bcodetest_study = actor_critic_linear_parameter_study(cartpole_continuing_mdp, s -> cartpole_tilecoding_setup.get_active_features((s.x, s.θ, s.ẋ, s.θ̇)), cartpole_tilecoding_setup.num_features, LinRange(.5f0, .95f0, 10), LinRange(0.5f0, .95f0, 10), [0.005f0, 0.01f0, 0.05f0], 2f0 .^ (-5:-1), 2f0 .^ (-10:-5), 100, 10_000; nruns = 40, seed = 45, binary_features = true) |> df -> sort(df, :output; rev=true)metadatashow_logsèdisabledîskip_as_script«code_folded$fddef10c-7695-4596-9e16-987fd45a57e6cell_id$fddef10c-7695-4596-9e16-987fd45a57e6codefunction setup_cartpole_continuous_problem(;h = 4f-2, f = 300f0, x_max = 50f0, θ_max = deg2rad(70f0), ẋ_max = 50f0, θ̇_max = 10f0, num_tiles = (8, 8, 8, 8), num_tilings = 8, kwargs...) tile_size = Tuple(1f0 / n for n in num_tiles) min_vals = (-x_max, -θ_max, -ẋ_max, -θ̇_max) max_vals = (x_max, θ_max, ẋ_max, θ̇_max) setup = tile_coding_setup(min_vals, max_vals, tile_size, num_tilings, (1, 3, 5, 7)) init_θ() = rand([-0.05f0, 0.05f0]) mdps = create_cartpole_mdps(h = h, fmax = f, x_max = x_max, θ_max = θ_max, ẋ_max = ẋ_max, θ̇_max = θ̇_max, init_θ = init_θ, kwargs...) (mdps = mdps, get_active_features = s -> setup.get_active_features((s.x, s.θ, s.ẋ, s.θ̇)), num_features = setup.num_features, min_vals = min_vals, max_vals = max_vals) endmetadatashow_logsèdisabled®skip_as_script«code_folded$e2b09af1-0f22-4f7f-b806-54fa522adb20cell_id$e2b09af1-0f22-4f7f-b806-54fa522adb20code#note that due to the feature construction in this problem, the bootstrapping estimate is worthless so we'd expect this to do poorly at this task. Imagine that the value function starts out perfectly accurate for the initial policy with bad parameterization which we know finishes episodes in 90 to 100 steps. Then the δ on each step is just -1. Given the policy initialization, this is very poor progress since the improvement is only towards something still with worse performance than the completely random policymetadatashow_logsèdisabled®skip_as_script«code_folded$2be8a812-4f21-4fe8-a2de-50497db0345acell_id$2be8a812-4f21-4fe8-a2de-50497db0345acodeHmd""" ### *Actor-Critic Implementation for Continuous Action Spaces* """metadatashow_logsèdisabled®skip_as_script«code_folded$68806899-9972-460a-9f11-daa708a9d610cell_id$68806899-9972-460a-9f11-daa708a9d610codeactor_critic_with_eligibility_traces_linear_features(mdp::StateMDP{T, S, A, P, F1, F2, F3}, λ_θ::T, λ_w::T, update_feature_vector!::Function, num_features::Integer, args...; policy_params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), value_params::Vector{T} = zeros(T, num_features), x = zeros(T, num_features), action_preferences = zeros(T, length(mdp.actions)), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = actor_critic_with_eligibility_traces!(policy_params, copy(policy_params), value_params, copy(value_params), mdp, λ_θ, λ_w, update_linear_action_preferences!, update_linear_eligibility_vector!, x, update_feature_vector!, linear_value_function, update_linear_value_gradient!, args...; kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$189798b3-ec6b-48b9-918c-ee0f65935ab3cell_id$189798b3-ec6b-48b9-918c-ee0f65935ab3codemd""" > ### *Exercise 13.3* > In Section 13.1 we considered policy parameterizations using the soft-max in action preferences (13.2) with linear action preferences (13.3). For this parameterization, prove that the eligibility vector is > $\begin{flalign} > \nabla \ln \pi(a|s, \boldsymbol{\theta}) = \mathbf{x}(s, a) - \sum_b \pi(b|s, \boldsymbol{\theta}) \mathbf{x}(s, b) \tag{13.9} > \end{flalign}$ > using the definitions and elementary calculus. """metadatashow_logsèdisabled®skip_as_script«code_folded$00152954-dc98-4120-b94b-2ea4d987832bcell_id$00152954-dc98-4120-b94b-2ea4d987832bcodefunction create_mountaincar_continuing_mdp() ptf = StateMDPTransitionSampler(mountaincar_continuing_step, (0f0, 0f0)) StateMDP(MountainCarTask.actions, ptf, MountainCarTask.initialize_state) endmetadatashow_logsèdisabled®skip_as_script«code_folded$42d4600a-bf3c-45ac-b7f5-d23917713ff5cell_id$42d4600a-bf3c-45ac-b7f5-d23917713ff5code@bind cartpole_continuing_fcann_network_params PlutoUI.combine() do Child md""" Layer Size: $(Child(NumberField(1:128, default = 4))) Num Layers: $(Child(NumberField(1:10, default = 2))) """ end |> confirmmetadatashow_logsèdisabled®skip_as_script«code_folded$4e29c621-223e-4859-8e96-db04b967815acell_id$4e29c621-223e-4859-8e96-db04b967815acodefunction setup_binary_squashed_gaussian_policy_arguments(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, amax::A, get_active_features::Function, num_features::Integer) where {T<:Real, S, N, A<:Union{T, NTuple{N, T}}, P, F1, F2, F3} x = BinaryFeatureVector() update_feature_vector!(x::BinaryFeatureVector, s) = update_binary_feature_vector!(x, s, get_active_features) sample_action = rand(A) action_dist_params = make_n_param_dist_params(2, sample_action) ∇lnπ = BinarySquashedGaussianEligibilityVector(sample_action, amax) return (feature_vector = x, update_feature_vector! = update_feature_vector!, action_distribution_parameters = action_dist_params, eligibility_vector = ∇lnπ) endmetadatashow_logsèdisabled®skip_as_script«code_folded$5981f52b-d829-4c7d-b47b-33310f7d64a2cell_id$5981f52b-d829-4c7d-b47b-33310f7d64a2codeRmake_ϵ_greedy_policy!(corridor_train.value_function(1).action_values; ϵ = 0.0f0)metadatashow_logsèdisabled®skip_as_script«code_folded$0e9de19e-bcd4-40ac-9831-afb6cad38422cell_id$0e9de19e-bcd4-40ac-9831-afb6cad38422codefunction setup_fcann_policy_arguments(params::FCANNParams{T}, input_length::Integer, hidden_layers::Vector{Int64}, reslayers::Integer, l2::T, dropout::T, use_μP::Bool, activation_list) where {T<:Real} x = zeros(T, input_length) activations = FCANN.form_activations(params[1]) tanh_grad_z = deepcopy(activations) deltas = deepcopy(activations) scales = fill(one(T), length(params[1])) if use_μP for i in eachindex(hidden_layers) i′ = i + 1 scales[i′] /= size(params[1][i′], 2) end end ∇lnπ = deepcopy(params) update_eligibility_vector!(∇lnπ::FCANNParams, action_preferences::Vector{T}, x, i_a, params::FCANNParams) = update_fcann_eligibility_vector!(∇lnπ, action_preferences, x, i_a, params, hidden_layers, l2, tanh_grad_z, activations, deltas, dropout, reslayers, activation_list, scales) update_action_preferences!(action_preferences::Vector{T}, x::Vector{T}, params::FCANNParams) = update_fcann_action_preferences!(action_preferences, x, params, activations, reslayers) return (feature_vector = x, params = params, eligibility_vector = ∇lnπ, update_eligibility_vector! = update_eligibility_vector!, update_action_preferences! = update_action_preferences!, scales = scales) endmetadatashow_logsèdisabled®skip_as_script«code_folded$ff3009eb-23f9-44fe-8e56-85dbc7b463d0cell_id$ff3009eb-23f9-44fe-8e56-85dbc7b463d0codeufunction show_squashed_policy(π::Function, s) pdist=π(s) plot_squashed_gaussian(pdist[1], exp(pdist[2]), 1f0) endmetadatashow_logsèdisabled®skip_as_script«code_folded$4fb83451-b6f8-4e6e-a131-1accc8e10b08cell_id$4fb83451-b6f8-4e6e-a131-1accc8e10b08code B#version of reinforce for general function approximation function reinforce_with_baseline_monte_carlo_control!(policy_params, ∇lnπ, value_params, ∇v̂, mdp::StateMDP{T, S, A, PTF, F1, F2, F3}, update_action_preferences!::Function, update_eligibility_vector!::Function, x, update_feature_vector!::Function, value_function::Function, update_value_gradient!::Function, max_episodes::Integer; α_w::T = one(T)/10, α_θ::T = one(T)/10, γ::T = one(T), action_preferences = zeros(T, length(mdp.actions)), epkwargs...) where {T<:Real, S, A, PTF, F1, F2, F3} rewards = zeros(T, max_episodes) steps = zeros(Int64, max_episodes) π! = form_state_policy_function(update_feature_vector!, update_action_preferences!) π(s) = π!(x, action_preferences, s, policy_params) π_sample(s) = sample_action(π(s)) v! = form_state_value_function(update_feature_vector!, value_function) estimate_state_value(s) = v!(x, s, value_params) state_history, action_history, reward_history, _, _ = runepisode(mdp; π = π_sample, max_steps = 0) #initialize variables to update episodes for i in eachindex(rewards) # @info "On episode $i of $max_episodes" state_history, action_history, reward_history, sterm, nsteps = runepisode!((state_history, action_history, reward_history), mdp; π = π_sample, epkwargs...) g = zero(T) rtotal = zero(T) #iterate through episode beginning at the end for i in nsteps:-1:1 g = (γ * g) + reward_history[i] update_feature_vector!(x, state_history[i]) v̂ = value_function(x, value_params) δ = g - v̂ update_value_gradient!(∇v̂, x, value_params) c = α_w*δ update_params_with_gradient!(value_params, c, ∇v̂) update_eligibility_vector!(∇lnπ, action_preferences, x, action_history[i], policy_params) c = α_θ * γ^(i-1) * δ update_params_with_gradient!(policy_params, c, ∇lnπ) rtotal += reward_history[i] end rewards[i] = rtotal steps[i] = nsteps end π2(s; feature_vector = deepcopy(x), action_preferences = copy(action_preferences)) = π!(feature_vector, action_preferences, s, policy_params) π_sample2(s; kwargs...) = sample_action(π2(s; kwargs...)) function policy_and_value(s::S) π!(x, action_preferences, s, policy_params) v̂ = value_function(x, value_params) return (action_probabilities = action_preferences, state_value_estimate = v̂) end return (episode_rewards = rewards, episode_steps = steps, policy_function = π2, policy_sample_action = π_sample2, policy_parameters = policy_params, estimate_state_value = estimate_state_value, value_parameters = value_params, policy_and_value = policy_and_value) endmetadatashow_logsèdisabled®skip_as_script«code_folded$406638af-1e08-44d2-9ee4-97aa9294a94bcell_id$406638af-1e08-44d2-9ee4-97aa9294a94bcode-md""" ## 13.2 The Policy Gradient Theorem """metadatashow_logsèdisabled®skip_as_script«code_folded$57e5e12a-b722-4ea3-ab3b-e5711029e640cell_id$57e5e12a-b722-4ea3-ab3b-e5711029e640codeone_step_actor_critic_linear_features(mdp::StateMDP{T, S, A, P, F1, F2, F3}, update_feature_vector!::Function, num_features::Integer, max_episodes::Integer, max_steps::Integer; policy_params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), value_params::Vector{T} = zeros(T, num_features), x = zeros(T, num_features), action_preferences = zeros(T, length(mdp.actions)), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = one_step_actor_critic!(policy_params, copy(policy_params), value_params, copy(value_params), mdp, update_linear_action_preferences!, update_linear_eligibility_vector!, x, update_feature_vector!, linear_value_function, update_linear_value_gradient!, max_episodes, max_steps; kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$374af774-3a97-49b5-a3bb-bc3f7f63a3facell_id$374af774-3a97-49b5-a3bb-bc3f7f63a3facode)plot_cart(ep[1][ep_step], ep[2][ep_step])metadatashow_logsèdisabled®skip_as_script«code_folded$7bf209c8-ef0a-46d1-937e-b1a6e45dc62ecell_id$7bf209c8-ef0a-46d1-937e-b1a6e45dc62ecode٨@bind beta_params PlutoUI.combine() do Child md""" α: $(Child(Slider(0.01:0.1:100; show_value=true))) β: $(Child(Slider(0.01:0.1:100, show_value=true))) """ endmetadatashow_logsèdisabled®skip_as_script«code_folded$dd8e8cd2-7b41-46c4-8530-adefb7aea684cell_id$dd8e8cd2-7b41-46c4-8530-adefb7aea684codegfunction actor_critic_binary_episodic_beta_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, λ_θ::T, λ_w::T, α_θ_list::AbstractVector{T}, α_w_list::AbstractVector{T}, max_episodes::Integer; nruns::Integer = 100, max_steps::Integer = 10_000, seed = rand(UInt64), init_policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), init_value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} Random.seed!(seed) function average_runs(α_θ, α_w) 1:nruns |> Map(_ -> actor_critic_with_eligibility_traces_binary_features_beta_actions(mdp, λ_θ, λ_w, get_active_features, num_features, max_episodes, max_steps; α_θ = α_θ, α_w = α_w, policy_params = copy(init_policy_params), value_params = copy(init_value_params), kwargs...) |> x -> isempty(x.episode_rewards) ? -T(Inf) : mean(x.episode_rewards)) |> foldxt(+) |> x -> x / nruns end traces = [begin scatter(x = α_θ_list, y = average_runs.(α_θ_list, α_w), name = "α_w = $α_w") end for α_w in α_w_list] plot(traces, Layout(xaxis_title = "α_θ", yaxis_title = "Average Reward Per Episode in the First
$max_episodes Episodes Averaged Over $nruns Runs", xaxis_type = "log", title = "Binary Feature Encoding with $num_features Features, λ_θ = $λ_θ, λ_w = $λ_w")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$4fea7232-f286-4a8b-93f8-a0702818ab31cell_id$4fea7232-f286-4a8b-93f8-a0702818ab31code8md""" #### Test Actor-Critic with Eligibility Traces """metadatashow_logsèdisabled®skip_as_script«code_folded$26880577-d267-4950-8725-7afe0d0402b6cell_id$26880577-d267-4950-8725-7afe0d0402b6code:const cartpole_setup = setup_cartpole_continuous_problem()metadatashow_logsèdisabled®skip_as_script«code_folded$a7891c63-18d6-4c1f-ba67-adf7c547d334cell_id$a7891c63-18d6-4c1f-ba67-adf7c547d334codeُ@bind fcann_mountaincar_study_params create_actor_critic_fcann_params_UI(;λ_θ = 0.5f0, λ_w = 0.5f0, h = 16, log2α_θ = -10, log2α_w = -11)metadatashow_logsèdisabledîskip_as_script«code_folded$44f14d4f-7414-4c6f-883a-042ca261a403cell_id$44f14d4f-7414-4c6f-883a-042ca261a403codeK@bind mountaincar_binary_continuous_params2 create_actor_critic_params_UI()metadatashow_logsèdisabledîskip_as_script«code_folded$94354552-9920-4b90-98d9-f75286d1f53ecell_id$94354552-9920-4b90-98d9-f75286d1f53ecode`corridor_parameter_studies(1.5f0 .^(-24:-20), 1.25f0 .^ (-27:-20), 2f0 .^(-3:-1); nruns = 1_000)metadatashow_logsèdisabled®skip_as_script«code_folded$e5faaa1b-88cb-43e2-8d04-8972b58b4bdacell_id$e5faaa1b-88cb-43e2-8d04-8972b58b4bdacodebegin v1(p) = -2*(1+p)/((1-p)*p) v2(p) = -(p+2)/((1-p)*p) v3(p) = -3/(1-p) plist = 0.:0.001:1. traces = [scatter(x = plist, y = f.(1 .- plist), name = n) for (f, n) in zip([v1, v2, v3], ["V(S1)", "V(S2)", "V(S3)"])] plot(traces, Layout(font_color = "LightGray", plot_bgcolor = bgcolor, paper_bgcolor = "rgb(40, 40, 40)", yaxis_range = [-100, 0], xaxis_title = "probability of right action", yaxis_title = "State Value", width = 900, height = 600)) endmetadatashow_logsèdisabled®skip_as_script«code_folded$70096b14-beab-4f71-9886-6355c749bb8acell_id$70096b14-beab-4f71-9886-6355c749bb8acodemd""" We previously derived an expression for the gradient of the policy itself in the case of linear action preferences: $\begin{flalign} h_a &= \boldsymbol{\theta}^\top \mathbf{x}(s, a) \\ \pi_a &= \frac{e^{h_a}}{\sum_k e^{h_k}} \\ \nabla(\pi_a)_i &= \pi_a \left ( \mathbf{x}(s, a)_i - \sum_k \pi_k \mathbf{x}(s, k)_i \right) \end{flalign}$ Applying the chain rule to the natural logarithm produces: $\nabla \left ( \ln f(\theta) \right) = \frac{\nabla f(\theta)}{f(\theta)} \implies \nabla \left ( \ln f(\theta) \right )_i = \frac{\nabla \left ( f(\theta) \right )_i}{f(\theta)}$ Applying this to the above expression yields: $\begin{flalign} \nabla \left ( \ln \pi_a \right )_i &= \frac{\nabla \left ( \pi_a \right )_i}{\pi_a} \\ &= \frac{\pi_a \left ( \mathbf{x}(s, a)_i - \sum_k \pi_k \mathbf{x}(s, k)_i \right)}{\pi_a} \\ &= \mathbf{x}(s, a)_i - \sum_k \pi_k \mathbf{x}(s, k)_i \end{flalign}$ which is the per component version of the desired vector expression. """metadatashow_logsèdisabled®skip_as_script«code_folded$90d3b96b-ad2b-405c-951b-f48ec7ccf24acell_id$90d3b96b-ad2b-405c-951b-f48ec7ccf24acodemd""" The final expected value expression (13.5) can be sampled on a step by step basis during an episode since we would have access to both the step count and some unbiased sample of the state-action value. """metadatashow_logsèdisabled®skip_as_script«code_folded$700dcbc4-c94c-4287-8cf0-0b2c7a320a3acell_id$700dcbc4-c94c-4287-8cf0-0b2c7a320a3acode1reinforce_test5.policy_and_value(CartPoleState())metadatashow_logsèdisabled®skip_as_script«code_folded$f59a5dcd-9f4a-4336-a391-e64af35ef799cell_id$f59a5dcd-9f4a-4336-a391-e64af35ef799codehtml""" """metadatashow_logsèdisabled®skip_as_scriptëcode_folded$5864a5a3-a5a5-43c2-9cb4-7d13b2d20bedcell_id$5864a5a3-a5a5-43c2-9cb4-7d13b2d20bedcodeHmd""" Normal Distribution: $f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{(x - \mu)^2}{2 \sigma^2}}$ Consider a new random variable $Y \sim \tanh(X)$ where $X \sim N(0, 1)$. Using the change of variables theorem from probability theory we can compute the density function of $Y$: $f_Y(y) = f_X (g^{-1}(y)) \cdot \left \vert \frac{d}{dy} g^{-1}(y) \right \vert$ where $g(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ so $f_Y(y) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{\left (\tanh^{-1}(y) - \mu \right )^2}{2 \sigma^2}} \left \vert \frac{1}{1 - y^2} \right \vert$ """metadatashow_logsèdisabled®skip_as_script«code_folded$e3a2fb12-37ce-4c23-ad93-5fc89991aabbcell_id$e3a2fb12-37ce-4c23-ad93-5fc89991aabbcodeNmd""" ### Eligibility Vector for General Soft-Max and State Feature Vector """metadatashow_logsèdisabled®skip_as_script«code_folded$e5c1aca8-7575-4835-8273-e69ca0a55fe8cell_id$e5c1aca8-7575-4835-8273-e69ca0a55fe8codefunction corridor_parameter_studies(α_list, α_θ_list, α_w_list; nruns = 100, num_episodes = 100, max_steps = 1_000) Random.seed!(45) function average_runs(α) 1:nruns |> Map(_ -> reinforce_monte_carlo_control_binary_features(corridor_mdp, get_corridor_features, 1, num_episodes, params = [0f0 3.7f0], α = α, max_steps = max_steps).episode_rewards |> sum) |> foldxt(+) |> x -> x / nruns / num_episodes end function average_runs(α_θ, α_w) 1:nruns |> Map(_ -> reinforce_with_baseline_monte_carlo_control_binary_features(corridor_mdp, get_corridor_features, 1, num_episodes, policy_params = [0f0 3.7f0], α_θ = α_θ, α_w = α_w, max_steps = max_steps).episode_rewards |> sum) |> foldxt(+) |> x -> x / nruns / num_episodes end trace1 = scatter(x = α_list, y = average_runs.(α_list), name = "REINFORCE") with_baseline_traces = [begin scatter(x = α_θ_list, y = average_runs.(α_θ_list, α_w), name = "REINFORCE with Baseline: α_w = 2^$(round(Int64, log2(α_w)))") end for α_w in α_w_list] plot([trace1; with_baseline_traces], Layout(xaxis_title = "Policy Parameters Learning Rate", yaxis_title = "Average Reward Per Episode
Over First $num_episodes Episodes", xaxis_type = "log", yaxis_range = [-70, -10])) endmetadatashow_logsèdisabled®skip_as_script«code_folded$44b32cc0-36a8-41fd-89bc-ce894536926ccell_id$44b32cc0-36a8-41fd-89bc-ce894536926ccode$best_mc_corridor.policy_and_value(1)metadatashow_logsèdisabled®skip_as_script«code_folded$646bc853-b7fc-49fa-a201-ff98e8f952d4cell_id$646bc853-b7fc-49fa-a201-ff98e8f952d4codeQfunction corridor_parameter_studies(α_θ_list, α_w_list; nruns = 100, max_episodes = 100, max_steps = 1_000_000) # Random.seed!(45) function average_runs(α_θ, α_w) 1:nruns |> Map(_ -> one_step_actor_critic_binary_features(corridor_mdp, get_corridor_features, 1, max_episodes, max_steps, policy_params = [0f0 3.7f0], α_θ = α_θ, α_w = α_w) |> x -> isempty(x.episode_rewards) ? -Inf32 : (sum(x.episode_rewards) / length(x.episode_rewards))) |> foldxt(+) |> x -> x / nruns end traces = [begin scatter(x = α_θ_list, y = average_runs.(α_θ_list, α_w), name = "α_w = 2^$(round(Int64, log2(α_w)))") end for α_w in α_w_list] plot(traces, Layout(xaxis_title = "Policy Parameters Learning Rate", yaxis_title = "Average Reward Per Episode In First
$max_episodes Episodes Averaged over $nruns Runs", xaxis_type = "log")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$25be5dcf-be63-46c4-b6de-6cf79fa28fd0cell_id$25be5dcf-be63-46c4-b6de-6cf79fa28fd0codebegin function update_traces_with_gradient!(c1::T, z_θ::Matrix{T}, c2::T, ∇θ::BinaryEligibilityVector{T, B}) where {T<:Real, B<:BinaryFeatureVector} z_θ .*= c1 @inbounds for i in eachindex(∇θ.π_dist) @simd for j in 1:∇θ.binary_features.num_features k = ∇θ.binary_features.active_features[j] z_θ[k, i] -= c2*∇θ.π_dist[i] end end @inbounds @simd for i in 1:∇θ.binary_features.num_features j = ∇θ.binary_features.active_features[i] z_θ[j, ∇θ.i_a] += c2 end return z_θ end function update_traces_with_gradient!(c1::T, z_w::Vector{T}, ∇w::BinaryFeatureVector) where {T<:Real} z_w .*= c1 @inbounds @simd for i in 1:∇w.num_features j = ∇w.active_features[i] z_w[j] += one(T) end return z_w end function update_traces_with_gradient!(c1::T, z_θ::Array{T, N}, ∇θ::Array{T, N}) where {T<:Real, N} z_θ .= c1 .* z_θ .+ ∇θ end function update_traces_with_gradient!(c1::T, z_θ::Array{T, N}, c2::T, ∇θ::Array{T, N}) where {T<:Real, N} z_θ .= c1 .* z_θ .+ c2 .* ∇θ end function update_traces_with_gradient!(c1::Float32, z_θ::FCANNParams, ∇θ::FCANNParams) for i in eachindex(first(z_θ)) for j in 1:2 update_traces_with_gradient!(c1, z_θ[j][i], ∇θ[j][i]) end end end function update_traces_with_gradient!(c1::Float32, z_θ::FCANNParams, c2::Float32, ∇θ::FCANNParams) for i in eachindex(first(z_θ)) for j in 1:2 update_traces_with_gradient!(c1, z_θ[j][i], c2, ∇θ[j][i]) end end end endmetadatashow_logsèdisabled®skip_as_script«code_folded$38acd032-1d18-4760-9111-67c9cdd2e892cell_id$38acd032-1d18-4760-9111-67c9cdd2e892codeq#without limiting the force in this way, the learned policy just applies so much force to go up the hill directlymetadatashow_logsèdisabled®skip_as_script«code_folded$cecc2a35-3850-4f66-84e8-e29da4f3d4b0cell_id$cecc2a35-3850-4f66-84e8-e29da4f3d4b0codefunction get_corridor_episode_stats(π::Function; ntrials=10_000, kwargs...) 1:ntrials |> Map(_ -> runepisode(corridor_mdp; π = π, kwargs...) |> first |> length) |> foldxt(+) |> a -> a / ntrials endmetadatashow_logsèdisabled®skip_as_script«code_folded$4c4e643b-d4b9-44f0-8d30-dc521bcc55accell_id$4c4e643b-d4b9-44f0-8d30-dc521bcc55accodeconst cartpole_continuing_mdp = StateMDP(cartpole_functions.discrete_actions, StateMDPTransitionSampler(cartpole_continuing_step, cartpole_functions.initialize_state()), cartpole_functions.initialize_state)metadatashow_logsèdisabled®skip_as_script«code_folded$738ada7f-edc7-4ed3-a15e-e92113468738cell_id$738ada7f-edc7-4ed3-a15e-e92113468738codet#note that the random policy i.e. p = 0.5 has an expected episode length of 12 which is very close to ideal already.metadatashow_logsèdisabled®skip_as_script«code_folded$cacaaca6-6e01-464f-a2ee-cbf62737a426cell_id$cacaaca6-6e01-464f-a2ee-cbf62737a426codeٵreinforce_with_baseline_monte_carlo_control_linear_features(corridor_mdp, update_corridor_features!, 1, 1_000; α_θ = 2f0^-12, α_w = 2f0^-6, max_steps = 1_000).policy_function(1)metadatashow_logsèdisabled®skip_as_script«code_folded$daf35bfe-8f9c-4f55-971d-4d443be8f8bfcell_id$daf35bfe-8f9c-4f55-971d-4d443be8f8bfcode٣display_cartpole_episode((runepisode(cartpole_setup.mdps.episodic.discrete; π = reinforce_test5.policy_sample_action, max_steps = 1_000) |> x -> (x[1], x[2]))...)metadatashow_logsèdisabled®skip_as_script«code_folded$8e096fae-9941-49d8-ae87-c68b02f68da5cell_id$8e096fae-9941-49d8-ae87-c68b02f68da5codeSconst mountaincar_continuous_beta_mdp = create_continuous_action_mountaincar_beta()metadatashow_logsèdisabled®skip_as_script«code_folded$666a4e89-306b-4fb2-bdc4-3dda2c63153fcell_id$666a4e89-306b-4fb2-bdc4-3dda2c63153fcodeusing SpecialFunctionsmetadatashow_logsèdisabled®skip_as_script«code_folded$5d35e515-e2d3-443e-becf-eb28c25db346cell_id$5d35e515-e2d3-443e-becf-eb28c25db346codes@bind mountaincar_continuing_fcann_params create_actor_critic_continuing_params_UI(; λ_θ = 0.85f0, λ_w = 0.95f0)metadatashow_logsèdisabled®skip_as_script«code_folded$4c34640f-efa2-4e1d-8a70-0acd2ce45428cell_id$4c34640f-efa2-4e1d-8a70-0acd2ce45428code md""" # Bonus Problems: Comparing Techniques Consider the case of applying the techniques in this chapter to problems where we choose feature vectors and parameters to effectively compute the tabular case. That is we enumerate every state and state/action pair. Our parameters for each function will store a single value for each case. Let's consider the gradients for both the state-value estimate and the policy. We will use two sets of parameters: $\mathbf{w}$ and $\mathbf{\theta}$. $\mathbf{w}_s$ is the parameter for state s and $\mathbf{\theta}_{s, a}$ is the parameter for state/action pair $(s, a)$. Using this notation $\mathbf{w}$ is a vector and $\theta$ is a matrix. Starting with the state-value function: $\begin{align} \hat v(s, \mathbf{w}) &= \mathbf{w}_s \\ \nabla v(s, \mathbf{w}) &= \nabla \mathbf{w}_s \\ &= \mathbf{e}_s \end{align}$ where $\mathbf{e}_s$ is the one-hot vector for index s and length equal to the number of states. Now moving on to the policy, we will use a soft-max function to convert action preferences into probabilities. $\begin{align} \pi(a|s, \theta) &= \frac{\exp{\theta_{s, a}}}{\sum_{i = 1}^{n_A}{\exp{\theta_{s, i}}}} \\ \nabla \pi(a|s, \theta) &= \nabla \frac{\exp{\theta_{s, a}}}{\sum_{i = 1}^{n_A}{\exp{\theta_{s, i}}}} \\ \end{align}$ But we already calculated the gradient of the soft-max function of a vector $\mathbf{x}$. $\nabla\sigma(\mathbf{x})_{i, j} = \sigma(\mathbf{x})_i \left ( \delta_{i, j} - \sigma(\mathbf{x})_j \right )$ Comparing to what we desire, $\mathbf{x} = \mathbf{\theta}_s$ which is the parameter vector for the state s and $\sigma = \pi$. So we can immediately write down the components of this gradient: $\begin{align} \nabla \pi(a|\theta_s)_i &= \pi(a|\theta_s) \left (\delta_{a, i} - \pi(i|\theta_s) \right ) \\ \frac{\nabla \pi(a|\theta_s)_i}{\pi(a|\theta_s)} = \nabla \ln \pi(a|\theta_s)_i &= \left (\delta_{a, i} - \pi(i|\theta_s) \right ) \\ \end{align}$ $\begin{equation} \nabla \ln{\pi(a|\theta_s)}_i = \begin{cases} -\pi(i|\theta_s) & i \neq a \\ 1 - \pi(i|\theta_s) & i = a \end{cases} \end{equation}$ This is a gradient vector which corresponds to the components of $\theta_s$ which is the parameter vector for each action at that state. We have a new vector update for each unique state/action pair observed, but once those two are fixed the number of components that need to be calculated is just a vector with a length equal to the number of actions. """metadatashow_logsèdisabled®skip_as_script«code_folded$e7566274-5518-4e28-8738-d4b1747d0cfbcell_id$e7566274-5518-4e28-8738-d4b1747d0cfbcodefunction form_state_value_function(update_feature_vector!::Function, value_function::Function) function v!(x, s, value_params) update_feature_vector!(x, s) value_function(x, value_params) end endmetadatashow_logsèdisabled®skip_as_script«code_folded$6bf5ad39-1400-4e1f-a843-a1934b8aaa48cell_id$6bf5ad39-1400-4e1f-a843-a1934b8aaa48codebegin function update_squashed_gaussian_eligibility_vector!(∇lnπ::Matrix{T}, action_dist_params::Vector{T}, x::Vector{T}, action::T, policy_params::Matrix{T}, amax::T) where T<:Real c1 = atanh(action/amax) - first(action_dist_params) σ = exp(last(action_dist_params)) c2 = σ^-2 c3 = c2*c1 c4 = c3*c1 - one(T) @inbounds @simd for i in eachindex(x) ∇lnπ[i, 1] = x[i]*c3 end @inbounds @simd for i in eachindex(x) ∇lnπ[i, 2] = x[i]*c4 end end function update_squashed_gaussian_eligibility_vector!(∇lnπ::Matrix{T}, action_dist_params::Vector{T}, x::Vector{T}, action::NTuple{N, T}, policy_params::Matrix{T}, amax::NTuple{N, T}) where {N, T <: Real} for k = 1:N c1 = atanh(action/amax[k]) - action_dist_params[k] σ = exp(action_dist_params[k+N]) c2 = σ^-2 c3 = c2*c1 c4 = c3*c1 - one(T) @inbounds @simd for i in eachindex(x) ∇lnπ[i, k] = x[i]*c3 end @inbounds @simd for i in eachindex(x) ∇lnπ[i, k+N] = x[i]*c4 end end end endmetadatashow_logsèdisabled®skip_as_script«code_folded$17d07ef4-7c0a-47cc-a701-32c60336571bcell_id$17d07ef4-7c0a-47cc-a701-32c60336571bcodemd""" Noticing this pattern, the kth term will be of the form $\gamma^k \sum_{x \in \mathcal{S}} \Pr(s \rightarrow x, k, \pi)f(x)$ and the total expression will just be a sum of all of these terms to infinity or the maximum length of an episode under the policy. Looking more closely at the probability term, we can equate it to some other probabilities regarding episode length. """metadatashow_logsèdisabled®skip_as_script«code_folded$76fd79a2-2bc8-45f8-a243-48415118898acell_id$76fd79a2-2bc8-45f8-a243-48415118898acode+begin mutable struct BinarySquashedGaussianEligibilityVector{T<:Real, A<:Union{T, NTuple{N, T} where N}, P<:Union{T, Vector{T}}, B <: BinaryFeatureVector} binary_features::B a::A μ::P σ::P amax::A end BinarySquashedGaussianEligibilityVector(a::T, amax::T) where T<:Real = BinarySquashedGaussianEligibilityVector(BinaryFeatureVector(), a, zero(T), one(T), amax) BinarySquashedGaussianEligibilityVector(a::NTuple{N, T}) where {T<:Real, N} = BinarySquashedGaussianEligibilityVector(BinaryFeatureVector(), a, zeros(T, N), ones(T, N), amax) endmetadatashow_logsèdisabled®skip_as_script«code_folded$0b01ba67-3921-4f3f-a7e8-235190bc84ebcell_id$0b01ba67-3921-4f3f-a7e8-235190bc84ebcodeRfunction make_beta_dist(α, β) f(x) = x^(α-1) * (1-x)^(β-1) / beta(α, β) endmetadatashow_logsèdisabled®skip_as_script«code_folded$9acdbf38-2e10-45ec-85a0-d0db8453a599cell_id$9acdbf38-2e10-45ec-85a0-d0db8453a599code#this version of tile coding setup just produces a function that returns the active indices as a generator rather than actually update the feature vector function fcann_feature_vector_setup(min_value::S, max_value::S) where {T<:Real, N, S <: Union{T, NTuple{N, T}}} #states must be tuples with k elements or some number value k = S == T ? 1 : N s_range = if k == 1 max_value - min_value else Tuple(max_value[i] - min_value[i] for i in 1:k) end sample_vector = make_sample_vector(min_value) function update_feature_vector!(x::Vector{T}, s::Real) x[1] = scale_state(s, min_value, s_range) return x end function update_feature_vector!(x::Vector{T}, s::NTuple{N, T}) for i in 1:N x[i] = scale_state(s[i], min_value[i], s_range[i]) end return x end (feature_vector = sample_vector, num_features = length(sample_vector), update_feature_vector! = update_feature_vector!) endmetadatashow_logsèdisabled®skip_as_script«code_folded$d4e87ac4-6008-43b2-aa06-e232ec2b2b5bcell_id$d4e87ac4-6008-43b2-aa06-e232ec2b2b5bcodeqplot_cartpole_policy(reinforce_test5.policy_and_value; s_ref = CartPoleState(Float32(x), 0f0, Float32(ẋ), 0f0))metadatashow_logsèdisabled®skip_as_script«code_folded$05f120be-9695-4824-82fd-142a0df13098cell_id$05f120be-9695-4824-82fd-142a0df13098codefunction actor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, amax::A, λ_θ::T, λ_w::T, get_active_features::Function, num_features::Integer, args...; policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, N, A <: Union{T, NTuple{N, T}}, P, F1, F2, F3} setup = setup_binary_squashed_gaussian_policy_arguments(mdp, amax, get_active_features, num_features) actor_critic_with_eligibility_traces!(policy_params, setup.eligibility_vector, value_params, BinaryFeatureVector(), mdp, λ_θ, λ_w, update_binary_action_preferences!, setup.action_distribution_parameters, make_squashed_gaussian_sampler(rand(A), amax), update_squashed_gaussian_eligibility_vector!, setup.feature_vector, setup.update_feature_vector!, binary_value_function, update_binary_value_gradient!, args...; kwargs...) endmetadatashow_logsèdisabled®skip_as_script«code_folded$b2539398-fdbc-42a2-a8f3-d327358f3643cell_id$b2539398-fdbc-42a2-a8f3-d327358f3643codeif start_cartpole_continuing_binary_param_study > 0 cartpole_binary_continuing_parameter_study(cartpole_continuing_binary_study_params, 5, 3, 10_000; seed = 45) else md""" Waiting to run parameter study """ endmetadatashow_logsèdisabled®skip_as_script«code_folded$c5dd7e99-57e0-4bc7-97d2-2c780b23bcffcell_id$c5dd7e99-57e0-4bc7-97d2-2c780b23bcffcodeqmd""" #### Discrete Action Space As an initial test, consider the discrete action space originally used for the mountain car problem where there are three actions (-1, 0, 1) corresponding to full throttle reverse, idle, and full throttle forward. We can apply the same tile coding solution technique from before but with a policy gradient method instead of Sarsa. """metadatashow_logsèdisabled®skip_as_script«code_folded$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9cell_id$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9code9const mountaincar_continuing_fcann_test = actor_critic_with_eligibility_traces_fcann(mountaincar_continuing_mdp, 0.85f0, 0.95f0, mountaincar_fcann_setup.num_features, [32, 32, 32], mountaincar_fcann_setup.update_feature_vector!, 1_000_000, α_θ = 0.002f0, α_w = 0.002f0, α_r̄ = 0.01f0; save_step_rewards=true)metadatashow_logsèdisabled®skip_as_script«code_folded$042fbafe-2401-4fb7-ac13-4531e0782c79cell_id$042fbafe-2401-4fb7-ac13-4531e0782c79codefunction update_binary_eligibility_vector!(∇lnπ::BinaryEligibilityVector{T, B}, action_preferences::Vector{T}, binary_features::B, i_a::Integer, params::Matrix{T}) where {T<:Real, B<:BinaryFeatureVector} update_binary_action_preferences!(action_preferences, binary_features, params) soft_max!(action_preferences) ∇lnπ.binary_features = binary_features ∇lnπ.i_a = i_a ∇lnπ.π_dist .= action_preferences return ∇lnπ endmetadatashow_logsèdisabled®skip_as_script«code_folded$d57375a5-b9e0-4742-b5f7-6a7da891604acell_id$d57375a5-b9e0-4742-b5f7-6a7da891604acode mountaincar_binary_continuing_parameter_study(args...; kwargs...) = actor_critic_linear_parameter_study(mountaincar_continuing_mdp, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, args...; binary_features=true, kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$07ad517a-c2ac-4377-99fb-adb13d0f1d0ccell_id$07ad517a-c2ac-4377-99fb-adb13d0f1d0ccodeٓreinforce_monte_carlo_control_fcann(corridor_mdp, 1, [10, 10], update_corridor_features!, 100; α = 2f0^-14, max_steps = 10_000).policy_function(1)metadatashow_logsèdisabled®skip_as_script«code_folded$71a5fce8-6d9a-4625-bad1-a951d61bff28cell_id$71a5fce8-6d9a-4625-bad1-a951d61bff28codef@bind mountaincar_binary_continuous_params create_actor_critic_params_UI(λ_θ = 0.05f0, λ_w = 0.8f0)metadatashow_logsèdisabled®skip_as_script«code_folded$77906355-08f8-4b08-b051-84697199b519cell_id$77906355-08f8-4b08-b051-84697199b519code,const mountaincar_max_vals = (0.5f0, 0.07f0)metadatashow_logsèdisabled®skip_as_script«code_folded$5207308e-f636-4d47-b135-036a6e7b8ecdcell_id$5207308e-f636-4d47-b135-036a6e7b8ecdcodefshow_mountaincar_continuous_trajectory(mountaincar_continuous_test_train3.policy_sample_action, 1_000)metadatashow_logsèdisabled®skip_as_script«code_folded$16113560-e911-47b4-abc4-641bbd246454cell_id$16113560-e911-47b4-abc4-641bbd246454code_plot(mountaincar_continuous_test_train_beta.episode_rewards, Layout(yaxis_range = [-10000, 0]))metadatashow_logsèdisabled®skip_as_script«code_folded$b7f77935-bcab-4ef1-8e1b-a7d059784ff3cell_id$b7f77935-bcab-4ef1-8e1b-a7d059784ff3code@bind test_mountaincar_state PlutoUI.combine() do Child md""" #### Evaluation State for Policy Function x position: $(Child(Slider(-1.2f0:0.1f0:0.5f0, default = 0f0, show_value=true))) velocity: $(Child(Slider(-0.07f0:0.01f0:0.07f0, default = 0f0, show_value=true))) """ endmetadatashow_logsèdisabled®skip_as_script«code_folded$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5cell_id$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5codeUreinforce_test.policy_and_value(cartpole_mdps.episodic.continuous.initialize_state())metadatashow_logsèdisabled®skip_as_script«code_folded$00bd2835-b006-4244-9877-bc7e031e3ef8cell_id$00bd2835-b006-4244-9877-bc7e031e3ef8codefunction plot_squashed_gaussian(μ::T, σ::T, xmax::T; npoints = 1000) where T<:Real x = LinRange(-one(T)*xmax, one(T)*xmax, npoints) y = squashed_gaussian_pdf.(x, μ, σ, xmax) plot(x, y, Layout(xaxis_range = [-2, 2])) endmetadatashow_logsèdisabled®skip_as_script«code_folded$50ae94c4-70f3-4215-82bd-eb2227c2badfcell_id$50ae94c4-70f3-4215-82bd-eb2227c2badfcodeif start_cartpole_continuing_fcann_param_study > 0 cartpole_fcann_continuing_parameter_study(cartpole_continuing_fcann_network_params..., cartpole_continuing_fcann_study_params, 4, 3, 100_000; seed = 45) else md""" Waiting to run parameter study """ endmetadatashow_logsèdisabled®skip_as_script«code_folded$cc3ac95e-a398-438a-ba3d-62b6733f6342cell_id$cc3ac95e-a398-438a-ba3d-62b6733f6342codefunction update_fcann_action_preferences!(action_preferences::Vector{T}, x::Vector{T}, params::FCANNParams, activations::FCANNActivations{T}, reslayers::Integer) where T<:Float32 FCANN.forwardNOGRAD_base!(activations, params..., x, reslayers) action_preferences .= activations[end] endmetadatashow_logsèdisabled®skip_as_script«code_folded$c926b6df-c40b-4c4c-8a95-ce9e41feb100cell_id$c926b6df-c40b-4c4c-8a95-ce9e41feb100codeMactor_critic_fcann_parameter_study(mountaincar_continuing_mdp, mountaincar_fcann_feature_setup.update_feature_vector!, mountaincar_fcann_feature_setup.num_features, [4, 4], 0.0f0:0.05f0:0.95f0, 0.0f0:0.05f0:0.95f0, [0.01f0, 0.005f0], 2f0 .^ (-20:-1), 2f0 .^ (-20:-1), 1_000, 1_000_000; seed = 45) |> df -> sort(df, :output; rev=true)metadatashow_logsèdisabledîskip_as_script«code_folded$740a3f41-9302-481d-b373-762c0dea8effcell_id$740a3f41-9302-481d-b373-762c0dea8effcodeVbegin function update_gaussian_eligibility_vector!(∇lnπ::BinaryGaussianEligibilityVector{T, T, T, B}, dist_params::Vector{T}, x::B, action::T, policy_params::Matrix{T}) where {T<:Real, B<:BinaryFeatureVector} ∇lnπ.binary_features = x ∇lnπ.a = action ∇lnπ.μ = first(dist_params) ∇lnπ.σ = exp(last(dist_params)) # isapprox(∇lnπ.σ, 0f0) && @info "with distribution params $dist_params having 0 result for σ of $∇lnπ.σ" # isinf(∇lnπ.σ) && @info "with distribution params $dist_params having inf result for σ of $∇lnπ.σ" # isnan(∇lnπ.σ) && @info "with distribution params $dist_params having nan result for σ of $∇lnπ.σ" return ∇lnπ end function update_gaussian_eligibility_vector!(∇lnπ::BinaryGaussianEligibilityVector{T, NTuple{N, T}, Vector{T}, B}, dist_params::Vector{T}, x::B, action::NTuple{N, T}, policy_params::Matrix{T}) where {T<:Real, N, B<:BinaryFeatureVector} ∇lnπ.binary_features = x ∇lnπ.a = action for i in 1:N ∇lnπ.μ[k] = dist_params[k] ∇lnπ.σ[k] = exp(dist_params[k+N]) end return ∇lnπ end endmetadatashow_logsèdisabled®skip_as_script«code_folded$ba642a22-6623-482a-ab4a-81585b83e457cell_id$ba642a22-6623-482a-ab4a-81585b83e457code-@memoize Dict function average_continuing_runs(nruns::Integer, seed::Integer, α_θ::T, α_w::T, α_r̄::T, policy_params, algo, args...; kwargs...) where T<:Real # @info "Running trials for continuing actor critic with binary encoding: $nruns $seed $α_θ $α_w $α_r̄ $mdp $λ_θ $λ_w $get_active_features $num_features" Random.seed!(seed) 1:nruns |> Map() do _ x =algo(args...; α_θ = α_θ, α_w = α_w, α_r̄ = α_r̄, policy_params = deepcopy(policy_params), kwargs...) x.total_reward / x.total_steps end |> foldxt(+) |> a -> a / nruns endmetadatashow_logsèdisabled®skip_as_script«code_folded$d17a4bd0-5992-4247-912d-73d51758d2f3cell_id$d17a4bd0-5992-4247-912d-73d51758d2f3code+md""" ### *Continuing Cartpole Example* """metadatashow_logsèdisabled®skip_as_script«code_folded$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7efcell_id$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7efcode٧plot_cartpole_policy(cartpole_continuing_fcann_test.policy_and_value; s_ref = cartpole_fcann_continuing_test_episode[1][cartpole_fcann_continuing_episode_step_select])metadatashow_logsèdisabled®skip_as_script«code_folded$5ee4ce72-7740-4297-8d84-619e0708e4accell_id$5ee4ce72-7740-4297-8d84-619e0708e4accodefunction cartpole_continuing_fcann_parameter_study(α1_list, α2_list, α_r̄, λ_θ, λ_w, hidden_layers, max_steps; num_trials = 100, kwargs...) setup = setup_cartpole_problem(;kwargs...) traces = [begin steps = [begin 1:num_trials |> Map() do i solution = actor_critic_with_eligibility_traces_fcann(cartpole_setup.mdps.continuing.discrete, λ_θ, λ_w, cartpole_fcann_feature_setup.num_features, hidden_layers, (x, s) -> cartpole_fcann_feature_setup.update_feature_vector!(x, (s.x, s.θ, s.ẋ, s.θ̇)), max_steps; α_θ = α1, α_w = α2, α_r̄ = α_r̄) solution.total_reward / max_steps end |> foldxt(+) |> x -> x / num_trials end for α1 in α1_list] scatter(x = α1_list, y = steps, name = "α_w = $α2") end for α2 in α2_list] plot(traces, Layout(xaxis_title = "Policy Learning Rate α_θ", yaxis_title = "Average Failure Rate Over First $max_steps Steps", xaxis_type = "log", title = "Hiden Layers = $hidden_layers, λ_θ = $λ_θ, λ_w = $λ_w")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$645e93e7-e92e-49c4-9757-8294fabf4e9bcell_id$645e93e7-e92e-49c4-9757-8294fabf4e9bcodeCplot_continuing_step_rewards(cartpole_continuing_test.step_rewards)metadatashow_logsèdisabled®skip_as_script«code_folded$0c56b341-24eb-4c78-844e-182f44a7221acell_id$0c56b341-24eb-4c78-844e-182f44a7221acode#in the source code used to generate this for the book found here: http://incompleteideas.net/book/code/figure_13_1.py - graphs look as they do because of poor parameter initialization since the random policy is fairly close to ideal already figure_13_1(2f0 .^ [-12, -13, -14])metadatashow_logsèdisabled®skip_as_script«code_folded$d34d22ad-89c2-423e-91dd-bfb895dc6540cell_id$d34d22ad-89c2-423e-91dd-bfb895dc6540codecartpole_fcann_parameter_study(args...; kwargs...) = actor_critic_fcann_episodic_parameter_study(cartpole_setup.mdps.episodic.discrete, cartpole_vector_update!, cartpole_fcann_feature_setup.num_features, args...; kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$20776e09-7d9b-4db8-a060-7bceeec65b47cell_id$20776e09-7d9b-4db8-a060-7bceeec65b47codefunction actor_critic_with_eligibility_traces_binary_features_gaussian_actions(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, λ_θ::T, λ_w::T, get_active_features::Function, num_features::Integer, args...; policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, N, A <: Union{T, NTuple{N, T}}, P, F1, F2, F3} setup = setup_binary_gaussian_policy_arguments(mdp, get_active_features, num_features) actor_critic_with_eligibility_traces!(policy_params, setup.eligibility_vector, value_params, BinaryFeatureVector(), mdp, λ_θ, λ_w, update_binary_action_preferences!, setup.action_distribution_parameters, make_gaussian_sampler(rand(A)), update_gaussian_eligibility_vector!, setup.feature_vector, setup.update_feature_vector!, binary_value_function, update_binary_value_gradient!, args...; kwargs...) endmetadatashow_logsèdisabled®skip_as_script«code_folded$7856b8a0-565d-4c86-9b3c-4424ff9b86ddcell_id$7856b8a0-565d-4c86-9b3c-4424ff9b86ddcodeb#add policy gradient example on cartpole without continuous actions and parameter studies for bothmetadatashow_logsèdisabled®skip_as_script«code_folded$735b548a-88f5-4a30-ab8f-dfb3d6401b2bcell_id$735b548a-88f5-4a30-ab8f-dfb3d6401b2bcodeAmd""" ## 13.7 Policy Parameterization for Continuous Actions With a parameterized policy we are to learn statistics of the distribution that selects actions. As a foundation consider the normal distribution: $p(x) \doteq \frac{1}{\sigma \sqrt{2\pi}} \exp \left ( - \frac{(x-\mu)^2}{2\sigma^2} \right ) \tag{13.18}$ """metadatashow_logsèdisabled®skip_as_script«code_folded$7cf26604-9c2b-4a77-9674-7d4dac2f99f0cell_id$7cf26604-9c2b-4a77-9674-7d4dac2f99f0codebegin include(joinpath(@__DIR__, "..", "Chapter-09", "Chapter_09_On-policy_Prediction_with_Approximation.jl")) include(joinpath(@__DIR__, "..", "Chapter-10", "Chapter_10_On_policy_Control_with_Approximation.jl")) include(joinpath(@__DIR__, "..", "Chapter-11", "Chapter_11_Off_policy_Methods_with_Approximation.jl")) include(joinpath(@__DIR__, "..", "Chapter-12", "Chapter_12_Eligibility_Traces.jl")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$87ee21f3-16ca-4c8c-a0b9-f9e2fd258a91cell_id$87ee21f3-16ca-4c8c-a0b9-f9e2fd258a91codeEmd""" ### *REINFORCE Implementation for Continuous Action Spaces* """metadatashow_logsèdisabled®skip_as_script«code_folded$54f1546d-87ae-49d2-92ed-6fcc9b66e027cell_id$54f1546d-87ae-49d2-92ed-6fcc9b66e027code md""" ### *Mountain Car MDP* """metadatashow_logsèdisabled®skip_as_script«code_folded$63fbf8f4-e4e2-4893-be09-67450e92dbd7cell_id$63fbf8f4-e4e2-4893-be09-67450e92dbd7code,function plot_cart(s::CartPoleState, a::Int64; xmin = -50, xmax = 50, θ̇_min = -10, θ̇_max = 10) s.x s.θ t1 = scatter(x = [0, sin(s.θ)], y = [0, cos(s.θ)], mode = "lines", color = "black") t2 = scatter(x = [sin(s.θ)], y = [cos(s.θ)], mode = "markers", color = "black") p1 = plot([t1, t2], Layout(yaxis_range = [-.1, 1.2], xaxis_range = [-1.2, 1.2], xaxis_scaleanchor = "y", width = 250, height = 230, showlegend = false, title = "Pole Angle")) p2 = plot(scatter(x = [s.x], y = [0]), Layout(xaxis_range = [xmin, xmax], width = 250, height = 230, title = "X Location")) p3 = plot(indicator(mode = "gauge+number+delta", value = s.θ̇, title_text = "Angular Speed
in Radians per Second", delta_reference = 0, gauge_axis_range = [-10, 10]), Layout(width = 250, height = 230)) p4 = plot(indicator(mode = "gauge+number+delta", value = s.ẋ, title_text = "Horizontal Speed
in Meters per Second", delta_reference = 0, gauge_axis_range = [-50, 50]), Layout(width = 250, height = 230)) p5 = plot(indicator(mode = "gaugue+number", gauge=attr(shape="bullet"), value = a - 2, title_text = "Action", delta_reference = 0, gauge_axis_range = [-1, 1]), Layout(width = 250, height = 230)) @htl("""
$p1 $p2 $p3 $p4 $p5
""") endmetadatashow_logsèdisabled®skip_as_script«code_folded$d5020a8d-1dd7-403c-9d1f-665b95543943cell_id$d5020a8d-1dd7-403c-9d1f-665b95543943codeVreinforce_with_baseline_monte_carlo_control_linear_features_gaussian_actions(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, update_feature_vector!::Function, num_features::Integer, max_episodes::Integer; policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), value_params::Vector{T} = zeros(T, num_features), x = zeros(T, num_features), action_dist_params::Vector{T} = make_gaussian_params(rand(A)), kwargs...) where {T<:Real, S, N, A<:Union{T, NTuple{N, T}}, P, F1, F2, F3} = reinforce_with_baseline_monte_carlo_control!(policy_params, copy(policy_params), value_params, copy(value_params), mdp, update_linear_action_preferences!, action_dist_params, make_gaussian_sampler(rand(A)), update_gaussian_eligibility_vector!, x, update_feature_vector!, linear_value_function, update_linear_value_gradient!, max_episodes; kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$37a8ef7e-e859-4ef0-81e2-76c02a324031cell_id$37a8ef7e-e859-4ef0-81e2-76c02a324031code md""" ### Policy Gradient Theorem Proof In all cases below when a sum over states is taken, it is assumed to be over the set of non-terminal states: $\sum_s \implies \sum_{s \in \mathcal{S}}$ Note that for the case of the value function this is identical to the sum over $\mathcal{S}^+$ because the state-action values are always zero for terminal states. $\begin{flalign} \nabla v_\pi(s) &= \nabla \left [ \sum_a \pi(a \vert s) q_\pi(s, a) \right ] \text{, } \forall s \in \mathcal{S} \tag{definitiong of value functions and expected value} \\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \nabla q_\pi(s, a) \right ] \tag{product rule} \\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \nabla \sum_{s^\prime, r} p(s^\prime, r \vert s, a)(r + \gamma v_\pi(s^\prime) \right ] \tag{relationship between action and state values} \\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \gamma \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) \right ] \tag{gradient independence}\\ \end{flalign}$ Note that the final term in the sum is the original expression evaluated at $s^\prime$ instead of $s$, so we have derived a recurssive expression which can be applied repeatedly: $\begin{flalign} \nabla v_\pi(s) &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \gamma \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \left [ \nabla \pi(a^\prime \vert s^\prime) q_\pi(s^\prime, a^\prime) + \pi(a^\prime \vert s^\prime) \gamma \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) \nabla v_\pi(s^{\prime \prime}) \right ] \right ] \tag{recur once}\\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) \right ] + \gamma \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \left [ \nabla \pi(a^\prime \vert s^\prime) q_\pi(s^\prime, a^\prime) \right ] \right ] + \\ &\hspace{50px} \gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) \nabla v_\pi(s^{\prime \prime}) \right ] \tag{grouping terms}\\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) \right ] + \gamma \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \left [ \nabla \pi(a^\prime \vert s^\prime) q_\pi(s^\prime, a^\prime) \right ] \right ] + \\ &\hspace{50px} \gamma^2 \sum_a \left [ \pi(a \vert s)\sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) \sum_{a^{\prime \prime}} [ \nabla \pi(a^{\prime \prime} \vert s^{\prime \prime}) q_\pi(s^{\prime \prime}, a^{\prime \prime})\right ] + \cdots \tag{extend recursion}\\ \end{flalign}$ """metadatashow_logsèdisabled®skip_as_script«code_folded$98229733-a71e-44ca-a52a-b7229cf8b422cell_id$98229733-a71e-44ca-a52a-b7229cf8b422code1md""" The probability transition function is normalized over all possible transition states $\sum_{s^\prime \in \mathcal{S}^+} p(s^\prime \vert s, a) = 1$. If we only take the sum of $\mathcal{S}$ then we instead get the probability that after a single transition we have NOT reached a terminal state. Let's say we also have a policy function $\pi(a \vert s)$ which is normalized over actions: $\sum_a \pi(a \vert s) = 1$. Now if we combine the two, we can arrive at a new distribution over transition states: $p(s^\prime \vert s, \pi) = \sum_a \pi(a \vert s) p(s^\prime \vert s, a)$ which is the probability of transitioning from $s$ to $s^\prime$ under the policy. We can see that this distribution is normalized over the transition states as well as long as we include the terminal state: $\sum_{s^\prime \in \mathcal{S}^+} p(s^\prime \vert s, \pi) = \sum_{s^\prime \in \mathcal{S}^+, a} \pi(a \vert s) p(s^\prime \vert s, a) = \sum_a \pi(a \vert s) \sum_{s^\prime \in \mathcal{S}^+} p(s^\prime \vert s, a) = 1 \times 1 = 1$. If instead we take the sum over $\mathcal{S}$ we simply get the probability of NOT terminating in one step. What if we consider two steps into the future though? Now we have $\sum_{s^\prime}\sum_{a^\prime}\pi(a^\prime \vert s^\prime)p(s^{\prime \prime} \vert s^\prime, a^\prime)\sum_a \pi(a \vert s) p(s^\prime \vert s, a) = \sum_{s^\prime}p(s^{\prime \prime} \vert s^\prime, \pi) p(s^\prime \vert s, \pi)$. It would appear as though we can just put the two probabilities together and consider a new distribution over $s^{\prime \prime}$ which is $p(s^{\prime \prime} \vert s, \pi, 2)$ where instead of one step this now occurs over two steps, but how is this distribution normalized? In the case of the one step, transition, we saw that its sum over all transition states is 1 as expected. If we sum both transition states over only $\mathcal{S}$ rather than $\mathcal{S}^+$ what is the result? We already know that $\sum_{s^{\prime \prime} \in \mathcal{S}^+} p(s^{\prime \prime} \vert s^\prime , \pi) = \Pr \{ S_1 \neq S_T \ \vert S_0 = s^\prime, \pi \}$ that is the probability that after transitioning out of $s^\prime$ under the policy $\pi$ we have not reached a terminal state. $\sum_{s^{\prime \prime} \in \mathcal{S}} \sum_{s^\prime \in \mathcal{S}} p(s^{\prime \prime} \vert s^\prime, \pi) p(s^\prime \vert s, \pi) = \sum_{s^\prime \in \mathcal{S}} p(s^\prime \vert s, \pi) \sum_{s^{\prime \prime} \in \mathcal{S}} p(s^{\prime \prime} \vert s^\prime, \pi) = \Pr \{ S_2 \neq S_T \vert S_0 = s, \pi \}$ which is to say the probability that after two transitions from $s$ we are not in a terminal state under the policy $\pi$. For the derivations that follow, we always take sums of these distributions over $\mathcal{S}$. For episodic problems, the on policy distribution $\mu_\pi(s)$ which is the probability of being in a state $s$ during an episode always excludes the terminal state. That is because if there is a non-zero probability of reaching a terminal state under a policy, then considering all possible episodes we may have an infinite number of visits to the terminal state. Technically the episodes have infinite length but we are only interested in the portion of the episode that preceeds the terminal state for the purpose of calculating probabilities. The more careful statement about the on policy distribution is that it measures the probability of being in a state during the non-terminal part of an episode. If we try to include the terminal states, then we cannot have a proper normalized definition of the on-policy distribution. Moreover, we have no need to measure the value of a terminal state accurately, since we always know it to be 0. The on policy distribution is used to formulate the value error objective function and it should only include states for which the value estimation is non-trivial. """metadatashow_logsèdisabled®skip_as_script«code_folded$42775fd1-5b27-48e0-abf1-9b22bb775e6dcell_id$42775fd1-5b27-48e0-abf1-9b22bb775e6dcodeKcorridor_continuing_parameter_study(continuing_study_params, 5, 3, 100_000)metadatashow_logsèdisabled®skip_as_script«code_folded$7dbb42a3-aa8c-47e5-b668-18e6325d4038cell_id$7dbb42a3-aa8c-47e5-b668-18e6325d4038code!md""" #### Tile Coding Method """metadatashow_logsèdisabled®skip_as_script«code_folded$192b9f82-8d3a-408f-91c2-829cfcd32572cell_id$192b9f82-8d3a-408f-91c2-829cfcd32572codeٝcartpole_vector_update!(x::Vector{T}, s::CartPoleState{T}) where T<:Real = cartpole_fcann_feature_setup.update_feature_vector!(x, (s.x, s.θ, s.ẋ, s.θ̇))metadatashow_logsèdisabled®skip_as_script«code_folded$b5319d8b-0420-4ebf-b603-ea0b93365ac1cell_id$b5319d8b-0420-4ebf-b603-ea0b93365ac1codenfunction show_mountaincar_continuous_trajectory(π::Function, max_steps::Integer; mdp = mountaincar_continuous_mdp) states, actions, rewards, sterm, nsteps = runepisode(mdp, π; max_steps = max_steps) positions = [s[1] for s in states] velocities = [s[2] for s in states] tr1 = scatter(x = positions, y = velocities, mode = "markers", showlegend = false) tr2 = scatter(y = positions, showlegend = false) tr3 = scatter(y = actions, showlegend = false) p1 = plot(tr1, Layout(xaxis_title = "position", yaxis_title = "velocity", xaxis_range = [-1.2, 0.5], yaxis_range = [-0.07, 0.07], height = 400)) p2 = plot(tr2, Layout(xaxis_title = "time", yaxis_title = "position", height = 400)) p3 = plot(tr3, Layout(xaxis_title = "time", yaxis_title = "action", height = 400)) @htl(""" Total Reward: $(sum(rewards))
$([p1 p2 p3])
""") endmetadatashow_logsèdisabled®skip_as_script«code_folded$4cbdb082-22ba-49e9-a6ed-4380917625accell_id$4cbdb082-22ba-49e9-a6ed-4380917625accodeCmd""" ### *Actor-Critic with Eligibility Traces Implementation* """metadatashow_logsèdisabled®skip_as_script«code_folded$cc80848a-6834-4272-9152-e17b45448814cell_id$cc80848a-6834-4272-9152-e17b45448814codefunction wind_speeds(directions) PlutoUI.combine() do Child @htl("""
Wind speeds
    $([ @htl("
  • $(name): $(Child(name, Slider(1:100)))
  • ") for name in directions ])
""") end endmetadatashow_logsèdisabled®skip_as_script«code_folded$05bfd818-bf4e-4bda-baa9-5ba647867097cell_id$05bfd818-bf4e-4bda-baa9-5ba647867097code.function actor_critic_with_eligibility_traces_binary_features(mdp::StateMDP{T, S, A, P, F1, F2, F3}, λ_θ::T, λ_w::T, get_active_features::Function, num_features::Integer, args...; policy_params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} setup = setup_binary_policy_arguments(mdp, get_active_features, num_features) actor_critic_with_eligibility_traces!(policy_params, setup.eligibility_vector, value_params, BinaryFeatureVector(), mdp, λ_θ, λ_w, update_binary_action_preferences!, update_binary_eligibility_vector!, setup.feature_vector, setup.update_feature_vector!, binary_value_function, update_binary_value_gradient!, args...; action_preferences = setup.action_preferences, kwargs...) endmetadatashow_logsèdisabled®skip_as_script«code_folded$f0962801-0dfa-421f-8ffc-e64068e49913cell_id$f0962801-0dfa-421f-8ffc-e64068e49913codefconst mountaincar_fcann_feature_setup = fcann_feature_vector_setup((-1.2f0, -0.07f0), (0.5f0, 0.07f0))metadatashow_logsèdisabled®skip_as_script«code_folded$11a55af7-5301-4507-bb26-88e1e11236dbcell_id$11a55af7-5301-4507-bb26-88e1e11236dbcode٥display_cartpole_episode((runepisode(cartpole_setup.mdps.episodic.discrete; π = reinforce_test3.policy_sample_action, max_steps = 100_000) |> x -> (x[1], x[2]))...)metadatashow_logsèdisabled®skip_as_script«code_folded$ddbca73f-c692-46f2-95f3-a7dd849d33f7cell_id$ddbca73f-c692-46f2-95f3-a7dd849d33f7codeOshow_mountaincar_trajectory(mountaincar_test_train.policy_sample_action, 1_000)metadatashow_logsèdisabled®skip_as_script«code_folded$b4875f2b-5487-429f-80a3-d1032bbccfc1cell_id$b4875f2b-5487-429f-80a3-d1032bbccfc1codeCmd""" ### Policy Gradient Theorem Proof for Continuing Problems """metadatashow_logsèdisabled®skip_as_script«code_folded$0cd96c44-cae6-421f-9fae-26141600bef4cell_id$0cd96c44-cae6-421f-9fae-26141600bef4code٬display_cartpole_episode((runepisode(cartpole_setup.mdps.episodic.discrete; π = cartpole_continuing_test.policy_sample_action, max_steps = 1_000) |> x -> (x[1], x[2]))...)metadatashow_logsèdisabled®skip_as_script«code_folded$51d6337d-c0bd-40a9-9129-7d88e41e4093cell_id$51d6337d-c0bd-40a9-9129-7d88e41e4093codeR#add plot under this to show the action selection or force being applied over timemetadatashow_logsèdisabled®skip_as_script«code_folded$5859ca11-90f8-4fd6-88ed-c56efe796fe8cell_id$5859ca11-90f8-4fd6-88ed-c56efe796fe8code٥display_cartpole_episode((runepisode(cartpole_setup.mdps.episodic.discrete; π = reinforce_test2.policy_sample_action, max_steps = 100_000) |> x -> (x[1], x[2]))...)metadatashow_logsèdisabled®skip_as_script«code_folded$3ea08816-705e-4be7-a175-dbd3f3e4c17dcell_id$3ea08816-705e-4be7-a175-dbd3f3e4c17dcode$md""" # Misc Utilities/Functions """metadatashow_logsèdisabled®skip_as_script«code_folded$f3e2db06-9cb7-464a-96b8-938175efd26bcell_id$f3e2db06-9cb7-464a-96b8-938175efd26bcodefunction setup_fcann_value_arguments(policy_setup::NamedTuple, input_length::Integer, hidden_layers::Vector{Int64}, reslayers::Integer, l2::T, dropout::T, use_μP::Bool, activation_list, scales) where {T<:Real} scale = (reslayers == 0) ? 1 : length(hidden_layers)/(reslayers + 1) + 1 c = scale*last(hidden_layers) f = use_μP ? one(T) / c : c^T(-0.5) w_θ_out = T.(FCANN.makeorthonormalrand(1, last(hidden_layers)) .* f) w_β_out = zeros(T, 1) #value function shares its params with the policy function value_params = deepcopy(policy_setup.params) for i in eachindex(hidden_layers) for j in 1:2 value_params[j][i] = policy_setup.params[j][i] end end #replace the final layer of the value network with something that outputs a single value value_params[1][end] = w_θ_out value_params[2][end] = w_β_out # value_params = FCANN.initializeparams_saxe(input_length, hidden_layers, 1) #form activations for value network value_activations = FCANN.form_activations(value_params[1]) value_activations[end] = zeros(T, 1) value_tanh_grad_z = deepcopy(value_activations) value_deltas = deepcopy(value_activations) value_function(x, params) = fcann_value_function(x, params, value_activations, reslayers) function update_value_gradient!(∇v̂, x, value_params) update_fcann_value_gradient!(∇v̂, x, value_params, hidden_layers, l2, value_tanh_grad_z, value_activations, value_deltas, dropout, reslayers, activation_list, scales) use_μP && scale_fcann_params!(∇v̂, policy_setup.scales) end return (value_params = value_params, value_gradient = deepcopy(value_params), value_function = value_function, gradient_update = update_value_gradient!) endmetadatashow_logsèdisabled®skip_as_script«code_folded$b2082ab0-73a4-45a6-8772-a2e6e22b519acell_id$b2082ab0-73a4-45a6-8772-a2e6e22b519acodebegin function beta_action_sampler(p1::T, p2::T) where T<:Real isnan(p1) && return p1 isnan(p2) && return p2 ϵ = eps(zero(T)) α = max(ϵ, exp(p1)) β = max(ϵ, exp(p2)) T(rand(Beta(α, β))) end beta_action_sampler(params::Vector{T}) where T<:Real = beta_action_sampler(params[1], params[2]) make_beta_n_sampler(::Val{1}) = beta_action_sampler function make_beta_n_sampler(::Val{N}) where N function f(params::Vector{T}) where T<:Real ntuple(i -> beta_action_sampler(params[i], params[i+N]), N) end end make_beta_n_sampler(n::Integer) = make_beta_n_sampler(Val(n)) make_beta_sampler(::T) where T<:Real = beta_action_sampler make_beta_sampler(::NTuple{N, T}) where {N, T<:Real} = make_beta_n_sampler(N) endmetadatashow_logsèdisabled®skip_as_script«code_folded$a361f4c9-47ce-42ad-899c-87b611c0d471cell_id$a361f4c9-47ce-42ad-899c-87b611c0d471codefunction update_binary_action_preferences!(action_preferences::Vector{T}, binary_features::BinaryFeatureVector, params::Matrix{T}) where T<:Real @inbounds for i_a in eachindex(action_preferences) action_preferences[i_a] = zero(T) @simd for i in 1:binary_features.num_features j = binary_features.active_features[i] action_preferences[i_a] += params[j, i_a] end end return action_preferences endmetadatashow_logsèdisabled®skip_as_script«code_folded$46fea69b-599e-46ab-8455-d2da865d9a8ecell_id$46fea69b-599e-46ab-8455-d2da865d9a8ecodeFconst mountaincar_continuing_mdp = create_mountaincar_continuing_mdp()metadatashow_logsèdisabled®skip_as_scriptëcode_folded$bfe7e41d-6318-4bd4-b892-287831876abccell_id$bfe7e41d-6318-4bd4-b892-287831876abccode7begin function update_beta_eligibility_vector!(∇lnπ::Matrix{T}, action_dist_params::Vector{T}, x::Vector{T}, action::T, policy_params::Matrix{T}) where T<:Real α = exp(first(action_dist_params)) β = exp(last(action_dist_params)) c1 = digamma(α + β) δ1 = (log(action) + c1 - digamma(α))*α @inbounds @simd for i in eachindex(x) ∇lnπ[i, 1] = x[i]*δ1 end δ2 = (log(one(T) - action) + c1 - digamma(β))*β @inbounds @simd for i in eachindex(x) ∇lnπ[i, 2] = x[i]*δ2 end end function update_beta_eligibility_vector!(∇lnπ::Matrix{T}, action_dist_params::Vector{T}, x::Vector{T}, action::NTuple{N, T}, policy_params::Matrix{T}) where {N, T <: Real} for k = 1:N α = exp(action_dist_params[k]) β = exp(action_dist_params[k+N]) c1 = digamma(α + β) δ1 = (log(action[k]) + c1 - digamma(α))*α @inbounds @simd for i in eachindex(x) ∇lnπ[i, k] = x[i]*δ1 end δ2 = (log(one(T) - action[k]) + c1 - digamma(β))*β @inbounds @simd for i in eachindex(x) ∇lnπ[i, k+N] = x[i]*δ2 end end end endmetadatashow_logsèdisabled®skip_as_script«code_folded$c251a630-7114-4188-9323-8d8feb5c32e0cell_id$c251a630-7114-4188-9323-8d8feb5c32e0codeCmountaincar_fcann_continuing_parameter_study(layer_size::Integer, num_layers::Integer, args...; kwargs...) = actor_critic_fcann_parameter_study(mountaincar_continuing_mdp, mountaincar_fcann_feature_setup.update_feature_vector!, mountaincar_fcann_feature_setup.num_features, fill(layer_size, num_layers), args...; kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$af144759-fe66-4ad0-b378-e9eb4e859db4cell_id$af144759-fe66-4ad0-b378-e9eb4e859db4codeNplot_cartpole_policy(reinforce_test4.policy_and_value; s_ref = ep[1][ep_step])metadatashow_logsèdisabled®skip_as_script«code_folded$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2cell_id$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2codeIconst mountaincar_continuous_mdp = create_continuous_action_mountaincar()metadatashow_logsèdisabled®skip_as_script«code_folded$fb8904a9-ae64-41cc-93b6-5a25855edad0cell_id$fb8904a9-ae64-41cc-93b6-5a25855edad0codefunction get_corridor_episode_stats(p::Real; ntrials=10_000) 1:ntrials |> Map(_ -> runepisode(corridor_mdp; π = s -> (rand() < p) + 1) |> first |> length) |> foldxt(+) |> a -> a / ntrials endmetadatashow_logsèdisabled®skip_as_script«code_folded$a5b002c9-5e11-462a-9da0-6e060c7963f8cell_id$a5b002c9-5e11-462a-9da0-6e060c7963f8code٦const ep2 = runepisode(cartpole_setup.mdps.episodic.discrete; π = reinforce_test5.policy_sample_action, max_steps = 1000, s0 = CartPoleState(30f0, 0.8f0, 0f0, -0f0))metadatashow_logsèdisabled®skip_as_script«code_folded$83640f5b-fe13-4ec1-98a0-67a56c189ba1cell_id$83640f5b-fe13-4ec1-98a0-67a56c189ba1code0function actor_critic_with_eligibility_traces!(policy_params::P1, ∇lnπ, value_params::P2, ∇v̂, mdp::StateMDP{T, S, A, PTF, F1, F2, F3}, λ_θ::T, λ_w::T, update_action_preferences!::Function, update_eligibility_vector!::Function, x, update_feature_vector!::Function, value_function::Function, update_value_gradient!::Function, max_steps::Integer; α_w::T = one(T)/10, α_θ::T = one(T)/10, α_r̄ = one(T)/10, action_preferences = zeros(T, length(mdp.actions)), z_θ::P1 = deepcopy(policy_params), z_w::P2 = deepcopy(value_params), save_step_rewards = false) where {P1, P2, T<:Real, S, A, PTF, F1, F2, F3} step_rewards = Vector{T}() #initialize variables step = 1 r̄ = zero(T) zero_params!(z_θ) zero_params!(z_w) rtot = zero(T) s = mdp.initialize_state() update_feature_vector!(x, s) while step <= max_steps update_value_gradient!(∇v̂, x, value_params) v̂ = value_function(x, value_params) update_action_preferences!(action_preferences, x, policy_params) soft_max!(action_preferences) i_a = sample_action(action_preferences) update_eligibility_vector!(∇lnπ, action_preferences, x, i_a, policy_params) (r, s′) = mdp.ptf(s, i_a) rtot += r save_step_rewards && push!(step_rewards, r) step += 1 mdp.isterm(s′) && error("$s′ is a terminal state and this method only applies to continuing tasks") update_feature_vector!(x, s′) v̂′ = value_function(x, value_params) δ = r - r̄ + v̂′ - v̂ r̄ += α_r̄*δ update_traces_with_gradient!(λ_w, z_w, ∇v̂) update_traces_with_gradient!(λ_θ, z_θ, one(T), ∇lnπ) update_params_with_gradient!(value_params, α_w*δ, z_w) update_params_with_gradient!(policy_params, α_θ*δ, z_θ) s = s′ end function_outputs = form_state_and_policy_function_outputs(update_feature_vector!, update_action_preferences!, value_function, x, action_preferences, policy_params, value_params) return (; step_rewards = step_rewards, total_reward = rtot, total_steps = step - 1, policy_parameters = policy_params, value_parameters = value_params, function_outputs...) endmetadatashow_logsèdisabled®skip_as_script«code_folded$61650a97-b353-4a85-b50b-93fee296ac7bcell_id$61650a97-b353-4a85-b50b-93fee296ac7bcodeqconst cartpole_fcann_feature_setup = fcann_feature_vector_setup(cartpole_setup.min_vals, cartpole_setup.max_vals)metadatashow_logsèdisabled®skip_as_script«code_folded$602a07dd-8928-4b44-97e5-01c5cbf38351cell_id$602a07dd-8928-4b44-97e5-01c5cbf38351codefunction plot_cartpole_policy(policy_and_value::Function; θ̇_range = 1, npoints = 100, s_ref::CartPoleState = CartPoleState()) θs = LinRange(-1.2f0, 1.2f0, npoints) θ̇s = LinRange(-10f0, 10f0, npoints) value_output = zeros(Float32, npoints, npoints) policy_outputs = [zeros(Float32, npoints, npoints) for _ in 1:3] x = s_ref.x ẋ = s_ref.ẋ policy_output = policy_and_value(s_ref) policy_plot = plot(bar(x = [-1, 0, 1], y = policy_output.action_probabilities), Layout(height = 350, xaxis_title = "Policy Action", yaxis_title = "Action Probability", title = "Policy Distribution Function")) for i in 1:npoints for j in 1:npoints s = CartPoleState(x, θs[i], ẋ, θ̇s[j]) output = policy_and_value(s) value_output[i, j] = output.state_value_estimate for i_a in 1:3 policy_outputs[i_a][i, j] = output.action_probabilities[i_a] end end end reference_trace = scatter(x = [s_ref.θ], y = [s_ref.θ̇], name = "reference state", marker_color = "black", marker_symbol = "x") value_plot = plot([heatmap(x = θs, y = θ̇s, z = value_output, name = "value function"), reference_trace], Layout(xaxis_title = "Pole Angle in Radians", yaxis_title = "Pole Angular Velocity", title = "Value Estimate for x = $x and ẋ = $ẋ", height = 350)) policy_plots = [plot([heatmap(x = θs, y = θ̇s, z = policy_outputs[i_a], zmin = 0, zmax = 1), reference_trace], Layout(title = "Action $i_a", xaxis_title = "Pole Angle in Radians", yaxis_title = "Pole Angular Velocity", height = 350)) for i_a in 1:3] @htl("""
$(vcat(value_plot, policy_plot))
$policy_plots
""") # value_traces = [begin # states = [CartPoleState(0f0, θ, 0f0, θ̇) for θ in θs] # output = [policy_and_value(s) for s in states] # scatter(x = θs, y = [a.state_value_estimate for a in output], name = "θ̇ = $θ̇") # end # for θ̇ in θ̇s] # plot(value_traces, Layout(xaxis_title = "Pole Angle in Radians", yaxis_title = "State Value Estimate")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$f7433324-acc3-49a5-b5b3-ada0c8f09d52cell_id$f7433324-acc3-49a5-b5b3-ada0c8f09d52coderunepisode(corridor_mdp)metadatashow_logsèdisabled®skip_as_script«code_folded$0c9986bb-54c0-4b08-9c29-4bfb0b68b54ecell_id$0c9986bb-54c0-4b08-9c29-4bfb0b68b54ecode2function collect_state_distributions(;num_episodes::Integer = 1_000_000, p::T = 0.5) where T<:Real function add_vecs(x::Array{T, N}, y::Array{T, N}) where {T<:Real, N} l1 = size(x, 1) l2 = size(y, 1) (l1 == l2) && return x .+ y if l1 > l2 out = copy(x) for i in 1:l2 view(out, i, :) .+= view(y, i, :) end else out = copy(y) for i in 1:l1 view(out, i, :) .+= view(x, i, :) end end return out end function π(s) rand(T) <= p && return 1 return 2 end counts = 1:num_episodes |> Map() do _ (states, actions, rewards, _, l) = runepisode(corridor_mdp; π = π) state_visits = zeros(T, l, 3) @inbounds @simd for i in eachindex(states) s = states[i] state_visits[i, s] += one(T) end return state_visits end |> foldxt(add_vecs) counts ./ num_episodes endmetadatashow_logsèdisabled®skip_as_script«code_folded$6d0925d3-af96-4b94-8e2e-4941cce39e51cell_id$6d0925d3-af96-4b94-8e2e-4941cce39e51code const mountaincar_test_train = actor_critic_with_eligibility_traces_binary_features(MountainCarTask.mdp, 0.1f0, 0.9f0, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, typemax(Int64), 100_000; α_θ = 0.008f0, α_w = 0.004f0)metadatashow_logsèdisabled®skip_as_script«code_folded$6bb0263e-368e-462a-948c-baf9cfa82512cell_id$6bb0263e-368e-462a-948c-baf9cfa82512codeget_corridor_features(s) = 1:1metadatashow_logsèdisabled®skip_as_script«code_folded$72273f27-d0b9-4645-a609-cb65cc9332eecell_id$72273f27-d0b9-4645-a609-cb65cc9332eecodeactor_critic_with_eligibility_traces_binary_features(corridor_mdp, 0f0, 0f0, get_corridor_features, 1, 100_000, α_θ = 2f0 ^ -4, α_w = 2f0 ^ -10, policy_params = [0f0 3.7f0]).policy_and_value(1)metadatashow_logsèdisabled®skip_as_script«code_folded$87482ea5-5265-4e02-92c0-1a8bb44ff0f4cell_id$87482ea5-5265-4e02-92c0-1a8bb44ff0f4codeUfunction actor_critic_binary_continuing_squashed_gaussian_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, amax::A, get_active_features::Function, num_features::Integer, λ_θ::T, λ_w::T, α_θ_list::AbstractVector{T}, α_w_list::AbstractVector{T}, α_r̄::T, max_steps::Integer; nruns::Integer = 100, seed = rand(UInt64), init_policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), init_value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} Random.seed!(seed) function average_runs(α_θ, α_w) 1:nruns |> Map(_ -> actor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions(mdp, λ_θ, λ_w, get_active_features, num_features, max_steps; α_θ = α_θ, α_w = α_w, α_r̄ = α_r̄, policy_params = copy(init_policy_params), value_params = copy(init_value_params), kwargs...).total_reward) |> foldxt(+) |> x -> x / nruns / max_steps end traces = [begin scatter(x = α_θ_list, y = average_runs.(α_θ_list, α_w), name = "α_w = $α_w") end for α_w in α_w_list] plot(traces, Layout(xaxis_title = "α_θ", yaxis_title = "Average Reward Per Step in the First
$max_steps Steps Averaged Over $nruns Runs", xaxis_type = "log", title = "Binary Feature Encoding with $num_features Features, λ_θ = $λ_θ, λ_w = $λ_w, α_r̄ = $α_r̄")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$3bafd7df-9bc0-4d13-874d-739590cf3ad9cell_id$3bafd7df-9bc0-4d13-874d-739590cf3ad9codeSmd""" > ### *Exercise 13.2* > Generalize the proof of the policy gradient theorem and the steps leading to the REINFORCE update equation (13.8), so that (13.8) ends up with a factor of $\gamma^t$ and thus aligns with the general algorithm given in the pseudocode. See proof above in the section on proving the policy gradient theorem. """metadatashow_logsèdisabled®skip_as_script«code_folded$f27f2bcd-05b6-44fe-bf9e-a3e51556db7ccell_id$f27f2bcd-05b6-44fe-bf9e-a3e51556db7ccode6const cartpole_functions = create_cartpole_functions()metadatashow_logsèdisabled®skip_as_scriptëcode_folded$41dc149d-c6f3-4b0d-a856-06f3aaae3049cell_id$41dc149d-c6f3-4b0d-a856-06f3aaae3049code{mutable struct BinaryEligibilityVector{T, B <: BinaryFeatureVector} binary_features::B i_a::Int64 π_dist::Vector{T} endmetadatashow_logsèdisabled®skip_as_script«code_folded$38e5d800-4d43-40d2-87ea-f7d4b4283dabcell_id$38e5d800-4d43-40d2-87ea-f7d4b4283dabcodemd""" In order to find the p that maximizes the expected value for state 1, we should differentiate by p and set the result to 0 $\frac{\partial v_1}{\partial p} = -\frac{2p(1-p) - 2(1+p)(1 - 2p)}{p^2(1-p)^2}$ Setting this equal to 0 implies $\begin{flalign} p-p^2 &= 1 - 2p + p - 2p^2\\ p^2 + 2p - 1 &= 0 \\ \end{flalign}$ Using the quadratic equation, there are two solutions but since we know p has to be positive we only take that one. $p = \frac{-2 \pm \sqrt{4 + 4}}{2} = \frac{-2 \pm 2\sqrt{2}}{2} = -1 \pm \sqrt{2} \implies p = \sqrt{2} - 1 \approx 0.41421$ So, in order to maximize the value at state 1, we have $p_{\text{left}} \approx 0.414$ and $p_{\text{right}} \approx 0.586$. That also implies that $v_1 = -2\frac{1+p}{p(1-p)} = -2\frac{\sqrt{2}}{(\sqrt{2}-1)(2 - \sqrt{2})}= \frac{-2\sqrt{2}}{2 \sqrt{2} - 2 - 2 + \sqrt{2}} = \frac{-2 \sqrt{2}}{3\sqrt{2} - 4} \approx -11.657$ """metadatashow_logsèdisabled®skip_as_script«code_folded$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2cell_id$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2code"function one_step_actor_critic_binary_features(mdp::StateMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, max_episodes::Integer, max_steps::Integer; policy_params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} setup = setup_binary_policy_arguments(mdp, get_active_features, num_features) one_step_actor_critic!(policy_params, setup.eligibility_vector, value_params, BinaryFeatureVector(), mdp, update_binary_action_preferences!, update_binary_eligibility_vector!, setup.feature_vector, setup.update_feature_vector!, binary_value_function, update_binary_value_gradient!, max_episodes, max_steps; action_preferences = setup.action_preferences, kwargs...) endmetadatashow_logsèdisabled®skip_as_script«code_folded$73b90260-d57a-449a-8db6-47f91e6a4e4fcell_id$73b90260-d57a-449a-8db6-47f91e6a4e4fcode5md""" ### Eligibility Vector with Binary Features """metadatashow_logsèdisabled®skip_as_script«code_folded$5aba4f96-e877-457e-8e95-18737348f99fcell_id$5aba4f96-e877-457e-8e95-18737348f99fcode_actor_critic_fcann_parameter_study(mdp::StateMDP{T, S, A, P, F1, F2, F3}, update_feature_vector!::Function, num_features::Integer, hidden_layers::Vector{Int64}, params::@NamedTuple{λ_θ::T, λ_w::T, α_r̄::T, α_θ_min::Int64, α_w_min::Int64}, num_θ::Integer, num_w::Integer, max_steps::Integer; kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = actor_critic_fcann_parameter_study(mdp, update_feature_vector!, num_features, hidden_layers, params.λ_θ, params.λ_w, params.α_r̄, 2f0 .^(params.α_θ_min:params.α_θ_min+num_θ-1), 2f0 .^(params.α_w_min:params.α_w_min+num_w-1), max_steps; kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$fed4dc4c-0d1c-4ee3-9d0e-8ef2a7db7486cell_id$fed4dc4c-0d1c-4ee3-9d0e-8ef2a7db7486codeِ@bind mountaincar_continuing_binary_params create_actor_critic_continuing_params_UI(λ_θ = 0.1f0, λ_w = 0.98f0, log2α_θ = -5, log2α_w = -8)metadatashow_logsèdisabled®skip_as_script«code_folded$27487ad0-4779-42ce-8def-e660ef04bee0cell_id$27487ad0-4779-42ce-8def-e660ef04bee0codeZreinforce_test4.policy_and_value(cartpole_setup.mdps.episodic.discrete.initialize_state())metadatashow_logsèdisabled®skip_as_script«code_folded$0d93132d-5819-47dc-8cf2-462d480d9c3dcell_id$0d93132d-5819-47dc-8cf2-462d480d9c3dcodeif run_mountaincar_binary_episodic_countinuous_param_study2 > 0 actor_critic_binary_episodic_squashed_gaussian_parameter_study(mountaincar_continuous_mdp, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, mountaincar_binary_continuous_params2, 4, 3, 1000; max_steps = 100_000, seed = 45) else md""" Waiting to run parameter study """ endmetadatashow_logsèdisabled®skip_as_script«code_folded$9978d537-49ff-4014-a971-b42704c50a6bcell_id$9978d537-49ff-4014-a971-b42704c50a6bcodeٍ@bind fcann_cartpole_study_params create_actor_critic_fcann_params_UI(;λ_θ = 0.95f0, λ_w = 0.2f0, h = 16, log2α_θ = -10, log2α_w = -11)metadatashow_logsèdisabled®skip_as_script«code_folded$f8215517-b18f-4a03-9421-8edab4ca8089cell_id$f8215517-b18f-4a03-9421-8edab4ca8089code`show_squashed_policy(mountaincar_continuous_test_train3.policy_function, test_mountaincar_state)metadatashow_logsèdisabled®skip_as_script«code_folded$1ac9296f-047b-4051-ba5c-0c23d5f9cde9cell_id$1ac9296f-047b-4051-ba5c-0c23d5f9cde9code>const corridor_continuing_mdp = make_corridor_continuing_mdp()metadatashow_logsèdisabled®skip_as_script«code_folded$c87dba8c-9a96-41b3-9dc7-a6c088ec1eafcell_id$c87dba8c-9a96-41b3-9dc7-a6c088ec1eafcodefshow_mountaincar_continuous_trajectory(mountaincar_continuous_test_train.policy_sample_action, 10_000)metadatashow_logsèdisabled®skip_as_script«code_folded$5cc4d12d-b537-47e2-8109-4e7a234fdf25cell_id$5cc4d12d-b537-47e2-8109-4e7a234fdf25codefunction make_corridor_mdp() function step(s::Integer, i_a::Integer) δ = 2*i_a - 3 #calculates the s change -1 for left (1) and 1 for right (2) switch = iseven(s) #returns true in state 2 which is where actions are switched, when switch is true, multiply δ by -1, otherwise by 1 c = 1 - 2*switch s′ = max(1, s + c*δ) (-1f0, s′) end actions = [:left, :right] ptf = StateMDPTransitionSampler(step, 1) StateMDP(actions, ptf, () -> 1, s -> s == 4) endmetadatashow_logsèdisabled®skip_as_script«code_folded$5334064b-5a16-4135-afa0-86a48291725bcell_id$5334064b-5a16-4135-afa0-86a48291725bcode corridor_train.value_function(1)metadatashow_logsèdisabled®skip_as_script«code_folded$9c342958-1971-48ec-b919-5dfdcbc915a4cell_id$9c342958-1971-48ec-b919-5dfdcbc915a4codecmd""" #### Change Plot Background Color $(@bind bgcolor ColorStringPicker(default = "#121212")) """metadatashow_logsèdisabled®skip_as_script«code_folded$966ef17c-23be-49dc-bc37-4cb52b34c049cell_id$966ef17c-23be-49dc-bc37-4cb52b34c049code$md""" #### Neural Network Method """metadatashow_logsèdisabled®skip_as_script«code_folded$e7e49ff8-32df-48a4-afb2-462859592e92cell_id$e7e49ff8-32df-48a4-afb2-462859592e92code1function form_state_and_policy_function_outputs(update_feature_vector!::Function, update_action_preferences!::Function, value_function::Function, feature_vector, action_preferences::Vector, policy_params, value_params) π! = form_state_policy_function(update_feature_vector!, update_action_preferences!) π(s; x = deepcopy(feature_vector), action_preferences = copy(action_preferences)) = π!(x, action_preferences, s, policy_params) π_sample(s; kwargs...) = sample_action(π(s; kwargs...)) v! = form_state_value_function(update_feature_vector!, value_function) estimate_state_value(s; x = deepcopy(feature_vector)) = v!(x, s, value_params) function policy_and_value(s; x = deepcopy(feature_vector), action_preferences = copy(action_preferences)) π!(x, action_preferences, s, policy_params) v̂ = value_function(x, value_params) return (action_probabilities = action_preferences, state_value_estimate = v̂) end (policy_function = π, policy_sample_action = π_sample, estimate_state_value = estimate_state_value, policy_and_value = policy_and_value) endmetadatashow_logsèdisabled®skip_as_script«code_folded$78c83673-2117-4542-b4d8-1c243e8f610bcell_id$78c83673-2117-4542-b4d8-1c243e8f610bcodemd""" #### Eligibility Vector Recall for the gaussian case and linear approximation we had: $\begin{flalign} \pi(a \vert s, \boldsymbol{\theta}) &= \frac{1}{\sqrt{2 \pi \sigma(s, \boldsymbol{\theta})^2}} \exp \left ( - \frac{(a - \mu(s, \boldsymbol{\theta}))^2}{2 \sigma(s, \boldsymbol{\theta})^2} \right )\\ \mu(s, \boldsymbol{\theta}) & \doteq \boldsymbol{\theta}_\mu ^ \top \mathbf{x}_\mu(s) \\ \sigma(s, \boldsymbol{\theta}) & \doteq \exp \left ( \boldsymbol{\theta}_\sigma ^ \top \mathbf{x}_\sigma(s) \right ) \\ \nabla \ln \pi(a \vert s, \boldsymbol{\theta}_\mu) &= \frac{1}{\sigma(s, \boldsymbol{\theta})^2} \left ( a - \mu(s, \boldsymbol{\theta}) \right ) \mathbf{x}_\mu(s) \\ \nabla \ln \pi(a \vert s, \boldsymbol{\theta}_\sigma) &= \left (\frac{(a - \mu(s, \boldsymbol{\theta}))^2}{\sigma(s, \boldsymbol{\theta})^2} \right )\mathbf{x}_\sigma(s) \\ \end{flalign}$ For the squashed gaussian we can apply the previous results to the new pdf: $\begin{flalign} \pi(a \vert s, \boldsymbol{\theta}) &= \frac{1}{\sqrt{2 \pi \sigma(s, \boldsymbol{\theta})^2}} \exp \left ( - \frac{(\tanh^{-1}(a) - \mu(s, \boldsymbol{\theta}))^2}{2 \sigma(s, \boldsymbol{\theta})^2} \right ) \left \vert \frac{1}{1 - a^2} \right \vert\\ \mu(s, \boldsymbol{\theta}) & \doteq \boldsymbol{\theta}_\mu ^ \top \mathbf{x}_\mu(s) \\ \sigma(s, \boldsymbol{\theta}) & \doteq \exp \left ( \boldsymbol{\theta}_\sigma ^ \top \mathbf{x}_\sigma(s) \right ) \\ \nabla \ln \pi(a \vert s, \boldsymbol{\theta}_\mu) &= \frac{1}{\sigma(s, \boldsymbol{\theta})^2} \left ( \tanh^{-1}(a) - \mu(s, \boldsymbol{\theta}) \right ) \mathbf{x}_\mu(s) \\ \nabla \ln \pi(a \vert s, \boldsymbol{\theta}_\sigma) &= \left (\frac{(\tanh^{-1}(a) - \mu(s, \boldsymbol{\theta}))^2}{\sigma(s, \boldsymbol{\theta})^2} \right )\mathbf{x}_\sigma(s) \\ \end{flalign}$ """metadatashow_logsèdisabled®skip_as_script«code_folded$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8fcell_id$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8fcodeّshow_mountaincar_continuous_trajectory(mountaincar_continuous_test_train_beta.policy_sample_action, 1_000; mdp = mountaincar_continuous_beta_mdp)metadatashow_logsèdisabled®skip_as_script«code_folded$396e0047-d848-462f-a769-0cc2829abc78cell_id$396e0047-d848-462f-a769-0cc2829abc78codeactor_critic_with_eligibility_traces_binary_features(corridor_mdp, .5f0, .5f0, get_corridor_features, 1, typemax(Int64), 100_000, α_θ = 2f0 ^ -4, α_w = 2f0 ^ -10, policy_params = [0f0 3.7f0]).policy_and_value(1)metadatashow_logsèdisabled®skip_as_script«code_folded$ff4f977e-48df-4c12-845c-c245b4d39d6dcell_id$ff4f977e-48df-4c12-845c-c245b4d39d6dcodefunction actor_critic_linear_parameter_study(mdp::StateMDP{T, S, A, P, F1, F2, F3}, feature_function::Function, num_features::Integer, λ_θ_list::AbstractVector{T}, λ_w_list::AbstractVector{T}, α_r̄_list::AbstractVector{T}, α_θ_list::AbstractVector{T}, α_w_list::AbstractVector{T}, num_tests::Integer, max_steps::Integer; nruns::Integer = 100, seed = rand(UInt64), init_policy_params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), binary_features = false, kwargs...) where {T<:Real, S, A, P, F1, F2, F3} if binary_features algo = actor_critic_with_eligibility_traces_binary_features title_prefix = "Binary Feature Encoding" else algo = actor_critic_with_eligibility_traces_linear_features title_prefix = "Linear Encoding" end run_test(α_θ, α_w, α_r̄, λ_θ, λ_w) = average_continuing_runs(nruns, seed, α_θ, α_w, α_r̄, init_policy_params, algo, mdp, λ_θ, λ_w, feature_function, num_features, max_steps; kwargs...) test_params = [(α_θ = rand(α_θ_list), α_w = rand(α_w_list), α_r̄ = rand(α_r̄_list), λ_θ = rand(λ_θ_list), λ_w = rand(λ_w_list)) for _ in 1:num_tests] DataFrame([begin output = run_test(params...) (;params..., output = output) end for params in test_params]) endmetadatashow_logsèdisabled®skip_as_script«code_folded$aa450da4-fe84-4eea-b6c4-9820b7982437cell_id$aa450da4-fe84-4eea-b6c4-9820b7982437code:md""" With continuous policy parametrization, we can smoothly very action selection probabilities by arbitrarily small amounts, something that was not possible with ϵ-greedy action selection. Therefore stronger convergence guarantees are possible for policy-gradient methods than for action-value methods. In the episodic case, assuming some particular non-random starting state $s_0$, we define the performance of a policy parametrized by *θ* as: $\begin{align} J(\mathbf{\theta}) \doteq v_{\pi_\mathbf{\theta}}(s_0) \tag{13.4} \end{align}$ where $v_{\pi_\mathbf{\theta}}$ is the true value function for $\pi_\mathbf{\theta}$, the policy determined by $\mathbf{\theta}$. The *policy gradient theorem* provides an analytic expression for the gradient of performance with respect to the policy parameter that does *not* involve the derivative of the state distribution: $\begin{align} \nabla J(\mathbf{\theta}) \propto \sum_s \mu (s) \sum_a q_\pi (s, a) \nabla \pi (a|s,\mathbf{\theta}) \tag{13.5} \end{align}$ where the gradients are column vectors of partial derivatives with respect to the components of $\mathbf{\theta}$. In the episodic case, the constant of proportionality is the average length of an episode, and in the continuing case it is 1. The distribution here $\mu$ is the on-policy distribution under $\pi$. """ metadatashow_logsèdisabled®skip_as_script«code_folded$bb1ef180-39ac-475f-beea-ef573e71a3bfcell_id$bb1ef180-39ac-475f-beea-ef573e71a3bfcode7display_cartpole_episode((ep2 |> x -> (x[1], x[2]))...)metadatashow_logsèdisabled®skip_as_script«code_folded$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27cell_id$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27codeconst cartpole_continuing_fcann_test = actor_critic_with_eligibility_traces_fcann(cartpole_continuing_mdp, 0.25f0, 0.1f0, cartpole_fcann_feature_setup.num_features, [4, 4], cartpole_vector_update!, 300_000, α_θ = 0.015f0, α_w = 0.125f0, α_r̄ = 0.01f0; save_step_rewards=true)metadatashow_logsèdisabled®skip_as_script«code_folded$5b868eba-c1af-49f6-8f93-79b78c319a6fcell_id$5b868eba-c1af-49f6-8f93-79b78c319a6fcode #version of reinforce for general function approximation function reinforce_with_baseline_monte_carlo_control!(policy_params, ∇lnπ, value_params, ∇v̂, mdp::ContinuousMDP{T, S, A, PTF, F1, F2, F3}, update_action_distribution!::Function, action_dist_params::Vector{T}, action_sampler::Function, update_eligibility_vector!::Function, x, update_feature_vector!::Function, value_function::Function, update_value_gradient!::Function, max_episodes::Integer; α_w::T = one(T)/10, α_θ::T = one(T)/10, γ::T = one(T), epkwargs...) where {T<:Real, S, A, PTF, F1, F2, F3} rewards = zeros(T, max_episodes) steps = zeros(Int64, max_episodes) π! = form_state_continuous_policy_function(update_feature_vector!, update_action_distribution!) π(s) = π!(x, action_dist_params, s, policy_params) π_sample(s) = action_sampler(π(s)) v! = form_state_value_function(update_feature_vector!, value_function) estimate_state_value(s) = v!(x, s, value_params) state_history, action_history, reward_history, _, _ = runepisode(mdp, π_sample, max_steps = 0) #initialize variables to update episodes for i in eachindex(rewards) # @info "On episode $i of $max_episodes" state_history, action_history, reward_history, sterm, nsteps = runepisode!((state_history, action_history, reward_history), mdp, π_sample, epkwargs...) g = zero(T) rtotal = zero(T) #iterate through episode beginning at the end for i in nsteps:-1:1 g = (γ * g) + reward_history[i] update_feature_vector!(x, state_history[i]) v̂ = value_function(x, value_params) δ = g - v̂ update_value_gradient!(∇v̂, x, value_params) c = α_w*δ update_params_with_gradient!(value_params, c, ∇v̂) update_eligibility_vector!(∇lnπ, action_dist_params, x, action_history[i], policy_params) c = α_θ * γ^(i-1) * δ update_params_with_gradient!(policy_params, c, ∇lnπ) rtotal += reward_history[i] end rewards[i] = rtotal steps[i] = nsteps end π2(s; feature_vector = deepcopy(x), action_dist_params = copy(action_dist_params)) = π!(feature_vector, action_dist_params, s, policy_params) π_sample2(s; kwargs...) = action_sampler(π2(s; kwargs...)) function policy_and_value(s::S) π!(x, action_dist_params, s, policy_params) v̂ = value_function(x, value_params) return (action_distribution_parameters = action_dist_params, sampler_function = () -> action_sampler(action_dist_params), state_value_estimate = v̂) end return (episode_rewards = rewards, episode_steps = steps, policy_function = π2, policy_sample_action = π_sample2, policy_parameters = policy_params, estimate_state_value = estimate_state_value, value_parameters = value_params, policy_and_value = policy_and_value) endmetadatashow_logsèdisabled®skip_as_script«code_folded$68469a40-7976-48b7-b7a1-eaa4c5f33a18cell_id$68469a40-7976-48b7-b7a1-eaa4c5f33a18code)function plot_mountaincar_continuous_values(policy_and_value::Function; n1 = 100, n2 = 100) xvals = LinRange(-1.2f0, 0.5f0, n1) vvals = LinRange(-0.07f0, 0.07f0, n2) values = zeros(Float32, n1, n2) action_p1 = zeros(Float32, n1, n2) action_p2 = zeros(Float32, n1, n2) for (i, x) in enumerate(xvals) for (j, v) in enumerate(vvals) dist, v̂ = policy_and_value((x, v)) values[j, i] = v̂ action_p1[j, i] = dist[1] action_p2[j, i] = dist[2] end end p1 = plot(heatmap(x = xvals, y = vvals, z = values), Layout(xaxis_title = "position", yaxis_title = "velocity", title = "Learned Value Function", height = 400)) p2 = plot(heatmap(x = xvals, y = vvals, z = action_p1, colorscale = "rb"), Layout(xaxis_title = "position", yaxis_title = "velocity", title = "Policy Parameter 1", height = 400)) p3 = plot(heatmap(x = xvals, y = vvals, z = action_p2, colorscale = "rb"), Layout(xaxis_title = "position", yaxis_title = "velocity", title = "Policy Parameter 2", height = 400)) @htl("""
$p1 $p2 $p3
""") endmetadatashow_logsèdisabled®skip_as_script«code_folded$2a586e46-66e4-461a-85c8-5817e4d1aa43cell_id$2a586e46-66e4-461a-85c8-5817e4d1aa43codemd""" $\begin{flalign} \nabla J(\boldsymbol{\theta}) &= \nabla v_\pi(s_0) \\ &= \sum_s \sum_k \gamma^k \Pr \{ s_0 \rightarrow s, k, \pi \} f(s) \\ &= \sum_s \sum_k \gamma^k \frac{\sum_{x \in \mathcal{S}} \sum_{t = 0}^\infty \Pr \{ s_0 \rightarrow x, t, \pi \}}{\sum_{x \in \mathcal{S}} \sum_{t = 0}^\infty \Pr \{ s_0 \rightarrow x, t, \pi \}} \Pr \{ s_0 \rightarrow s, k, \pi \} f(s) \tag{multiply by 1}\\ &= \eta \sum_s \sum_k \gamma^k \frac{\Pr \{ s_0 \rightarrow s, k, \pi \}}{\sum_{x \in \mathcal{S}} \sum_{t = 0}^\infty \Pr \{ s_0 \rightarrow x, t, \pi \}} f(s) \tag{average episode length}\\ &= \eta \sum_s \sum_k \gamma^k \mu_\pi(s, k) f(s) \tag{on policy distribution over states and steps}\\ &= \eta \mathbb{E}_\pi[ \gamma^k f(s) \mid S_0 = s_0, S_k = s] \tag{definition of expected value}\\ &\propto \mathbb{E}_\pi \left [ \gamma^k \sum_a \nabla \pi(a \vert s) q_\pi(s, a) \mid S_0 = s_0, S_k = s \right ] \tag{13.5}\\ \end{flalign}$ """metadatashow_logsèdisabled®skip_as_script«code_folded$a206c759-3f6e-4003-8cba-5f6ce6742646cell_id$a206c759-3f6e-4003-8cba-5f6ce6742646codeٿmd""" ### Figure 13.1 REINFORCE on short-corridor gridworld (Example 13.1). Performance varies with step size but can approach the ideal. Feature vector encodes every state identically. """metadatashow_logsèdisabled®skip_as_script«code_folded$fc3dcd26-c5cf-4141-bf6c-eaed5fc9bb1dcell_id$fc3dcd26-c5cf-4141-bf6c-eaed5fc9bb1dcodemd""" Consider the linear parameterization proposed with $h_a = \boldsymbol{\theta}^\top \mathbf{x}(s, a)$: $\frac{\partial{h_a}}{\partial{\theta_i}} = \mathbf{x}(s, a)_i \implies \nabla(\pi(a \vert s, \boldsymbol{\theta}))_i = \pi_a \left ( \mathbf{x}(s, a)_i - \sum_k \pi_k \mathbf{x}(s, k)_i \right)$ Now consider $\mathbf{h} = \theta ^ \top \mathbf{x}$ with $h_a = \mathbf{h}_a$. Since the parameters are now represented as a matrix, we can also index the gradient partial derivatives such that $\nabla \left ( f(\theta) \right )_{i, j} = \frac{\partial f(\theta)}{\theta_{i, j}}$ $\frac{\partial{h_a}}{\partial{\theta_{i, j}}} = \begin{cases} \mathbf{x}(s)_i, & \text{ if } j = a \\ 0, & \text{ else } \end{cases} \implies \nabla(\pi(s, \boldsymbol{\theta})_a)_{i, j} = \pi_a \left ( \frac{\partial h_a}{\partial \theta_{i, j}} - \sum_k \pi_k \frac{\partial h_k}{\partial \theta_{i, j}} \right)=\pi_a \begin{cases} \mathbf{x}(s)_i (1 - \pi_j), & \text{ if } j = a \\ -\pi_j \mathbf{x}(s)_i, & \text{ else }\\ \end{cases}$ """metadatashow_logsèdisabled®skip_as_script«code_folded$3cfd63ad-b1a2-4b99-ae97-2ff10351e4f5cell_id$3cfd63ad-b1a2-4b99-ae97-2ff10351e4f5code+md""" ### Beta Distribution Alternative """metadatashow_logsèdisabled®skip_as_script«code_folded$31db0f58-28e4-454f-9394-25565687266fcell_id$31db0f58-28e4-454f-9394-25565687266fcodexdisplay_cartpole_episode((runepisode(cartpole_mdps.episodic.continuous, s -> Float32(randn())) |> x -> (x[1], x[2]))...)metadatashow_logsèdisabled®skip_as_script«code_folded$822e4d69-2582-4956-858e-06ecb091e76acell_id$822e4d69-2582-4956-858e-06ecb091e76acode^function display_cartpole_episode(states::Vector{S}, actions::Vector) where S<:CartPoleState fields = [:x, :θ, :ẋ, :θ̇] names = ["x", "θ", "ẋ", "θ̇"] yaxes = ["y", "y2", "y", "y2"] x = [s.t for s in states] #time history in seconds state_traces = [begin y = [getfield(s, f) for s in states] scatter(x = x, y = y, name = names[i], yaxis = yaxes[i]) end for (i, f) in enumerate(fields)] plot(state_traces, Layout(xaxis_title = "Time(s)", yaxis_title = "Horizontal Position", yaxis2 = attr(title = "Pole Angle (Radians)", overlaying = "y", side = "right"), legend_orientation = "h")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580acell_id$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580acodeVplot_mountaincar_continuous_values(mountaincar_continuous_test_train.policy_and_value)metadatashow_logsèdisabled®skip_as_script«code_folded$05b0fcad-628b-48d2-aa24-f6f562dbb660cell_id$05b0fcad-628b-48d2-aa24-f6f562dbb660code_md""" $\begin{flalign} &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) \sum_{a^{\prime \prime}} [ \nabla \pi(a^{\prime \prime} \vert s^{\prime \prime}) q_\pi(s^{\prime \prime}, a^{\prime \prime})\right ] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) f(s^{\prime \prime}) \right ] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{s^{\prime \prime}} f(s^{\prime \prime}) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) p(s^{\prime \prime} \vert s^\prime, a^\prime) \right ] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \mathbb{E}_\pi[f(s^{\prime \prime}) \vert s^\prime] \right ] \\ &\gamma^2 \mathbb{E}_\pi[f(s^{\prime \prime}) \vert s] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) g(s^\prime) \right ] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \mathbb{E}[g(s^\prime) \vert s, a] \right ] \\ &\gamma^2 \mathbb{E}_\pi[g(s^\prime) \vert s]\\ &\gamma^2 \sum_{s^{\prime \prime}} \Pr(s \rightarrow s^{\prime \prime}, 2, \pi) f(s^{\prime \prime}) \end{flalign}$ """metadatashow_logsèdisabled®skip_as_script«code_folded$d2729657-d0bf-4d39-8ec7-f242a1ad48d6cell_id$d2729657-d0bf-4d39-8ec7-f242a1ad48d6codefunction create_continuous_action_mountaincar_beta() #if we sample actions from a beta distribution then the action will always be bounded between 0 and 1. this step function rescales it to -1 to 1 mdp = MountainCarTask.mdp function step(s, a) f = 2f0*(a - 0.5f0) (-1f0, MountainCarTask.step(s, f)) end ContinuousMDP(step, mdp.initialize_state, 0f0; isterm = mdp.isterm) endmetadatashow_logsèdisabled®skip_as_script«code_folded$5c11a92d-7496-4aba-af15-2537eac49dd7cell_id$5c11a92d-7496-4aba-af15-2537eac49dd7code Map(_ -> actor_critic_with_eligibility_traces_fcann(mdp, λ_θ, λ_w, num_features, hidden_layers, update_feature_vector!, max_episodes, max_steps; α_θ = α_θ, α_w = α_w, kwargs...) |> x -> isempty(x.episode_rewards) ? missing : mean(x.episode_rewards)) |> Filter(!ismissing) |> tcollect |> x -> isempty(x) ? missing : mean(x) end traces = [begin scatter(x = α_θ_list, y = average_runs.(α_θ_list, α_w), name = "α_w = $α_w") end for α_w in α_w_list] plot(traces, Layout(xaxis_title = "α_θ", yaxis_title = "Average Reward Per Episode in the First
$max_episodes Episodes Averaged Over $nruns Runs", xaxis_type = "log", title = "$num_features Inputs, $hidden_layers Hidden Non Linear, λ_θ = $λ_θ, λ_w = $λ_w")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$76eb6743-cac0-4174-9ba3-a0691c200b54cell_id$76eb6743-cac0-4174-9ba3-a0691c200b54code٩begin make_n_param_dist_params(n::Integer, ::T) where T<:Real = zeros(T, n) make_n_param_dist_params(n::Integer, ::NTuple{N, T}) where {N, T<:Real} = zeros(T, n*N) endmetadatashow_logsèdisabled®skip_as_script«code_folded$94517664-6988-44dc-a297-e9d5873ee540cell_id$94517664-6988-44dc-a297-e9d5873ee540codeN@bind squashed_gaussian_plot_params PlutoUI.combine() do Child md""" ### Squashed Gaussian Plot Parameters $$\mu$$: $(Child(Slider(-4:.1:4, default = 0, show_value=true))) $$\sigma$$: $(Child(Slider(0.1:0.1:2, default = .5, show_value=true))) maximum value: $(Child(Slider(.1:0.1:2., default = 1, show_value=true))) """ endmetadatashow_logsèdisabled®skip_as_script«code_folded$d037ea92-915c-4dc7-97c6-d006d92e088acell_id$d037ea92-915c-4dc7-97c6-d006d92e088acodeefunction figure_13_1(α_list; nruns = 100, num_episodes = 1_000, max_steps = 1_000) Random.seed!(45) function average_runs(α) 1:nruns |> Map(_ -> reinforce_monte_carlo_control_binary_features(corridor_mdp, get_corridor_features, 1, num_episodes; params = [0f0 3.7f0], α = α, max_steps = max_steps).episode_rewards) |> foldxt((a, b) -> a .+ b) |> v -> v ./ nruns end traces = [begin out = average_runs(α) scatter(x = 1:num_episodes, y = out, name = "α = 2^$(round(Int64, log2(α)))") end for α in α_list] baselinetrace = scatter(x = 1:num_episodes, y = fill(-2*sqrt(2) / (3*sqrt(2) - 4), num_episodes), name = "ideal value", line_dash = "dash", line_color = "gray") plot([baselinetrace; traces], Layout(yaxis_range = [-90, -10], yaxis_title = "Total reward on episode
(averaged over $nruns runs)", xaxis_title = "Episode", width = 800)) endmetadatashow_logsèdisabled®skip_as_script«code_folded$24fa139c-ad4b-49db-ac8f-23c476ed8608cell_id$24fa139c-ad4b-49db-ac8f-23c476ed8608codeconst reinforce_test = reinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions(cartpole_setup.mdps.episodic.continuous, cartpole_setup.get_active_features, cartpole_setup.num_features, 10_000; α_θ = 2f0 ^-14, α_w = 2f0 ^-6)metadatashow_logsèdisabled®skip_as_script«code_folded$2025ff38-f2ec-4224-b771-ff72ffe1af28cell_id$2025ff38-f2ec-4224-b771-ff72ffe1af28code.const mountaincar_min_vals = (-1.2f0, -0.07f0)metadatashow_logsèdisabled®skip_as_script«code_folded$cb70d400-3e9c-441c-b17c-e727e8c928f3cell_id$cb70d400-3e9c-441c-b17c-e727e8c928f3codeif start_mountaincar_continuing_fcann_param_study > 0 mountaincar_fcann_continuing_parameter_study(32, 3, mountaincar_continuing_fcann_params, 5, 3, 1_000_000; seed = 45) else md""" Waiting to run parameter study """ endmetadatashow_logsèdisabled®skip_as_script«code_folded$e034b9cb-f4ee-46f4-bea6-72c93c75d966cell_id$e034b9cb-f4ee-46f4-bea6-72c93c75d966codeusing DataFramesmetadatashow_logsèdisabled®skip_as_script«code_folded$e6cf9550-2e69-4b82-92cf-5e07a35490aacell_id$e6cf9550-2e69-4b82-92cf-5e07a35490aacodebegin zero_params!(params::Array{T, N}) where {N, T<:Real} = params .= zero(T) function zero_params!(params::FCANNParams) for i = 1:2 for j in eachindex(params[i]) zero_params!(params[i][j]) end end end end metadatashow_logsèdisabled®skip_as_script«code_folded$717e4c69-59d5-4929-923f-dd35a97fb160cell_id$717e4c69-59d5-4929-923f-dd35a97fb160codeactor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, λ_θ::T, λ_w::T, get_active_features::Function, num_features::Integer, args...; kwargs...) where {T<:Real, S, N, A <: Union{T, NTuple{N, T}}, P, F1, F2, F3} = actor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions(mdp, one(T), λ_θ, λ_w, get_active_features, num_features, args...; kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$1386ffdb-940d-4f1b-a872-4e38647b5335cell_id$1386ffdb-940d-4f1b-a872-4e38647b5335codemd""" #### Test One-step Actor-Critic The following function calls execute the One-step Actor-Critic algorithm on Example 13.1. The output displayed is the policy function acting on the single state representation for the problem. The two values represent the probability of taking the left and right action respectively. If converged properly, the right action probability should be higher, approaching a value of about 60%. """metadatashow_logsèdisabled®skip_as_script«code_folded$a893a87b-2d07-4db5-9d1a-9da8646216f4cell_id$a893a87b-2d07-4db5-9d1a-9da8646216f4codefunction update_params_with_gradient!(w::Vector{T}, α::T, ∇w::BinaryFeatureVector) where {T<:Real} @inbounds @simd for i in 1:∇w.num_features j = ∇w.active_features[i] w[j] += α end return w endmetadatashow_logsèdisabled®skip_as_script«code_folded$2cbc972b-c685-4c1c-8a8d-9d58b197ad90cell_id$2cbc972b-c685-4c1c-8a8d-9d58b197ad90codeٹfunction update_binary_value_params!(params::Vector{T}, active_features::BinaryFeatures, c::T) where T<:Real @inbounds for i in active_features params[i] += c end return params endmetadatashow_logsèdisabled®skip_as_script«code_folded$37ec6802-d4c2-4470-ad69-439d5a732f77cell_id$37ec6802-d4c2-4470-ad69-439d5a732f77codefunction form_state_policy_function(update_feature_vector!::Function, update_action_preferences!::Function) function π!(x, action_preferences, s, params) update_feature_vector!(x, s) update_action_preferences!(action_preferences, x, params) soft_max!(action_preferences) end endmetadatashow_logsèdisabled®skip_as_script«code_folded$98222fcd-b456-477c-90dd-844df36877e5cell_id$98222fcd-b456-477c-90dd-844df36877e5codeKplot_continuing_step_rewards(mountaincar_continuing_tile_test.step_rewards)metadatashow_logsèdisabled®skip_as_script«code_folded$f7f58fd2-facc-4b87-9172-5e911677c8f4cell_id$f7f58fd2-facc-4b87-9172-5e911677c8f4codez#for an episode progressing, show the point in the state space that the cart exsits and use the value of x and ẋ in thatmetadatashow_logsèdisabled®skip_as_script«code_folded$58403c8e-0ee4-4466-ba25-ee0c86fb0b47cell_id$58403c8e-0ee4-4466-ba25-ee0c86fb0b47code md""" Consider $\mathbf{x}(s)$ and $\mathbf{h}(s, \boldsymbol{\theta})$ which produces a vector of action preferences. We would like to derive an expression for $\nabla \ln \pi (a \vert s, \boldsymbol{\theta})$ in the case of $\mathbf{\pi}(s, \boldsymbol{\theta}) = \sigma(\mathbf{h}(s, \boldsymbol{\theta}))$ where $\sigma(\mathbf{x})$ is the softmax function defined in section 13.1. Here I'm using the notation $\mathbf{\pi}(s, \boldsymbol{\theta})$ to refer to the vector of action probabilities at a given state. The subscript on the vector refers to selecting that element from the vector. To shorten expressions, the following terms are equivalent: $\begin{flalign} \mathbf{\pi} &\doteq \mathbf{\pi}(s, \boldsymbol{\theta}) \\ \mathbf{h} &\doteq \mathbf{h}(s, \boldsymbol{\theta}) \\ x_i &\doteq \mathbf{x}_i \text{ for all vectors} \\ \end{flalign}$ Using these conventions, we previously had an expression for the ith component of the gradient of the policy: $\nabla \left( \pi_a \right )_i = \pi_a \left ( \frac{\partial{h_a}}{\partial{\theta_i}} - \sum_k{\pi_k \frac{\partial{h_k}}{\partial{\theta_i}}} \right )$ We can use this expression to derive the components of the eligibility vector in general: $\begin{flalign} \nabla \left( \ln \mathbf{\pi}_a \right)_i &= \frac{\nabla \left( \pi_a \right )_i}{\pi_a}\\ &=\frac{\partial{h_a}}{\partial{\theta_i}} - \sum_k{\pi_k \frac{\partial{h_k}}{\partial{\theta_i}}} \\ \end{flalign}$ ### Connection to Cross-Entropy Loss Classification problems involve training a function to predict the class label of an input. The function returns a vector of class preferences which can be converted to a probability distribution by the soft-max function. The cross-entropy loss is a way of comparing this distribution with the desired output label to generate an error value. Let's denote $\mathbf{p}(s)$ as the vector of true probabilities for an example $s$ and keep our output function as $\pi(s,\theta) = \sigma(\mathbf{h}(s, \boldsymbol{\theta}))$. The cross entropy loss is defined as: $\mathcal{L}(\mathbf{p}, \mathbf{\pi}) = -\sum_i \mathbf{p}_i \ln \mathbf{\pi}_i$ omitting $s$ and $\boldsymbol{\theta}$. In a typical situation with a dataset, $\mathbf{p}(s)$ will be a one-hot vector representing the index of label of the example in the dataset. Let's call that index $a$ such that $p_a = 1$ and $p_i = 0 \: \forall i \neq a$. The loss then simplifies to $\mathcal{L}(a, \mathbf{\pi}) = -\ln \mathbf{\pi}_a$. When we train with gradient descent on such a dataset, we must compute the gradient of this loss with respect to the parameters or $-\nabla \ln \pi_a$ which is just negative one times the eligibility vector for general paramaterized approximation. So if we have a function that computes the gradient of the cross entropy loss of the soft-max output for a vector function and a label index, we can replace the label index of the dataset with the desired action index $a$ and then that gradient will match our desired gradient after multiplying by negative one. """metadatashow_logsèdisabled®skip_as_script«code_folded$e1aec891-d95a-47d1-97d7-d2a4cfb16e64cell_id$e1aec891-d95a-47d1-97d7-d2a4cfb16e64codefunction setup_fcann_policy_and_value_arguments(policy_params::FCANNParams{T}, input_length::Integer, hidden_layers::Vector{Int64}, reslayers::Integer, l2::T, dropout::T, use_μP::Bool, activation_list) where {T<:Real} policy_setup = setup_fcann_policy_arguments(policy_params::FCANNParams{T}, input_length::Integer, hidden_layers::Vector{Int64}, reslayers::Integer, l2::T, dropout::T, use_μP::Bool, activation_list) value_setup = setup_fcann_value_arguments(policy_setup, input_length::Integer, hidden_layers::Vector{Int64}, reslayers::Integer, l2::T, dropout::T, use_μP::Bool, activation_list, policy_setup.scales) (;policy_setup..., value_setup...) endmetadatashow_logsèdisabled®skip_as_script«code_folded$3d065608-eef2-4caa-b17d-ec60714e3d58cell_id$3d065608-eef2-4caa-b17d-ec60714e3d58code;actor_critic_binary_episodic_beta_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, params::@NamedTuple{λ_θ::T, λ_w::T, α_θ_min::Int64, α_w_min::Int64}, num_θ::Integer, num_w::Integer, num_episodes::Integer; kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = actor_critic_binary_episodic_beta_parameter_study(mdp, get_active_features, num_features, params.λ_θ, params.λ_w, 2f0 .^(params.α_θ_min:params.α_θ_min+num_θ-1), 2f0 .^(params.α_w_min:params.α_w_min+num_w-1), num_episodes; kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$b87ff1a9-abff-40f7-a1d8-f751a1c8b060cell_id$b87ff1a9-abff-40f7-a1d8-f751a1c8b060code9md""" In the episodic case, we provided a reward of -1 per step and then considered an episode finished when a failure state was reached. In the continuing case, the step function will provide a reward of 0 unless a failure occurs in which case it will provide a reward of -1 and then initialize a new state. """metadatashow_logsèdisabled®skip_as_script«code_folded$e89bdc84-dbb5-4c73-a39c-6392e5f79704cell_id$e89bdc84-dbb5-4c73-a39c-6392e5f79704codeمplot_mountaincar_values(mountaincar_continuing_tile_test.estimate_state_value, mountaincar_continuing_tile_test.policy_sample_action)metadatashow_logsèdisabled®skip_as_script«code_folded$d3b56fca-5b79-4465-8987-8d0005f854d8cell_id$d3b56fca-5b79-4465-8987-8d0005f854d8codeconst reinforce_test2 = reinforce_with_baseline_monte_carlo_control_binary_features(cartpole_setup.mdps.episodic.discrete, cartpole_setup.get_active_features, cartpole_setup.num_features, 10_000; α_θ = 2f0 ^-14, α_w = 2f0 ^-8)metadatashow_logsèdisabled®skip_as_script«code_folded$d21617aa-6f38-4a90-8586-4b32022497adcell_id$d21617aa-6f38-4a90-8586-4b32022497adcode'cartpole_setup.mdps.continuing.discretemetadatashow_logsèdisabled®skip_as_script«code_folded$0574f5a0-72e7-4aa2-80ac-f4ce4f0fe7c2cell_id$0574f5a0-72e7-4aa2-80ac-f4ce4f0fe7c2codeْplot_cartpole_policy(cartpole_continuing_test.policy_and_value; s_ref = CartPoleState(sref_cartpole_binary.x, 0f0, sref_cartpole_binary.ẋ, 0f0))metadatashow_logsèdisabled®skip_as_script«code_folded$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7dcell_id$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7dcode;const mountaincar_continuous_test_train3 = actor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions(mountaincar_continuous_mdp, 0.2f0, 0.99f0, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, typemax(Int64), 1_000_000; α_θ = 1f-5, α_w = 0.0001f0)metadatashow_logsèdisabled®skip_as_script«code_folded$d82e7ab8-c372-4462-afb5-1617560cdb56cell_id$d82e7ab8-c372-4462-afb5-1617560cdb56codeّplot_mountaincar_values(mountaincar_continuous_test_train_beta.estimate_state_value, mountaincar_continuous_test_train_beta.policy_sample_action)metadatashow_logsèdisabled®skip_as_script«code_folded$3c89209c-9202-4d5d-841c-ea34be369616cell_id$3c89209c-9202-4d5d-841c-ea34be369616codeHconst cartpole_continuing_test = actor_critic_with_eligibility_traces_binary_features(cartpole_continuing_mdp, 0.95f0, 0.8f0, s -> cartpole_tilecoding_setup.get_active_features((s.x, s.θ, s.ẋ, s.θ̇)), cartpole_tilecoding_setup.num_features, 30_000, α_θ = .125f0, α_w = 0.006f0, α_r̄ = 0.01f0, save_step_rewards = true)metadatashow_logsèdisabled®skip_as_script«code_folded$635abb34-2c97-4f04-a74c-22fbec32f408cell_id$635abb34-2c97-4f04-a74c-22fbec32f408codefunction fcann_value_function(x::Vector{T}, params::FCANNParams, activations::FCANNActivations{T}, reslayers::Integer) where T<:Float32 FCANN.forwardNOGRAD_base!(activations, params..., x, reslayers) return first(last(activations)) endmetadatashow_logsèdisabled®skip_as_script«code_folded$0bf3b988-b3fb-49d5-8dde-b25766596363cell_id$0bf3b988-b3fb-49d5-8dde-b25766596363codeMlinear_value_function(x::Vector{T}, w::Vector{T}) where {T<:Real} = dot(x, w)metadatashow_logsèdisabled®skip_as_script«code_folded$d8222abf-139c-4220-8e92-cc987ec6900ccell_id$d8222abf-139c-4220-8e92-cc987ec6900ccodemd""" Note that for the corridor problem, the state-value learning rates have very little impact and learning is most effective when $\lambda_{\boldsymbol{\theta}}$ is close to 1 which mimics REINFORCE with baseline. """metadatashow_logsèdisabled®skip_as_script«code_folded$68e6f17e-8c87-40f0-a673-1115ecd1b71dcell_id$68e6f17e-8c87-40f0-a673-1115ecd1b71dcodepmd""" > ### *Exercise 13.5* > A *Bernoulli-logistic unit* is a stochastic neuron-like unit used in some ANNs. Its input at time *t* is a feature vector $\mathbf{x}(S_t)$; its output, $A_t$, is a random variable having two values, 0 and 1, with $\Pr \{A_t=1 \}=P_t$ and $\Pr\{A_t=0\}=1-P_t$ (the Bernoulli distribution). Let $h(s, 0, \mathbf{\theta})$ and $h(s, 1, \mathbf{\theta})$ be the preferences in state $s$ for the unit's two actions given by policy parameter $\mathbf{\theta}$. Assume that the difference between the action preferences is given by a weights sum of teh unit's input vector, that is, assume that $h(s, 1, \mathbf{\theta})-h(s,0, \mathbf{\theta}) = \mathbf{\theta}^\top \mathbf{x}(s)$, where $\mathbf{\theta}$ is the unit's weight vector. > 1. Show that if the exponential soft-max distribution (13.2) is used to convert action preferences to policies, then ${P_t = \pi(1|S_t, \theta_t)=1/(1+\exp(-\theta_t^\top\mathbf{x}(S_t)))}$ (the logistic function). > 2. What is the Monte-Carlo REINFORCE update of $\theta_t$ to $\theta_{t+1}$ upon receipt of return $G_t$? > 3. Express the eligility $\nabla \ln \pi(a|s, \theta)$ for a Bernoulli-logistic unit, in terms of $a$, $\mathbf{x}(s)$, and $\pi(a|s, \theta)$ by calculating the gradient. > Hint for part (c): Define $P=\pi(1|s,\theta)$ and compute the derivative of the logarithm, for each action, using the chain rule on $P$. Combine the two results into one expression that depends on $a$ and $P$, and then use the chain rule again, this time on $\theta^\top\mathbf{x}(s)$, noting that the derivative of the logistic function $f(x)=1/(1+e^{-x})$ is $f(x)(1-f(x))$. """metadatashow_logsèdisabled®skip_as_script«code_folded$cf1859d6-f889-4923-8c87-2d7c039f26c3cell_id$cf1859d6-f889-4923-8c87-2d7c039f26c3codeDrunepisode(cartpole_mdps.episodic.continuous, s -> Float32(randn()))metadatashow_logsèdisabled®skip_as_script«code_folded$5500fd8e-64cb-4af7-808d-230440746319cell_id$5500fd8e-64cb-4af7-808d-230440746319code/md""" ### *Continuing Mountain Car Example* """metadatashow_logsèdisabled®skip_as_script«code_folded$76d54520-baa3-44bf-b303-4cdcb8b87080cell_id$76d54520-baa3-44bf-b303-4cdcb8b87080codeكbegin make_sample_vector(::T) where T<:Real = zeros(T, 1) make_sample_vector(::NTuple{N, T}) where {N, T<:Real} = zeros(T, N) endmetadatashow_logsèdisabled®skip_as_script«code_folded$27441783-d3c6-40be-9c36-4941613e6ae9cell_id$27441783-d3c6-40be-9c36-4941613e6ae9codezplot(reinforce_test5.step_rewards |> cumsum |> x -> x ./ length(x) |> x -> x[round.(Int64, LinRange(1, length(x), 1000))])metadatashow_logsèdisabled®skip_as_script«code_folded$fac138d9-3c5d-44b0-a87c-b13872f19450cell_id$fac138d9-3c5d-44b0-a87c-b13872f19450codeusing Memoizemetadatashow_logsèdisabled®skip_as_script«code_folded$82e0e9a0-9662-429a-87e3-e6bdae02709acell_id$82e0e9a0-9662-429a-87e3-e6bdae02709acodefconst reinforce_test5 = actor_critic_with_eligibility_traces_fcann(cartpole_setup.mdps.continuing.discrete, 0.90f0, 0.1f0, cartpole_fcann_feature_setup.num_features, [32, 32], (x, s) -> cartpole_fcann_feature_setup.update_feature_vector!(x, (s.x, s.θ, s.ẋ, s.θ̇)), 1_000_000; α_θ = 0.0625f0, α_w = 0.0625f0, α_r̄ = 0.01f0, save_step_rewards = true)metadatashow_logsèdisabled®skip_as_script«code_folded$d3c1379f-acd6-4e15-be7e-a5dbe46a4f62cell_id$d3c1379f-acd6-4e15-be7e-a5dbe46a4f62code_@bind start_mountaincar_continuing_param_study CounterButton("Run Mountaincar Parameter Study")metadatashow_logsèdisabled®skip_as_script«code_folded$fad02876-efba-46a7-9cb7-43820528779fcell_id$fad02876-efba-46a7-9cb7-43820528779fcodeٽplot_cart(cartpole_fcann_continuing_test_episode[1][cartpole_fcann_continuing_episode_step_select], cartpole_fcann_continuing_test_episode[2][cartpole_fcann_continuing_episode_step_select])metadatashow_logsèdisabled®skip_as_script«code_folded$1ce4bc6c-7cde-48e9-8ff1-7281697fd121cell_id$1ce4bc6c-7cde-48e9-8ff1-7281697fd121code-plot_cart(ep2[1][ep2_step], ep2[2][ep2_step])metadatashow_logsèdisabled®skip_as_script«code_folded$024dcd1a-8eaa-4a95-8037-2f578828309ccell_id$024dcd1a-8eaa-4a95-8037-2f578828309ccode-const cartpole_mdps = create_cartpole_mdps()metadatashow_logsèdisabled®skip_as_script«code_folded$e1274f57-75cb-4659-a82f-e5870c5367e2cell_id$e1274f57-75cb-4659-a82f-e5870c5367e2codeyconst ep = runepisode(cartpole_setup.mdps.episodic.discrete; π = reinforce_test4.policy_sample_action, max_steps = 1000)metadatashow_logsèdisabled®skip_as_script«code_folded$fdd3f4fd-4706-4d6b-b150-6ee6b4b370cbcell_id$fdd3f4fd-4706-4d6b-b150-6ee6b4b370cbcodemd""" ### Notes on Probability Distributions In order to prove the policy gradient theorem, we must manipulate terms that are probability distributions over states and visit steps. In order to build intuition for these distributions, we can visualize how data is being averaged with the sort corridor example. The following function simulates many episodes in the environment with a stochastic policy that has some probability of moving left regardless of the state. The simulation keeps track of the visit count for a given state and the visit step. The result of the accumulation is a matrix who's columns contain the number of times each state was visited on every step of an episode across all of the simulated episodes. If we divide each count by the number of episodes simulated, then we have an unbiased sample of the probability of visiting a state on each step $k$ of an episode: $\Pr \{ S_k = s \mid \pi \}$ such that $\sum_{s \in \mathcal{S}^+} \Pr \{ S_k = s \mid \pi \} = 1$. Note that this distribution is only normalized over the sum of all states including terminal states which is denoted in episodic problems by the notation $\mathcal{S}^+$. The notation $\mathcal{S}$ excludes all terminal states, so if we sum the above probabilities over that set on a given step $k$ we calculate the probability that we are NOT in a terminal state by the time we reach step $k$: $\sum_\mathcal{S} \Pr \{ S_k = s \mid \pi \} = \Pr \{ T \gt k \mid \pi \}$ where we use the notation that $T$ is the step of termination for a particular episode. """metadatashow_logsèdisabled®skip_as_script«code_folded$b02ba928-5b9f-4695-b980-07988c788bb9cell_id$b02ba928-5b9f-4695-b980-07988c788bb9code8const mountaincar_continuing_tile_test = actor_critic_with_eligibility_traces_binary_features(mountaincar_continuing_mdp, 0.1f0, 0.98f0, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, 200_000, α_θ = 0.5f0, α_w = 0.0025f0, α_r̄ = 0.005f0; save_step_rewards=true)metadatashow_logsèdisabled®skip_as_script«code_folded$f946c886-6246-4f98-a96f-f06984691ad8cell_id$f946c886-6246-4f98-a96f-f06984691ad8codebegin function ApproximationUtils.runepisode!((states, actions, rewards)::Tuple{Vector{S}, Vector{A}, Vector{T}}, mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, π::Function; s0::S = mdp.initialize_state(), a0::A = π(s0), max_steps = typemax(Int64)) where {T<:Real, S, A, P, F1<:Function, F2<:Function, F3<:Function} s = s0 l = length(states) @assert l == length(actions) == length(rewards) function add_value!(v, x, i) if i > l push!(v, x) else v[i] = x end end add_value!(states, s, 1) a = a0 # @info "Selected action is $a" (r, s′) = mdp.ptf(s, a0) add_value!(actions, a, 1) add_value!(rewards, r, 1) step = 2 sterm = s if mdp.isterm(s′) sterm = s′ else sterm = s end s = s′ #note that the terminal state will not be added to the state list while !mdp.isterm(s) && (step <= max_steps) add_value!(states, s, step) a = π(s) if bad_continuous_action(a) @info "Terminating episode after $step steps due to bad continuous action $a taken in state $s" step = 1 break end # @info "Selected action is $a" (r, s′) = mdp.ptf(s, a) add_value!(actions, a, step) add_value!(rewards, r, step) s = s′ step += 1 if mdp.isterm(s′) sterm = s′ end end return states, actions, rewards, sterm, step-1 end function ApproximationUtils.runepisode(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, π::Function; kwargs...) where {T<:Real, S, A, P, F1, F2, F3} states = Vector{S}() actions = Vector{A}() rewards = Vector{T}() runepisode!((states, actions, rewards), mdp, π; kwargs...) end ApproximationUtils.runepisode(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}; kwargs...) where {T<:Real, S, N, A <: Union{T, NTuple{N, T}}, P, F1, F2, F3} = runepisode(mdp, Returns(rand(A)); kwargs...) endmetadatashow_logsèdisabled®skip_as_script«code_folded$3c316495-bb6c-41e2-a38f-ba867a319fbbcell_id$3c316495-bb6c-41e2-a38f-ba867a319fbbcode #create a cart pole MDP environment function create_cartpole_mdps(; m::T = 1f0, #mass at the end of the pole in kg m_c::T = 10f0, #mass of the cart in kg l::T = 1f0, #length of the pole in meters g::T = 9.8f0, #gravitational constant in meters per second squared h::T = 1f-3, #step size parameter of simulation in seconds k::T = 1f0, #inertial constant of pendulum, m_f::T = 0f0, #friction of the rotating pole μ_c::T = 0f0, #friction of the cart wheels against the track fmax::T = 100f0, #force applied by throttle x_max::T = Inf32, #maximum horizontal position θ_max::T = π/2f0, #maximum pole angle ẋ_max::T = Inf32, θ̇_max::T = Inf32, init_x::Function = () -> 0f0, #initialize each of the 4 state variables init_θ::Function = () -> Float32(rand([-π/6, π/6])), init_ẋ::Function = () -> 0f0, init_θ̇::Function = () -> 0f0) where T<:Real #the action space is full throttle forward or backwards or idle in the discrete case actions = [-fmax, zero(T), fmax] #create a vehicle to use in simulation steps vehicle = CartPoleVehicle(m, m_c, l, k, m_f, μ_c) initialize_state(;t = 0f0) = CartPoleState(init_x(), init_θ(), init_ẋ(), init_θ̇(), t) function failure(s::CartPoleState) (abs(s.x) > x_max) || (abs(s.θ) > θ_max) || (abs(s.ẋ) > ẋ_max) || (abs(s.θ̇) > θ̇_max) end step(s::CartPoleState{T}, f::T) = cartpole_runge_kutta_step(vehicle, s, g, clamp(f, -fmax, fmax), h) function episodic_step(s::CartPoleState{T}, f::T) s′ = step(s, f) return (one(T), s′) end function continuing_step(s::CartPoleState{T}, f::T) s′ = step(s, f) failure(s′) && return (-one(T), initialize_state(;t = s′.t)) return (zero(T), s′) end s0 = initialize_state() ptf = StateMDPTransitionSampler((s, i_a) -> episodic_step(s, actions[i_a]), s0) episodic_mdp = TabularRL.StateMDP(actions, ptf, initialize_state, failure) ptf = ContinuousMDPTransitionSampler(episodic_step, s0, zero(T)) episodic_mdp_continuous = ContinuousMDP(ptf, initialize_state; isterm = failure) ptf = StateMDPTransitionSampler((s, i_a) -> continuing_step(s, actions[i_a]), s0) continuing_mdp = TabularRL.StateMDP(actions, ptf, initialize_state, Returns(false)) ptf = ContinuousMDPTransitionSampler(continuing_step, s0, zero(T)) continuing_mdp_continuous = ContinuousMDP(ptf, initialize_state) (episodic = (discrete = episodic_mdp, continuous = episodic_mdp_continuous), continuing = (discrete = continuing_mdp, continuous = continuing_mdp_continuous)) endmetadatashow_logsèdisabled®skip_as_script«code_folded$6c5e9bb2-4c38-4613-9652-dec99e97b512cell_id$6c5e9bb2-4c38-4613-9652-dec99e97b512code%md""" #### Policy Function Output """metadatashow_logsèdisabled®skip_as_script«code_folded$b0a66a19-ee76-463b-a704-8fcee85444d0cell_id$b0a66a19-ee76-463b-a704-8fcee85444d0codelbegin function update_params_with_gradient!(θ::Array{T, N}, α::T, ∇θ::Array{T, N}) where {T<:Real, N} θ .+= α .* ∇θ end function update_params_with_gradient!(θ::Matrix{T}, α::T, ∇θ::BinaryEligibilityVector{T, B}) where {T<:Real, B<:BinaryFeatureVector} @inbounds for i in eachindex(∇θ.π_dist) @simd for j in 1:∇θ.binary_features.num_features k = ∇θ.binary_features.active_features[j] θ[k, i] -= α*∇θ.π_dist[i] end end @inbounds @simd for i in 1:∇θ.binary_features.num_features j = ∇θ.binary_features.active_features[i] θ[j, ∇θ.i_a] += α end return θ end function update_params_with_gradient!(params::FCANNParams{T}, α::T, ∇::FCANNParams{T}) where T<:Float32 for i in eachindex(first(params)) for j in 1:2 # @info "updating parameter $((j, i)) $(params[j][i]) with gradient $(∇[j][i]) and constant $α" update_params_with_gradient!(params[j][i], α, ∇[j][i]) # @info "new parameter values are: $(params[j][i])" end end end update_params_with_gradient!(::Nothing, α::T, ::Nothing) where T<:Real = return nothing endmetadatashow_logsèdisabled®skip_as_script«code_folded$13ebc12f-ff6f-4266-88d3-28d6df5fcf59cell_id$13ebc12f-ff6f-4266-88d3-28d6df5fcf59codeCactor_critic_binary_episodic_gaussian_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, params::@NamedTuple{λ_θ::T, λ_w::T, α_θ_min::Int64, α_w_min::Int64}, num_θ::Integer, num_w::Integer, num_episodes::Integer; kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = actor_critic_binary_episodic_gaussian_parameter_study(mdp, get_active_features, num_features, params.λ_θ, params.λ_w, 2f0 .^(params.α_θ_min:params.α_θ_min+num_θ-1), 2f0 .^(params.α_w_min:params.α_w_min+num_w-1), num_episodes; kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$7a6fb1f0-fc3c-4c29-a6d9-769d32ca98a9cell_id$7a6fb1f0-fc3c-4c29-a6d9-769d32ca98a9code3md""" ### Example 13.1 Short corridor gridworld """metadatashow_logsèdisabled®skip_as_script«code_folded$f2f2dd1d-180c-4d36-b515-5079d129f93acell_id$f2f2dd1d-180c-4d36-b515-5079d129f93acodesarsa_λ(corridor_mdp, 1f0, 0.9f0, typemax(Int64), 100_000, 1, get_corridor_features; ϵ = 0.0001f0, α = 0.000001f0, save_episode_steps = true).history.episode_steps |> a -> a ./ (1:length(a)) |> plotmetadatashow_logsèdisabled®skip_as_script«code_folded$553b0ceb-f2ca-41ee-99bc-9f53a4487b49cell_id$553b0ceb-f2ca-41ee-99bc-9f53a4487b49codeTget_corridor_episode_stats(best_mc_corridor.policy_sample_action; ntrials = 100_000)metadatashow_logsèdisabled®skip_as_script«code_folded$f9facbba-39d4-483e-9066-275603156db0cell_id$f9facbba-39d4-483e-9066-275603156db0codeifunction plot_mountaincar_values(v̂_mountain_car, π; n1 = 100, n2 = 100) xvals = LinRange(-1.2f0, 0.5f0, n1) vvals = LinRange(-0.07f0, 0.07f0, n2) values = zeros(Float32, n1, n2) actions = zeros(Float32, n1, n2) for (i, x) in enumerate(xvals) for (j, v) in enumerate(vvals) v̂ = v̂_mountain_car((x, v)) values[j, i] = v̂ actions[j, i] = π((x, v)) end end p1 = plot(heatmap(x = xvals, y = vvals, z = values), Layout(xaxis_title = "position", yaxis_title = "velocity", title = "Learned Value Function", height = 400)) p2 = plot(heatmap(x = xvals, y = vvals, z = actions, colorscale = "rb", showscale = false), Layout(xaxis_title = "position", yaxis_title = "velocity", title = "Policy (blue = accelerate left,
red = accelerate right, gray = no acceleration)", height = 400)) @htl("""
$p1 $p2
""") endmetadatashow_logsèdisabled®skip_as_script«code_folded$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0cell_id$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0code٘one_step_actor_critic_fcann(corridor_mdp, 1, [1], update_corridor_features!, typemax(Int64), 100_000, α_θ = 2f0^-4, α_w = 2f0^-20).policy_function(1)metadatashow_logs¨disabled®skip_as_script«code_folded$d41f1dd1-45fe-4456-9a01-ed47fd6704a7cell_id$d41f1dd1-45fe-4456-9a01-ed47fd6704a7codeebegin function update_beta_eligibility_vector!(∇lnπ::BinaryBetaEligibilityVector{T, T, T, B}, dist_params::Vector{T}, x::B, action::T, policy_params::Matrix{T}) where {T<:Real, B<:BinaryFeatureVector} # @info "Beta eligibility vector is $∇lnπ" ∇lnπ.binary_features = x ∇lnπ.a = action ∇lnπ.α = exp(first(dist_params)) ∇lnπ.β = exp(last(dist_params)) # @info "Beta eligibility vector updated to $∇lnπ" return ∇lnπ end function update_beta_eligibility_vector!(∇lnπ::BinaryBetaEligibilityVector{T, NTuple{N, T}, Vector{T}, B}, dist_params::Vector{T}, x::B, action::NTuple{N, T}, policy_params::Matrix{T}) where {T<:Real, N, B<:BinaryFeatureVector} ∇lnπ.binary_features = x ∇lnπ.a = action for i in 1:N ∇lnπ.α[k] = exp(dist_params[k]) ∇lnπ.β[k] = exp(dist_params[k+N]) end return ∇lnπ end endmetadatashow_logsèdisabled®skip_as_script«code_folded$ba5d6311-daee-4abc-b2fb-fae2184ef3ebcell_id$ba5d6311-daee-4abc-b2fb-fae2184ef3ebcodefunction setup_binary_gaussian_policy_arguments(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer) where {T<:Real, S, N, A<:Union{T, NTuple{N, T}}, P, F1, F2, F3} x = BinaryFeatureVector() update_feature_vector!(x::BinaryFeatureVector, s) = update_binary_feature_vector!(x, s, get_active_features) sample_action = rand(A) action_dist_params = make_n_param_dist_params(2, sample_action) ∇lnπ = BinaryGaussianEligibilityVector(sample_action) return (feature_vector = x, update_feature_vector! = update_feature_vector!, action_distribution_parameters = action_dist_params, eligibility_vector = ∇lnπ) endmetadatashow_logsèdisabled®skip_as_script«code_folded$8e742d32-c074-4981-b35b-b596b64c869bcell_id$8e742d32-c074-4981-b35b-b596b64c869bcode٨@bind cartpole_continuing_binary_study_params create_actor_critic_continuing_params_UI(;λ_θ = 0.95f0, λ_w = 0.05f0, log2α_θ = -4, log2α_w = -16, α_r̄ = 0.005f0)metadatashow_logsèdisabled®skip_as_script«code_folded$03a218cb-aa83-4000-85b5-c6f247087053cell_id$03a218cb-aa83-4000-85b5-c6f247087053codefunction update_binary_value_gradient!(∇v̂::BinaryFeatureVector, binary_features::BinaryFeatureVector, value_params::Vector{T}) where T<:Real ∇v̂.active_features = binary_features.active_features ∇v̂.num_features = binary_features.num_features endmetadatashow_logsèdisabled®skip_as_script«code_folded$1ec1acf1-f833-4478-9b3c-88029340a629cell_id$1ec1acf1-f833-4478-9b3c-88029340a629code:md""" ##### Non-linear Features This version of REINFORCE uses non-linear features in a fully connected neural network. The number of parameters no longer matches the size of the input feature vector, but a mapping from state to feature vector is still required. One must specify the size of the feature vector, a function that updates the values in a feature vector given a state, and the size of each hidden layer in the neural network. Additional keyword arguments are available to change the construction of the neural network such as adding residual layers. """metadatashow_logsèdisabled®skip_as_script«code_folded$de3cba34-9842-44d1-9b79-47126c0a0751cell_id$de3cba34-9842-44d1-9b79-47126c0a0751codeٜconst cartpole_tilecoding_setup = tile_coding_setup(cartpole_functions.min_vals, cartpole_functions.max_vals, (1f0/4, 1f0/8, 1f0/8, 1f0/8), 8, (1, 3, 5, 7))metadatashow_logsèdisabled®skip_as_script«code_folded$04f42c09-8ab5-4233-b196-51c4aa2dcedbcell_id$04f42c09-8ab5-4233-b196-51c4aa2dcedbcodeif start_mountaincar_continuing_param_study > 0 mountaincar_binary_continuing_parameter_study(mountaincar_continuing_binary_params, 5, 3, 100_000; seed = 45) else md""" Waiting to run parameter study """ endmetadatashow_logsèdisabled®skip_as_script«code_folded$54ff46a2-489a-4dd2-bc30-df70c780cc42cell_id$54ff46a2-489a-4dd2-bc30-df70c780cc42code\cartpole_fcann_parameter_study(fill(fcann_cartpole_study_params.h, fcann_cartpole_study_params.l), fcann_cartpole_study_params.λ_θ, fcann_cartpole_study_params.λ_w, 2f0 .^(fcann_cartpole_study_params.α_θ_min:fcann_cartpole_study_params.α_θ_min+4), 2f0 .^ (fcann_cartpole_study_params.α_w_min:fcann_cartpole_study_params.α_w_min+2), 1_000)metadatashow_logsèdisabledîskip_as_script«code_folded$7126aefd-b847-497a-9545-514e9b9afa71cell_id$7126aefd-b847-497a-9545-514e9b9afa71codeactor_critic_fcann_episodic_parameter_study(MountainCarTask.mdp, mountaincar_fcann_setup.update_feature_vector!, mountaincar_fcann_setup.num_features, fill(fcann_mountaincar_study_params.h, fcann_mountaincar_study_params.l), fcann_mountaincar_study_params.λ_θ, fcann_mountaincar_study_params.λ_w, 2f0 .^ (fcann_mountaincar_study_params.α_θ_min:fcann_mountaincar_study_params.α_θ_min+4), 2f0 .^ (fcann_mountaincar_study_params.α_w_min:fcann_mountaincar_study_params.α_w_min+2), 100_000; nruns = 100, max_steps = 1_000)metadatashow_logsèdisabled®skip_as_script«code_folded$48dcd2d0-a940-41da-a097-90c780f2ec4dcell_id$48dcd2d0-a940-41da-a097-90c780f2ec4dcodezmd""" ### Alternative Paramaterization If the action space is small enough, then it may be convenient to create a function that simply outputs the preferences for all of the actions at a given state. Let's call $N_a$ to be the number of available actions. We would then consider the vector function $\mathbf{h}(s, \boldsymbol{\theta}) \in \mathbb{R}^{N_a}$ and its components $h_1, h_2, h_3, \dots, h_{N_a}$. To be the action preferences at each state. With this style of paramaterization, we need only compute state feature vectors $\mathbf{x}(s) \in \mathbb{R}^d$. Similarly, the policy function would also be a vector function. In order to compute the softmax, we must evaluate the denominator of (13.2) which requires knowing all of the action preferences. Practically, it is only defined as a function on vectors, so consider the following notation to simplify expressions where we use the symbol $\mathbf{\sigma}$ to denote the soft-max vector function. $\sigma(\mathbf{x}) = \frac{e^{\mathbf{x}}}{\sum_j{e^{x_j}}} \text{ where we abuse the notation } e^{\mathbf{x}} = \begin{pmatrix} e^{x_1} \\ e^{x_2} \\ \vdots \\ e^{x_n} \end{pmatrix}$ Using this notation, we can write down the policy function under this new parameterization: $\mathbf{\pi}(s, \boldsymbol{\theta}) = \mathbf{\sigma}(\mathbf{h}(s, \boldsymbol{\theta}))$. What do linear preferences look like with this parameterization? Instead of a parameter vector $\boldsymbol{\theta} \in \mathbb{R}^{d^\prime}$, we have a parameter matrix $\boldsymbol{\theta} \in \mathbb{R}^{d \times N_a}$ and the vector of preferences is the result of a matrix vector multiplication: $\mathbf{h}(s, \boldsymbol{\theta}) = \theta^\top \mathbf{x}(s) \in \mathbb{R}^{N_a}$. Subscript notation is used to refer to single preference values so $\mathbf{h}_i$ would be the $ith$ index of $\mathbf{h}$ for the $ith$ action preference equivalent to $h_i$. """metadatashow_logsèdisabled®skip_as_script«code_folded$e1493cea-19c4-475d-98a0-86d27fb04af1cell_id$e1493cea-19c4-475d-98a0-86d27fb04af1code٠sarsa_λ(corridor_mdp, 1f0, 0.9f0, typemax(Int64), 100_000, 1, get_corridor_features; ϵ = 0.001f0, α = 0.000001f0).greedy_policy |> get_corridor_episode_statsmetadatashow_logsèdisabled®skip_as_script«code_folded$511a847f-234c-465e-8f4a-688e79d9b975cell_id$511a847f-234c-465e-8f4a-688e79d9b975codeSmd""" ## 13.6 Policy Gradient for Continuing Problems In the continuing case we need to define the average reward per time step as discussed in Section 10.3. In the update procedure the δ is calculated differently in terms of the reward compared to this long running average. The value functions in this case will also learn the reward difference from the average which is assumed to have a well defined expected value under the stationary state distribution for the policy. This shift in the value function will not affect performance since shifting the value function up and down by a constant does not affect the learned policy. To implement this we need a new learning rate $α_{\overline{R}}$ which controls how quickly the reward average updates. This replaces $γ$ in a sense since we no longer discount rewards of future time steps. """metadatashow_logsèdisabled®skip_as_script«code_folded$697b2310-9d96-4f7f-be62-c3bd6bf736f3cell_id$697b2310-9d96-4f7f-be62-c3bd6bf736f3codefunction reinforce_with_baseline_monte_carlo_control_fcann(mdp::StateMDP{T, S, A, P, F1, F2, F3}, input_length::Integer, hidden_layers::Vector{Int64}, update_feature_vector!::Function,max_episodes::Integer; policy_params::FCANNParams = FCANN.initializeparams_saxe(input_length, hidden_layers, length(mdp.actions)), reslayers = 0, l2 = 0f0, dropout = 0f0, use_μP = true, activation_list = fill(true, length(hidden_layers)), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} setup = setup_fcann_policy_and_value_arguments(policy_params, input_length, hidden_layers, reslayers, l2, dropout, use_μP, activation_list) reinforce_with_baseline_monte_carlo_control!(policy_params, setup.eligibility_vector, setup.value_params, setup.value_gradient, mdp, setup.update_action_preferences!, setup.update_eligibility_vector!, setup.feature_vector, update_feature_vector!, setup.value_function, setup.gradient_update, max_episodes; kwargs...) endmetadatashow_logsèdisabled®skip_as_script«code_folded$056a8adc-92f4-4b33-90d9-4b3b4026bbbccell_id$056a8adc-92f4-4b33-90d9-4b3b4026bbbccodebegin function update_traces_with_gradient!(c::T, z_θ::Matrix{T}, ∇θ::BinaryGaussianEligibilityVector{T, T, T, B}) where {T<:Real, B<:BinaryFeatureVector} c1 = ∇θ.a - ∇θ.μ c2 = ∇θ.σ^(-2) # isnan(c2) && @info "warning σ of $∇θ.σ is causing nan results" # isinf(c2) && @info "warning σ of $∇θ.σ is causing inf results" c3 = c1 * c2 c4 = c3*c1 - one(T) z_θ .*= c @inbounds @simd for j in 1:∇θ.binary_features.num_features i = ∇θ.binary_features.active_features[j] z_θ[i, 1] += c3 end @inbounds @simd for j in 1:∇θ.binary_features.num_features i = ∇θ.binary_features.active_features[j] z_θ[i, 2] += c4 end return z_θ end function update_traces_with_gradient!(c::T, z_θ::Matrix{T}, ∇θ::BinaryBetaEligibilityVector{T, T, T, B}) where {T<:Real, B<:BinaryFeatureVector} c1 = digamma(∇θ.α + ∇θ.β) δ1 = ∇θ.α*(log(∇θ.a) + c1 - digamma(∇θ.α)) z_θ .*= c @inbounds @simd for j in 1:∇θ.binary_features.num_features i = ∇θ.binary_features.active_features[j] z_θ[i, 1] += δ1 end δ2 = ∇θ.β*(log(one(T) - ∇θ.a) + c1 - digamma(∇θ.β)) @inbounds @simd for j in 1:∇θ.binary_features.num_features i = ∇θ.binary_features.active_features[j] z_θ[i, 2] += δ2 end return z_θ end function update_traces_with_gradient!(c::T, z_θ::Matrix{T}, ∇θ::BinarySquashedGaussianEligibilityVector{T, T, T, B}) where {T<:Real, B<:BinaryFeatureVector} c1 = atanh(∇θ.a / ∇θ.amax) - ∇θ.μ c2 = ∇θ.σ^(-2) # isnan(c2) && @info "warning σ of $∇θ.σ is causing nan results" # isinf(c2) && @info "warning σ of $∇θ.σ is causing inf results" c3 = c1 * c2 c4 = c3*c1 - one(T) z_θ .*= c @inbounds @simd for j in 1:∇θ.binary_features.num_features i = ∇θ.binary_features.active_features[j] z_θ[i, 1] += c3 end @inbounds @simd for j in 1:∇θ.binary_features.num_features i = ∇θ.binary_features.active_features[j] z_θ[i, 2] += c4 end return z_θ end function update_traces_with_gradient!(a::T, z_θ::Matrix{T}, b::T, ∇θ::BinaryGaussianEligibilityVector{T, T, T, B}) where {T<:Real, B<:BinaryFeatureVector} c1 = ∇θ.a - ∇θ.μ c2 = ∇θ.σ^(-2) # isnan(c2) && @info "warning σ of $∇θ.σ is causing nan results" # isinf(c2) && @info "warning σ of $∇θ.σ is causing inf results" c3 = c1 * c2 c4 = c3*c1 - one(T) z_θ .*= a δ1 = b*c3 @inbounds @simd for j in 1:∇θ.binary_features.num_features i = ∇θ.binary_features.active_features[j] z_θ[i, 1] += δ1 end δ2 = b*c4 @inbounds @simd for j in 1:∇θ.binary_features.num_features i = ∇θ.binary_features.active_features[j] z_θ[i, 2] += δ2 end return z_θ end function update_traces_with_gradient!(a::T, z_θ::Matrix{T}, b::T, ∇θ::BinarySquashedGaussianEligibilityVector{T, T, T, B}) where {T<:Real, B<:BinaryFeatureVector} c1 = atanh(∇θ.a / ∇θ.amax) - ∇θ.μ c2 = ∇θ.σ^(-2) # isnan(c2) && @info "warning σ of $∇θ.σ is causing nan results" # isinf(c2) && @info "warning σ of $∇θ.σ is causing inf results" c3 = c1 * c2 c4 = c3*c1 - one(T) z_θ .*= a δ1 = b*c3 @inbounds @simd for j in 1:∇θ.binary_features.num_features i = ∇θ.binary_features.active_features[j] z_θ[i, 1] += δ1 end δ2 = b*c4 @inbounds @simd for j in 1:∇θ.binary_features.num_features i = ∇θ.binary_features.active_features[j] z_θ[i, 2] += δ2 end return z_θ end function update_traces_with_gradient!(a::T, z_θ::Matrix{T}, b::T, ∇θ::BinaryBetaEligibilityVector{T, T, T, B}) where {T<:Real, B<:BinaryFeatureVector} c1 = digamma(∇θ.α + ∇θ.β) z_θ .*= a δ1 = b*∇θ.α*(log(∇θ.a) + c1 - digamma(∇θ.α)) @inbounds @simd for j in 1:∇θ.binary_features.num_features i = ∇θ.binary_features.active_features[j] z_θ[i, 1] += δ1 end δ2 = b*∇θ.β*(log(one(T) - ∇θ.a) + c1 - digamma(∇θ.β)) @inbounds @simd for j in 1:∇θ.binary_features.num_features i = ∇θ.binary_features.active_features[j] z_θ[i, 2] += δ2 end return z_θ end function update_traces_with_gradient!(a::T, z_θ::Matrix{T}, ∇θ::BinaryGaussianEligibilityVector{T, NTuple{N, T}, Vector{T}, B}) where {T<:Real, N, B<:BinaryFeatureVector} z_θ .*= a for k in 1:N c1 = ∇θ.a[k] - ∇θ.μ[k] c2 = ∇θ.σ[k] ^-2 # isnan(c2) && @info "warning σ of $∇θ.σ is causing nan results" # isinf(c2) && @info "warning σ of $∇θ.σ is causing inf results" c3 = c1 * c2 c4 = c3*c1 - one(T) @inbounds @simd for i in 1:size(θ, 1) θ[i, k] += c3 end @inbounds @simd for i in 1:size(θ, 1) θ[i, k+N] += c4 end end return θ end function update_traces_with_gradient!(a::T, z_θ::Matrix{T}, ∇θ::BinaryBetaEligibilityVector{T, NTuple{N, T}, Vector{T}, B}) where {T<:Real, N, B<:BinaryFeatureVector} z_θ .*= a for k in 1:N c1 = digamma(∇θ.α[k] + ∇θ.β[k]) δ1 = ∇θ.α[k]*(log(∇θ.a[k]) + c1 - digamma(∇θ.α[k])) @inbounds @simd for i in 1:size(θ, 1) θ[i, k] += δ1 end δ2 = ∇θ.β[k]*(log(one(T) - ∇θ.a[k]) + c1 - digamma(∇θ.β[k])) @inbounds @simd for i in 1:size(θ, 1) θ[i, k+N] += δ2 end end return θ end function update_traces_with_gradient!(a::T, z_θ::Matrix{T}, b::T, ∇θ::BinaryGaussianEligibilityVector{T, NTuple{N, T}, Vector{T}, B}) where {T<:Real, N, B<:BinaryFeatureVector} z_θ .*= a for k in 1:N c1 = ∇θ.a[k] - ∇θ.μ[k] c2 = ∇θ.σ[k] ^-2 isnan(c2) && @info "warning σ of $∇θ.σ is causing nan results" isinf(c2) && @info "warning σ of $∇θ.σ is causing inf results" c3 = c1 * c2 c4 = c3*c1 - one(T) δ1 = b*c3 @inbounds @simd for i in 1:size(θ, 1) θ[i, k] += δ1 end δ2 = b*c4 @inbounds @simd for i in 1:size(θ, 1) θ[i, k+N] += δ2 end end return θ end function update_traces_with_gradient!(a::T, z_θ::Matrix{T}, b::T, ∇θ::BinaryBetaEligibilityVector{T, NTuple{N, T}, Vector{T}, B}) where {T<:Real, N, B<:BinaryFeatureVector} z_θ .*= a for k in 1:N c1 = digamma(∇θ.α[k] + ∇θ.β[k]) δ1 = b*∇θ.α[k]*(log(∇θ.a[k]) + c1 - digamma(∇θ.α[k])) @inbounds @simd for i in 1:size(θ, 1) θ[i, k] += δ1 end δ2 = b*∇θ.β[k]*(log(one(T) - ∇θ.a[k]) + c1 - digamma(∇θ.β[k])) @inbounds @simd for i in 1:size(θ, 1) θ[i, k+N] += δ2 end end return θ end endmetadatashow_logsèdisabled®skip_as_script«code_folded$bc8a399b-8864-4473-89d2-e3b0a03d15b5cell_id$bc8a399b-8864-4473-89d2-e3b0a03d15b5codeٹcorridor_parameter_study(args...; kwargs...) = actor_critic_binary_episodic_parameter_study(corridor_mdp, get_corridor_features, 1, args...; init_policy_params = [0f0 3.7f0], kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$bba13634-ff0e-47f7-a23b-8d56098f4ac6cell_id$bba13634-ff0e-47f7-a23b-8d56098f4ac6codebegin function gaussian_action_sampler(params::Vector{T}) where T<:Real σ = exp(params[2]) μ = params[1] isinf(μ) && return μ isapprox(σ, zero(T)) && return μ isnan(σ) && return μ rand(Normal(μ, σ)) end make_gaussian_n_sampler(::Val{1}) = gaussian_action_sampler function make_gaussian_n_sampler(::Val{N}) where N function f(params::Vector{T}) where T<:Real ntuple(i -> rand(Normal(params[i], exp(params[i+N]))), N) end end make_gaussian_n_sampler(n::Integer) = make_gaussian_n_sampler(Val(n)) make_gaussian_sampler(::T) where T<:Real = gaussian_action_sampler make_gaussian_sampler(::NTuple{N, T}) where {N, T<:Real} = make_gaussian_n_sampler(N) endmetadatashow_logsèdisabled®skip_as_script«code_folded$407a0724-4bb6-4c83-ab2d-17a0e19c4072cell_id$407a0724-4bb6-4c83-ab2d-17a0e19c4072codeKconst reinforce_test4 = actor_critic_with_eligibility_traces_fcann(cartpole_setup.mdps.episodic.discrete, 0.95f0, 0.2f0, cartpole_fcann_feature_setup.num_features, [64, 64], (x, s) -> cartpole_fcann_feature_setup.update_feature_vector!(x, (s.x, s.θ, s.ẋ, s.θ̇)), typemax(Int64), 1_000_000; α_θ = 4f-4, α_w = 2f-5, γ = 1f0)metadatashow_logsèdisabled®skip_as_script«code_folded$77cf3a74-899f-4ade-99f2-5aaf7a98c02dcell_id$77cf3a74-899f-4ade-99f2-5aaf7a98c02dcodeٴfunction scale_fcann_params!(params::FCANNParams, scales::Vector{T}) where T<:Real @inbounds for i in eachindex(scales) for j in 1:2 params[j][i] ./= scales[i] end end endmetadatashow_logsèdisabled®skip_as_script«code_folded$28ce6e60-59cf-408a-8081-b978507b3c72cell_id$28ce6e60-59cf-408a-8081-b978507b3c72code@bind cartpole_fcann_continuing_test_state PlutoUI.combine() do Child md""" x position: $(Child(Slider(-50f0:50f0, default = 0, show_value=true))) pole angle: $(Child(Slider(LinRange(-deg2rad(70f0), deg2rad(70f0), 1000), default = 0, show_value=true))) x velocity: $(Child(Slider(-50f0:50f0, default = 0, show_value=true))) pole angular velocity: $(Child(Slider(-10f0:10f0, default = 0, show_value=true))) """ end |> confirmmetadatashow_logsèdisabled®skip_as_script«code_folded$7ccadf01-fbba-4dfd-a5ad-770dab9946f9cell_id$7ccadf01-fbba-4dfd-a5ad-770dab9946f9codeTmd""" We can define our policy as a normal distribution function over actions for a given state and parameter vector. $\pi(a|s, \mathbf{\theta}) \doteq \frac{1}{\sigma(s, \mathbf{\theta}) \sqrt{2\pi}} \exp \left ( - \frac{(a-\mu(s, \mathbf{\theta}))^2}{2\sigma(s, \mathbf{\theta})^2} \right ) \tag{13.19}$ This policy requires μ and σ to be parameterized by the parameter vector. To make a linear model for both parameters we can use the following formulas: $\mu(s, \mathbf{\theta}) \doteq \mathbf{\theta}_\mu ^\top \mathbf{x}_\mu(s) \text{ and } \sigma(s, \mathbf{\theta}) \doteq \exp{( \mathbf{\theta}_\sigma ^ \top \mathbf{x}_\sigma (s))} \tag{13.20}$ where $\mathbf{x}_\mu(s)$ and $\mathbf{x}_\sigma(s)$ are state feature vectors. With these formulas we can apply the previous algorithms to solve environments with real-valued actions. """metadatashow_logsèdisabled®skip_as_script«code_folded$b72e030f-7d52-481f-b4f7-2b16b227e547cell_id$b72e030f-7d52-481f-b4f7-2b16b227e547code\md""" ### Figure 13.2 Adding a baseline to REINFORCE can make it learn much faster as illustrated here on the short-corridor gridworld (Example 13.1). Here the approximate state-value function used in the baseline is $\hat v(s, \mathbf{w}) = w$. There is only one component of the feature vector and the state value approximation parameters. """metadatashow_logsèdisabled®skip_as_script«code_folded$4c5cb75e-79b5-4502-b1eb-6246e002feafcell_id$4c5cb75e-79b5-4502-b1eb-6246e002feafcodeZ@bind mountaincar_binary_params create_actor_critic_params_UI(λ_θ = 0.1f0, λ_w = 0.9f0)metadatashow_logsèdisabled®skip_as_script«code_folded$48b342f2-e48f-457a-9bd3-b3504a79f3a6cell_id$48b342f2-e48f-457a-9bd3-b3504a79f3a6codemd""" #### Binary Features This version of REINFORCE uses binary feature vectors for which one needs to specify the total number of features as well as a function that returns the active features for a given state. """metadatashow_logsèdisabled®skip_as_script«code_folded$5d50a5d0-8fe2-4c6e-b76c-d5614e4fd884cell_id$5d50a5d0-8fe2-4c6e-b76c-d5614e4fd884codem#for displaying plots that do not load by default when the notebook first runs. Displays a placeholder markdown and then if the counter is more than 0 runs the function f with the provided arguments and caches the result in the appropriate dictionary function show_or_lookup_plot(buttoncounter::Integer, args::Tuple, kwargs::NamedTuple, dict::Dict, f::Function, name::AbstractString) buttoncounter == 0 && return md""" #### Placeholder for $name plot. Click above button to run """ haskey(dict, (args, kwargs)) && return dict[(args, kwargs)] p = f(args...; kwargs...) dict[(args, kwargs)] = p endmetadatashow_logsèdisabled®skip_as_script«code_folded$ba645f6b-143f-4e83-9003-707770ae308dcell_id$ba645f6b-143f-4e83-9003-707770ae308dcode|function show_mountaincar_trajectory(π::Function, max_steps::Integer) states, actions, rewards, sterm, nsteps = runepisode(MountainCarTask.mdp; π = π, max_steps = max_steps) positions = [s[1] for s in states] velocities = [s[2] for s in states] tr1 = scatter(x = positions, y = velocities, mode = "markers", showlegend = false) tr2 = scatter(y = positions, showlegend = false) tr3 = scatter(y = [MountainCarTask.actions[i] for i in actions], showlegend = false) p1 = plot(tr1, Layout(xaxis_title = "position", yaxis_title = "velocity", xaxis_range = [-1.2, 0.5], yaxis_range = [-0.07, 0.07], height = 400)) p2 = plot(tr2, Layout(xaxis_title = "time", yaxis_title = "position", height = 400)) p3 = plot(tr3, Layout(xaxis_title = "time", yaxis_title = "action", height = 400)) @htl(""" Total Reward: $(sum(rewards))
$([p1 p2 p3])
""") endmetadatashow_logsèdisabled®skip_as_script«code_folded$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811cell_id$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811codeHupdate_corridor_features!(x::Vector{T}, s) where T<:Real = x[1] = one(T)metadatashow_logsèdisabled®skip_as_script«code_folded$8f1b2db4-ed35-44fc-a3d5-e06deae16d48cell_id$8f1b2db4-ed35-44fc-a3d5-e06deae16d48code`cartpole_tilecoding_reinforce_continuous_parameter_study(2f0 .^ (-18:-15), 2f0 .^ (-6:-4), 1000)metadatashow_logsèdisabledîskip_as_script«code_folded$57bbdb10-bed8-459d-8f67-9ea637cf12bacell_id$57bbdb10-bed8-459d-8f67-9ea637cf12bacodefunction one_step_actor_critic_fcann(mdp::StateMDP{T, S, A, P, F1, F2, F3}, input_length::Integer, hidden_layers::Vector{Int64}, update_feature_vector!::Function, max_episodes::Integer, max_steps::Integer; policy_params::FCANNParams = FCANN.initializeparams_saxe(input_length, hidden_layers, length(mdp.actions)), reslayers = 0, l2 = 0f0, dropout = 0f0, use_μP = true, activation_list = fill(true, length(hidden_layers)), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} setup = setup_fcann_policy_and_value_arguments(policy_params, input_length, hidden_layers, reslayers, l2, dropout, use_μP, activation_list) one_step_actor_critic!(policy_params, setup.eligibility_vector, setup.value_params, setup.value_gradient, mdp, setup.update_action_preferences!, setup.update_eligibility_vector!, setup.feature_vector, update_feature_vector!, setup.value_function, setup.gradient_update, max_episodes, max_steps; kwargs...) endmetadatashow_logsèdisabled®skip_as_script«code_folded$ca360680-afc9-4dd9-9351-493643f91575cell_id$ca360680-afc9-4dd9-9351-493643f91575code|md""" #### Probability distributions for short corridor gridworld example with probability of left action selected below """metadatashow_logsèdisabled®skip_as_script«code_folded$d95f75b5-21d8-4862-baa7-50b58d9725b8cell_id$d95f75b5-21d8-4862-baa7-50b58d9725b8code md""" ### Soft-max notation and gradients To use policy gradient methods, we must be able to take the gradient of the policy function for every state-action pair. Using the above notation and treating the policy as a vector function, we must know the gradient of the soft-max applied to a vector function at a particular index. Each gradient is a column vector of length $d$ where $d$ is the number of parameters. There is a separate gradient available for every index in the vector output which is one for each action or a total of $N_a$. To simplify expressions, $\mathbf{h}(s, \boldsymbol{\theta})$ will we written as $\mathbf{h}$ and $\mathbf{\pi} = \mathbf{\sigma}(\mathbf{h})$. Our desired gradient is with respect to a particular component of $\mathbf{\sigma}(\mathbf{h})$ denoted $\mathbf{\sigma}(\mathbf{h})_a$ where $a$ represents the action index. The gradient itself is the vector of partial derivatives with respect to the parameters $\theta$. The $ith$ component of the gradient $\nabla(f(\theta))_i = \frac{\partial f(\theta)}{\partial \theta_i}$. When we compute the gradient we need all the components whose expression is derived below. $\begin{align} \nabla \left ( \sigma(\mathbf{h})_a \right )_i &= \frac{\partial}{\partial \theta_i} \left ( \frac{e^{h_a}}{\sum_k{e^{h_k}}} \right ) \\ &=\left ( \frac{1}{{\sum_k{e^{h_k}}}} \right )^2 \left ( e^{h_a} \frac{\partial{h_a}}{\partial{\theta_i}} \sum_k{e^{h_k}} - e^{h_a} \sum_k{e^{h_k} \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \\ &=\left ( \frac{1}{{\sum_k{e^{h_k}}}} \right )^2 e^{h_a} \left ( \frac{\partial{h_a}}{\partial{\theta_i}} \sum_k{e^{h_k}} - \sum_k{e^{h_k} \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \tag{factoring out exponenential term}\\ &=\left ( \frac{e^{h_a}}{{\sum_k{e^{h_k}}}} \right ) \left ( \frac{\partial{h_a}}{\partial{\theta_i}} \sum_k{\frac{e^{h_k}}{\sum_l e^{h_l}}} - \sum_k{\frac{e^{h_k}}{\sum_l e^{h_l}} \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \tag{distributing squared fraction}\\ &=\pi_a \left ( \frac{\partial{h_a}}{\partial{\theta_i}} \sum_k{\pi_k} - \sum_k{\pi_k \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \tag{definition of policy function}\\ &=\pi_a \left ( \frac{\partial{h_a}}{\partial{\theta_i}} - \sum_k{\pi_k \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \end{align}$ The final step results form the fact that the policy function is a probability distribution so the sum over it is always 1. """metadatashow_logsèdisabled®skip_as_script«code_folded$65be0e58-24be-4932-92a9-9e4825b14144cell_id$65be0e58-24be-4932-92a9-9e4825b14144codebactor_critic_binary_continuing_squashed_gaussian_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, args...; kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = actor_critic_binary_continuing_squashed_gaussian_parameter_study(mdp, one(T), get_active_features, num_features, args...; kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$60c21e9c-e42d-4f0b-a910-3b318440fbc8cell_id$60c21e9c-e42d-4f0b-a910-3b318440fbc8code@bind gaussian_plot_params PlutoUI.combine() do Child md""" ### Normal Distribution Plot with $$\mu$$: $(Child(Slider(-4:.01:4, default = 0, show_value=true))) $$\sigma$$: $(Child(Slider(0.01:0.01:5, default = 1, show_value=true))) """ endmetadatashow_logsèdisabled®skip_as_script«code_folded$da2d3186-a778-41cc-9b49-759bf1e9b8facell_id$da2d3186-a778-41cc-9b49-759bf1e9b8facodeٟconst BinaryFeatures{I} = Union{C1, C2, C3} where {I <: Integer, C1 <: AbstractVector{I}, N, C2 <: NTuple{N, I}, T<:AbstractVector{I}, C3 <: Base.Generator{T}}metadatashow_logsèdisabled®skip_as_script«code_folded$b695ef21-a1ac-4d1f-a0e1-71cd81cede18cell_id$b695ef21-a1ac-4d1f-a0e1-71cd81cede18codeWplot_mountaincar_continuous_values(mountaincar_continuous_test_train2.policy_and_value)metadatashow_logsèdisabled®skip_as_script«code_folded$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00cell_id$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00codefunction reinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, max_episodes::Integer; policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, N, A <: Union{T, NTuple{N, T}}, P, F1, F2, F3} setup = setup_binary_gaussian_policy_arguments(mdp, get_active_features, num_features) reinforce_with_baseline_monte_carlo_control!(policy_params, setup.eligibility_vector, value_params, BinaryFeatureVector(), mdp, update_binary_action_preferences!, setup.action_distribution_parameters, make_gaussian_sampler(rand(A)), update_gaussian_eligibility_vector!, setup.feature_vector, setup.update_feature_vector!, binary_value_function, update_binary_value_gradient!, max_episodes; kwargs...) endmetadatashow_logsèdisabled®skip_as_script«code_folded$dcb306ae-a1b1-43d6-ba6e-e38668838689cell_id$dcb306ae-a1b1-43d6-ba6e-e38668838689code'md""" ### *Soft-max Implementation* """metadatashow_logsèdisabled®skip_as_script«code_folded$54f559b6-8a62-4a42-894d-c56e41d5ebefcell_id$54f559b6-8a62-4a42-894d-c56e41d5ebefcode;const corridor_state_counts = collect_state_distributions()metadatashow_logsèdisabled®skip_as_script«code_folded$f545c800-0bf3-491f-9d7d-42341cfdb573cell_id$f545c800-0bf3-491f-9d7d-42341cfdb573codefunction form_state_continuous_policy_function(update_feature_vector!::Function, update_action_preferences!::Function) function π!(x, action_preferences, s, params) # @info "Updating feature vector with state $(s)" update_feature_vector!(x, s) update_action_preferences!(action_preferences, x, params) # @info "Action distribution is $action_preferences" return action_preferences end endmetadatashow_logsèdisabled®skip_as_script«code_folded$8b35661b-5075-4d63-bc31-044407f99acfcell_id$8b35661b-5075-4d63-bc31-044407f99acfcodeactor_critic_with_eligibility_traces_binary_features(corridor_continuing_mdp, 0.75f0, 0.25f0, get_corridor_features, 1, 1_000_000, α_θ = 0.00625f0, α_w = 0.0004f0, α_r̄ = 0.004f0, policy_params = [0f0 3.7f0]; save_step_rewards = true).policy_and_value(1)metadatashow_logsèdisabled®skip_as_script«code_folded$09dd1440-5d09-421f-addc-b1ede43ff517cell_id$09dd1440-5d09-421f-addc-b1ede43ff517codeolet x = LinRange(-5, 5, 1000) plot(scatter(x = x, y = pdf.(Normal(gaussian_plot_params...), x)), Layout()) endmetadatashow_logsèdisabled®skip_as_script«code_folded$a0ca7a5e-0089-4a45-9278-c0f27cd096a0cell_id$a0ca7a5e-0089-4a45-9278-c0f27cd096a0codeWplot_mountaincar_continuous_values(mountaincar_continuous_test_train3.policy_and_value)metadatashow_logsèdisabled®skip_as_script«code_folded$64b38d1f-ecf9-4843-89a1-4c8953048265cell_id$64b38d1f-ecf9-4843-89a1-4c8953048265code٭const cartpole_fcann_continuing_test_episode = runepisode(cartpole_setup.mdps.episodic.discrete; π = cartpole_continuing_fcann_test.policy_sample_action, max_steps = 1_000)metadatashow_logsèdisabled®skip_as_script«code_folded$d963ff6d-f1b6-4799-aa0e-1ae100310d84cell_id$d963ff6d-f1b6-4799-aa0e-1ae100310d84codepPlutoDevMacros.@frompackage @raw_str(joinpath(@__DIR__, "..", "ApproximationUtils.jl")) using ApproximationUtilsmetadatashow_logsèdisabled®skip_as_script«code_folded$b16899b7-36bf-4a5e-8e2f-4496b8450687cell_id$b16899b7-36bf-4a5e-8e2f-4496b8450687codesquashed_gaussian_pdf(x::Union{T, AbstractArray{N, T}}, μ::T, σ::T, xmax::T) where {N, T<:Real} = inv(σ*sqrt(T(2)*π)) * exp(-inv(T(2))*((atanh(x/xmax) - μ)/σ)^2) / abs(xmax*(1 - (x/xmax)^2))metadatashow_logsèdisabled®skip_as_script«code_folded$10cdd16e-a337-4421-a7a0-6de4e4b60c0fcell_id$10cdd16e-a337-4421-a7a0-6de4e4b60c0fcodebegin mutable struct BinaryGaussianEligibilityVector{T<:Real, A<:Union{T, NTuple{N, T} where N}, P<:Union{T, Vector{T}}, B <: BinaryFeatureVector} binary_features::B a::A μ::P σ::P end BinaryGaussianEligibilityVector(a::T) where T<:Real = BinaryGaussianEligibilityVector(BinaryFeatureVector(), a, zero(T), one(T)) BinaryGaussianEligibilityVector(a::NTuple{N, T}) where {T<:Real, N} = BinaryGaussianEligibilityVector(BinaryFeatureVector(), a, zeros(T, N), ones(T, N)) endmetadatashow_logsèdisabled®skip_as_script«code_folded$a8b40b8f-051a-4e6f-a079-ece4f32873decell_id$a8b40b8f-051a-4e6f-a079-ece4f32873decodeVfunction create_actor_critic_params_UI(;λ_θ = 0.5f0, λ_w = 0.5f0, log2α_θ = -10, log2α_w = -10) PlutoUI.combine() do Child @htl(""" $(md""" $$\lambda_\theta$$: $(Child(:λ_θ, Slider(0.00f0:0.001f0:.999f0, default = λ_θ, show_value=true))) $$\lambda_\mathbf{w}$$: $(Child(:λ_w, Slider(0.00f0:0.001f0:.999f0, default = λ_w, show_value=true))) $$\log_2 \alpha_\theta$$ min: $(Child(:α_θ_min, NumberField(-100:0, default = log2α_θ))) $$\log_2 \alpha_{\mathbf{w}}$$ min: $(Child(:α_w_min, NumberField(-100:0, default = log2α_w))) """)""") end |> confirm endmetadatashow_logsèdisabled®skip_as_script«code_folded$5eebf3da-bfe7-46eb-81a3-f87f334ee270cell_id$5eebf3da-bfe7-46eb-81a3-f87f334ee270codefunction create_actor_critic_fcann_params_UI(;λ_θ = 0.5f0, λ_w = 0.5f0, h = 8, l = 2, log2α_θ = -10, log2α_w = -10) PlutoUI.combine() do Child md""" $$\lambda_\theta$$: $(Child(:λ_θ, Slider(0.00f0:0.001f0:.999f0, default = λ_θ, show_value=true))) $$\lambda_\mathbf{w}$$: $(Child(:λ_w, Slider(0.00f0:0.001f0:.999f0, default = λ_w, show_value=true))) hidden layer size: $(Child(:h, NumberField(1:128, default = h))), num layers: $(Child(:l, NumberField(1:5, default = l))) $$\log_2 \alpha_\theta$$ min: $(Child(:α_θ_min, NumberField(-100:0, default = log2α_θ))) $$\log_2 \alpha_{\mathbf{w}}$$ min: $(Child(:α_w_min, NumberField(-100:0, default = log2α_w))) """ end |> confirm endmetadatashow_logsèdisabled®skip_as_script«code_folded$9bce6fdb-2cbc-4758-9a8b-794e490c973dcell_id$9bce6fdb-2cbc-4758-9a8b-794e490c973dcode8@bind ep2_step Slider(1:length(ep2[1]), show_value=true)metadatashow_logsèdisabled®skip_as_script«code_folded$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbfcell_id$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbfcode"function create_continuous_action_mountaincar(;slipforce = 1f0) mdp = MountainCarTask.mdp function step(s, a) f = if abs(a) > slipforce sign(a)*0.1f0 else a end (-1f0, MountainCarTask.step(s, f)) end ContinuousMDP(step, mdp.initialize_state, 0f0; isterm = mdp.isterm) endmetadatashow_logsèdisabled®skip_as_script«code_folded$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6cell_id$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6codeٚconst mountaincar_continuing_test_episode = runepisode(MountainCarTask.mdp, π = mountaincar_continuing_tile_test.policy_sample_action, max_steps = 1_000)metadatashow_logsèdisabled®skip_as_script«code_folded$7afb6fb0-248a-4518-b94f-9876f81eca64cell_id$7afb6fb0-248a-4518-b94f-9876f81eca64codecorridor_continuing_parameter_study(args...; kwargs...) = actor_critic_linear_parameter_study(corridor_continuing_mdp, get_corridor_features, 1, args...; init_policy_params = [0f0 3.7f0], seed = 45, binary_features=true, kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$37a273b6-b104-46f0-987a-401dc1c97327cell_id$37a273b6-b104-46f0-987a-401dc1c97327codeW@bind start_cartpole_continuing_binary_param_study CounterButton("Run Parameter Study")metadatashow_logsèdisabled®skip_as_script«code_folded$7a6f3f79-ea06-4994-8b62-90b2056e4034cell_id$7a6f3f79-ea06-4994-8b62-90b2056e4034codebegin function squashed_gaussian_action_sampler(params::Vector{T}, amax::T) where T<:Real σ = exp(params[2]) μ = params[1] isinf(μ) && return amax*sign(μ) isapprox(σ, zero(T)) && return amax*tanh(μ) isnan(σ) && return amax*tanh(μ) amax*tanh(rand(Normal(μ, σ))) end make_squashed_gaussian_n_sampler(::Val{1}, amax::T) where T<:Real = params -> squashed_gaussian_action_sampler(params, amax) function make_squashed_gaussian_n_sampler(::Val{N}, amax::NTuple{N, T}) where {N, T<:Real} function f(params::Vector{T}) where T<:Real ntuple(i -> amax[i]*tanh(rand(Normal(params[i], exp(params[i+N])))), N) end end make_squashed_gaussian_n_sampler(n::Integer, amax::T) where T<:Real = make_squashed_gaussian_n_sampler(Val(n), amax) make_squashed_gaussian_sampler(::T, amax::T) where T<:Real = params -> squashed_gaussian_action_sampler(params, amax) make_squashed_gaussian_sampler(::NTuple{N, T}, amax::NTuple{N, T}) where {N, T<:Real} = make_squashed_gaussian_n_sampler(N, amax) endmetadatashow_logsèdisabled®skip_as_script«code_folded$f2ed56c9-c2b7-42cb-a083-e12aeaa126efcell_id$f2ed56c9-c2b7-42cb-a083-e12aeaa126efcodeِreinforce_monte_carlo_control_binary_features(corridor_mdp, get_corridor_features, 1, 1_000, α = 2f0^-13, max_steps = 1_000).policy_function(1)metadatashow_logsèdisabled®skip_as_script«code_folded$cbea5840-49d2-4e91-be9c-f5f15666d78acell_id$cbea5840-49d2-4e91-be9c-f5f15666d78acodeٰreinforce_with_baseline_monte_carlo_control_binary_features(corridor_mdp, get_corridor_features, 1, 1_000, α_θ = 2f0^-12, α_w = 2f0^-6, max_steps = 1_000).policy_function(1)metadatashow_logsèdisabled®skip_as_script«code_folded$1f041cb3-618c-4380-a1ec-d7bbe4a80f62cell_id$1f041cb3-618c-4380-a1ec-d7bbe4a80f62codeCfunction actor_critic_binary_episodic_parameter_study(mdp::StateMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, λ_θ::T, λ_w::T, α_θ_list::AbstractVector{T}, α_w_list::AbstractVector{T}, max_episodes::Integer; nruns::Integer = 100, max_steps::Integer = 10_000, seed = rand(UInt64), init_policy_params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), init_value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} Random.seed!(seed) function average_runs(α_θ, α_w) 1:nruns |> Map(_ -> actor_critic_with_eligibility_traces_binary_features(mdp, λ_θ, λ_w, get_active_features, num_features, max_episodes, max_steps; α_θ = α_θ, α_w = α_w, policy_params = copy(init_policy_params), value_params = copy(init_value_params), kwargs...) |> x -> isempty(x.episode_rewards) ? -T(Inf) : mean(x.episode_rewards)) |> foldxt(+) |> x -> x / nruns end traces = [begin scatter(x = α_θ_list, y = average_runs.(α_θ_list, α_w), name = "α_w = $α_w") end for α_w in α_w_list] plot(traces, Layout(xaxis_title = "α_θ", yaxis_title = "Average Reward Per Episode in the First
$max_episodes Episodes Averaged Over $nruns Runs", xaxis_type = "log", title = "Binary Feature Encoding with $num_features Features, λ_θ = $λ_θ, λ_w = $λ_w")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$96506201-6b66-49e6-8179-06952e2394e1cell_id$96506201-6b66-49e6-8179-06952e2394e1codeBfunction setup_binary_policy_arguments(mdp::StateMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer) where {T<:Real, S, A, P, F1, F2, F3} x = BinaryFeatureVector() update_feature_vector!(x::BinaryFeatureVector, s) = update_binary_feature_vector!(x, s, get_active_features) action_preferences = zeros(T, length(mdp.actions)) ∇lnπ = BinaryEligibilityVector(x, 1, copy(action_preferences)) return (feature_vector = x, update_feature_vector! = update_feature_vector!, action_preferences = action_preferences, eligibility_vector = ∇lnπ) endmetadatashow_logsèdisabled®skip_as_script«code_folded$76b03e72-da04-4530-8534-6d6468268cbdcell_id$76b03e72-da04-4530-8534-6d6468268cbdcode md""" $\sum_{s \in \mathcal{S}} \sum_{k = 0}^\infty \Pr \{ s_0 \rightarrow s, k, \pi \} = \sum_{k = 0}^\infty \left [ 1 - \Pr \{s_0 \rightarrow S_T, k, \pi \} \right ] = \eta$ where $\eta$ is the average length of an episode. The quantity inside the brackets is the probability that an episode has not terminated by step k and follows from the fact that the sum over states in $\mathcal{S}$ is over the set of non-terminal states. If the sum was over $\mathcal{S}^+$ instead then it would be infinite since the first sum term would be 1 for every k. Normally to calculate $\eta$, we would use the expected value with the probability of an episode lasting exactly $k$ steps, but the probability we have access to here is actually the distribution function, not the density function. That is $\Pr \{s_0 \rightarrow S_T, k, \pi \} = \sum_{t = 0}^k \Pr \{ T = t \} = \Pr \{ T \leq k \}$ where $T$ is the length of an episode. Using these probabilities, we can write $\eta = \mathbb{E}_\pi [T] = \sum_{k = 0}^\infty k \Pr \{ T = k \} = \Pr \{T = 1 \} + 2 \Pr \{T = 2 \} + \cdots$. Earlier we had the expression $\eta = \sum_{k = 0}^\infty \left [ 1 - \Pr \{s_0 \rightarrow S_T, k, \pi \} \right ] = \sum_{k = 0}^\infty \Pr \{T \gt k \} = \sum_{k = 0}^\infty \sum_{t = k + 1}^\infty \Pr \{T = t \}$ We can stack up the terms of this double sum to see that it is equivalent to the expected value calcuation from before: $\begin{flalign} \Pr \{ T = 1 \} + \Pr \{ T = 2 \} + &\Pr \{ T = 3 \} +\cdots \\ \Pr \{ T = 2 \} + &\Pr \{ T = 3 \} + \cdots \\ &\Pr \{ T = 3 \} + \cdots \\ \vdots \end{flalign}$ If we count terms along the diagonal, we see that each value of $k$ has exactly $k$ terms, matching the expected value calculation. What if we wanted to calculate the bivariate distribution over states and steps where we ignore the terminal states $\mu_\pi(s, k)$ such that $\sum_{s \in \mathcal{S}} \sum_k \mu_\pi(s, k) = 1$. This probability represents the chance of sampling a particular step and state simultaneously from a unbiased sample of non-terminal states in an episode. Luckily we can break down this probability into two components: 1) the probability of reaching a step k without terminating 2) the probability of being in a non-terminal state on step k. We saw already that 1) is just $\sum_{s \in \mathcal{S}} \Pr \{ s_0 \rightarrow s, k, \pi \}$ and 2) we can calculate by normalizing those probabilities over only the non-terminal states: $\frac{\Pr \{ s_0 \rightarrow s, k, \pi \}}{\sum_{s \in \mathcal{S}} \Pr \{ s_0 \rightarrow s, k, \pi \} }$. By multiplying these two together we see that the probability is just the original distribution but where the domain of possible input values is $s \in \mathcal{S}$ and all possible steps $k$. Therefore, we can transform this into a normalized bivariate distribution by dividing by its sum over those two sets: $\mu_\pi(s, k) = \frac{\Pr \{ s_0 \rightarrow s, k, \pi \}}{\sum_{x \in \mathcal{S}} \sum_{t = 0}^\infty \Pr \{ s_0 \rightarrow x, t, \pi \}}$ Now that we have established the relationship between the on-policy distribution function and the probability expression we have, we can use it to complete the proof below. """metadatashow_logsèdisabled®skip_as_script«code_folded$fd89433e-643c-474b-b3c4-a997678421a6cell_id$fd89433e-643c-474b-b3c4-a997678421a6codemd""" #### Linear Features This version of REINFORCE uses linear feature vectors for which one needs to specify the total number of features as well as a function that updates the values in a feature vector given a state. """metadatashow_logsèdisabled®skip_as_script«code_folded$87feff3e-e510-4916-91a9-db3a2cd12225cell_id$87feff3e-e510-4916-91a9-db3a2cd12225code@bind fcann_continuing_cartpole_study_params PlutoUI.combine() do Child md""" $$\lambda_\theta$$: $(Child(:λ_θ, Slider(0.00f0:0.001f0:.999f0, default = 0.75f0, show_value=true))) $$\lambda_\mathbf{w}$$: $(Child(:λ_w, Slider(0.00f0:0.001f0:.999f0, default = 0.25f0, show_value=true))) $$\alpha_{\overline{r}}$$: $(Child(:α_r̄, NumberField(0.00f0:0.001f0:.999f0, default = 0.1f0))) hidden layer size: $(Child(:h, NumberField(1:128, default = 8))), num layers: $(Child(:l, NumberField(1:5, default = 3))) $$\log_2 \alpha_\theta$$ min: $(Child(:α_θ_min, NumberField(-100:0, default = -11))) $$\log_2 \alpha_{\mathbf{w}}$$ min: $(Child(:α_w_min, NumberField(-100:0, default = -10))) """ end |> confirmmetadatashow_logsèdisabled®skip_as_script«code_folded$5261651e-a51e-4e80-8e23-83a4c10e5259cell_id$5261651e-a51e-4e80-8e23-83a4c10e5259codebegin function update_gaussian_eligibility_vector!(∇lnπ::Matrix{T}, action_dist_params::Vector{T}, x::Vector{T}, action::T, policy_params::Matrix{T}) where T<:Real c1 = action - first(action_dist_params) σ = exp(last(action_dist_params)) c2 = σ^-2 c3 = c2*c1 c4 = c3*c1 - one(T) @inbounds @simd for i in eachindex(x) ∇lnπ[i, 1] = x[i]*c3 end @inbounds @simd for i in eachindex(x) ∇lnπ[i, 2] = x[i]*c4 end end function update_gaussian_eligibility_vector!(∇lnπ::Matrix{T}, action_dist_params::Vector{T}, x::Vector{T}, action::NTuple{N, T}, policy_params::Matrix{T}) where {N, T <: Real} for k = 1:N c1 = action - action_dist_params[k] σ = exp(action_dist_params[k+N]) c2 = σ^-2 c3 = c2*c1 c4 = c3*c1 - one(T) @inbounds @simd for i in eachindex(x) ∇lnπ[i, k] = x[i]*c3 end @inbounds @simd for i in eachindex(x) ∇lnπ[i, k+N] = x[i]*c4 end end end endmetadatashow_logsèdisabled®skip_as_script«code_folded$dddc4a2f-34b2-41dc-85b3-55aba4880fa6cell_id$dddc4a2f-34b2-41dc-85b3-55aba4880fa6codeٌdisplay_cartpole_episode((runepisode(cartpole_setup.mdps.episodic.continuous, reinforce_test.policy_sample_action) |> x -> (x[1], x[2]))...)metadatashow_logsèdisabled®skip_as_script«code_folded$54fff14b-cf53-47b0-9cfa-8b9ee33df54ecell_id$54fff14b-cf53-47b0-9cfa-8b9ee33df54ecodebegin mutable struct BinaryBetaEligibilityVector{T<:Real, A<:Union{T, NTuple{N, T} where N}, P<:Union{T, Vector{T}}, B <: BinaryFeatureVector} binary_features::B a::A α::P β::P end BinaryBetaEligibilityVector(a::T) where T<:Real = BinaryBetaEligibilityVector(BinaryFeatureVector(), a, one(T), one(T)) BinaryBetaEligibilityVector(a::NTuple{N, T}) where {T<:Real, N} = BinaryBetaEligibilityVector(BinaryFeatureVector(), a, ones(T, N), ones(T, N)) endmetadatashow_logsèdisabled®skip_as_script«code_folded$023f67b8-8f38-470a-9766-ac60a75678aacell_id$023f67b8-8f38-470a-9766-ac60a75678aacodefconst mountaincar_fcann_setup = fcann_feature_vector_setup(mountaincar_min_vals, mountaincar_max_vals)metadatashow_logsèdisabled®skip_as_script«code_folded$1558cec1-c4fd-4bc0-85ed-ae22c6067d41cell_id$1558cec1-c4fd-4bc0-85ed-ae22c6067d41codemd""" We can also repeat this derivation for the alternative linear parameterization where we only have state feature vectors and a parameter matrix with components $\boldsymbol{\theta}_{i, j}$: $\begin{flalign} \mathbf{h} &= \boldsymbol{\theta}^\top \mathbf{x}(s) \\ h_a &= \mathbf{h}_a \\ \mathbf{\pi}(s) &= \sigma(\mathbf{h}) \\ \pi_a &= \sigma(\mathbf{h})_a \\ \nabla(\pi_a)_{i, j} &= \pi_a \begin{cases} \mathbf{x}(s)_i (1 - \pi_j), & \text{ if } j = a \\ -\pi_j \mathbf{x}(s)_i, & \text{ else }\\ \end{cases} \end{flalign}$ We already know how to apply the chain rule to the natural logarithm so our final gradient is: Applying this to the above expression yields: $\begin{flalign} \nabla \left ( \ln \pi_a \right )_{i, j} &= \frac{\nabla \left ( \pi_a \right )_{i, j}}{\pi_a} \\ &= \begin{cases} \mathbf{x}(s)_i (1 - \pi_j), & \text{ if } j = a \\ -\pi_j \mathbf{x}(s)_i, & \text{ else }\\ \end{cases} \end{flalign}$ which is the per component version of the desired vector expression. """metadatashow_logsèdisabled®skip_as_script«code_folded$da8d0bca-105b-4d0b-a73d-ee5c9059aeafcell_id$da8d0bca-105b-4d0b-a73d-ee5c9059aeafcodemd""" Notice now that all of the parameters associated with the state-value estimate are irrelevent since they always cancel out in the parameter update. Even though we have added a parameter, this method effectively removes two from the analysis. Also, we seem to actually benefit from an intermediate value of $\lambda_{\boldsymbol{\theta}}$ unlike in the episodic case where using the Monte Carlo method was always the best. """metadatashow_logsèdisabled®skip_as_script«code_folded$3e7cecec-eb77-4862-8e3c-b510422e06dbcell_id$3e7cecec-eb77-4862-8e3c-b510422e06dbcode8plot_squashed_gaussian(squashed_gaussian_plot_params...)metadatashow_logsèdisabled®skip_as_script«code_folded$0284f0d7-b8a9-4ae6-add0-ac1078571d9bcell_id$0284f0d7-b8a9-4ae6-add0-ac1078571d9bcode3md""" $\begin{flalign} J(\boldsymbol{\theta}) \doteq r(\pi) &\doteq \lim_{h \rightarrow \infty} \frac{1}{h} \sum_{t=1}^h \mathbb{E} [R_t \mid S_0, A_{0:t-1} \sim \pi] \tag{13.15} \\ &= \lim_{t \rightarrow \infty} \mathbb{E}[R_t \vert S_0,A_{0:t-1} \sim \pi] \\ &= \sum_s \mu(s) \sum_a \pi(a \vert s) \sum_{s^\prime, r} p(s^\prime, r \vert s, a) r \end{flalign}$ where $\mu$ is the steady-state distribution under $\pi$, $\mu(s) \doteq \lim_{t \rightarrow \infty} \Pr \{ S_t = s \vert A_{0:t} \sim \pi \}$, which is assumed to exist and to be independent of $S_0$ (an ergodicity assumption). Remember that this is the special distribution under which, if you select actions according to $\pi$, you remain the same distribution: $\sum_s \mu(s) \sum_a \pi(a \vert s, \boldsymbol{\theta})p(s^\prime \vert s, a) = \mu(s^\prime), \: \forall s^\prime \in \mathcal{S}$ Naturally, in the continuing case, we define values, $v_\pi(s) \doteq \mathbb{E}_\pi [G_t \vert S_t = s]$ and $q_\pi(s, a) \doteq \mathbb{E}_\pi[G_t \vert S_t = s, A_t = a]$, with respect to the differential return: $G_t \doteq R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + R_{t+3} - r(\pi) + \cdots \tag{13.17}$ With these alternate definitions, the policy gradient theorem as given for the episodic case (13.5) remains true for the continuing case. See proof below: """metadatashow_logsèdisabled®skip_as_script«code_folded$b94fc99c-f439-4df2-8da3-c01718a136c4cell_id$b94fc99c-f439-4df2-8da3-c01718a136c4code{md""" Repeating this process for state 2 yields: $\begin{flalign} v_2 &= -\frac{2+p}{p(1-p)} \\ \frac{\partial v_2}{\partial p} &= -\frac{p(1-p) - (2+p)(1 - 2p)}{p^2(1-p)^2} \end{flalign}$ Setting this equal to 0 implies $\begin{flalign} p - p^2 &= 2 - 4p + p - 2p^2 \\ p^2 + 4p - 2 &= 0 \\ \end{flalign}$ Using the quadratic equation and taking only the positive solution yields: $p = \frac{-4 + \sqrt{16 + 8}}{2} = \frac{-4 + \sqrt{24}}{2} = -2 + \sqrt{6} \approx 0.4495$ So, in order to maximize the value at state 2, we have $p_{\text{left}} \approx 0.4495$ and $p_{\text{right}} \approx 0.5505$. Which is different from the value we got for state 1. So There is a different optimal policy depending on the starting state. It should be obvious for example that starting in the third state results in an optimial policy of choosing the right action every time. The value functions for each state are plotted below. The behavior of $v_3$ is not well defined at $p=0$ because for any finite $v_2$ it should be 0 but the limit approaching from the right side is -3. This is because for $p=0$ both $v_1$ and $v_2$ are not finite and the episode never terminates. The value of the state at this probability is: $v_2 = - \frac{2+p}{p(1-p)} = -\frac{\sqrt{6}}{(\sqrt{6}-2)(3 - \sqrt{6})} = - \frac{\sqrt{6}}{3 \sqrt{6} - 6 - 6 + 2 \sqrt{6}} = - \frac{\sqrt{6}}{5 \sqrt{6} - 12} \approx -9.9$ """metadatashow_logsèdisabled®skip_as_script«code_folded$b8532822-179b-4cd5-a279-4b71dafb544acell_id$b8532822-179b-4cd5-a279-4b71dafb544acode2const mountaincar_continuous_test_train = actor_critic_with_eligibility_traces_binary_features_gaussian_actions(mountaincar_continuous_mdp, 0.05f0, 0.8f0, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, typemax(Int64), 1_000_000; α_θ = 5f-5, α_w = 0.00008f0)metadatashow_logsèdisabled®skip_as_script«code_folded$07ba9fe4-aaa7-4123-9865-cbfa79d0d44acell_id$07ba9fe4-aaa7-4123-9865-cbfa79d0d44acode٣display_cartpole_episode((runepisode(cartpole_setup.mdps.episodic.discrete; π = reinforce_test4.policy_sample_action, max_steps = 1_000) |> x -> (x[1], x[2]))...)metadatashow_logsèdisabled®skip_as_script«code_folded$f487f2dd-ad09-48ac-ae34-bf50cfa6ac7dcell_id$f487f2dd-ad09-48ac-ae34-bf50cfa6ac7dcodee@bind start_mountaincar_continuing_fcann_param_study CounterButton("Run Mountaincar Parameter Study")metadatashow_logsèdisabled®skip_as_script«code_folded$5c4a383f-fcf2-4f2b-819f-6d84471dda00cell_id$5c4a383f-fcf2-4f2b-819f-6d84471dda00codeufunction update_fcann_value_gradient!(∇v̂::FCANNParams, x::Vector{T}, params::FCANNParams, hidden_layers::Vector{Int64}, l2::T, tanh_grad_z::FCANNActivations{T}, activations::FCANNActivations{T}, deltas::FCANNActivations{T}, dropout::T, reslayers::Integer, activation_list::AbstractVector{B}, scales) where {T<:Float32, B<:Bool} FCANN.nnCostFunction(params..., hidden_layers, x, 1, l2, ∇v̂..., tanh_grad_z, activations, deltas, dropout; resLayers = reslayers, loss_type = OutputIndex(), activation_list = activation_list) @inbounds for i in eachindex(params[1]) for j in 1:2 ∇v̂[j][i] .*= scales[i] end end endmetadatashow_logsèdisabled®skip_as_script«code_folded$135f205a-f87e-4691-8e87-d317d6312c84cell_id$135f205a-f87e-4691-8e87-d317d6312c84code:md""" The plots below visualize these distributions for the corridor problem starting with the normalized distributions per step which include the terminal states. If we continued to create these plots for larger values of $k$, then the distribution would collapse to a value of 1 for being in a terminal state. In order to calculate other distributions such as the stationary state distribution, it is necessary to renormalize these probabilities by excluding the terminal states: #### On-policy Distributions $$\begin{flalign} &\mu_{k, \pi}(s) = \Pr\{S_k = s \mid \pi \} \; \forall s \in \mathcal{S}^+ \tag{state visits per step}\\ &\Pr \{ T \leq k \vert \pi \} = 1 - \sum_{s \in \mathcal{S}} \Pr\{S_k = s \mid \pi \} \; \forall k \tag{Chance of terminating already (distribution function not density)}\\ &\mu_\pi(s) = \frac{\sum_k \Pr \{ S_k = s \mid \pi \}}{\sum_{k} \sum_{s \in \mathcal{S}} \Pr \{ S_k = s \mid \pi \}} \; \forall s \in \mathcal{S} \tag{non-terminal state visits}\\ &\mu_\pi(s, k) = \frac{\Pr \{ S_k = s \mid \pi \}}{\sum_{k} \sum_{s \in \mathcal{S}} \Pr \{ S_k = s \mid \pi \}} \; \forall s \in \mathcal{S} \tag{non-terminal state and step visits}\\ \end{flalign}$$ Note that final two distributions are only defined for non-terminal states. If we tried to include terminal states we would be unable to normalize the distribution since $\lim_{k \rightarrow \infty} \Pr \{ S_k = S_T \mid \pi \} = 1$ and we would have a diverging sum in the denominator. The only reason these calculation is possible is that the probabilities reach zero quickly enough at higher $k$ for the non-terminal states. The plots below visualize the four expressions above. The second expression notably is not a probability density but a cummulative distribution function since it includes a sum of all probabilities that meet the condition. """metadatashow_logsèdisabled®skip_as_script«code_folded$4a39f9a7-72d4-44ad-895a-742cd1291f92cell_id$4a39f9a7-72d4-44ad-895a-742cd1291f92codeW@bind dist_plot_p Slider(0.1f0:0.1f0:.9f0; default = 0.5f0, show_value=true) |> confirmmetadatashow_logsèdisabled®skip_as_script«code_folded$ee72af8d-3cb8-4314-82df-580f068e1252cell_id$ee72af8d-3cb8-4314-82df-580f068e1252code md""" One common form of linear feature vector is one that selects active features per state. Tile coding is an example of this where a state is assigned a tile in each tiling used and the number of tilings control how many active features a given state will have. Because the only possible feature vector values are 1 or 0, this style of encoding need not be as complex as other methods. We can see by the form of the gradients an abbreviated algorithm that need not compute the eligibility vector explicitely. We can define a binary feature encoding by the function $\mathcal{F}(s)$ which returns the indices of active features for a state $s$ as well as the knowledge of how many total features there are, $d$. All of the values of $\mathbf{x}(s)$ are zero except for the indices in $\mathcal{F}(s)$ whose values are 1. That simplifies the expression we have before for the linear feature eligibility vector: $\begin{flalign} \nabla \left ( \ln \pi_a \right )_{i, j} &= \frac{\nabla \left ( \pi_a \right )_{i, j}}{\pi_a} \\ &= \begin{cases} \mathbf{x}(s)_i (1 - \pi_j), & \text{ if } j = a \\ -\pi_j \mathbf{x}(s)_i, & \text{ else }\\ \end{cases} \\ &= \begin{cases} (1 - \pi_j), & \text{ if } j = a \text{, } i \in \mathcal{F}(s) \\ -\pi_j, & \text{ if } j \neq a \text{, } i \in \mathcal{F}(s) \\ 0, & \text{ otherwise} \end{cases} \end{flalign}$ We can see from this form of the eligibility vector that it need not be computed explicitely and we do not need to instantiate a feature vector either. Rather we can simply go through the active feature indices and subtract the policy output for the column index at each row and then add 1 to the column corresponding to the selected action: Loop for each step of the episode $t = 0, 1, \cdots, T-1$ $G \leftarrow \sum_{k=t+1} \gamma^{k-t-1}R_k$ $c = \alpha \times \gamma^t \times G$ Loop for each action index j Loop for each feature i $\theta_{i, j} \leftarrow \theta_{i, j} - c \times \pi(a_j, S_t, \mathbf{\theta})$ Define $j_a$ as the column index corresponding to action $A_t$ Loop for each feature i $\theta_{i, j_a} \leftarrow \theta_{i, j_a} + c$ Specialized versions of REINFORCE that use binary features and linear features can be found below as well as the general case that works for any type of parameterized function approximation. """metadatashow_logsèdisabled®skip_as_script«code_folded$e524f8cc-ab69-4f8b-a59f-28156696a104cell_id$e524f8cc-ab69-4f8b-a59f-28156696a104codec@bind run_mountaincar_binary_episodic_countinuous_param_study2 CounterButton("Run Parameter Study")metadatashow_logsèdisabled®skip_as_script«code_folded$1894ae1a-bb68-4de0-a4d2-ac5d02c49f09cell_id$1894ae1a-bb68-4de0-a4d2-ac5d02c49f09code,plot(mountaincar_test_train.episode_rewards)metadatashow_logsèdisabledîskip_as_script«code_folded$f3bc47b5-03fc-4bd9-a890-26f9608a730bcell_id$f3bc47b5-03fc-4bd9-a890-26f9608a730bcode.md""" ### *Continuing Corridor Gridworld Example* Note that if we try to apply this algorithm to the short corridor gridworld it fails because a terminal state is encountered. This condition is checked inside the algorithm because there is nothing about an MDP the way it is defined which tells you in advance if it is a continuing task or not. In the tabular case you can always check to see if a terminal state exists since every state is available, but for the non-tabular case, all we can do is note the problem if a terminal state is encountered. """metadatashow_logsèdisabled®skip_as_script«code_folded$4915b1ed-ad53-4ece-9b00-bc136d47d8dccell_id$4915b1ed-ad53-4ece-9b00-bc136d47d8dccodemd""" It is implicit in all expressions below that $\pi$ is a function of $\boldsymbol{\theta}$ and that the gradients are with respect to $\boldsymbol{\theta}$. The performance measure for the continuing case is $J(\boldsymbol{\theta}) = r(\boldsymbol{\theta})$ (13.15) and all value functions use the definition of the differential return. We begin by expressing the gradient of the state value function in terms of the state-action value function, the policy, the average return and gradients thereof: $\begin{flalign} \nabla v_\pi(s) &= \nabla \left [ \sum_a \pi(a \vert s) q_\pi (s, a) \right ], \: \forall s \in \mathcal{S} \\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \nabla q_\pi(s, a) \right ] \tag{product rule} \\ &=\sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \nabla \sum_{s^\prime, r} p(s^\prime, r, \vert s, a)\left (r - r(\boldsymbol{\theta}) + v_\pi(s^\prime) \right ) \right ] \tag{differential return definitions} \\ &=\sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) [ -\nabla r(\boldsymbol{\theta}) + \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) ] \right ] \tag{distributing gradient}\\ \end{flalign}$ The purpose of this expression is to isolate the term which is the gradient of the average return since this is the performance metric gradient we originally sought. Note that if we separate the terms inside the sum, the one with the gradient of $r$ is $\sum_a \pi(a\vert s) [- \nabla r(\boldsymbol{\theta})] = -\nabla r(\boldsymbol{\theta}) \sum_a \pi(a \vert s)$. But the policy function is a probability distribution so its sum over actions is just 1. Therefore, this term simplifies to just $-\nabla r(\boldsymbol{\theta})$ which we can simply move to the other side of the expression swapping its place with the state value function: $\begin{flalign} \nabla v_\pi(s)&=-\nabla r(\boldsymbol{\theta}) + \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) \right ] \\ \nabla r(\boldsymbol{\theta}) &=-\nabla v_\pi(s) + \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) \right ] \end{flalign}$ Now the left hand side is $\nabla J(\boldsymbol{\theta})$ and does not depend on $s$. As such, the right hand side as a whole must be independent of $s$ as well so we are free to take a weighted sum of it over some probability distribution on $s$ since all the terms sum to 1. That is, if $f$ is independent of $s$, then $f = \sum_s \mu(s) f = f \sum_s \mu(s) = f \times 1 = f$: $\begin{flalign} \nabla J(\boldsymbol{\theta}) &= \sum_s \mu(s) \left ( \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) \right ] - \nabla v_\pi(s) \right ) \\ &= \sum_s \mu(s) \sum_a \nabla \pi(a \vert s) q_\pi(s, a) + \sum_s \mu(s) \sum_a \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) - \sum_s \mu(s) \nabla v_\pi(s) \tag{separating sum terms}\\ &= \sum_s \mu(s) \sum_a \nabla \pi(a \vert s) q_\pi(s, a) + \sum_{s^\prime} \sum_s \mu(s) \sum_a \pi(a \vert s) p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) - \sum_s \mu(s) \nabla v_\pi(s) \tag{swapping sum order in second term}\\ &= \sum_s \mu(s) \sum_a \nabla \pi(a \vert s) q_\pi(s, a) + \sum_{s^\prime} \mu(s^\prime) \nabla v_\pi(s^\prime) - \sum_s \mu(s) \nabla v_\pi(s) \tag{stationary state distribution definition}\\ &= \sum_s \mu(s) \sum_a \nabla \pi(a \vert s) q_\pi(s, a) \tag{cancelling equivalent sum terms}\\ &= \mathbb{E}_\pi \left [ \sum_a \nabla \pi(a \vert S_t) q_\pi(S_t, a) \right ] \tag{expected value definition}\\ &= \mathbb{E}_\pi \left [ \sum_a \pi(a \vert S_t) \frac{\nabla \pi(a \vert S_t)}{\pi(a \vert S_t)} q_\pi(S_t, a) \right ] \tag{multiplying and dividing by the policy}\\ &= \mathbb{E}_\pi \left [\frac{\nabla \pi(A_t \vert S_t)}{\pi(A_t \vert S_t)} q_\pi(S_t, A_t) \right ] \tag{expected value definition}\\ &= \mathbb{E}_\pi \left [\frac{\nabla \pi(A_t \vert S_t)}{\pi(A_t \vert S_t)} G_t \right ] \tag{differential return definition}\\ &= \mathbb{E}_\pi \left [G_t \nabla \ln \pi(A_t \vert S_t) \right ] \tag{chain rule}\\ \end{flalign}$ The expression inside the expected value can be sampled on every time step and the gradient is only in terms of the policy function which we have selected as something differentiable with respect to the parameters. Since this method will only be used for continuing problems, we cannot rely on Monte Carlo sampling for the differential return. Instead, our only option is to use a bootstrap value estimate in combination with a running estimate of the average reward and the immediate sample reward: $R - \overline{R} + \hat v^\prime$ where $\hat v^\prime$ is the differential value function estimate at the transition state and $\overline{R}$ is an estimate of the average reward. We can apply the existing actor-critic algorithms to these continuing problems as long as we track that additional information and use an additional step size parameter to update the average reward estimate. This step size parameter replaces the discount rate. See a full implementation below: """metadatashow_logsèdisabled®skip_as_script«code_folded$f924eb30-d1cc-4941-8fb5-ff70ad425ab9cell_id$f924eb30-d1cc-4941-8fb5-ff70ad425ab9codemd""" ## 13.3 REINFORCE: Monte Carlo Policy Gradient If we replace the true action-value function in (13.5) with a learned approximation $\hat q_\pi$, then we have a method called the *all-actions* method because the update involves the sum over all actions. For the REINFORCE algorithm, we instead sample this value using the actual return and the policy distribution. We can re-write (13.5) using an expected value under the policy and continue from there: $\begin{flalign} \nabla J(\boldsymbol{\theta}) & \propto \mathbb{E}_\pi \left [ \gamma^t \sum_a q_\pi (S_t, a) \nabla \pi(a|S_t, \boldsymbol{\theta}) \right ] \tag{13.6}\\ &= \mathbb{E}_\pi \left [\gamma^t \sum_a \pi(a|S_t, \boldsymbol{\theta}) q_\pi (S_t, a) \frac{\nabla \pi(a|S_t, \boldsymbol{\theta})}{\pi(a|S_t, \boldsymbol{\theta})} \right ] \tag{multiply and divide by policy} \\ &= \mathbb{E}_\pi \left [ \gamma^t q_\pi (S_t, A_t) \frac{\nabla \pi(A_t|S_t, \boldsymbol{\theta})}{\pi(A_t|S_t, \boldsymbol{\theta})} \right ] \tag{replace a with sample under policy} \\ &= \mathbb{E}_\pi \left [ \gamma^t G_t \frac{\nabla \pi(A_t|S_t, \boldsymbol{\theta})}{\pi(A_t|S_t, \boldsymbol{\theta})} \right ] \tag{replace value with sample return} \\ \end{flalign}$ Using the expression in the brackets we can write down an update rule for the parameters that can be sampled on each time step. This is the **REINFORCE update**: $\begin{align} \boldsymbol{\theta}_{t+1} \doteq \boldsymbol{\theta}_t + \alpha \gamma^t G_t \frac{\nabla \pi(A_t|S_t, \boldsymbol{\theta}_t)}{\pi(A_t|S_t, \boldsymbol{\theta}_t)} \tag{13.8} \end{align}$ Because it uses all future returns after step t, REINFORCE is a Monte Carlo algorithm and is well defined only for the episodic case. For implementation purposes we can replace $\frac{\nabla \pi(A_t|S_t, \boldsymbol{\theta})}{\pi(A_t|S_t, \boldsymbol{\theta})}$ with $\nabla \ln \pi(A_t|S_t, \boldsymbol{\theta}_t)$ which is usually refered to as the *eligibility vector*. With the alternative parameterization, the eligibility vector is $\nabla \ln \pi(S_t, \theta_t)_{A_t}$ where $\pi$ is a vector and the $A_t$ subscript takes the value of that vector at the index corresponding to the action $A_t$. """metadatashow_logsèdisabled®skip_as_script«code_folded$d83dc659-dce7-41dd-a8e7-2933ab39d15ccell_id$d83dc659-dce7-41dd-a8e7-2933ab39d15ccodemd""" ### *REINFORCE with Baseline Implementation* These functions use two sets of parameters, one to calculate the policy function and another to calculate the state value function. The state representation vector is shared between the two functions, but the policy function will return a distribution of preferences over actions while the value function will return a single value. If linear approximation is used to estimate both functions, the the policy parameters $\boldsymbol{\theta}$ will be a $d \times N_a$ matrix where $d$ is the length of the state feature vector representation and the value function parameters $\mathbf{w}$ will be a length $d$ vector. It is also possible to mix linear and non-linear approximation with this method. """metadatashow_logsèdisabled®skip_as_script«code_folded$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fcecell_id$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fcecodendisplay_cartpole_episode(cartpole_fcann_continuing_test_episode[1], cartpole_fcann_continuing_test_episode[2])metadatashow_logsèdisabled®skip_as_script«code_folded$83ca0577-15d7-4448-b597-c77810b812bfcell_id$83ca0577-15d7-4448-b597-c77810b812bfcodefunction figure_13_2_test(α_list, α_pair_list; nruns = 100, num_episodes = 1_000, max_steps = 1_000) Random.seed!(45) function average_runs(α) 1:nruns |> Map(_ -> reinforce_monte_carlo_control_binary_features(corridor_mdp, get_corridor_features, 1, num_episodes, params = [0f0 3.7f0], α = α, max_steps = max_steps).episode_rewards) |> foldxt((a, b) -> a .+ b) |> v -> v ./ nruns end function average_runs(α_θ, α_w) 1:nruns |> Map(_ -> reinforce_with_baseline_monte_carlo_control_binary_features(corridor_mdp, get_corridor_features, 1, num_episodes, policy_params = [0f0 3.7f0], α_θ = α_θ, α_w = α_w, max_steps = max_steps).episode_rewards) |> foldxt((a, b) -> a .+ b) |> v -> v ./ nruns end traces1 = [begin out = average_runs(α) scatter(x = 1:num_episodes, y = out, name = name = "α = 2^$(round(Int64, log2(α)))") end for α in α_list] traces2 = [begin out = average_runs(αs...) scatter(x = 1:num_episodes, y = out, name = name = "α_θ = 2^$(round(Int64, log2(αs[1]))) and α_w = 2^$(round(Int64, log2(αs[2])))") end for αs in α_pair_list] baselinetrace = scatter(x = 1:num_episodes, y = fill(-2*sqrt(2) / (3*sqrt(2) - 4), num_episodes), name = "ideal value", line_dash = "dash", line_color = "gray") plot([baselinetrace; traces1; traces2], Layout(yaxis_range = [-90, -10], yaxis_title = "Total reward on episode
(averaged over $nruns runs)", xaxis_title = "Episode", width = 800)) endmetadatashow_logsèdisabled®skip_as_script«code_folded$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbcell_id$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbcode/function reinforce_with_baseline_monte_carlo_control_binary_features(mdp::StateMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, max_episodes::Integer; policy_params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} setup = setup_binary_policy_arguments(mdp, get_active_features, num_features) reinforce_with_baseline_monte_carlo_control!(policy_params, setup.eligibility_vector, value_params, BinaryFeatureVector(), mdp, update_binary_action_preferences!, update_binary_eligibility_vector!, setup.feature_vector, setup.update_feature_vector!, binary_value_function, update_binary_value_gradient!, max_episodes; action_preferences = setup.action_preferences, kwargs...) endmetadatashow_logsèdisabled®skip_as_script«code_folded$a7dcc8cd-04ec-48f2-a387-116330eaffb2cell_id$a7dcc8cd-04ec-48f2-a387-116330eaffb2codegfigure_13_2_test([2f0^-13], vcat([(2f0^n, 2f0^-4) for n in -12:-10], [(2f0^n, 2f0^-2) for n in -8:-6]))metadatashow_logsèdisabled®skip_as_script«code_folded$0ab70fc3-6188-42eb-aba2-d808f319be9fcell_id$0ab70fc3-6188-42eb-aba2-d808f319be9fcodemd""" # Dependencies """metadatashow_logsèdisabled®skip_as_script«code_folded$047656d1-2921-40f2-b75b-ce4a87098007cell_id$047656d1-2921-40f2-b75b-ce4a87098007code1md""" ### Switched Corridor Parameter Studies """metadatashow_logsèdisabled®skip_as_script«code_folded$5d434c83-c9ca-499f-8695-c7733031c2decell_id$5d434c83-c9ca-499f-8695-c7733031c2decodeffunction cartpole_continuing_step(s::CartPoleState, i_a::Integer) s′ = cartpole_functions.step(s, cartpole_functions.discrete_actions[i_a]) if cartpole_functions.failure(s′) s′ = cartpole_functions.initialize_state() s′ = CartPoleState(s′.x, s′.θ, s′.ẋ, s′.θ̇, s.t+cartpole_functions.h) (-1f0, s′) else (0f0, s′) end endmetadatashow_logsèdisabled®skip_as_script«code_folded$3a37b53d-9174-4faa-9404-74a40c385b0acell_id$3a37b53d-9174-4faa-9404-74a40c385b0acodeYshow_mountaincar_trajectory(mountaincar_continuing_fcann_test.policy_sample_action, 1000)metadatashow_logsèdisabled®skip_as_script«code_folded$820752af-8966-4ee8-82f7-a40934522de5cell_id$820752af-8966-4ee8-82f7-a40934522de5codePtest_study2 = actor_critic_fcann_parameter_study(cartpole_continuing_mdp, cartpole_vector_update!, cartpole_fcann_feature_setup.num_features, [4, 4], LinRange(0f0, .95f0, 20), LinRange(0.0f0, .95f0, 20), [0.005f0, 0.01f0, 0.05f0], 2f0 .^ (-8:-2), 2f0 .^ (-8:-2), 100, 100_000; nruns = 40, seed = 45) |> df -> sort(df, :output; rev=true)metadatashow_logsèdisabledîskip_as_script«code_folded$6acb549a-5d90-4457-a347-d22448ad8071cell_id$6acb549a-5d90-4457-a347-d22448ad8071codeق@bind cartpole_fcann_continuing_episode_step_select Slider(1:length(cartpole_fcann_continuing_test_episode[1]); show_value = true)metadatashow_logsèdisabled®skip_as_script«code_folded$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62cell_id$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62codecartpole_fcann_continuing_parameter_study(layer_size::Integer, num_layers::Integer, args...; kwargs...) = actor_critic_fcann_parameter_study(cartpole_continuing_mdp, cartpole_vector_update!, cartpole_fcann_feature_setup.num_features, fill(layer_size, num_layers), args...; kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728cell_id$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728codeactor_critic_with_eligibility_traces_binary_features(corridor_mdp, 0f0, 0f0, get_corridor_features, 1, typemax(Int64), 100_000, α_θ = 2f0 ^ -4, α_w = 2f0 ^ -10, policy_params = [0f0 3.7f0]).policy_and_value(1)metadatashow_logsèdisabled®skip_as_script«code_folded$ae0f5a96-7a4b-47f9-be1e-e803a238a071cell_id$ae0f5a96-7a4b-47f9-be1e-e803a238a071code@md""" ### *MDP Types and Transitions for Continuous Actions* """metadatashow_logsèdisabled®skip_as_script«code_folded$41d62de1-2c92-41ee-9430-b9ca3007afd9cell_id$41d62de1-2c92-41ee-9430-b9ca3007afd9codeumd""" The above matrix represents an estimate of $\Pr \{ S_k = s \mid \pi \}$; however note that the terminal states are excluded from the rows. This corridor problem only has three non-terminal states. If we sum across each row, then we have the probability of reaching that step prior to terminating. The vector defined below measures the probability of an episode terminating prior to each step. Notably, this probablity is 0 for the first three steps since no policy starting from the left can terminate that quickly. As expected, the probability of terminating under the random policy grows with time approaching 1. """metadatashow_logsèdisabled®skip_as_script«code_folded$8eb42403-1234-4e59-993e-057cc3a6d5c9cell_id$8eb42403-1234-4e59-993e-057cc3a6d5c9codeCif run_mountaincar_binary_episodic_param_study > 0 actor_critic_binary_episodic_parameter_study(MountainCarTask.mdp, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, mountaincar_binary_params, 5, 3, 1000; max_steps = 100_000) else md""" Waiting to run parameter study """ endmetadatashow_logsèdisabled®skip_as_script«code_folded$bbc8864a-1545-433f-bc7c-0ddf6e907138cell_id$bbc8864a-1545-433f-bc7c-0ddf6e907138codefunction plot_mountaincar_policy_values(policy_and_value::Function; n1 = 100, n2 = 100) xvals = LinRange(-1.2f0, 0.5f0, n1) vvals = LinRange(-0.07f0, 0.07f0, n2) values = zeros(Float32, n1, n2) action_dists = [zeros(Float32, n1, n2) for i in 1:3] for (i, x) in enumerate(xvals) for (j, v) in enumerate(vvals) π, v̂ = policy_and_value((x, v)) values[j, i] = v̂ for k in 1:3 action_dists[k][j, i] = π[k] end end end p1 = plot(heatmap(x = xvals, y = vvals, z = values), Layout(xaxis_title = "position", yaxis_title = "velocity", title = "Learned Value Function", height = 400, width = 600)) p2 = [plot(heatmap(x = xvals, y = vvals, z = action_dists[k], colorscale = "rb"), Layout(xaxis_title = "position", yaxis_title = "velocity", title = "Policy Probability for Action $k", height = 400, width = 600)) for k in 1:3] @htl("""
$p1 $p2
""") endmetadatashow_logsèdisabled®skip_as_script«code_folded$a12b92d1-e045-4f92-b8cd-eee5d56fa67dcell_id$a12b92d1-e045-4f92-b8cd-eee5d56fa67dcodeٹconst best_mc_corridor = reinforce_with_baseline_monte_carlo_control_linear_features(corridor_mdp, update_corridor_features!, 1, 100; α_θ = 0.006f0, α_w = 2f0^-2, max_steps = 1_000)metadatashow_logsèdisabled®skip_as_script«code_folded$ce33f710-fd9d-4dfa-acda-40204e54d518cell_id$ce33f710-fd9d-4dfa-acda-40204e54d518codemd""" ## 13.5 Actor-Critic Methods Here we also use the value function estimator to calculate the the return estimate using the one step bootstrap return. When the state value function is used in this way we call it the *critic*. In general we can use this function with n-step returns and eligibility traces. Recall from the subject of TD learning of value functions that the one-step return is often superior to the actual return regarding variance and ease of computation, although it does introduce bias to the estimate. With the use of eligibility traces we can smoothly vary arbitrarily close to the Monte Carlo return. Note that the bias in the gradient estimate is n due to the bootstrapping as such; the actor would be biased even if the critic was learned by a Monte Carlo method. The one-step actor-critic method is the analog of the one step methods such as TD$(0)$, Sarsa$(0)$, and Q learning. These methods replace the full return of REINFORCE with the one step return as follows: $\begin{flalign} \boldsymbol{\theta}_{t+1} &\doteq \boldsymbol{\theta}_t + \alpha(G_{t:t+1} - \hat v(S_t, \mathbf{w}))\ln\nabla\pi(A_t|S_t, \mathbf{\theta_t}) \tag{13.12} \\ & = \boldsymbol{\theta}_t + \alpha(R_{t+1} + \gamma \hat v(S_{t+1}, \mathbf{w}) - \hat v(S_t, \mathbf{w}))\ln\nabla\pi(A_t|S_t, \mathbf{\theta_t}) \tag{13.13} \\ & = \boldsymbol{\theta}_t + \delta_t\ln\nabla\pi(A_t|S_t, \mathbf{\theta_t}) \tag{13.14} \\ \end{flalign}$ This can be implemented as a fully online algorithm because we do not have to wait until the end of an episode to calculate return estimates. The natural state-value-function learning method to pair with this is semi-gradient TD(0). See a full implementation below. """metadatashow_logsèdisabled®skip_as_script«code_folded$339b4d2b-2237-46a3-9867-ecc3332856c1cell_id$339b4d2b-2237-46a3-9867-ecc3332856c1code!md""" This expression repeats terms of the form $\nabla \pi(a \vert s) q_\pi(s, a)$ summed over different probabilities. The first appearance of this term is just a sum over all actions at the state $s$ which is the state we are using for the gradient expression. The next appearance of the expression is a sum over actions at state $s^\prime$. Let's define a new expressions: $\begin{flalign} f(s) &\doteq \sum_a \nabla \pi(a \vert s) q_\pi(s, a) \\ \end{flalign}$ Then we can rewrite the second term as follows: $\gamma \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) f(s^\prime) \right ] = \gamma \sum_{s^\prime} f(s^\prime) \sum_a \left [ \pi(a \vert s) p(s^\prime \vert s, a) \right ] = \gamma \mathbb{E}_\pi [f(s^\prime) \vert s] = \gamma \sum_{s^\prime} f(s ^\prime) \Pr \{ S_1 = s^\prime \mid S_0 = s, A_1 \sim \pi(s) \}$ Define a new term $g(s) = \sum_{s^\prime} f(s^\prime) \Pr \{ S_1 = s^\prime \vert S_0 = s, A_1 \sim \pi(s) \} = \sum_{s^\prime} f(s^\prime) \sum_a [\pi(a \vert s) p(s^\prime \vert s, a)$ So the second term can be written as $\gamma g(s)$ where the final expression uses the probability that the agent transitions from state $s$ to $s^\prime$ in one step under the policy $\pi$. Using this same logic, we can rewrite the third expression as well. """metadatashow_logsèdisabled®skip_as_script«code_folded$a8349352-3242-46d5-b0d5-1b6eb5d77e90cell_id$a8349352-3242-46d5-b0d5-1b6eb5d77e90code4@bind x Slider(-50:50; default = 0, show_value=true)metadatashow_logsèdisabled®skip_as_script«code_folded$7d63b960-3998-4f7b-8cbb-ccd49db9aeaccell_id$7d63b960-3998-4f7b-8cbb-ccd49db9aeaccodeٻone_step_actor_critic_binary_features(corridor_mdp, get_corridor_features, 1, typemax(Int64), 100_000, α_θ = 2f0 ^ -3, α_w = 2f0 ^ -10, policy_params = [0f0 3.7f0]).policy_and_value(1)metadatashow_logsèdisabled®skip_as_script«code_folded$65d2add6-fd6f-456c-92ed-3cd9d1862ef6cell_id$65d2add6-fd6f-456c-92ed-3cd9d1862ef6codePfunction update_binary_policy_params!(params::Matrix{T}, active_features::BinaryFeatures, i_a::Integer, π_dist::Vector{T}, c::T) where T<:Real @inbounds for i in eachindex(π_dist) for j in active_features params[j, i] -= c*π_dist[i] end end @inbounds for j in active_features params[j, i_a] += c end return params endmetadatashow_logsèdisabled®skip_as_script«code_folded$f55afa58-962d-4551-8d95-a5b467d61adfcell_id$f55afa58-962d-4551-8d95-a5b467d61adfcode mbegin function update_params_with_gradient!(θ::Matrix{T}, α::T, ∇θ::BinaryGaussianEligibilityVector{T, T, T, B}) where {T<:Real, B<:BinaryFeatureVector} c1 = ∇θ.a - ∇θ.μ c2 = ∇θ.σ^(-2) # isnan(c2) && @info "warning σ of $∇θ.σ is causing nan results" # isinf(c2) && @info "warning σ of $∇θ.σ is causing inf results" c3 = c1 * c2 c4 = c3*c1 - one(T) δ1 = α*c3 @inbounds @simd for j in 1:∇θ.binary_features.num_features i = ∇θ.binary_features.active_features[j] θ[i, 1] += δ1 end δ2 = α*c4 @inbounds @simd for j in 1:∇θ.binary_features.num_features i = ∇θ.binary_features.active_features[j] θ[i, 2] += δ2 end return θ end function update_params_with_gradient!(θ::Matrix{T}, α::T, ∇θ::BinaryBetaEligibilityVector{T, T, T, B}) where {T<:Real, B<:BinaryFeatureVector} c1 = digamma(∇θ.α + ∇θ.β) δ1 = α*(log(∇θ.a) + c1 - digamma(∇θ.α)) @inbounds @simd for j in 1:∇θ.binary_features.num_features i = ∇θ.binary_features.active_features[j] θ[i, 1] += δ1 end δ2 = α*(log(one(T) - ∇θ.a) + c1 - digamma(∇θ.β)) @inbounds @simd for j in 1:∇θ.binary_features.num_features i = ∇θ.binary_features.active_features[j] θ[i, 2] += δ2 end return θ end function update_params_with_gradient!(θ::Matrix{T}, α::T, ∇θ::BinarySquashedGaussianEligibilityVector{T, T, T, B}) where {T<:Real, B<:BinaryFeatureVector} c1 = atanh(∇θ.a/∇θ.amax) - ∇θ.μ c2 = ∇θ.σ^(-2) # isnan(c2) && @info "warning σ of $∇θ.σ is causing nan results" # isinf(c2) && @info "warning σ of $∇θ.σ is causing inf results" c3 = c1 * c2 c4 = c3*c1 - one(T) δ1 = α*c3 @inbounds @simd for j in 1:∇θ.binary_features.num_features i = ∇θ.binary_features.active_features[j] θ[i, 1] += δ1 end δ2 = α*c4 @inbounds @simd for j in 1:∇θ.binary_features.num_features i = ∇θ.binary_features.active_features[j] θ[i, 2] += δ2 end return θ end function update_params_with_gradient!(θ::Matrix{T}, α::T, ∇θ::BinaryGaussianEligibilityVector{T, NTuple{N, T}, Vector{T}, B}) where {T<:Real, N, B<:BinaryFeatureVector} for k in 1:N c1 = ∇θ.a[k] - ∇θ.μ[k] c2 = ∇θ.σ[k] ^-2 # isnan(c2) && @info "warning σ of $∇θ.σ is causing nan results" # isinf(c2) && @info "warning σ of $∇θ.σ is causing inf results" c3 = c1 * c2 c4 = c3*c1 - one(T) δ1 = α*c3 @inbounds @simd for i in 1:size(θ, 1) θ[i, k] += δ1 end δ2 = α*c4 @inbounds @simd for i in 1:size(θ, 1) θ[i, k+N] += δ2 end end return θ end function update_params_with_gradient!(θ::Matrix{T}, α::T, ∇θ::BinaryBetaEligibilityVector{T, NTuple{N, T}, Vector{T}, B}) where {T<:Real, N, B<:BinaryFeatureVector} for k in 1:N c1 = digamma(∇θ.α[k] + ∇θ.β[k]) δ1 = α*(log(∇θ.a[k]) + c1 - digamma(∇θ.α[k])) @inbounds @simd for i in 1:size(θ, 1) θ[i, k] += δ1 end δ2 = α*(log(one(T) - ∇θ.a[k]) + c1 - digamma(∇θ.β[k])) @inbounds @simd for i in 1:size(θ, 1) θ[i, k+N] += δ2 end end return θ end endmetadatashow_logsèdisabled®skip_as_script«code_folded$d9d11d69-bc16-400a-8f46-f9a8ecb8516acell_id$d9d11d69-bc16-400a-8f46-f9a8ecb8516acode,actor_critic_binary_episodic_parameter_study(mdp::StateMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, params::@NamedTuple{λ_θ::T, λ_w::T, α_θ_min::Int64, α_w_min::Int64}, num_θ::Integer, num_w::Integer, num_episodes::Integer; kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = actor_critic_binary_episodic_parameter_study(mdp, get_active_features, num_features, params.λ_θ, params.λ_w, 2f0 .^(params.α_θ_min:params.α_θ_min+num_θ-1), 2f0 .^(params.α_w_min:params.α_w_min+num_w-1), num_episodes; kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$ed93259c-7b8b-46d7-97fb-f194e0e04b3acell_id$ed93259c-7b8b-46d7-97fb-f194e0e04b3acodefunction setup_binary_beta_policy_arguments(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer) where {T<:Real, S, N, A<:Union{T, NTuple{N, T}}, P, F1, F2, F3} x = BinaryFeatureVector() update_feature_vector!(x::BinaryFeatureVector, s) = update_binary_feature_vector!(x, s, get_active_features) sample_action = rand(A) action_dist_params = make_n_param_dist_params(2, sample_action) ∇lnπ = BinaryBetaEligibilityVector(sample_action) return (feature_vector = x, update_feature_vector! = update_feature_vector!, action_distribution_parameters = action_dist_params, eligibility_vector = ∇lnπ) endmetadatashow_logsèdisabled®skip_as_script«code_folded$d1ed25e6-60c6-411f-a541-99986e5da2c5cell_id$d1ed25e6-60c6-411f-a541-99986e5da2c5codereinforce_with_baseline_monte_carlo_control_linear_features(mdp::StateMDP{T, S, A, P, F1, F2, F3}, update_feature_vector!::Function, num_features::Integer, max_episodes::Integer; policy_params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), value_params::Vector{T} = zeros(T, num_features), x = zeros(T, num_features), action_preferences = zeros(T, length(mdp.actions)), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = reinforce_with_baseline_monte_carlo_control!(policy_params, copy(policy_params), value_params, copy(value_params), mdp, update_linear_action_preferences!, update_linear_eligibility_vector!, x, update_feature_vector!, linear_value_function, update_linear_value_gradient!, max_episodes; action_preferences = action_preferences, kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$b966b248-fb4d-457d-90f6-114370846242cell_id$b966b248-fb4d-457d-90f6-114370846242codeٱbegin bad_continuous_action(a) = false bad_continuous_action(a::Real) = isnan(a) bad_continuous_action(a::NTuple{N, T}) where {N, T<:Real} = any(bad_continuous_action, a) endmetadatashow_logsèdisabled®skip_as_script«code_folded$4156d955-9daf-4429-b152-e8332980fb9ecell_id$4156d955-9daf-4429-b152-e8332980fb9ecode7const mountaincar_continuous_test_train_beta = actor_critic_with_eligibility_traces_binary_features_beta_actions(mountaincar_continuous_beta_mdp, 0.01f0, 0.99f0, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, typemax(Int64), 100_000; α_θ = 1f-4, α_w = 0.00002f0)metadatashow_logsèdisabled®skip_as_script«code_folded$b09e1e48-494e-4967-826a-6e70199acad4cell_id$b09e1e48-494e-4967-826a-6e70199acad4code+md""" ### Squashed Gaussian Alternative """metadatashow_logsèdisabled®skip_as_script«code_folded$734573e5-547b-4dcc-89bb-412aa6cc42d6cell_id$734573e5-547b-4dcc-89bb-412aa6cc42d6codefunction actor_critic_linear_parameter_study(mdp::StateMDP{T, S, A, P, F1, F2, F3}, feature_function::Function, num_features::Integer, λ_θ::T, λ_w::T, α_r̄::T, α_θ_list::AbstractVector{T}, α_w_list::AbstractVector{T}, max_steps::Integer; nruns::Integer = 100, seed = rand(UInt64), init_policy_params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), binary_features = false, kwargs...) where {T<:Real, S, A, P, F1, F2, F3} if binary_features algo = actor_critic_with_eligibility_traces_binary_features title_prefix = "Binary Feature Encoding" else algo = actor_critic_with_eligibility_traces_linear_features title_prefix = "Linear Encoding" end make_trace_data(α_θ_list, α_w) = [average_continuing_runs(nruns, seed, α_θ, α_w, α_r̄, init_policy_params, algo, mdp, λ_θ, λ_w, feature_function, num_features, max_steps; kwargs...) for α_θ in α_θ_list] traces = [begin scatter(x = α_θ_list, y = make_trace_data(α_θ_list, α_w), name = "α_w = $α_w") end for α_w in α_w_list] plot(traces, Layout(xaxis_title = "α_θ", yaxis_title = "Average Reward Per Step in the First
$max_steps Steps Averaged Over $nruns Runs", xaxis_type = "log", title = "$title_prefix with $num_features Features, λ_θ = $λ_θ, λ_w = $λ_w")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54cell_id$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54codefunction actor_critic_with_eligibility_traces_fcann(mdp::StateMDP{T, S, A, P, F1, F2, F3}, λ_θ::T, λ_w::T, input_length::Integer, hidden_layers::Vector{Int64}, update_feature_vector!::Function, args...; policy_params::FCANNParams = FCANN.initializeparams_saxe(input_length, hidden_layers, length(mdp.actions)), reslayers = 0, l2 = 0f0, dropout = 0f0, use_μP = true, activation_list = fill(true, length(hidden_layers)), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} setup = setup_fcann_policy_and_value_arguments(policy_params, input_length, hidden_layers, reslayers, l2, dropout, use_μP, activation_list) actor_critic_with_eligibility_traces!(policy_params, setup.eligibility_vector, setup.value_params, setup.value_gradient, mdp, λ_θ, λ_w, setup.update_action_preferences!, setup.update_eligibility_vector!, setup.feature_vector, update_feature_vector!, setup.value_function, setup.gradient_update, args...; kwargs...) endmetadatashow_logsèdisabled®skip_as_script«code_folded$692c1043-4eaf-491e-b8fe-368618867f99cell_id$692c1043-4eaf-491e-b8fe-368618867f99codemd""" 1. The soft-max distribution is: $\sigma(a|s, \theta) = \frac{e^{h(s, a, \theta)}}{\sum_b e^{h(s, b, \theta)}}$ We only have two possible actions in each state so the policy for action 1 would be given by: $\pi(1|S_t, \theta_t) = \frac{e^{h(s, 1, \theta_t)}}{e^{h(S_t, 0, \theta_t)} + e^{h(S_t, 1, \theta)}}$ Simplify this expression by dividing by $e^{h(s, 1, \theta_t)}$ which results in: $\pi(1|S_t, \theta_t) = \frac{1}{e^{h(S_t, 0, \theta_t) - h(S_t, 1, \theta_t)} + 1}$ Given the assumption that $h(s, 1, \theta)-h(s, 0, \theta) = \theta^\top\mathbf{x}(s)$, we replace the expression in the exponent resulting in the final expression of: $\pi(1|S_t, \theta_t) = \frac{1}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1}$ Using the notation $f(x) = 1/(1+e^{-x})$ we can write $\pi(1|S_t, \theta_t) = f(\theta_t^\top \mathbf{x}(S_t))$ where $f$ is the logistic function. Consider this notation for the rest of the exercises. 2. The REINFORCE update is given by: $\theta_{t+1} = \theta_t + \alpha G_t \frac{\nabla\pi(A_t|S_t, \theta_t)}{\pi(A_t|S_t, \theta_t)}$, so we need to compute the gradient of the policy in terms of the parameters for this action selection: $\nabla \pi(1|S_t, \theta_t)$. Luckily, the derivative of the logistic function is simply given by: $f(x)(1-f(x))$ where $f(x)$ is the logistic function itself. In our case $x = \theta_t^\top \mathbf{x}_t$ so after applying the chain rule we have: $\nabla\pi(1|S_t, \theta_t) = f(x)(1-f(x))\nabla x = f(x)(1-f(x)) \mathbf{x_t}$ since $x$ is just a linear function of the parameters. So for the parameter update step we have: $\frac{\nabla\pi(1|S_t, \theta_t)}{\pi(1|S_t, \theta_t)} = \frac{f(x)(1-f(x))\mathbf{x}_t}{f(x)} = (1 - f(x))\mathbf{x}_t$ Also note that: $1 - f(x) = 1 - \frac{1}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1} = \frac{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1 - 1}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1} = \frac{e^{-\theta_t^\top\mathbf{x}(S_t)}}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1}$ The REINFORCE update will then be: $\theta_{t+1} = \theta_t + \alpha G_t \left ( \frac{e^{-\theta_t^\top\mathbf{x}(S_t)}}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1} \right ) \mathbf{x}_t$ 3. For the general case, we want to calculate $\frac{\nabla\pi(a|s, \theta)}{\pi(a|s, \theta)}$. We already know this expression for $a = 1$. $\nabla {\pi(1|s, \mathbf{\theta})} = f(x)(1 - f(x))\mathbf{x}(s) = \pi(1|s, \mathbf{\theta})(1 - \pi(1|s, \mathbf{\theta})\mathbf{x}(s)$ Since $\pi(a|s, \theta)$ is a probability distribution across actions, we also know that $\pi(0|s, \theta) = 1 - \pi(1|s, \theta)$ which implies that $\nabla \pi(0|s, \theta) = -\nabla \pi(1|s, \theta) = -\pi(1|s, \mathbf{\theta})(1 - \pi(1|s, \mathbf{\theta}))\mathbf{x}(s)$ We can express this in terms of $\pi(0|s, \theta)$ completely: $\nabla \pi(0|s, \theta) = (\pi(0|s, \mathbf{\theta}) - 1)\pi(0|s, \theta)\mathbf{x}(s) = -\pi(0|s, \theta)(1 - \pi(0|s, \mathbf{\theta}))\mathbf{x}(s)$ Let's now compare the two expressions for the policy gradient at each action: $\begin{align} \nabla {\pi(1|s, \mathbf{\theta})} &= \pi(1|s, \mathbf{\theta})(1 - \pi(1|s, \mathbf{\theta})\mathbf{x}(s) \\ \nabla \pi(0|s, \theta) &= -\pi(0|s, \theta)(1 - \pi(0|s, \mathbf{\theta}))\mathbf{x}(s) \\ \therefore \\ \nabla \pi(a|s, \theta) &= \chi (a) \pi(a|s, \theta)(1 - \pi(a|s, \mathbf{\theta}))\mathbf{x}(s) \\ \end{align}$ Where $\chi (a)$ is a function that returns 1 for $a=1$ and -1 for $a=0$. There are many ways to achieve this but the following expression is simple and works: $\chi(a) = 2a - 1$. Dividing by the policy yields a unified expression for the eligibility vector: $\nabla \ln{\pi(a|s,\theta)} = (2a - 1) (1 - \pi(a|s, \mathbf{\theta}))\mathbf{x}(s)$ """metadatashow_logsèdisabled®skip_as_script«code_folded$2c5d221a-2469-49e1-9249-dfdc2457f2facell_id$2c5d221a-2469-49e1-9249-dfdc2457f2facode\@bind start_cartpole_continuing_fcann_param_study CounterButton("Run FCANN Parameter Study")metadatashow_logsèdisabled®skip_as_script«code_folded$7c592385-e8d3-4efe-962c-d39debb64405cell_id$7c592385-e8d3-4efe-962c-d39debb64405code~const mountaincar_tilecoding_setup = tile_coding_setup(mountaincar_min_vals, mountaincar_max_vals, (0.1f0, 0.1f0), 12, (1, 3))metadatashow_logsèdisabled®skip_as_script«code_folded$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaebcell_id$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaebcodeimport HypertextLiteral.@htlmetadatashow_logsèdisabled®skip_as_scriptëcode_folded$8eab55a5-41b7-4f5e-a02f-4c19388bc9eacell_id$8eab55a5-41b7-4f5e-a02f-4c19388bc9eacodeifunction update_binary_feature_vector!(x::BinaryFeatureVector, s::S, get_active_features::Function) where S active_features = get_active_features(s) l = length(x.active_features) n = 0 for (i, f) in enumerate(active_features) if i > l push!(x.active_features, f) else x.active_features[i] = f end n += 1 end x.num_features = n return x endmetadatashow_logsèdisabled®skip_as_script«code_folded$0ac7ea44-14f6-4e80-80f9-d6df8059bb38cell_id$0ac7ea44-14f6-4e80-80f9-d6df8059bb38codefunction reinforce_monte_carlo_control!(policy_params, ∇lnπ, mdp::StateMDP{T, S, A, PTF, F1, F2, F3}, update_action_preferences!::Function, update_eligibility_vector!::Function, x, update_feature_vector!::Function, max_episodes::Integer; α = one(T)/10, kwargs...) where {T<:Real, S, A, PTF, F1, F2, F3} out = reinforce_with_baseline_monte_carlo_control!(policy_params, ∇lnπ, nothing, nothing, mdp, update_action_preferences!, update_eligibility_vector!, x, update_feature_vector!, Returns(zero(T)), Returns(nothing), max_episodes; α_θ = α, kwargs...) return (episode_rewards = out.episode_rewards, episode_steps = out.episode_steps, policy_function = out.policy_function, policy_sample_action = out.policy_sample_action, parameters = out.policy_parameters) endmetadatashow_logsèdisabled®skip_as_script«code_folded$5ffc271f-c73f-494a-9727-8d7516af2191cell_id$5ffc271f-c73f-494a-9727-8d7516af2191code٣@bind cartpole_continuing_fcann_study_params create_actor_critic_continuing_params_UI(;λ_θ= 0.8f0, λ_w = 0.15f0, α_r̄ = 0.05f0, log2α_θ = -6, log2α_w = -5)metadatashow_logsèdisabled®skip_as_script«code_folded$c5a2879c-e89b-47f7-bbd6-48200d7e89e3cell_id$c5a2879c-e89b-47f7-bbd6-48200d7e89e3code^actor_critic_binary_episodic_squashed_gaussian_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, args...; kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = actor_critic_binary_episodic_squashed_gaussian_parameter_study(mdp, one(T), get_active_features, num_features, args...; kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$537270ba-122b-4f2b-880b-31d086766295cell_id$537270ba-122b-4f2b-880b-31d086766295codebegin #the following struct represents a problem for which both the state and action space can take arbitrary values struct ContinuousMDP{T<:Real, S, A, P<:AbstractContinuousTransition{T, S, A, F} where F<:Function, StateInit<:Function, IsTerm<:Function, ValidAction <: Function} <: AbstractMDP{T, S, A, P, StateInit} ptf::P initialize_state::StateInit #function which provides an initial state index isterm::IsTerm #function that returns true if a state is terminal and false otherwise is_valid_action::ValidAction #is_valid_action(s, a) returns true if the action a is valid to take from state. by default every action is assumed to be available ContinuousMDP(ptf::P, initialize_state::F1, isterm::F2, is_valid_action::F3) where {T<:Real, S, A, F<:Function, P<:AbstractContinuousTransition{T, S, A, F}, F1<:Function, F2<:Function, F3<:Function} = new{T, S, A, P, F1, F2, F3}(ptf, initialize_state, isterm, is_valid_action) end ContinuousMDP(ptf::AbstractContinuousTransition{T, S, A, F}, initialize_state::StateInit; isterm::Function = Returns(false), is_valid_action::Function = Returns(true)) where {T<:Real, S, A, F<:Function, StateInit<:Function} = ContinuousMDP(ptf, initialize_state, isterm, is_valid_action) function ContinuousMDP(step::Function, initialize_state::Function, a::A; kwargs...) where A s0 = initialize_state() ptf = ContinuousMDPTransitionSampler(step, s0, a) ContinuousMDP(ptf, initialize_state; kwargs...) end endmetadatashow_logsèdisabled®skip_as_script«code_folded$dc2efc6c-8da8-425b-aa5f-290949109565cell_id$dc2efc6c-8da8-425b-aa5f-290949109565codeGplot_mountaincar_policy_values(mountaincar_test_train.policy_and_value)metadatashow_logsèdisabled®skip_as_script«code_folded$a019925a-460a-410e-a54b-50a4cfe0e90ecell_id$a019925a-460a-410e-a54b-50a4cfe0e90ecodeplot(scatter(x = 1 .- LinRange(0.01, 0.99, 100), y = -[get_corridor_episode_stats(p) for p in 1 .- LinRange(0.01, 0.99, 100)]), Layout(xaxis_title = "probability of right action", yaxis_title = "sample mean value of starting state", width = 800, yaxis_range = [-60, -10]))metadatashow_logsèdisabled®skip_as_script«code_folded$f92bb265-4b19-4f0e-a698-d7547bb6dd41cell_id$f92bb265-4b19-4f0e-a698-d7547bb6dd41code٧mutable struct BinaryFeatureVector{I <: Integer} active_features::Vector{I} num_features::I function BinaryFeatureVector() new{Int64}(Vector{Int64}(), 0) end endmetadatashow_logsèdisabled®skip_as_script«code_folded$ac9c8845-284d-4c21-b05d-d930f86598a3cell_id$ac9c8845-284d-4c21-b05d-d930f86598a3codeb@bind run_mountaincar_binary_episodic_countinuous_param_study CounterButton("Run Parameter Study")metadatashow_logsèdisabled®skip_as_script«code_folded$192cc1cf-9ea1-492d-baa7-f2e197abecd4cell_id$192cc1cf-9ea1-492d-baa7-f2e197abecd4codeV@bind run_mountaincar_binary_episodic_param_study CounterButton("Run Parameter Study")metadatashow_logsèdisabled®skip_as_script«code_folded$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547dcell_id$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547dcode6@bind ep_step Slider(1:length(ep[1]), show_value=true)metadatashow_logsèdisabled®skip_as_script«code_folded$c8b47eac-2d45-419a-bec6-2ae0cdc59393cell_id$c8b47eac-2d45-419a-bec6-2ae0cdc59393codehbegin #represents a transition where the state must be referenced directly instead of through a tabular index abstract type AbstractContinuousTransition{T<:Real, S, A, F<:Function} <: AbstractTransition{T, 2} end struct ContinuousMDPTransitionSampler{T <: Real, S, A, F <: Function} <: AbstractContinuousTransition{T, S, A, F} step::F function ContinuousMDPTransitionSampler(step::F, s::S, a::A) where {F<:Function, S, A} (r, s′) = step(s, a) @assert promote_type(S, typeof(s′)) != Any "There is no common type between the provided state $s and the transition state $s′" new{typeof(r), promote_type(S, typeof(s′)), A, F}(step) end end #when used as a functor just apply the step function to the state action pair indices (ptf::ContinuousMDPTransitionSampler{T, S, A, F})(s::S, a::A) where {T<:Real, S, A, F<:Function} = ptf.step(s, a) endmetadatashow_logsèdisabled®skip_as_script«code_folded$36a6e43f-6bcf-4c27-bfbb-047760e77adacell_id$36a6e43f-6bcf-4c27-bfbb-047760e77adacode md""" # Chapter 13 Policy Gradient Methods Introduction Instead of selection actions based on *action-value estimates* we learn a *parameterized policy* with parameters $\boldsymbol{θ}$. $\pi(a|s, \boldsymbol{\theta}) = \text{Pr}\{A_t=a|S_t=s, \boldsymbol{\theta}_t=\boldsymbol{\theta\}}$ denotes the probability that action *a* is taken at time *t* given that the environment is in state *s* at time *t* with parameter $\boldsymbol{θ}$. We consider methods that improve the policy parameter using the gradient of some scalar performance measure $J(\boldsymbol{\theta})$ with respect to the policy parameters. We follow gradient ascent since we are trying to maximize this value and methods that use this approach are called *policy gradient methods*. Methods that learn approximations to both policy and value functions are often called *actor-critic methods*, where 'actor' is a reference to the learned policy, and 'critic' refers to the learned value function, usually a state-value function. ## 13.1 Policy Approximation and its Advantages """metadatashow_logsèdisabled®skip_as_script«code_folded$436c52d2-280b-4ca4-9360-d6587b8254c7cell_id$436c52d2-280b-4ca4-9360-d6587b8254c7code~md""" In order to test this algorithm we need to use a continuing task which is lacking a terminal state. We could simply modify the corridor MDP to be a continuing task by altering the reward structure so a reward of 1 is received upon moving to the right from state 3 after which the state is reset to 1. Se below for a version of this MDP updated to be a continuing problem. """metadatashow_logsèdisabled®skip_as_script«code_folded$e96d592d-1e54-486d-8ad9-b857f85476e8cell_id$e96d592d-1e54-486d-8ad9-b857f85476e8code.actor_critic_linear_parameter_study(mdp::StateMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, params::@NamedTuple{λ_θ::T, λ_w::T, α_r̄::T, α_θ_min::Int64, α_w_min::Int64}, num_θ::Integer, num_w::Integer, max_steps::Integer; kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = actor_critic_linear_parameter_study(mdp, get_active_features, num_features, params.λ_θ, params.λ_w, params.α_r̄, 2f0 .^(params.α_θ_min:params.α_θ_min+num_θ-1), 2f0 .^(params.α_w_min:params.α_w_min+num_w-1), max_steps; kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6ccell_id$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6ccodeLcorridor_parameter_studies(2f0 .^ (-15:-8), 2f0 .^ (-35:5:-15); nruns = 100)metadatashow_logsèdisabled®skip_as_script«code_folded$4da20fd7-b897-4f26-bf2a-f08d66ddf90fcell_id$4da20fd7-b897-4f26-bf2a-f08d66ddf90fcode b#version of reinforce for general function approximation function actor_critic_with_eligibility_traces!(policy_params::P1, ∇lnπ, value_params::P2, ∇v̂, mdp::ContinuousMDP{T, S, A, PTF, F1, F2, F3}, λ_θ::T, λ_w::T, update_action_distribution!::Function, action_dist_params::Vector{T}, action_sampler::Function, update_eligibility_vector!::Function, x, update_feature_vector!::Function, value_function::Function, update_value_gradient!::Function, max_steps::Integer; α_w::T = one(T)/10, α_θ::T = one(T)/10, α_r̄ = one(T)/10, z_θ::P1 = deepcopy(policy_params), z_w::P2 = deepcopy(value_params), save_step_rewards = false) where {P1, P2, T<:Real, S, A, PTF, F1, F2, F3} step_rewards = Vector{T}() #initialize variables step = 1 rtot = zero(T) r̄ = zero(T) c = one(T) zero_params!(z_θ) zero_params!(z_w) s = mdp.initialize_state() update_feature_vector!(x, s) while step <= max_steps update_value_gradient!(∇v̂, x, value_params) v̂ = value_function(x, value_params) update_action_distribution!(action_dist_params, x, policy_params) a = action_sampler(action_dist_params) if bad_continuous_action(a) @info "terminating after $step steps due to invalid continuous action $a taken in state $s with action distribution parameters $action_dist_params" push!(episode_steps, max_steps) push!(episode_rewards, typemin(T)) break end update_eligibility_vector!(∇lnπ, action_dist_params, x, a, policy_params) (r, s′) = mdp.ptf(s, a) rtot += r save_step_rewards && push!(step_rewards, r) step += 1 mdp.isterm(s′) && error("$s′ is a terminal state and this method only applies to continuing tasks") update_feature_vector!(x, s′) v̂′ = value_function(x, value_params) δ = r - r̄ + v̂′ - v̂ r̄ += α_r̄*δ update_traces_with_gradient!(γ*λ_w, z_w, ∇v̂) update_traces_with_gradient!(γ*λ_θ, z_θ, c, ∇lnπ) update_params_with_gradient!(value_params, α_w*δ, z_w) update_params_with_gradient!(policy_params, α_θ*c*δ, z_θ) s = s′ end function_outputs = form_state_and_policy_function_outputs(update_feature_vector!, update_action_distribution!, action_dist_params, action_sampler, value_function, x, policy_params, value_params) return (;step_rewards = step_rewards, total_reward = rtot, policy_parameters = policy_params, value_parameters = value_params, function_outputs...) endmetadatashow_logsèdisabled®skip_as_script«code_folded$11ea640c-3981-404d-87c6-4d3d0708a2b8cell_id$11ea640c-3981-404d-87c6-4d3d0708a2b8codefunction actor_critic_linear_episodic_parameter_study(mdp::StateMDP{T, S, A, P, F1, F2, F3}, update_feature_vector!::Function, num_features::Integer, λ_θ::T, λ_w::T, α_θ_list::AbstractVector{T}, α_w_list::AbstractVector{T}, max_episodes::Integer; nruns = 100, max_steps::Integer = 10_000, seed = rand(UInt64), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} Random.seed!(seed) function average_runs(α_θ, α_w) 1:nruns |> Map(_ -> actor_critic_with_eligibility_traces_linear_features(mdp, λ_θ, λ_w, update_feature_vector!, num_features, max_episodes, max_steps; α_θ = α_θ, α_w = α_w, kwargs...) |> x -> isempty(x.episode_rewards) ? -T(Inf) : sum(x.episode_rewards) / length(x.episode_rewards)) |> foldxt(+) |> x -> x / nruns end traces = [begin scatter(x = α_θ_list, y = average_runs.(α_θ_list, α_w), name = "α_w = $α_w") end for α_w in α_w_list] plot(traces, Layout(xaxis_title = "α_θ", yaxis_title = "Average Reward Per Episode in the First
$max_episodes Episodes Averaged Over $nruns Runs", xaxis_type = "log2", title = "Linear Feature Encoding with $num_features Features, λ_θ = $λ_θ, λ_w = $λ_w")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$281360af-46bf-4c73-bf11-3cb1153ad3e2cell_id$281360af-46bf-4c73-bf11-3cb1153ad3e2codeUcartpole_tilecoding_reinforce_parameter_study(2f0 .^ (-12:-7), 2f0 .^ (-13:-10), 100)metadatashow_logsèdisabledîskip_as_script«code_folded$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690ccell_id$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690ccode%begin function update_squashed_gaussian_eligibility_vector!(∇lnπ::BinarySquashedGaussianEligibilityVector{T, T, T, B}, dist_params::Vector{T}, x::B, action::T, policy_params::Matrix{T}) where {T<:Real, B<:BinaryFeatureVector} ∇lnπ.binary_features = x ∇lnπ.a = action ∇lnπ.μ = first(dist_params) ∇lnπ.σ = exp(last(dist_params)) return ∇lnπ end function update_squashed_gaussian_eligibility_vector!(∇lnπ::BinarySquashedGaussianEligibilityVector{T, NTuple{N, T}, Vector{T}, B}, dist_params::Vector{T}, x::B, action::NTuple{N, T}, policy_params::Matrix{T}) where {T<:Real, N, B<:BinaryFeatureVector} ∇lnπ.binary_features = x ∇lnπ.a = action for i in 1:N ∇lnπ.μ[k] = dist_params[k] ∇lnπ.σ[k] = exp(dist_params[k+N]) end return ∇lnπ end endmetadatashow_logsèdisabled®skip_as_script«code_folded$da3cb392-78f2-48b2-b0dc-5f016664798ccell_id$da3cb392-78f2-48b2-b0dc-5f016664798ccodeXshow_mountaincar_trajectory(mountaincar_continuing_tile_test.policy_sample_action, 1000)metadatashow_logsèdisabled®skip_as_script«code_folded$dca2f8e2-76af-4679-bf81-3824c15fc76dcell_id$dca2f8e2-76af-4679-bf81-3824c15fc76dcode const reinforce_test3 = actor_critic_with_eligibility_traces_binary_features(cartpole_setup.mdps.episodic.discrete, 0.85f0, 0.5f0, cartpole_setup.get_active_features, cartpole_setup.num_features, typemax(Int64), 100_000; α_θ = 2f0 ^-6, α_w = 2f0 ^-4, γ = 0.99f0)metadatashow_logsèdisabled®skip_as_script«code_folded$8019bec9-1228-407b-9199-2fe29f26a981cell_id$8019bec9-1228-407b-9199-2fe29f26a981code md""" > ### *Exercise 13.1* > Use your knowledge of the gridworld and its dynamics to determine an *exact* symbolic expression for the optimal probability of selecting the right action in Example 13.1 Example 13.1 is a gridworld with 3 non-terminal states and a terminal state at the far right. The reward is -1 per step. States 1 and 3 have actions left/right that move in the expected directions but state 2 reverses the directions. We use a performance measure $J(\mathbf{\theta}) = v_{\pi_\theta}(S)$. Given our feature representations of $\mathbf{x}(s, \text{right}) = [1, 0]^{\top}$ and $\mathbf{x}(s, \text{left}) = [0, 1]^{\top}$, we can only learn policies that are stochastic in terms of left/right action selection but do not vary between states. Also observe that due to probability constraints $p_{\text{right}} = 1 - p_{\text{left}}$. For simplicity, we will use the notation $p \doteq p_{\text{left}}$ and the following for the three state values: $v1, v2, v3$. $\begin{flalign} v_1 &= p \times v_1 + (1-p) \times v_2 - 1 \tag{1} \\ v_1 (1-p) &= v_2 (1-p) - 1 \\ v_1 &= v_2 - \frac{1}{1-p} \tag{1′}\\ v_2 &= p \times v_3 + (1-p) \times v_1 - 1 \tag{2} \\ v_3 &= p \times v_2 - 1 \tag{3}\\ v_2 &= p \times [p\times v_2 - 1] +(1-p) \times v_1 - 1 \tag{substituting 3 into 2} \\ v_2(1 - p^2) &= -p +(1-p) \times v_1 - 1 \\ v_2 &= \frac{(1-p) v_1 - (1+p)}{(1+p)(1-p)} \tag{collecting terms} \\ &= \frac{(1-p) v_2 - 1 - (1+p)}{(1+p)(1-p)} \tag{using 1′} \\ &= \frac{v_2}{1+p} - \frac{2 + p}{(1+p)(1-p)} \\ v_2 \left [1 - \frac{1}{1+p} \right ] &= - \frac{2 + p}{(1+p)(1-p)} \\ v_2 \frac{1+p-1}{1+p} &= - \frac{2 + p}{(1+p)(1-p)} \\ v_2 &= - \frac{2 + p}{(1-p)p} \\ v_1 &= - \frac{2 + p}{(1-p)p} - \frac{1}{1-p} \\ &= \frac{-2 - p - p}{(1-p)p} \\ &= -\frac{2 + 2p}{(1-p)p} \\ v_3 &= -\frac{2 + p}{1-p} - 1\\ &= \frac{-2 - p - 1 + p}{1-p}\\ &= -\frac{3}{1-p}\\ \end{flalign}$ To summarize all the state values: $\begin{flalign} v_1 &= -\frac{2 + 2p}{(1-p)p} \\ v_2 &= - \frac{2 + p}{(1-p)p} \\ v_3 &= -\frac{3}{1-p} \end{flalign}$ """metadatashow_logsèdisabled®skip_as_script«code_folded$fd964539-2baf-4ff1-b286-5a0bb1b222c4cell_id$fd964539-2baf-4ff1-b286-5a0bb1b222c4codemd""" The beta distribution has two parameters like the normal distribution but is only defined from 0 to 1. The two parameters $\alpha$ and $\beta$ are positive real numbers and control the shape of the distribution. The density function is given below: $f(x; \alpha, \beta) = \frac{x^{\alpha-1} (1-x)^{\beta - 1}}{\text{B}(\alpha, \beta)}$ where $\text{B}(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha + \beta)}$ and $\Gamma(z) = \int_0^\infty t^{z-1}e^{-t} \text{d} t$ We saw earlier from the treatment of the gaussian distribution that we need to find the gradient of a function of each distribution parameter with respect to the parameters of the function approximation. Luckily, the maximum likelihood estimator already computes the gradient we are interested in for this distribution. Note that the likelihood function for a single sample of the random variable $x$ which follows the beta distribution is given by $\mathcal{L}(\alpha, \beta \vert X) = \ln(f(X_i; \alpha, \beta))$ and the partial derivative of this function with respect to each parameter $\alpha$ and $\beta$ is given by: $\frac{\partial \mathcal{L}(\alpha, \beta, \vert X)}{\partial \alpha} = \ln X - \frac{\partial \ln \text{B}(\alpha, \beta)}{\partial \alpha}$ $\frac{\partial \mathcal{L}(\alpha, \beta, \vert X)}{\partial \beta} = \ln (1-X) - \frac{\partial \ln \text{B}(\alpha, \beta)}{\partial \beta}$ where $\frac{\partial \ln \text{B}(\alpha, \beta)}{\partial \alpha} = -\psi(\alpha + \beta) + \psi(\alpha)$ and $\frac{\partial \ln \text{B}(\alpha, \beta)}{\partial \beta} = -\psi(\alpha + \beta) + \psi(\beta)$ and $\phi(\alpha)$ is the digamma function which is just the derivative of the logarithm of the gamma function. Since both $\alpha$ and $\beta$ must be greater than zero, we can use for an estimate for each one the exponential function applied to a dot product of the parameter vector with the feature vector: $\alpha(s, \boldsymbol{\theta}) \doteq \exp \left (\boldsymbol{\theta}_\alpha^\top \mathbf{x}(s) \right )$ and $\beta(s, \boldsymbol{\theta}) \doteq \exp \left (\boldsymbol{\theta}_\beta^\top \mathbf{x}(s) \right )$. The eligibility vector for this distribution is then: $\nabla \ln f(a \vert \alpha(s, \boldsymbol{\theta}_\alpha), \beta(s, \boldsymbol{\theta}_\beta))$ where $\alpha$ is a function of its parameters and $\beta$ is a function of the other parameter vector. The gradient components corresponding to each vector is only a function of a partial derivative of the distribution with respect to $\alpha$ and $\beta$. That is, since $\frac{\partial \alpha}{\partial \theta_{\beta_i}} = 0 \forall i$ and vice versa, then we can treat each part of the gradient separately. $\begin{flalign} \nabla_{\boldsymbol{\theta}_\alpha} \ln f(a \vert \alpha, \beta) &= \frac{\partial f(a \vert \alpha, \beta)}{\partial \alpha} \nabla_{\boldsymbol{\theta}_\alpha}\alpha \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\alpha) \right ) \nabla_{\boldsymbol{\theta}_\alpha} \alpha \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\alpha) \right ) \nabla_{\boldsymbol{\theta}_\alpha} \exp \left ( \boldsymbol{\theta}_\alpha^\top \mathbf{x}(s) \right ) \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\alpha) \right ) \alpha \mathbf{x}(s)\\ \end{flalign}$ $\begin{flalign} \nabla_{\boldsymbol{\theta}_\beta} \ln f(a \vert \alpha, \beta) &= \frac{\partial f(a \vert \alpha, \beta)}{\partial \beta} \nabla_{\boldsymbol{\theta}_\beta}\beta \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\beta) \right ) \nabla_{\boldsymbol{\theta}_\beta} \beta \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\beta) \right ) \nabla_{\boldsymbol{\theta}_\beta} \exp \left ( \boldsymbol{\theta}_\beta^\top \mathbf{x}(s) \right ) \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\beta) \right ) \beta \mathbf{x}(s)\\ \end{flalign}$ """metadatashow_logsèdisabled®skip_as_script«code_folded$5720e942-d3f8-4329-83a8-8bcedf078b6acell_id$5720e942-d3f8-4329-83a8-8bcedf078b6acodeٔreinforce_monte_carlo_control_linear_features(corridor_mdp, update_corridor_features!, 1, 1_000; α = 2f0^-14, max_steps = 1_000).policy_function(1)metadatashow_logsèdisabled®skip_as_script«code_folded$62e677ac-2070-4f6b-9df2-90849d89fa9fcell_id$62e677ac-2070-4f6b-9df2-90849d89fa9fcodeQconst corridor_terminal_probabilities = 1 .- sum(corridor_state_counts, dims = 2)metadatashow_logsèdisabled®skip_as_script«code_folded$11b9beea-b0cd-45eb-84c6-151728894df0cell_id$11b9beea-b0cd-45eb-84c6-151728894df0codefunction form_state_and_policy_function_outputs(update_feature_vector!::Function, update_action_distribution!::Function, action_dist_params::Vector{T}, action_sampler::Function, value_function::Function, feature_vector, policy_params, value_params) where T<:Real π! = form_state_continuous_policy_function(update_feature_vector!, update_action_distribution!) π(s) = π!(feature_vector, action_dist_params, s, policy_params) π_sample(s) = action_sampler(π(s)) v! = form_state_value_function(update_feature_vector!, value_function) estimate_state_value(s; x = deepcopy(feature_vector)) = v!(x, s, value_params) function policy_and_value(s; x = deepcopy(feature_vector), action_dist_params = copy(action_dist_params)) π!(x, action_dist_params, s, policy_params) v̂ = value_function(x, value_params) return (action_distribution_parameters = action_dist_params, state_value_estimate = v̂) end (policy_function = π, policy_sample_action = π_sample, estimate_state_value = estimate_state_value, policy_and_value = policy_and_value) endmetadatashow_logsèdisabled®skip_as_script«code_folded$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290cell_id$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290codexfunction reinforce_monte_carlo_control_binary_features(mdp::StateMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, max_episodes::Integer; params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} setup = setup_binary_policy_arguments(mdp, get_active_features, num_features) reinforce_monte_carlo_control!(params, setup.eligibility_vector, mdp, update_binary_action_preferences!, update_binary_eligibility_vector!, setup.feature_vector, setup.update_feature_vector!, max_episodes; action_preferences = setup.action_preferences, kwargs...) endmetadatashow_logsèdisabled®skip_as_script«code_folded$55ba8725-0ddf-4196-a41d-3f3c490a8d84cell_id$55ba8725-0ddf-4196-a41d-3f3c490a8d84codeofunction actor_critic_binary_episodic_gaussian_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, λ_θ::T, λ_w::T, α_θ_list::AbstractVector{T}, α_w_list::AbstractVector{T}, max_episodes::Integer; nruns::Integer = 100, max_steps::Integer = 10_000, seed = rand(UInt64), init_policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), init_value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} Random.seed!(seed) function average_runs(α_θ, α_w) 1:nruns |> Map(_ -> actor_critic_with_eligibility_traces_binary_features_gaussian_actions(mdp, λ_θ, λ_w, get_active_features, num_features, max_episodes, max_steps; α_θ = α_θ, α_w = α_w, policy_params = copy(init_policy_params), value_params = copy(init_value_params), kwargs...) |> x -> isempty(x.episode_rewards) ? -T(Inf) : mean(x.episode_rewards)) |> foldxt(+) |> x -> x / nruns end traces = [begin scatter(x = α_θ_list, y = average_runs.(α_θ_list, α_w), name = "α_w = $α_w") end for α_w in α_w_list] plot(traces, Layout(xaxis_title = "α_θ", yaxis_title = "Average Reward Per Episode in the First
$max_episodes Episodes Averaged Over $nruns Runs", xaxis_type = "log", title = "Binary Feature Encoding with $num_features Features, λ_θ = $λ_θ, λ_w = $λ_w")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$a540814a-57a1-4b98-9443-59e401425444cell_id$a540814a-57a1-4b98-9443-59e401425444codefunction binary_value_function(binary_features::BinaryFeatureVector, params::Vector{T})::T where T<:Real v = zero(T) @inbounds @simd for i in 1:binary_features.num_features j = binary_features.active_features[i] v += params[j] end return v endmetadatashow_logsèdisabled®skip_as_script«code_folded$1b102220-6d78-480d-a77f-0e57bad23dcacell_id$1b102220-6d78-480d-a77f-0e57bad23dcacode!cartpole_binary_continuing_parameter_study(args...; kwargs...) = actor_critic_linear_parameter_study(cartpole_continuing_mdp, s -> cartpole_tilecoding_setup.get_active_features((s.x, s.θ, s.ẋ, s.θ̇)), cartpole_tilecoding_setup.num_features, binary_features = true, args...; kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$4d4ae57b-afc3-44f9-b6fc-892f59f82921cell_id$4d4ae57b-afc3-44f9-b6fc-892f59f82921code #version of reinforce for general function approximation function one_step_actor_critic!(policy_params, ∇lnπ, value_params, ∇v̂, mdp::StateMDP{T, S, A, PTF, F1, F2, F3}, update_action_preferences!::Function, update_eligibility_vector!::Function, x, update_feature_vector!::Function, value_function::Function, update_value_gradient!::Function, max_episodes::Integer, max_steps::Integer; α_w::T = one(T)/10, α_θ::T = one(T)/10, γ::T = one(T), action_preferences = zeros(T, length(mdp.actions)), save_episode_steps = false) where {T<:Real, S, A, PTF, F1, F2, F3} step_rewards = Vector{T}() episode_steps = Vector{Int64}() episode_rewards = Vector{T}() #initialize variables ep = 1 step = 1 rtot = zero(T) c = one(T) s = mdp.initialize_state() update_feature_vector!(x, s) # @info "initial value params: $value_params" while (ep <= max_episodes) && (step <= max_steps) update_value_gradient!(∇v̂, x, value_params) v̂ = value_function(x, value_params) update_action_preferences!(action_preferences, x, policy_params) soft_max!(action_preferences) i_a = sample_action(action_preferences) update_eligibility_vector!(∇lnπ, action_preferences, x, i_a, policy_params) (r, s′) = mdp.ptf(s, i_a) rtot += r save_episode_steps && push!(step_rewards, r) step += 1 if mdp.isterm(s′) push!(episode_steps, step) push!(episode_rewards, rtot) v̂′ = zero(T) ep += 1 rtot = zero(T) c = one(T) s = mdp.initialize_state() update_feature_vector!(x, s) else update_feature_vector!(x, s′) v̂′ = value_function(x, value_params) s = s′ c *= γ end δ = r + γ*v̂′ - v̂ # @info "About to update value params with gradient $∇v̂ and constant $(α_w * δ)" update_params_with_gradient!(value_params, α_w*δ, ∇v̂) # @info "About to update policy params with eligibility vector $∇lnπ and constant $(α_θ*c*δ)" update_params_with_gradient!(policy_params, α_θ*c*δ, ∇lnπ) # @info "policy params after $step updates: $policy_params" # @info "value params after $step updates: $value_params" end function_outputs = form_state_and_policy_function_outputs(update_feature_vector!, update_action_preferences!, value_function, x, action_preferences, policy_params, value_params) return (;step_rewards = step_rewards, episode_steps = episode_steps, episode_rewards = episode_rewards, policy_parameters = policy_params, value_parameters = value_params, function_outputs...) endmetadatashow_logsèdisabled®skip_as_script«code_folded$61949faa-8174-4b7b-8fbc-01d5f850b419cell_id$61949faa-8174-4b7b-8fbc-01d5f850b419code:function actor_critic_binary_continuing_gaussian_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, λ_θ::T, λ_w::T, α_θ_list::AbstractVector{T}, α_w_list::AbstractVector{T}, α_r̄::T, max_steps::Integer; nruns::Integer = 100, seed = rand(UInt64), init_policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), init_value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} Random.seed!(seed) function average_runs(α_θ, α_w) 1:nruns |> Map(_ -> actor_critic_with_eligibility_traces_binary_features_gaussian_actions(mdp, λ_θ, λ_w, get_active_features, num_features, max_steps; α_θ = α_θ, α_w = α_w, α_r̄ = α_r̄, policy_params = copy(init_policy_params), value_params = copy(init_value_params), kwargs...).total_reward) |> foldxt(+) |> x -> x / nruns / max_steps end traces = [begin scatter(x = α_θ_list, y = average_runs.(α_θ_list, α_w), name = "α_w = $α_w") end for α_w in α_w_list] plot(traces, Layout(xaxis_title = "α_θ", yaxis_title = "Average Reward Per Step in the First
$max_steps Steps Averaged Over $nruns Runs", xaxis_type = "log", title = "Binary Feature Encoding with $num_features Features, λ_θ = $λ_θ, λ_w = $λ_w, α_r̄ = $α_r̄")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$5b15f5c9-80bf-47f0-898a-f8dead5b927ccell_id$5b15f5c9-80bf-47f0-898a-f8dead5b927ccodemd""" ### *Continuing Case Actor-Critic Implementation* Note that this function has the same name as the episodic version. The only difference other than keyword arguments is that the `max_episodes` argument is missing. Since we already defined the versions of the algorithm for linear and non-linear cases in a generic manner, we only need to define the core version of this algorithm and the other functions will dispatch to it if they are called without the `max_episodes` argument. """metadatashow_logsèdisabled®skip_as_script«code_folded$266d2234-26c8-43f1-9e75-49440a230ed6cell_id$266d2234-26c8-43f1-9e75-49440a230ed6code #version of reinforce for general function approximation function actor_critic_with_eligibility_traces!(policy_params::P1, ∇lnπ, value_params::P2, ∇v̂, mdp::StateMDP{T, S, A, PTF, F1, F2, F3}, λ_θ::T, λ_w::T, update_action_preferences!::Function, update_eligibility_vector!::Function, x, update_feature_vector!::Function, value_function::Function, update_value_gradient!::Function, max_episodes::Integer, max_steps::Integer; α_w::T = one(T)/10, α_θ::T = one(T)/10, γ::T = one(T), action_preferences = zeros(T, length(mdp.actions)), z_θ::P1 = deepcopy(policy_params), z_w::P2 = deepcopy(value_params), save_step_rewards = false) where {P1, P2, T<:Real, S, A, PTF, F1, F2, F3} step_rewards = Vector{T}() episode_steps = Vector{Int64}() episode_rewards = Vector{T}() #initialize variables ep = 1 step = 1 rtot = zero(T) c = one(T) zero_params!(z_θ) zero_params!(z_w) s = mdp.initialize_state() update_feature_vector!(x, s) while (ep <= max_episodes) && (step <= max_steps) update_value_gradient!(∇v̂, x, value_params) v̂ = value_function(x, value_params) update_action_preferences!(action_preferences, x, policy_params) soft_max!(action_preferences) i_a = sample_action(action_preferences) update_eligibility_vector!(∇lnπ, action_preferences, x, i_a, policy_params) (r, s′) = mdp.ptf(s, i_a) rtot += r save_step_rewards && push!(step_rewards, r) step += 1 if mdp.isterm(s′) push!(episode_steps, step) push!(episode_rewards, rtot) v̂′ = zero(T) rtot = zero(T) zero_params!(z_θ) zero_params!(z_w) ep += 1 c = one(T) s = mdp.initialize_state() update_feature_vector!(x, s) else update_feature_vector!(x, s′) v̂′ = value_function(x, value_params) s = s′ c *= γ end δ = r + γ*v̂′ - v̂ update_traces_with_gradient!(γ*λ_w, z_w, ∇v̂) update_traces_with_gradient!(γ*λ_θ, z_θ, c, ∇lnπ) update_params_with_gradient!(value_params, α_w*δ, z_w) update_params_with_gradient!(policy_params, α_θ*c*δ, z_θ) end function_outputs = form_state_and_policy_function_outputs(update_feature_vector!, update_action_preferences!, value_function, x, action_preferences, policy_params, value_params) return (;step_rewards = step_rewards, episode_steps = episode_steps, episode_rewards = episode_rewards, policy_parameters = policy_params, value_parameters = value_params, function_outputs...) endmetadatashow_logsèdisabled®skip_as_script«code_folded$aa69e4ea-91e0-496a-a7be-529e67f4dbeccell_id$aa69e4ea-91e0-496a-a7be-529e67f4dbeccodeٴreinforce_with_baseline_monte_carlo_control_fcann(corridor_mdp, 1, [10, 10], update_corridor_features!, 100; α_θ = 2f0^-14, α_w = 2f0^-14, max_steps = 10_000).policy_function(1)metadatashow_logsèdisabled®skip_as_script«code_folded$10ee7709-0816-48d2-abe0-9be3dd04700fcell_id$10ee7709-0816-48d2-abe0-9be3dd04700fcodeLplot_continuing_step_rewards(mountaincar_continuing_fcann_test.step_rewards)metadatashow_logsèdisabled®skip_as_script«code_folded$7d94922e-dc9f-4953-b539-24aaa2c85b12cell_id$7d94922e-dc9f-4953-b539-24aaa2c85b12code٘@bind continuing_study_params create_actor_critic_continuing_params_UI(;λ_θ = 0.75f0, λ_w = 0.25f0, log2α_θ = -6, log2α_w = -10, α_r̄ = 0.005f0)metadatashow_logsèdisabled®skip_as_script«code_folded$df7f84e8-b42a-4001-9dbf-6bc3ced94207cell_id$df7f84e8-b42a-4001-9dbf-6bc3ced94207codeَusing PlutoDevMacros, Random, Statistics, LinearAlgebra, Transducers, Base.Threads, Random, Distributions, Statistics, StatsBase, StaticArraysmetadatashow_logsèdisabled®skip_as_script«code_folded$352d2952-cb83-47d3-9078-2b2ef9927443cell_id$352d2952-cb83-47d3-9078-2b2ef9927443code#create a cart pole MDP environment function create_cartpole_functions(; m::T = 1f0, #mass at the end of the pole in kg m_c::T = 10f0, #mass of the cart in kg l::T = 1f0, #length of the pole in meters g::T = 9.8f0, #gravitational constant in meters per second squared h::T = 4f-2, #step size parameter of simulation in seconds k::T = 1f0, #inertial constant of pendulum, m_f::T = 0f0, #friction of the rotating pole μ_c::T = 0f0, #friction of the cart wheels against the track fmax::T = 300f0, #force applied by throttle x_max::T = 50f0, #maximum horizontal position θ_max::T = deg2rad(70f0), #maximum pole angle ẋ_max::T = 50f0, θ̇_max::T = 10f0, init_x::Function = () -> 0f0, #initialize each of the 4 state variables init_θ::Function = () -> Float32(rand([-0.02f0, 0.02f0])), init_ẋ::Function = () -> 0f0, init_θ̇::Function = () -> 0f0) where T<:Real #the action space is full throttle forward or backwards or idle in the discrete case actions = [-fmax, zero(T), fmax] #create a vehicle to use in simulation steps vehicle = CartPoleVehicle(m, m_c, l, k, m_f, μ_c) initialize_state(;t = 0f0) = CartPoleState(init_x(), init_θ(), init_ẋ(), init_θ̇(), t) function failure(s::CartPoleState) (abs(s.x) > x_max) || (abs(s.θ) > θ_max) || (abs(s.ẋ) > ẋ_max) || (abs(s.θ̇) > θ̇_max) end step(s::CartPoleState{T}, f::T) = cartpole_runge_kutta_step(vehicle, s, g, clamp(f, -fmax, fmax), h) min_vals = (-x_max, -θ_max, -ẋ_max, -θ̇_max) max_vals = (x_max, θ_max, ẋ_max, θ̇_max) (step = step, failure = failure, initialize_state = initialize_state, discrete_actions = actions, min_vals = min_vals, max_vals = max_vals, h = h) endmetadatashow_logsèdisabled®skip_as_script«code_folded$0964133c-3a5b-433b-a8c4-a97813c37583cell_id$0964133c-3a5b-433b-a8c4-a97813c37583code,function plot_continuing_step_rewards(r::Vector{T}; npoints = 1000) where T<:Real rsum = cumsum(r) ravg = rsum ./ (1:length(r)) inds = round.(Int64, LinRange(1, length(r), npoints)) plot(scatter(x = inds, y = ravg[inds]), Layout(xaxis_title = "Training Step", yaxis_title = "Reward Average")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$349631b2-4686-49a9-9f3a-1e4ad588b568cell_id$349631b2-4686-49a9-9f3a-1e4ad588b568code\const mountaincar_continuous_mdp2 = create_continuous_action_mountaincar(;slipforce = 100f0)metadatashow_logsèdisabled®skip_as_script«code_folded$8544eddb-2095-4a3c-82e0-920123a88e6dcell_id$8544eddb-2095-4a3c-82e0-920123a88e6dcodemd""" ### Test REINFORCE With and Without Baseline The following function calls execute the REINFORCE algorithm on Example 13.1. The output displayed is the policy function acting on the single state representation for the problem. The two values represent the probability of taking the left and right action respectively. If converged properly, the right action probability should be higher, approaching a value of about 60%. """metadatashow_logsèdisabled®skip_as_script«code_folded$31f7e903-30b6-4193-9174-88093e004de4cell_id$31f7e903-30b6-4193-9174-88093e004de4codemd""" In policy gradient methods, the policy can be parameterized in any way, as long as $\pi(a \vert s, \boldsymbol{\theta})$ is differentiable with respect to its parameters, that is, as long as $\nabla \pi(a \vert s, \boldsymbol{\theta})$ exists and is finite for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$, and $\boldsymbol{\theta} \in \mathbb{R}^{d^\prime}$ where $d^\prime$ is the number of parameters. If the action space is discrete and not too large then we can have numerical preferences for each state/action pair parameterized by $\boldsymbol{\theta}$. $h(s, a, \boldsymbol{\theta})$ and the corresponding policy can be to select actions according to the probability distribution generated by the soft-max. $\pi(a|s, \boldsymbol{\theta}) \doteq \frac{\exp{h(s, a, \boldsymbol{\theta})}}{\sum_b \exp{h(s, b, \boldsymbol{\theta})}}$. One advantage of using the soft-max is that the optimal policy can be stochastic or we can approach a deterministic policy by selecting the action with the highest probability. If we include a temperature parameter in the soft-max then we can vary the same policy to be more or less stochastic as needed. If we calculate preferences with linear features, then we would have feature vectors $\mathbf{x}(s, a) \in \mathbb{R}^{d^\prime}$ to match with the parameter vector $\boldsymbol{\theta} \in \mathbb{R}^{d^\prime}$. Then the preferences would be calculated: $h(s, a, \boldsymbol{\theta}) = \boldsymbol{\theta}^\top \mathbf{x}(s, a)$ Another advantage is that for some problems the policy may be easier to approximate than the action-value function. We can also inject some prior knowledge of the environment into how the policy is parametrized. """metadatashow_logsèdisabled®skip_as_script«code_folded$fee14dfe-c5ca-4126-a830-cc9d7eda5433cell_id$fee14dfe-c5ca-4126-a830-cc9d7eda5433code1const mountaincar_continuous_test_train2 = actor_critic_with_eligibility_traces_binary_features_gaussian_actions(mountaincar_continuous_mdp2, 0.05f0, 0.8f0, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, typemax(Int64), 100_000; α_θ = 5f-4, α_w = 0.0008f0)metadatashow_logsèdisabled®skip_as_script«code_folded$b53dba81-a9e9-41da-8fc2-7736bf25f2dccell_id$b53dba81-a9e9-41da-8fc2-7736bf25f2dccodejif run_mountaincar_binary_episodic_countinuous_param_study > 0 actor_critic_binary_episodic_gaussian_parameter_study(mountaincar_continuous_mdp, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, mountaincar_binary_continuous_params, 4, 3, 1000; max_steps = 100_000) else md""" Waiting to run parameter study """ endmetadatashow_logsèdisabled®skip_as_script«code_folded$beb01fb8-c77d-4b5c-a66d-3812415e04a3cell_id$beb01fb8-c77d-4b5c-a66d-3812415e04a3code ### *Exercise 13.4* > For the Gaussian policy parameterization, derive the formula for the eligibility vector $\nabla \ln{\pi(a|s, \mathbf{\theta})}$ Starting with our expression for the parameter function, we can calculate the gradient: $\nabla \pi(a|s, \mathbf{\theta}) = \nabla \left ( \frac{1}{\sigma(s, \mathbf{\theta}) \sqrt{2\pi}} \exp \left ( - \frac{(a-\mu(s, \mathbf{\theta}))^2}{2\sigma(s, \mathbf{\theta})^2} \right ) \right )$ We will eventually need $\nabla \mu$ and $\nabla \sigma$ so let's calculate them now. $\nabla (\sigma(s, \mathbf{\theta})) = \nabla \exp{( \mathbf{\theta}_\sigma ^ \top \mathbf{x}_\sigma (s))} = \sigma(s, \mathbf{\theta})\mathbf{x}_\sigma (s)$ $\nabla(\mu(s, \mathbf{\theta})) = \nabla ( \mathbf{\theta}_\mu ^\top \mathbf{x}_\mu(s)) = \mathbf{x}_\mu (s)$ The first application of the quotient rule is trivial, I will omit the input arguments to μ and σ keeping in mind that these are functions of the parameters. Also let $\left ( - \frac{(a-\mu)^2}{2\sigma^2} \right ) = f(\mu, \sigma)$ which results in $\pi(a|s, \mathbf{\theta}) = \frac{1}{\sigma \sqrt{2\pi}} \exp{(f(\mu, \sigma))}$. Therefore: $\begin{flalign} \nabla \pi(a|s, \mathbf{\theta}) \sqrt{2\pi} &= \frac{1}{\sigma ^2} \left (- \exp{(f(\mu, \sigma))} \nabla \sigma + \sigma \exp{(f(\mu, \sigma))}\nabla f(\mu, \sigma) \right ) \\ &= \frac{1}{\sigma ^2} \left ( -\exp{(f(\mu, \sigma))} \sigma\mathbf{x}_\sigma + \sigma \exp{(f(\mu, \sigma))}\nabla f(\mu, \sigma) \right ) \\ &=\frac{\exp{(f(\mu, \sigma))}}{\sigma} \left (-\mathbf{x}_\sigma + \nabla f(\mu, \sigma) \right ) \\ \end{flalign}$ Now we need only calculate the gradient of $f$: $\begin{flalign} \nabla f(\mu, \sigma) &= \frac{-1}{2} \nabla \left [ \frac{(a-\mu)^2}{\sigma^2} \right ] \\ & = \frac{-1}{2\sigma^4} \left [-2 \sigma^2 (a - \mu) \nabla \mu - (a - \mu)^2 2\sigma \nabla \sigma \right ] \\ & = \frac{-1}{\sigma^3} \left [ -\sigma (a - \mu) \nabla \mu - (a - \mu)^2 \nabla \sigma \right ] \\ & = \frac{-1}{\sigma^3} \left [ -\sigma (a - \mu) \mathbf{x}_\mu (s) - (a - \mu)^2 \sigma \mathbf{x}_\sigma \right ] \tag{substituting gradients}\\ & = \frac{(a - \mu)}{\sigma^2} ((a - \mu) \mathbf{x}_\sigma + \mathbf{x}_\mu) \tag{simplifying}\\ \end{flalign}$ Now substitute this back into the policy gradient: $\nabla \pi(a|s, \mathbf{\theta}) \sqrt{2\pi} = \frac{\exp{(f(\mu, \sigma))}}{\sigma} \left (-\mathbf{x}_\sigma + \frac{(a - \mu)}{\sigma^2} ((a - \mu) \mathbf{x}_\sigma + \mathbf{x}_\mu) \right )$ Furthermore, observe that $\pi(a|s, \mathbf{\theta}) = \frac{1}{\sigma\sqrt{2\pi}} \exp(f(\mu, \sigma))$ So our expression for the policy gradient is: $\nabla \pi(a|s, \mathbf{\theta}) = \pi(a|s, \mathbf{\theta}) \left (-\mathbf{x}_\sigma + \frac{(a - \mu)}{\sigma^2} ((a - \mu) \mathbf{x}_\sigma + \mathbf{x}_\mu) \right )$ To get the eligibility vector we must divide this by the policy which is conveniently already in the expression: $\begin{flalign} \frac{\nabla \pi(a|s, \mathbf{\theta})}{\pi(a|s, \mathbf{\theta})} &= -\mathbf{x}_\sigma + \frac{(a - \mu)}{\sigma^2} ((a - \mu) \mathbf{x}_\sigma + \mathbf{x}_\mu)\\ &= \mathbf{x}_\mu \left [ \frac{(a - \mu)}{\sigma^2} \right ] + \mathbf{x}_\sigma \left [\frac{(a-\mu)^2}{\sigma^2} -1 \right ] \\ \end{flalign}$ There are two components to the sum, one for $\mu$ and one for $\sigma$. If we think of the paramters and feature vectors as concatenated, then this sum would be an element by element sum where $\mathbf{x}_\mu$ has a zero value for all the feature indices corresponding to $\sigma$ and vice-versa. This way doing the sum will form one complete vector that has gradient components for all the parameters $\mathbf{\theta}_\mu$ and $\mathbf{\theta}_\sigma$. Alternatively, the sum can be separated and each gradient can be treated separately with only those components keeping them separated throughout the calculation. """metadatashow_logsèdisabled®skip_as_script«code_folded$8bc280db-e57d-4e40-be46-1790f4f7d9e7cell_id$8bc280db-e57d-4e40-be46-1790f4f7d9e7codefunction actor_critic_fcann_parameter_study(mdp::StateMDP{T, S, A, P, F1, F2, F3}, update_feature_vector!::Function, num_features::Integer, hidden_layers::Vector{Int64}, λ_θ::T, λ_w::T, α_r̄::T, α_θ_list::AbstractVector{T}, α_w_list::AbstractVector{T}, max_steps::Integer; nruns::Integer = 100, seed = rand(UInt64), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} Random.seed!(seed) init_policy_params = FCANN.initializeparams_saxe(num_features, hidden_layers, length(mdp.actions)) make_trace_data(α_θ_list, α_w) = [average_continuing_runs(nruns, seed, α_θ, α_w, α_r̄, init_policy_params, actor_critic_with_eligibility_traces_fcann, mdp, λ_θ, λ_w, num_features, hidden_layers, update_feature_vector!, max_steps; kwargs...) for α_θ in α_θ_list] traces = [begin scatter(x = α_θ_list, y = make_trace_data(α_θ_list, α_w), name = "α_w = $α_w") end for α_w in α_w_list] plot(traces, Layout(xaxis_title = "α_θ", yaxis_title = "Average Reward Per Step in the First
$max_steps Steps Averaged Over $nruns Runs", xaxis_type = "log", title = "$num_features Input, $hidden_layers Hidden Non Linear Approximation, λ_θ = $λ_θ, λ_w = $λ_w")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$89901156-b874-416b-89c1-6dc434a4eb17cell_id$89901156-b874-416b-89c1-6dc434a4eb17code(md""" ### *REINFORCE Implementation* """metadatashow_logsèdisabled®skip_as_script«code_folded$ff76ef94-fdf5-41f3-a31a-21c4629efabecell_id$ff76ef94-fdf5-41f3-a31a-21c4629efabecode(const corridor_mdp = make_corridor_mdp()metadatashow_logsèdisabled®skip_as_scriptëcode_folded$581f7e9b-a5c2-4841-9605-85f9585b0274cell_id$581f7e9b-a5c2-4841-9605-85f9585b0274codeٺupdate_linear_action_preferences!(action_preferences::Vector{T}, x::Vector{T}, params::Matrix{T}) where T<:AbstractFloat = BLAS.gemv!('T', one(T), params, x, zero(T), action_preferences)metadatashow_logsèdisabled®skip_as_script«code_folded$8aa16866-bfda-48df-9cf1-cf3d2e203ccbcell_id$8aa16866-bfda-48df-9cf1-cf3d2e203ccbcodeofunction cartpole_tilecoding_reinforce_continuous_parameter_study(α1_list, α2_list, max_episodes; num_trials = 100, kwargs...) setup = setup_cartpole_problem(;kwargs...) traces = [begin steps = [begin 1:num_trials |> Map() do i solution = reinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions(cartpole_setup.mdps.episodic.continuous, cartpole_setup.get_active_features, cartpole_setup.num_features, max_episodes; α_θ = α1, α_w = α2) steps = solution.episode_steps isempty(steps) && return 0 mean(steps) end |> foldxt(+) |> x -> x / num_trials end for α1 in α1_list] scatter(x = α1_list, y = steps, name = "α_w = $α2") end for α2 in α2_list] plot(traces, Layout(xaxis_title = "Policy Learning Rate α_θ", yaxis_title = "Average Episode Duration Over First $max_episodes Episodes", xaxis_type = "log")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$04b5929a-2058-49c9-963a-96c752a1d67dcell_id$04b5929a-2058-49c9-963a-96c752a1d67dcodeIplot_continuing_step_rewards(cartpole_continuing_fcann_test.step_rewards)metadatashow_logsèdisabled®skip_as_script«code_folded$f0104778-81a6-417b-8501-f916e5e7f3afcell_id$f0104778-81a6-417b-8501-f916e5e7f3afcodeAfunction make_corridor_continuing_mdp() function step(s::Integer, i_a::Integer) δ = 2*i_a - 3 #calculates the s change -1 for left (1) and 1 for right (2) switch = iseven(s) #returns true in state 2 which is where actions are switched, when switch is true, multiply δ by -1, otherwise by 1 c = 1 - 2*switch s′ = s + c*δ goal = s == 4 left_limit = s == 0 s′ = ifelse(left_limit || goal, 1, s′) r = Float32(goal) (r, s′) end actions = [:left, :right] ptf = StateMDPTransitionSampler(step, 1) StateMDP(actions, ptf, () -> 1, Returns(false)) endmetadatashow_logsèdisabled®skip_as_script«code_folded$3e3c5897-809f-46e3-bb58-f115b082443ecell_id$3e3c5897-809f-46e3-bb58-f115b082443ecodefunction actor_critic_with_eligibility_traces_binary_features_beta_actions(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, λ_θ::T, λ_w::T, get_active_features::Function, num_features::Integer, args...; policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, N, A <: Union{T, NTuple{N, T}}, P, F1, F2, F3} setup = setup_binary_beta_policy_arguments(mdp, get_active_features, num_features) actor_critic_with_eligibility_traces!(policy_params, setup.eligibility_vector, value_params, BinaryFeatureVector(), mdp, λ_θ, λ_w, update_binary_action_preferences!, setup.action_distribution_parameters, make_beta_sampler(rand(A)), update_beta_eligibility_vector!, setup.feature_vector, setup.update_feature_vector!, binary_value_function, update_binary_value_gradient!, args...; kwargs...) endmetadatashow_logsèdisabled®skip_as_script«code_folded$a9db3f85-ff56-4bbc-be87-47b893ef3b7bcell_id$a9db3f85-ff56-4bbc-be87-47b893ef3b7bcodefunction mountaincar_continuing_step(s, i_a::Integer) a = MountainCarTask.actions[i_a] s′ = MountainCarTask.step(s, a) (s′[1] == 0.5f0) && return (1f0, MountainCarTask.initialize_state()) return (0f0, s′) endmetadatashow_logsèdisabled®skip_as_script«code_folded$08505e88-9c23-4e95-91e3-d18bf5133dbccell_id$08505e88-9c23-4e95-91e3-d18bf5133dbccodefunction actor_critic_binary_episodic_squashed_gaussian_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, amax::A, get_active_features::Function, num_features::Integer, λ_θ::T, λ_w::T, α_θ_list::AbstractVector{T}, α_w_list::AbstractVector{T}, max_episodes::Integer; nruns::Integer = 100, max_steps::Integer = 10_000, seed = rand(UInt64), init_policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), init_value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} Random.seed!(seed) function average_runs(α_θ, α_w) 1:nruns |> Map(_ -> actor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions(mdp, amax, λ_θ, λ_w, get_active_features, num_features, max_episodes, max_steps; α_θ = α_θ, α_w = α_w, policy_params = copy(init_policy_params), value_params = copy(init_value_params), kwargs...) |> x -> isempty(x.episode_rewards) ? -T(Inf) : mean(x.episode_rewards)) |> foldxt(+) |> x -> x / nruns end traces = [begin scatter(x = α_θ_list, y = average_runs.(α_θ_list, α_w), name = "α_w = $α_w") end for α_w in α_w_list] plot(traces, Layout(xaxis_title = "α_θ", yaxis_title = "Average Reward Per Episode in the First
$max_episodes Episodes Averaged Over $nruns Runs", xaxis_type = "log", title = "Binary Feature Encoding with $num_features Features, λ_θ = $λ_θ, λ_w = $λ_w")) endmetadatashow_logsèdisabled®skip_as_script«code_folded$ad0009af-2cfc-4820-bd4a-698ad391f459cell_id$ad0009af-2cfc-4820-bd4a-698ad391f459codebplot(scatter(x = LinRange(0, 1, 1000), y = make_beta_dist(beta_params...).(LinRange(0, 1, 1000))))metadatashow_logsèdisabled®skip_as_script«code_folded$16fcc2d0-9f2f-4226-9dcc-6d86248cab26cell_id$16fcc2d0-9f2f-4226-9dcc-6d86248cab26codeXfunction plot_state_distributions(p_left; kwargs...) state_visits = collect_state_distributions(;p = p_left, kwargs...) η = sum(state_visits) μ = sum(state_visits, dims=1)[:] μs_plot = plot(bar(x = 1:3, y = μ ./ sum(μ)), Layout(yaxis_range = [0, 1], xaxis_tickvals = [1, 2, 3], xaxis_title = "State", yaxis_title = "Probability", title = "Stationary State Distribution")) p_not_term = sum(state_visits, dims = 2) pterm = 1 .- p_not_term termplot = plot(pterm, Layout(xaxis_title = "Step", yaxis_title = "Probability", title = "Probability of Episode Terminating On or Before Step")) (n, m) = size(state_visits) plots = [begin v = state_visits[k, :][:] vterm = pterm[k] t = bar(x = 1:4, y = [v; vterm], name = "k = $k") p = plot(t, Layout(width = 270, height = 250, yaxis_range = [0, 1], xaxis = attr(tickvals = 1:4, ticktext = ["1", "2", "3", "Term"], title = "State"), yaxis_title = "Probability", title = "Step $k")) end for k in vcat(1:5, 10:10:50)] full_p = plot(heatmap(x = 0:20, y = 1:3, z = state_visits[1:21, :]' ./ sum(state_visits)), Layout(xaxis_title = "Step", yaxis_title = "State", title = "Probability Over States and Steps", yaxis_tickvals = [1, 2, 3])) # p3 = plot(traces2) @htl(""" Policy Probability for Left Action is $p_left and Average Episode Length is $η

State Distribution Per Step Including Terminal State

$plots
$termplot $μs_plot
$full_p
""") endmetadatashow_logsèdisabled®skip_as_script«code_folded$11063fff-4d36-46d5-828f-dbed0f46b9cfcell_id$11063fff-4d36-46d5-828f-dbed0f46b9cfcoderfunction actor_critic_fcann_parameter_study(mdp::StateMDP{T, S, A, P, F1, F2, F3}, update_feature_vector!::Function, num_features::Integer, hidden_layers::Vector{Int64}, λ_θ_list::AbstractVector{T}, λ_w_list::AbstractVector{T}, α_r̄_list::AbstractVector{T}, α_θ_list::AbstractVector{T}, α_w_list::AbstractVector{T}, num_tests::Integer, max_steps::Integer; nruns::Integer = 100, seed = rand(UInt64), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} Random.seed!(seed) init_policy_params = FCANN.initializeparams_saxe(num_features, hidden_layers, length(mdp.actions)) run_test(α_θ, α_w, α_r̄, λ_θ, λ_w) = average_continuing_runs(nruns, seed, α_θ, α_w, α_r̄, init_policy_params, actor_critic_with_eligibility_traces_fcann, mdp, λ_θ, λ_w, num_features, hidden_layers, update_feature_vector!, max_steps; kwargs...) test_params = [(α_θ = rand(α_θ_list), α_w = rand(α_w_list), α_r̄ = rand(α_r̄_list), λ_θ = rand(λ_θ_list), λ_w = rand(λ_w_list)) for _ in 1:num_tests] DataFrame([begin output = run_test(params...) (;params..., output = output) end for params in test_params]) endmetadatashow_logsèdisabled®skip_as_script«code_folded$8fcdca63-01a0-4d4b-933c-06a7621d980acell_id$8fcdca63-01a0-4d4b-933c-06a7621d980acodeW#add neural network implementation of continuous policy gradient and do parameter studymetadatashow_logsèdisabled®skip_as_script«code_folded$33c99850-67cd-4754-94b9-6df97b238e27cell_id$33c99850-67cd-4754-94b9-6df97b238e27codefunction soft_max!(x::AbstractVector{T}) where T<:Real minx, maxx = extrema(x) if minx == maxx x .= one(T) / length(x) return x end s = zero(T) @inbounds @simd for i in eachindex(x) h = exp(x[i] - maxx) s += h x[i] = h end x ./= s endmetadatashow_logsèdisabled®skip_as_script«code_folded$786a5385-b648-4fc3-8e19-bf6582828136cell_id$786a5385-b648-4fc3-8e19-bf6582828136codeKmd""" #### Continuous Action Space Now that we have verified the success of policy gradient methods on this problem, we can consider using a continuous action space where the policy can output a distribution over throttles. In the original problem, the maximum throttle value is 1, but the velocity of the car is already capped at 0.07. We can see if a policy attempts to use much higher throttle values to end the episode faster even if the physics is unrealistic. That observation would confirm a successful use of continuous actions where the throttle is an unbounded continuous value. The optimal policy would likely try to use the highest throttle possible to reach the maximum speed in either direction faster. We could apply friction to the problem so that the car would actually slip if it attempts to accelerate too quickly. """metadatashow_logsèdisabled®skip_as_script«code_folded$573878bb-020d-40f6-9329-3d5f91843010cell_id$573878bb-020d-40f6-9329-3d5f91843010code^get_corridor_episode_stats(corridor_train.greedy_policy; max_steps = 100, ntrials = 1_000_000)metadatashow_logsèdisabled®skip_as_script«code_folded$2e7c737c-c798-4442-a7e1-d74ccfd73119cell_id$2e7c737c-c798-4442-a7e1-d74ccfd73119code6@bind ẋ Slider(-50:50; default = 0, show_value=true)metadatashow_logsèdisabled®skip_as_script«code_folded$9d264543-33ab-498a-90f5-5f913c252484cell_id$9d264543-33ab-498a-90f5-5f913c252484code]plot(reinforce_test4.episode_steps[1:10:end] ./ (1:10:length(reinforce_test4.episode_steps)))metadatashow_logsèdisabled®skip_as_script«code_folded$9cf3dc5f-8a25-479f-93db-06e34f0d37a0cell_id$9cf3dc5f-8a25-479f-93db-06e34f0d37a0code%plot_state_distributions(dist_plot_p)metadatashow_logsèdisabled®skip_as_script«code_folded$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70cell_id$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70codeهbegin using PlutoUI, PlutoPlotly, LaTeXStrings, PlutoProfile, HypertextLiteral, ProgressLogging, BenchmarkTools TableOfContents() endmetadatashow_logsèdisabled®skip_as_scriptëcode_folded$bd6a7c16-6c25-4fc2-8e1b-4dab693ce19fcell_id$bd6a7c16-6c25-4fc2-8e1b-4dab693ce19fcodedactor_critic_binary_episodic_squashed_gaussian_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, amax::A, get_active_features::Function, num_features::Integer, params::@NamedTuple{λ_θ::T, λ_w::T, α_θ_min::Int64, α_w_min::Int64}, num_θ::Integer, num_w::Integer, num_episodes::Integer; kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = actor_critic_binary_episodic_squashed_gaussian_parameter_study(mdp, amax, get_active_features, num_features, params.λ_θ, params.λ_w, 2f0 .^(params.α_θ_min:params.α_θ_min+num_θ-1), 2f0 .^(params.α_w_min:params.α_w_min+num_w-1), num_episodes; kwargs...)metadatashow_logsèdisabled®skip_as_script«code_folded$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72ccell_id$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72ccodeلcorridor_train = sarsa_λ(corridor_mdp, 1f0, 0.99f0, typemax(Int64), 1_000_000, 1, get_corridor_features; ϵ = 0.5f0, α = 0.0001f0)metadatashow_logsèdisabled®skip_as_script«code_folded«notebook_id$6d683db8-38f5-11f0-0729-898e37e867d8in_temp_dir¨metadata