Þ¥bonds€¬cell_resultsÞ§Ù$4f96be72-ef3e-4e08-ac4c-be4271dcd14cŠ¦queuedÂ¤logs§runningÂ¦output†¤body ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ”ô•Z°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$4f96be72-ef3e-4e08-ac4c-be4271dcd14c¹depends_on_disabled_cellsÂ§runtimeÎ@Uµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$19dfabda-7049-4050-8662-0385529c0c5aŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ"W

x position: 0.0 x velocity: 0.0

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ• ûžV°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$19dfabda-7049-4050-8662-0385529c0c5a¹depends_on_disabled_cellsÂ§runtimeÎ KÞµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$b71145a4-2614-4f62-bfd2-7d5d1fecec56Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙGactor_critic_with_eligibility_traces! (generic function with 3 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•+ )o°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$b71145a4-2614-4f62-bfd2-7d5d1fecec56¹depends_on_disabled_cellsÂ§runtimeÎ‰Xþµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$c0876a48-ea18-494d-8bfc-e2bceb73b417Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÛF%

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@µ!ÿ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$c0876a48-ea18-494d-8bfc-e2bceb73b417¹depends_on_disabled_cellsÂ§runtimeÎ|÷¸µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙDreinforce_monte_carlo_control_fcann (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•$0Šç°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091¹depends_on_disabled_cellsÂ§runtimeÎ=‡xµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$f4b6f10b-4cd0-4be6-98ec-4d4ffb696392Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙS

One-step Actor-Critic Implementation

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‹ï°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f4b6f10b-4cd0-4be6-98ec-4d4ffb696392¹depends_on_disabled_cellsÂ§runtimeÎmÈµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$9db9ff71-bee9-4bea-a45b-748f8517fed1Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements’’´action_probabilities’…¦prefix§Float32¨elements’’’©0.0033094ªtext/plain’’¨0.996691ªtext/plain¤type¥Array¬prefix_short ¨objectid°b337585f8106efc9Ù!application/vnd.pluto.tree+object’´state_value_estimate’¨-243.183ªtext/plain¤typeªNamedTuple¨objectid¯1ff9b53824d3895¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÀ²last_run_timestampËAÚ•*í6¤°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$9db9ff71-bee9-4bea-a45b-748f8517fed1¹depends_on_disabled_cellsÂ§runtimeÎ…Çµµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$4634267b-5dea-4164-8bb2-1eb2fd4d7954Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙBupdate_linear_eligibility_vector! (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•´Ø°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$4634267b-5dea-4164-8bb2-1eb2fd4d7954¹depends_on_disabled_cellsÂ§runtimeÎ"¥¥µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$6c5f51bb-a6be-447e-b73d-4f9c2885e809Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÀ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampË°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$6c5f51bb-a6be-447e-b73d-4f9c2885e809¹depends_on_disabled_cellsÃ§runtimeÀµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$cc45091e-b889-4d5a-9eef-84d80f792046Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚý

13.4 REINFORCE with Baseline

The policy gradient theorem (13.5) can be generalized to include a comparison of the action value to an arbitrary baseline b(s):

$$\nabla J(\boldsymbol{\theta}) \propto \sum_s \mu(s)\sum_a\left( q_\pi(s,a)-b(s) \right ) \nabla\pi(a|s,\boldsymbol{\theta}) \tag{13.10}$$

The baseline can be any function, even a random variable, as long as it does not vary with $a$; the euation remains valid because the subtracted quantity is zero:

$$\sum_ab(s)\nabla\pi(a|s,\boldsymbol{\theta})=b(s)\nabla\sum_a\pi(a|s,\boldsymbol{\theta})=b(s)\nabla1=0$$

The policy gradient theorem with baseline (13.10) can be used to derive an update rule using similar steps as in the previous section. The update rule that we end up with is a new version of REINFORCE that includes a general baseline:

$$\boldsymbol{\theta}_{t+1} \doteq \boldsymbol{\theta}_t+\alpha(G_t-b(S_t))\frac{\nabla\pi(A_t|S_t,\boldsymbol{\theta}_t)}{\pi(A_t|S_t,\boldsymbol{\theta}_t)} \tag{13.11}$$

Since the baseline could be uniformly zero, this is a strict generalization of REINFORCE. To have an effective baseline that depends on state we can use a state value estimate that is also updated with gradient steps: $\hat v(S_t, \mathbf{w})$. Using such an estimate we can revise the previous REINFORCE algorithm.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‰¨ø°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$cc45091e-b889-4d5a-9eef-84d80f792046¹depends_on_disabled_cellsÂ§runtimeÎ]Aµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$5b15d91e-7119-4f85-a54a-7d4f1fdaf097Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙIcreate_actor_critic_continuing_params_UI (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ• ró°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5b15d91e-7119-4f85-a54a-7d4f1fdaf097¹depends_on_disabled_cellsÂ§runtimeÎ>¶õµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$ba41f521-4ee2-42a6-bf18-078bfa4b875eŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙAmake_n_param_dist_policy_params (generic function with 2 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•' ¶ °persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$ba41f521-4ee2-42a6-bf18-078bfa4b875e¹depends_on_disabled_cellsÂ§runtimeÎx®µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙNcartpole_tilecoding_reinforce_parameter_study (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•1<×°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03¹depends_on_disabled_cellsÂ§runtimeÎY?µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$3c695d54-c30f-4f04-bd40-f5da53be2a95Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙN

Cart Pole Continuous Action MDP

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô”—°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$3c695d54-c30f-4f04-bd40-f5da53be2a95¹depends_on_disabled_cellsÂ§runtimeÎ^Hµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$0d45ae72-572f-4d17-83cf-9814f2854131Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚcY

$\lambda_\theta$: 0.05

$\lambda_\mathbf{w}$: 0.8

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@/Ü°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$0d45ae72-572f-4d17-83cf-9814f2854131¹depends_on_disabled_cellsÂ§runtimeÎ&Lµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$cd9c9eeb-c90d-4499-9503-7773d5250f47Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ‹¤Total Reward: -68.0

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•AtK°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$cd9c9eeb-c90d-4499-9503-7773d5250f47¹depends_on_disabled_cellsÂ§runtimeÎ-œµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$fd58402f-da65-44cf-b81a-e21192fd0e63Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÛyú

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@`‚à°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$fd58402f-da65-44cf-b81a-e21192fd0e63¹depends_on_disabled_cellsÂ§runtimeÎ\9µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$8e39bd15-862e-4941-88f9-2794b861a523Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙNreinforce_monte_carlo_control_linear_features (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•$9?°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$8e39bd15-862e-4941-88f9-2794b861a523¹depends_on_disabled_cellsÂ§runtimeÎ;çhµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$64900586-ef92-48e4-839e-ff952a46671bŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÀ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampË°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$64900586-ef92-48e4-839e-ff952a46671b¹depends_on_disabled_cellsÃ§runtimeÀµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$fddef10c-7695-4596-9e16-987fd45a57e6Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙBsetup_cartpole_continuous_problem (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•0“uê°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$fddef10c-7695-4596-9e16-987fd45a57e6¹depends_on_disabled_cellsÂ§runtimeÎNãvµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$e2b09af1-0f22-4f7f-b806-54fa522adb20Š¦queuedÂ¤logs§runningÂ¦output†¤body ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ”ô‹š°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$e2b09af1-0f22-4f7f-b806-54fa522adb20¹depends_on_disabled_cellsÂ§runtimeÍ'Kµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$2be8a812-4f21-4fe8-a2de-50497db0345aŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙg

Actor-Critic Implementation for Continuous Action Spaces

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô“æ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$2be8a812-4f21-4fe8-a2de-50497db0345a¹depends_on_disabled_cellsÂ§runtimeÎ¿uµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$68806899-9972-460a-9f11-daa708a9d610Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙUactor_critic_with_eligibility_traces_linear_features (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•+¾NY°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$68806899-9972-460a-9f11-daa708a9d610¹depends_on_disabled_cellsÂ§runtimeÎKMµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$189798b3-ec6b-48b9-918c-ee0f65935ab3Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ

Exercise 13.3

In Section 13.1 we considered policy parameterizations using the soft-max in action preferences (13.2) with linear action preferences (13.3). For this parameterization, prove that the eligibility vector is $\begin{flalign} \nabla \ln \pi(a|s, \boldsymbol{\theta}) = \mathbf{x}(s, a) - \sum_b \pi(b|s, \boldsymbol{\theta}) \mathbf{x}(s, b) \tag{13.9} \end{flalign}$ using the definitions and elementary calculus.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôˆEé°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$189798b3-ec6b-48b9-918c-ee0f65935ab3¹depends_on_disabled_cellsÂ§runtimeÎ-µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$00152954-dc98-4120-b94b-2ea4d987832bŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙBcreate_mountaincar_continuing_mdp (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•!'r°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$00152954-dc98-4120-b94b-2ea4d987832b¹depends_on_disabled_cellsÂ§runtimeÎ’¾µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$42d4600a-bf3c-45ac-b7f5-d23917713ff5Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ‚

Layer Size: Num Layers:

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•!@°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$42d4600a-bf3c-45ac-b7f5-d23917713ff5¹depends_on_disabled_cellsÂ§runtimeÎÍÌµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$4e29c621-223e-4859-8e96-db04b967815aŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙPsetup_binary_squashed_gaussian_policy_arguments (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•'4Qø°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$4e29c621-223e-4859-8e96-db04b967815a¹depends_on_disabled_cellsÂ§runtimeÎéÓµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$5981f52b-d829-4c7d-b47b-33310f7d64a2Š¦queuedÂ¤logs§runningÂ¦output†¤body…¦prefix§Float32¨elements’’’£0.5ªtext/plain’’£0.5ªtext/plain¤type¥Array¬prefix_short ¨objectid°540650b0f89a7c43¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÀ²last_run_timestampËAÚ•>°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5981f52b-d829-4c7d-b47b-33310f7d64a2¹depends_on_disabled_cellsÂ§runtimeÍiµµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$0e9de19e-bcd4-40ac-9831-afb6cad38422Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ=setup_fcann_policy_arguments (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•Ý…n°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$0e9de19e-bcd4-40ac-9831-afb6cad38422¹depends_on_disabled_cellsÂ§runtimeÎKäåµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$ff3009eb-23f9-44fe-8e56-85dbc7b463d0Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ5show_squashed_policy (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•>^Œ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$ff3009eb-23f9-44fe-8e56-85dbc7b463d0¹depends_on_disabled_cellsÂ§runtimeÎî>µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$4fb83451-b6f8-4e6e-a131-1accc8e10b08Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙMreinforce_with_baseline_monte_carlo_control! (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•#]R;°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$4fb83451-b6f8-4e6e-a131-1accc8e10b08¹depends_on_disabled_cellsÂ§runtimeÎa:›µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$406638af-1e08-44d2-9ee4-97aa9294a94bŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙF

13.2 The Policy Gradient Theorem

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô…vu°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$406638af-1e08-44d2-9ee4-97aa9294a94b¹depends_on_disabled_cellsÂ§runtimeÎÛÃµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$57e5e12a-b722-4ea3-ab3b-e5711029e640Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙFone_step_actor_critic_linear_features (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•*âí»°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$57e5e12a-b722-4ea3-ab3b-e5711029e640¹depends_on_disabled_cellsÂ§runtimeÎ:ï¤µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$374af774-3a97-49b5-a3bb-bc3f7f63a3faŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚÔ¾

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@šá"°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$374af774-3a97-49b5-a3bb-bc3f7f63a3fa¹depends_on_disabled_cellsÂ§runtimeÎÖ·µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$7bf209c8-ef0a-46d1-937e-b1a6e45dc62eŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚV(

Î±: 0.01

Î²: 0.01

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•!hHw°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$7bf209c8-ef0a-46d1-937e-b1a6e45dc62e¹depends_on_disabled_cellsÂ§runtimeÎŠ2–µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$dd8e8cd2-7b41-46c4-8530-adefb7aea684Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙRactor_critic_binary_episodic_beta_parameter_study (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•/å¼°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$dd8e8cd2-7b41-46c4-8530-adefb7aea684¹depends_on_disabled_cellsÂ§runtimeÎþ;µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$4fea7232-f286-4a8b-93f8-a0702818ab31Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙO

Test Actor-Critic with Eligibility Traces

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‹Òª°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$4fea7232-f286-4a8b-93f8-a0702818ab31¹depends_on_disabled_cellsÂ§runtimeÎß®µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$26880577-d267-4950-8725-7afe0d0402b6Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements•’¤mdps’ƒ¨elements’’¨episodic’ƒ¨elements’’¨discrete’…¦prefixÚÏStateMDP{Float32, CartPoleState{Float32}, Float32, StateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1515#1531"{var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}, Vector{Float32}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1521#1537", var"#init_Î¸#1551", var"#1523#1539", var"#1524#1540"}}, var"#failure#1527"{Float32, Float32, Float32, Float32}, var"#164#169"}¨elements–’§actions’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°94c2589b374bb4e8Ù!application/vnd.pluto.tree+object’£ptf’…¦prefixÙÎStateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1515#1531"{var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}, Vector{Float32}}}¨elements‘’¤step’¥#1515ªtext/plain¤type¦struct¬prefix_short¹StateMDPTransitionSampler¨objectid°fe79d9d3300a131aÙ!application/vnd.pluto.tree+object’°initialize_state’°initialize_stateªtext/plain’¦isterm’§failureªtext/plain’¯is_valid_action’¤#164ªtext/plain’¬action_index’…¦prefix´Dict{Float32, Int64}¨elements‘¤more¤type¤Dict¬prefix_short¤Dict¨objectid°604cc680086d6fbcÙ!application/vnd.pluto.tree+object¤type¦struct¬prefix_short¨StateMDP¨objectid°93f5ea03ec017b99Ù!application/vnd.pluto.tree+object’ªcontinuous’…¦prefixÚÀContinuousMDP{Float32, CartPoleState{Float32}, Float32, ContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1521#1537", var"#init_Î¸#1551", var"#1523#1539", var"#1524#1540"}}, var"#failure#1527"{Float32, Float32, Float32, Float32}, Returns{Bool}}¨elements”’£ptf’…¦prefixÙºContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}}¨elements‘’¤step’episodic_stepªtext/plain¤type¦struct¬prefix_short¾ContinuousMDPTransitionSampler¨objectid°7c8829407e8c2ed7Ù!application/vnd.pluto.tree+object’°initialize_state’°initialize_stateªtext/plain’¦isterm’§failureªtext/plain’¯is_valid_action’³Returns{Bool}(true)ªtext/plain¤type¦struct¬prefix_shortContinuousMDP¨objectid°677d0c961fd63ccaÙ!application/vnd.pluto.tree+object¤typeªNamedTuple¨objectid°6ef1486dd0419d46Ù!application/vnd.pluto.tree+object’ªcontinuing’ƒ¨elements’’¨discrete’…¦prefixÚàStateMDP{Float32, CartPoleState{Float32}, Float32, StateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1516#1532"{var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}, Vector{Float32}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1521#1537", var"#init_Î¸#1551", var"#1523#1539", var"#1524#1540"}}, Returns{Bool}, var"#164#169"}¨elements–’§actions’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°98ad56d5f22ee7f4Ù!application/vnd.pluto.tree+object’£ptf’…¦prefixÚStateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1516#1532"{var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}, Vector{Float32}}}¨elements‘’¤step’¥#1516ªtext/plain¤type¦struct¬prefix_short¹StateMDPTransitionSampler¨objectid°1a36bcae07eee980Ù!application/vnd.pluto.tree+object’°initialize_state’°initialize_stateªtext/plain’¦isterm’´Returns{Bool}(false)ªtext/plain’¯is_valid_action’¤#164ªtext/plain’¬action_index’…¦prefix´Dict{Float32, Int64}¨elements‘¤more¤type¤Dict¬prefix_short¤Dict¨objectid°c3e528862580ef49Ù!application/vnd.pluto.tree+object¤type¦struct¬prefix_short¨StateMDP¨objectid°45fb03e2144b629bÙ!application/vnd.pluto.tree+object’ªcontinuous’…¦prefixÚÑContinuousMDP{Float32, CartPoleState{Float32}, Float32, ContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1521#1537", var"#init_Î¸#1551", var"#1523#1539", var"#1524#1540"}}, Returns{Bool}, Returns{Bool}}¨elements”’£ptf’…¦prefixÙôContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}}¨elements‘’¤step’¯continuing_stepªtext/plain¤type¦struct¬prefix_short¾ContinuousMDPTransitionSampler¨objectid°5a3a5ccaebd85094Ù!application/vnd.pluto.tree+object’°initialize_state’°initialize_stateªtext/plain’¦isterm’´Returns{Bool}(false)ªtext/plain’¯is_valid_action’³Returns{Bool}(true)ªtext/plain¤type¦struct¬prefix_shortContinuousMDP¨objectid°8a16015925357c5aÙ!application/vnd.pluto.tree+object¤typeªNamedTuple¨objectid°db92a2c629c9c65aÙ!application/vnd.pluto.tree+object¤typeªNamedTuple¨objectid¯2f01e5bd9066e7bÙ!application/vnd.pluto.tree+object’³get_active_features’¥#1549ªtext/plain’¬num_features’¥52488ªtext/plain’¨min_vals’ƒ¨elements”’’¥-50.0ªtext/plain’’¨-1.22173ªtext/plain’’¥-50.0ªtext/plain’’¥-10.0ªtext/plain¤type¥Tuple¨objectid°e43d54ac3f2f06edÙ!application/vnd.pluto.tree+object’¨max_vals’ƒ¨elements”’’¤50.0ªtext/plain’’§1.22173ªtext/plain’’¤50.0ªtext/plain’’¤10.0ªtext/plain¤type¥Tuple¨objectid°670383347ca2d0caÙ!application/vnd.pluto.tree+object¤typeªNamedTuple¨objectid°a4287526702b37b9¤mimeÙ!application/vnd.pluto.tree+object¬rootassignee´const cartpole_setup²last_run_timestampËAÚ•0Ëµ=°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$26880577-d267-4950-8725-7afe0d0402b6¹depends_on_disabled_cellsÂ§runtimeÎ[nµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$a7891c63-18d6-4c1f-ba67-adf7c547d334Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÀ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampË°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$a7891c63-18d6-4c1f-ba67-adf7c547d334¹depends_on_disabled_cellsÃ§runtimeÀµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$44f14d4f-7414-4c6f-883a-042ca261a403Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÀ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampË°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$44f14d4f-7414-4c6f-883a-042ca261a403¹depends_on_disabled_cellsÃ§runtimeÀµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$94354552-9920-4b90-98d9-f75286d1f53eŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ.R ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•)øp °persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$94354552-9920-4b90-98d9-f75286d1f53e¹depends_on_disabled_cellsÂ§runtimeÏeup=µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$e5faaa1b-88cb-43e2-8d04-8972b58b4bdaŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÛk9 ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•Yî»°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$e5faaa1b-88cb-43e2-8d04-8972b58b4bda¹depends_on_disabled_cellsÂ§runtimeÎ$Ø.Cµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$70096b14-beab-4f71-9886-6355c749bb8aŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ

We previously derived an expression for the gradient of the policy itself in the case of linear action preferences:

$$\begin{flalign} h_a &= \boldsymbol{\theta}^\top \mathbf{x}(s, a) \\ \pi_a &= \frac{e^{h_a}}{\sum_k e^{h_k}} \\ \nabla(\pi_a)_i &= \pi_a \left ( \mathbf{x}(s, a)_i - \sum_k \pi_k \mathbf{x}(s, k)_i \right) \end{flalign}$$

Applying the chain rule to the natural logarithm produces:

$$\nabla \left ( \ln f(\theta) \right) = \frac{\nabla f(\theta)}{f(\theta)} \implies \nabla \left ( \ln f(\theta) \right )_i = \frac{\nabla \left ( f(\theta) \right )_i}{f(\theta)}$$

Applying this to the above expression yields:

$$\begin{flalign} \nabla \left ( \ln \pi_a \right )_i &= \frac{\nabla \left ( \pi_a \right )_i}{\pi_a} \\ &= \frac{\pi_a \left ( \mathbf{x}(s, a)_i - \sum_k \pi_k \mathbf{x}(s, k)_i \right)}{\pi_a} \\ &= \mathbf{x}(s, a)_i - \sum_k \pi_k \mathbf{x}(s, k)_i \end{flalign}$$

which is the per component version of the desired vector expression.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôˆiŒ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$70096b14-beab-4f71-9886-6355c749bb8a¹depends_on_disabled_cellsÂ§runtimeÎxÕµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$90d3b96b-ad2b-405c-951b-f48ec7ccf24aŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ÷

The final expected value expression (13.5) can be sampled on a step by step basis during an episode since we would have access to both the step count and some unbiased sample of the state-action value.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‡&x°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$90d3b96b-ad2b-405c-951b-f48ec7ccf24a¹depends_on_disabled_cellsÂ§runtimeÎµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$700dcbc4-c94c-4287-8cf0-0b2c7a320a3aŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements’’´action_probabilities’…¦prefix§Float32¨elements“’’¦0.6152ªtext/plain’’¨0.107868ªtext/plain’’¨0.276932ªtext/plain¤type¥Array¬prefix_short ¨objectid°77c56ee0c0a77e30Ù!application/vnd.pluto.tree+object’´state_value_estimate’¨0.920919ªtext/plain¤typeªNamedTuple¨objectid°8b86b57a11e05d5b¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÀ²last_run_timestampËAÚ•:™J°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$700dcbc4-c94c-4287-8cf0-0b2c7a320a3a¹depends_on_disabled_cellsÂ§runtimeÍl†µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$f59a5dcd-9f4a-4336-a391-e64af35ef799Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ´ ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô—t°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f59a5dcd-9f4a-4336-a391-e64af35ef799¹depends_on_disabled_cellsÂ§runtimeÎ<üµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$5864a5a3-a5a5-43c2-9cb4-7d13b2d20bedŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ~

Normal Distribution: $f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{(x - \mu)^2}{2 \sigma^2}}$

Consider a new random variable $Y \sim \tanh(X)$ where $X \sim N(0, 1)$. Using the change of variables theorem from probability theory we can compute the density function of $Y$:

$$f_Y(y) = f_X (g^{-1}(y)) \cdot \left \vert \frac{d}{dy} g^{-1}(y) \right \vert$$

where $g(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ so $f_Y(y) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{\left (\tanh^{-1}(y) - \mu \right )^2}{2 \sigma^2}} \left \vert \frac{1}{1 - y^2} \right \vert$

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô“mÎ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5864a5a3-a5a5-43c2-9cb4-7d13b2d20bed¹depends_on_disabled_cellsÂ§runtimeÎIÂµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$e3a2fb12-37ce-4c23-ad93-5fc89991aabbŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙf

Eligibility Vector for General Soft-Max and State Feature Vector

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôˆ¦¤°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$e3a2fb12-37ce-4c23-ad93-5fc89991aabb¹depends_on_disabled_cellsÂ§runtimeÎÝqµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$e5c1aca8-7575-4835-8273-e69ca0a55fe8Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ;corridor_parameter_studies (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•&Œ‚°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$e5c1aca8-7575-4835-8273-e69ca0a55fe8¹depends_on_disabled_cellsÂ§runtimeÎˆû¥µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$44b32cc0-36a8-41fd-89bc-ce894536926cŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements’’´action_probabilities’…¦prefix§Float32¨elements’’’¨0.401392ªtext/plain’’¨0.598608ªtext/plain¤type¥Array¬prefix_short ¨objectid°bf6dd01b0c41893fÙ!application/vnd.pluto.tree+object’´state_value_estimate’¨-9.63535ªtext/plain¤typeªNamedTuple¨objectid°d3a6d0a62b65b7a5¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÀ²last_run_timestampËAÚ•&»Yæ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$44b32cc0-36a8-41fd-89bc-ce894536926c¹depends_on_disabled_cellsÂ§runtimeÍFÖµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$646bc853-b7fc-49fa-a201-ff98e8f952d4Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙupdate_traces_with_gradient! (generic function with 6 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ• B³)°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$25be5dcf-be63-46c4-b6de-6cf79fa28fd0¹depends_on_disabled_cellsÂ§runtimeÎ=&µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$38acd032-1d18-4760-9111-67c9cdd2e892Š¦queuedÂ¤logs§runningÂ¦output†¤body ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ”ô–V°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$38acd032-1d18-4760-9111-67c9cdd2e892¹depends_on_disabled_cellsÂ§runtimeÍ&dµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$cecc2a35-3850-4f66-84e8-e29da4f3d4b0Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ // We start by putting all the variable interpolation here at the beginning // Publish the plot object to JS let plot_obj = {"layout": {"xaxis": {"title": {"text": "Time(s)"}}, "template": {"layout": {"coloraxis": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "xaxis": {"gridcolor": "white", "zerolinewidth": 2, "title": {"standoff": 15}, "ticks": "", "zerolinecolor": "white", "automargin": true, "linecolor": "white"}, "hovermode": "closest", "paper_bgcolor": "white", "geo": {"showlakes": true, "showland": true, "landcolor": "#E5ECF6", "bgcolor": "white", "subunitcolor": "white", "lakecolor": "white"}, "colorscale": {"sequential": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]], "diverging": [[0, "#8e0152"], [0.1, "#c51b7d"], [0.2, "#de77ae"], [0.3, "#f1b6da"], [0.4, "#fde0ef"], [0.5, "#f7f7f7"], [0.6, "#e6f5d0"], [0.7, "#b8e186"], [0.8, "#7fbc41"], [0.9, "#4d9221"], [1, "#276419"]], "sequentialminus": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}, "yaxis": {"gridcolor": "white", "zerolinewidth": 2, "title": {"standoff": 15}, "ticks": "", "zerolinecolor": "white", "automargin": true, "linecolor": "white"}, "shapedefaults": {"line": {"color": "#2a3f5f"}}, "hoverlabel": {"align": "left"}, "mapbox": {"style": "light"}, "polar": {"angularaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "bgcolor": "#E5ECF6", "radialaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}}, "autotypenumbers": "strict", "font": {"color": "#2a3f5f"}, "ternary": {"baxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "bgcolor": "#E5ECF6", "caxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "aaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}}, "annotationdefaults": {"arrowhead": 0, "arrowwidth": 1, "arrowcolor": "#2a3f5f"}, "plot_bgcolor": "#E5ECF6", "title": {"x": 0.05}, "scene": {"xaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}, "zaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}, "yaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}}, "colorway": ["#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52"]}, "data": {"barpolar": [{"type": "barpolar", "marker": {"line": {"color": "#E5ECF6", "width": 0.5}}}], "carpet": [{"aaxis": {"gridcolor": "white", "endlinecolor": "#2a3f5f", "minorgridcolor": "white", "startlinecolor": "#2a3f5f", "linecolor": "white"}, "type": "carpet", "baxis": {"gridcolor": "white", "endlinecolor": "#2a3f5f", "minorgridcolor": "white", "startlinecolor": "#2a3f5f", "linecolor": "white"}}], "scatterpolar": [{"type": "scatterpolar", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "parcoords": [{"line": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "type": "parcoords"}], "scatter": [{"type": "scatter", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram2dcontour": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "histogram2dcontour", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "contour": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "contour", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scattercarpet": [{"type": "scattercarpet", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "mesh3d": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "mesh3d"}], "surface": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "surface", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scattermapbox": [{"type": "scattermapbox", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scattergeo": [{"type": "scattergeo", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram": [{"type": "histogram", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "pie": [{"type": "pie", "automargin": true}], "choropleth": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "choropleth"}], "heatmapgl": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "heatmapgl", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "bar": [{"type": "bar", "error_y": {"color": "#2a3f5f"}, "error_x": {"color": "#2a3f5f"}, "marker": {"line": {"color": "#E5ECF6", "width": 0.5}}}], "heatmap": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "heatmap", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "contourcarpet": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "contourcarpet"}], "table": [{"type": "table", "header": {"line": {"color": "white"}, "fill": {"color": "#C8D4E3"}}, "cells": {"line": {"color": "white"}, "fill": {"color": "#EBF0F8"}}}], "scatter3d": [{"line": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "type": "scatter3d", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scattergl": [{"type": "scattergl", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram2d": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "histogram2d", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scatterternary": [{"type": "scatterternary", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scatterpolargl": [{"type": "scatterpolargl", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}]}}, "legend": {"orientation": "h"}, "margin": {"l": 50, "b": 50, "r": 50, "t": 60}, "yaxis": {"title": {"text": "Horizontal Position"}}, "yaxis2": {"overlaying": "y", "title": "Pole Angle (Radians)", "side": "right"}}, "config": {"showLink": false, "editable": false, "responsive": true, "staticPlot": false, "scrollZoom": true}, "frames": [], "data": [{"y": [0.0, -1.8645987e-5, -0.022929357, -0.045886353, -0.023194335, -0.0005510263, 0.022051038, 0.044617552, 0.06715512, 0.13538401, 0.20361239, 0.22614819, 0.20299889, 0.13415128, 0.06529444, -0.0035927966, -0.09537905, -0.16440776, -0.23353508, -0.3255812, -0.39498538, -0.41907012, -0.39791384, -0.35427064, -0.2881403, -0.1767683, -0.06560569, 0.045367204, 0.15615189, 0.26676503, 0.37720996, 0.44188887, 0.5064011, 0.61633414, 0.7489039, 0.90414107, 1.1049118, 1.3056073, 1.4605527, 1.6154549, 1.7703285, 1.9023333, 2.011477, 2.1206179, 2.275472, 2.4303405, 2.539529, 2.6258953, 2.735144, 2.867276, 2.97659, 3.0402324, 3.1038985, 3.2132893, 3.3226922, 3.4321096, 3.5415447, 3.6052852, 3.6690316, 3.755628, 3.8422112, 3.9516323, 4.106752, 4.2618747, 4.371306, 4.435049, 4.4988036, 4.562556, 4.6262918, 4.73571, 4.8679576, 5.0230465, 5.223851, 5.424685, 5.579887, 5.6894994, 5.799211, 5.9090276, 5.99612, 6.060486, 6.1021104, 6.1666603, 6.2312856, 6.2731314, 6.292176, 6.31125, 6.376049, 6.486568, 6.597103, 6.7076654, 6.8411207, 6.951783, 7.0625157, 7.219008, 7.375595, 7.48668, 7.5523143, 7.6181126, 7.684078, 7.7502127, 7.816523, 7.8374333, 7.8129177, 7.788544, 7.764289, 7.7401423, 7.716083, 7.6463995, 7.5767508, 7.5528245, 7.574607, 7.642098, 7.755311, 7.8685584, 7.9590197, 8.026736, 8.094562, 8.162517, 8.185008, 8.184859, 8.162067, 8.093802, 8.025659, 8.003284, 8.003853, 7.9817004, 7.936823, 7.8692017, 7.8016596, 7.779879, 7.80386, 7.8279204, 7.806403, 7.78499, 7.7636843, 7.742489, 7.7214103, 7.654795, 7.588292, 7.5447288, 7.5241046, 7.503594, 7.4375525, 7.3716316, 7.305827, 7.240137, 7.174559, 7.1090927, 7.089399, 7.069826, 7.004761, 6.9398437, 6.875079, 6.7648544, 6.60914, 6.453538, 6.343704, 6.2339664, 6.1243205, 6.0147614, 5.9052863, 5.818736, 5.709429, 5.6002064, 5.491068, 5.382013, 5.3187256, 5.2555337, 5.1467957, 4.9925194, 4.838351, 4.6842756, 4.530287, 4.3763723, 4.176821, 3.9773064, 3.8235195, 3.669739, 3.5159564, 3.3850207, 3.2312167, 3.0773928, 2.9235408, 2.769651, 2.6614213, 2.598863, 2.5362892, 2.4508643, 2.3426044, 2.2343729, 2.1261773, 1.9723158, 1.7727747, 1.5275265, 1.2822561, 1.0826205, 0.928563, 0.8200561, 0.71143824, 0.6027142, 0.49388582, 0.38495576, 0.29876402, 0.18964016, 0.057584107, -0.051734634, -0.16115624, -0.270684, -0.33464965, -0.39871112, -0.46285516, -0.527074, -0.61420643, -0.67855245, -0.74295354, -0.8531072, -0.98616576, -1.0964469, -1.2068076, -1.3172601, -1.4278116, -1.5384762, -1.6036144, -1.6688783, -1.757092, -1.8226123, -1.8882642, -1.9540497, -2.0199702, -2.0860305, -2.1065993, -2.1273015, -2.1709533, -2.191903, -2.212972, -2.256985, -2.278282, -2.2540185, -2.2298415, -2.2514281, -2.318779, -2.4090514, -2.4765909, -2.5442452, -2.6120355, -2.634371, -2.611266, -2.5426855, -2.4513898, -2.383013, -2.3603988, -2.337854, -2.3153737, -2.3386526, -2.3620014, -2.385429, -2.4089506, -2.386911, -2.3649797, -2.3431559, -2.3214393, -2.299831, -2.2326608, -2.1655843, -2.0985863, -2.0316575, -2.0104876, -2.0350761, -2.0825791, -2.107349, -2.0866106, -2.066016, -2.0911744, -2.1392574, -2.1647384, -2.1449573, -2.0799887, -1.969851, -1.8145049, -1.613857, -1.4133584, -1.2585917, -1.1495901, -1.0407465, -0.90925497, -0.7551049, -0.5782754, -0.42439175, -0.27062166, -0.07128258, 0.12798035, 0.30433846, 0.45779777, 0.61121714, 0.7646016, 0.8722452, 0.97984815, 1.1102513, 1.2177473, 1.3251743, 1.4782076, 1.6768526, 1.8754377, 2.0739822, 2.318216, 2.6081653, 2.898141, 3.1425116, 3.3413675, 3.4947717, 3.6027434, 3.710875, 3.819157, 3.881948, 3.8991861, 3.9164896, 3.9795377, 4.0654726, 4.1514297, 4.21454, 4.2776437, 4.3407235, 4.4037595, 4.5124316, 4.666733, 4.8209786, 4.975182, 5.175068, 5.3749514, 5.574855, 5.7748046, 5.9291406, 6.0378923, 6.1467385, 6.301356, 6.4560766, 6.5652604, 6.62891, 6.692666, 6.7565126, 6.774744, 6.793021, 6.8113174, 6.8296103, 6.8935847, 7.0032225, 7.158531, 7.313822, 7.4691133, 7.6244264, 7.7340727, 7.843767, 7.953515, 8.063321, 8.173192, 8.283133, 8.415995, 8.526107, 8.6134815, 8.723794, 8.834218, 8.921934, 8.98695, 9.05209, 9.11735, 9.182729, 9.248226, 9.313843, 9.425231, 9.536748, 9.625601, 9.691819, 9.712618, 9.710777, 9.686263, 9.66187, 9.683248, 9.704732, 9.680645, 9.656641, 9.6327, 9.608809, 9.630663, 9.652552, 9.674477, 9.719294, 9.741303, 9.717653, 9.694035, 9.716148, 9.738283, 9.760441, 9.805481, 9.827703, 9.849967, 9.872275, 9.894631, 9.939892, 10.008061, 10.076311, 10.098997, 10.1217985, 10.144724, 10.167779, 10.190976, 10.168705, 10.1465845, 10.124613, 10.102789, 10.126733, 10.173624, 10.197894, 10.17682, 10.110425, 9.998679, 9.841503, 9.684438, 9.52745, 9.370525, 9.259346, 9.193909, 9.174213, 9.154566, 9.089285, 9.024073, 9.004623, 9.030927, 9.057321, 9.08381, 9.11043, 9.091724, 9.027802, 8.918723, 8.764486, 8.565034, 8.365782, 8.189479, 8.013363, 7.860214, 7.7072477, 7.5088744, 7.265038, 7.021312, 6.777666, 6.5340853, 6.336252, 6.1841655, 6.077826, 5.971545, 5.8196597, 5.6221848, 5.4247947, 5.2274795, 5.0073824, 4.8101892, 4.613042, 4.3702226, 4.127419, 3.9074671, 3.710352, 3.5132103, 3.2703269, 3.0273921, 2.8072305, 2.6098146, 2.457968, 2.351708, 2.2453794, 2.1390057, 2.0326037, 1.9261947, 1.8197982, 1.6905794, 1.5842582, 1.4780027, 1.3718272, 1.2657576, 1.1141831, 0.96275413, 0.8114837, 0.6147864, 0.37265286, 0.08502662, -0.20250261, -0.48996568, -0.82308984, -1.1790636, -1.512204, -1.7997215, -2.0416884, -2.2837818, -2.5260215, -2.7228296, -2.8742185, -2.980151, -3.086207, -3.23803, -3.4356277, -3.6333435, -3.8311834, -4.029162, -4.1816707, -4.334332, -4.5327606, -4.731348, -4.930098, -5.1517916, -5.350927, -5.52755, -5.6817102, -5.7907405, -5.854638, -5.8733435, -5.8695016, -5.888594, -5.953426, -6.018442, -6.038055, -6.0578256, -6.077734, -6.0521173, -6.0265875, -6.046815, -6.1128035, -6.2245526, -6.3363833, -6.4483085, -6.560353, -6.672525, -6.8303995, -6.988398, -7.10112, -7.1687007, -7.2365303, -7.304627, -7.3277855, -7.3060665, -7.239475, -7.173194, -7.107218, -6.9962935, -6.840341, -6.6846237, -6.574542, -6.46471, -6.3551297, -6.2457957, -6.0912538, -5.891438, -5.691799, -5.5378947, -5.3841634, -5.1849947, -4.985962, -4.7870407, -4.5882206, -4.4351687, -4.327886, -4.2207017, -4.090797, -3.961014, -3.8085406, -3.6562016, -3.5040014, -3.3063056, -3.1087403, -2.934123, -2.782454, -2.6537333, -2.5023386, -2.3054826, -2.1087809, -1.9122297, -1.6701969, -1.4282925, -1.1864984, -0.944806, -0.72604287, -0.4845205, -0.24307097, -0.0473824, 0.12539136, 0.3209296, 0.51638067, 0.71173584, 0.95266193, 1.2163333, 1.5027755, 1.789164, 2.0298085, 2.270422, 2.5567236, 2.8430214, 3.0836217, 3.3242445, 3.564895, 3.8055782, 4.0463004, 4.2870665, 4.5278835, 4.7687573, 5.0096955, 5.2050095, 5.400387, 5.641522, 5.882722, 6.123994, 6.3653502, 6.606798, 6.8940086, 7.1813316, 7.4231896, 7.619634, 7.770668, 7.9218636, 8.073206, 8.22469, 8.421948, 8.619347, 8.771265, 8.900506, 9.007041, 9.0908375, 9.197549, 9.350019, 9.50255, 9.65514, 9.807787, 9.960491, 10.158953, 10.403167, 10.647463, 10.846238, 10.999553, 11.107423, 11.215446, 11.369241, 11.52319, 11.677296, 11.8315735, 11.94046, 12.003951, 12.067605, 12.177026, 12.332209, 12.487555, 12.6203, 12.753241, 12.863646, 12.951537, 13.062389, 13.218885, 13.375585, 13.487214, 13.5990925, 13.711238, 13.77856, 13.8011465, 13.7790365, 13.712215, 13.645724, 13.579557, 13.468581, 13.312708, 13.157091, 13.024375, 12.869243, 12.714358, 12.605137, 12.49617, 12.34203, 12.188131, 12.034465, 11.835533, 11.591249, 11.347103, 11.103061, 10.836275, 10.59239, 10.394253, 10.24186, 10.135214, 10.028622, 9.899256, 9.747138, 9.549434, 9.351813, 9.19996, 9.048188, 8.850821, 8.607839, 8.364908, 8.167719, 7.970556, 7.727698, 7.46198, 7.1962347, 6.9532814, 6.733088, 6.5356255, 6.383711, 6.2773614, 6.1709313, 6.0644436, 5.9579144, 5.8513637, 5.790524, 5.7297096, 5.623258, 5.471214, 5.3192677, 5.167427, 4.9700317, 4.72706, 4.4613104, 4.218454, 3.998476, 3.8013659, 3.6271176, 3.4300184, 3.1872046, 2.9443703, 2.7472007, 2.5728319, 2.4212656, 2.269658, 2.0723076, 1.8292117, 1.5860579, 1.3656662, 1.1223476, 0.87891394, 0.6809477, 0.52840495, 0.42129087, 0.3596588, 0.34359628, 0.3274892, 0.2656523, 0.18095392, 0.0734135, -0.07981146, -0.27872625, -0.4776377, -0.6765644, -0.8755267, -1.0745409, -1.2736295, -1.427141, -1.5807598, -1.7344948, -1.8426976, -1.9510179, -2.0822816, -2.2364883, -2.4364555, -2.6365485, -2.7912173, -2.9005198, -3.010019, -3.1197193, -3.1840816, -3.24864, -3.3133855, -3.3327315, -3.352233, -3.4175007, -3.4829164, -3.5484798, -3.614192, -3.680054, -3.7916791, -3.9034624, -3.969863, -3.99091, -3.96658, -3.9424124, -3.918385, -3.8944912, -3.870711, -3.847036, -3.869132, -3.936998, -4.004971, -4.0502467, -4.1184688, -4.1868424, -4.2098255, -4.187454, -4.1652784, -4.1432943, -4.121502, -4.099901, -4.0329285, -3.9661343, -3.9223049, -3.9014416, -3.880755, -3.8146782, -3.748783, -3.6830654, -3.5719285, -3.4153166, -3.2588098, -3.1480637, -3.0374002, -2.9268103, -2.8391314, -2.774362, -2.7553518, -2.7364206, -2.6719248, -2.5618849, -2.4291193, -2.2964444, -2.1638472, -2.0541654, -1.9445462, -1.8349862, -1.7254812, -1.6160281, -1.5523285, -1.4886901, -1.3794353, -1.2245612, -1.0697435, -0.9606747, -0.85164815, -0.69695425, -0.49657103, -0.29618263, -0.14146921, -0.032405302, 0.07671446, 0.23159096, 0.38653728, 0.49588817, 0.5596485, 0.57778823, 0.59596956, 0.65988135, 0.76951563, 0.92487574, 1.0802667, 1.190006, 1.2769548, 1.3411089, 1.40531, 1.4695469, 1.488096, 1.5066452, 1.5708824, 1.635084, 1.699239, 1.8090404, 1.9416435, 2.0513544, 2.1610327, 2.2706828, 2.3803096, 2.4899173, 2.5537958, 2.5947924, 2.6585956, 2.745188, 2.8545609, 3.0095692, 3.2102442, 3.4109101, 3.565883, 3.7208934, 3.8759608, 3.985413, 4.0492563, 4.113168, 4.2228374, 4.3325686, 4.419517, 4.4836783, 4.5478926, 4.6350036, 4.7450104, 4.9007683, 5.056592, 5.1668296, 5.2315035, 5.2505984, 5.2697673, 5.3346925, 5.4453745, 5.556123, 5.621259, 5.6864743, 5.7517657, 5.7714353, 5.7911596, 5.810921, 5.830705, 5.896211, 5.9617243, 5.981529, 6.0013256, 6.06681, 6.1779838, 6.2891493, 6.354608, 6.420076, 6.5312676, 6.6424847, 6.7080355, 6.773631, 6.8621254, 6.927823, 6.9935784, 7.0593944, 7.0795712, 7.0540867, 7.0286183, 7.0488553, 7.091926, 7.157825, 7.2236996, 7.289553, 7.355389, 7.4212103, 7.5098777, 7.6214027, 7.7329464, 7.821673, 7.887598, 7.953579, 8.01962, 8.085726, 8.151902, 8.218153, 8.284487, 8.305224, 8.326042, 8.392626, 8.459291, 8.480364, 8.478677, 8.454208, 8.406931, 8.382527, 8.403838, 8.47086, 8.583603, 8.696378, 8.763515, 8.785043, 8.760956, 8.736926, 8.758646, 8.826115, 8.939333, 9.052623, 9.120362, 9.142606, 9.164995, 9.187535, 9.164613, 9.1190195, 9.050722, 8.982526, 8.937251, 8.914897, 8.915462, 8.938946, 8.962517, 8.940514, 8.895774, 8.873967, 8.852254, 8.78495, 8.717723, 8.650556, 8.583438, 8.562066, 8.586441, 8.610873, 8.635377, 8.659976, 8.63904, 8.57259, 8.506265, 8.485714, 8.46529, 8.399364, 8.333575, 8.313559, 8.293684, 8.251159, 8.186004, 8.075423, 7.9193773, 7.7634487, 7.6076093, 7.451846, 7.3418393], "type": "scatter", "name": "x", "yaxis": "y", "x": [0.0, 0.04, 0.08, 0.12, 0.16, 0.19999999, 0.23999998, 0.27999997, 0.31999996, 0.35999995, 0.39999995, 0.43999994, 0.47999993, 0.5199999, 0.55999994, 0.59999996, 0.64, 0.68, 0.72, 0.76000005, 0.8000001, 0.8400001, 0.8800001, 0.92000014, 0.96000016, 1.0000001, 1.0400001, 1.08, 1.12, 1.16, 1.1999999, 1.2399999, 1.2799999, 1.3199998, 1.3599998, 1.3999997, 1.4399997, 1.4799997, 1.5199996, 1.5599996, 1.5999995, 1.6399995, 1.6799995, 1.7199994, 1.7599994, 1.7999994, 1.8399993, 1.8799993, 1.9199992, 1.9599992, 1.9999992, 2.0399992, 2.0799992, 2.1199992, 2.1599991, 2.199999, 2.239999, 2.279999, 2.319999, 2.359999, 2.399999, 2.4399989, 2.4799988, 2.5199988, 2.5599988, 2.5999987, 2.6399987, 2.6799986, 2.7199986, 2.7599986, 2.7999985, 2.8399985, 2.8799984, 2.9199984, 2.9599984, 2.9999983, 3.0399983, 3.0799983, 3.1199982, 3.1599982, 3.1999981, 3.239998, 3.279998, 3.319998, 3.359998, 3.399998, 3.439998, 3.4799979, 3.5199978, 3.5599978, 3.5999978, 3.6399977, 3.6799977, 3.7199976, 3.7599976, 3.7999976, 3.8399975, 3.8799975, 3.9199975, 3.9599974, 3.9999974, 4.0399976, 4.0799975, 4.1199975, 4.1599975, 4.1999974, 4.2399974, 4.2799973, 4.3199973, 4.3599973, 4.399997, 4.439997, 4.479997, 4.519997, 4.559997, 4.599997, 4.639997, 4.679997, 4.719997, 4.759997, 4.799997, 4.839997, 4.879997, 4.9199967, 4.9599967, 4.9999967, 5.0399966, 5.0799966, 5.1199965, 5.1599965, 5.1999965, 5.2399964, 5.2799964, 5.3199964, 5.3599963, 5.3999963, 5.4399962, 5.479996, 5.519996, 5.559996, 5.599996, 5.639996, 5.679996, 5.719996, 5.759996, 5.799996, 5.839996, 5.879996, 5.919996, 5.9599957, 5.9999957, 6.0399957, 6.0799956, 6.1199956, 6.1599956, 6.1999955, 6.2399955, 6.2799954, 6.3199954, 6.3599954, 6.3999953, 6.4399953, 6.4799953, 6.519995, 6.559995, 6.599995, 6.639995, 6.679995, 6.719995, 6.759995, 6.799995, 6.839995, 6.879995, 6.919995, 6.959995, 6.9999948, 7.0399947, 7.0799947, 7.1199946, 7.1599946, 7.1999946, 7.2399945, 7.2799945, 7.3199944, 7.3599944, 7.3999944, 7.4399943, 7.4799943, 7.5199943, 7.559994, 7.599994, 7.639994, 7.679994, 7.719994, 7.759994, 7.799994, 7.839994, 7.879994, 7.919994, 7.959994, 7.999994, 8.039994, 8.079994, 8.119994, 8.159994, 8.199994, 8.239994, 8.279994, 8.319994, 8.359994, 8.399994, 8.439994, 8.479994, 8.519994, 8.559994, 8.599994, 8.639994, 8.679994, 8.719994, 8.759994, 8.7999935, 8.8399935, 8.879993, 8.919993, 8.959993, 8.999993, 9.039993, 9.079993, 9.119993, 9.159993, 9.199993, 9.239993, 9.279993, 9.319993, 9.359993, 9.399993, 9.439993, 9.479993, 9.519993, 9.559993, 9.599993, 9.639993, 9.679993, 9.719993, 9.759993, 9.799993, 9.839993, 9.8799925, 9.919992, 9.959992, 9.999992, 10.039992, 10.079992, 10.119992, 10.159992, 10.199992, 10.239992, 10.279992, 10.319992, 10.359992, 10.399992, 10.439992, 10.479992, 10.519992, 10.559992, 10.599992, 10.639992, 10.679992, 10.719992, 10.759992, 10.799992, 10.839992, 10.879992, 10.9199915, 10.959991, 10.999991, 11.039991, 11.079991, 11.119991, 11.159991, 11.199991, 11.239991, 11.279991, 11.319991, 11.359991, 11.399991, 11.439991, 11.479991, 11.519991, 11.559991, 11.599991, 11.639991, 11.679991, 11.719991, 11.759991, 11.799991, 11.839991, 11.879991, 11.919991, 11.9599905, 11.99999, 12.03999, 12.07999, 12.11999, 12.15999, 12.19999, 12.23999, 12.27999, 12.31999, 12.35999, 12.39999, 12.43999, 12.47999, 12.51999, 12.55999, 12.59999, 12.63999, 12.67999, 12.71999, 12.75999, 12.79999, 12.83999, 12.87999, 12.91999, 12.95999, 12.9999895, 13.039989, 13.079989, 13.119989, 13.159989, 13.199989, 13.239989, 13.279989, 13.319989, 13.359989, 13.399989, 13.439989, 13.479989, 13.519989, 13.559989, 13.599989, 13.639989, 13.679989, 13.719989, 13.759989, 13.799989, 13.839989, 13.879989, 13.919989, 13.959989, 13.999989, 14.0399885, 14.0799885, 14.119988, 14.159988, 14.199988, 14.239988, 14.279988, 14.319988, 14.359988, 14.399988, 14.439988, 14.479988, 14.519988, 14.559988, 14.599988, 14.639988, 14.679988, 14.719988, 14.759988, 14.799988, 14.839988, 14.879988, 14.919988, 14.959988, 14.999988, 15.039988, 15.079988, 15.1199875, 15.159987, 15.199987, 15.239987, 15.279987, 15.319987, 15.359987, 15.399987, 15.439987, 15.479987, 15.519987, 15.559987, 15.599987, 15.639987, 15.679987, 15.719987, 15.759987, 15.799987, 15.839987, 15.879987, 15.919987, 15.959987, 15.999987, 16.039988, 16.079988, 16.11999, 16.15999, 16.199991, 16.239992, 16.279993, 16.319994, 16.359995, 16.399996, 16.439997, 16.479998, 16.519999, 16.56, 16.6, 16.640001, 16.680002, 16.720003, 16.760004, 16.800005, 16.840006, 16.880007, 16.920008, 16.960009, 17.00001, 17.04001, 17.080011, 17.120012, 17.160013, 17.200014, 17.240015, 17.280016, 17.320017, 17.360018, 17.400019, 17.44002, 17.48002, 17.520021, 17.560022, 17.600023, 17.640024, 17.680025, 17.720026, 17.760027, 17.800028, 17.840029, 17.88003, 17.92003, 17.960032, 18.000032, 18.040033, 18.080034, 18.120035, 18.160036, 18.200037, 18.240038, 18.280039, 18.32004, 18.36004, 18.400042, 18.440042, 18.480043, 18.520044, 18.560045, 18.600046, 18.640047, 18.680048, 18.720049, 18.76005, 18.80005, 18.840052, 18.880053, 18.920053, 18.960054, 19.000055, 19.040056, 19.080057, 19.120058, 19.160059, 19.20006, 19.24006, 19.280062, 19.320063, 19.360064, 19.400064, 19.440065, 19.480066, 19.520067, 19.560068, 19.600069, 19.64007, 19.68007, 19.720072, 19.760073, 19.800074, 19.840075, 19.880075, 19.920076, 19.960077, 20.000078, 20.04008, 20.08008, 20.12008, 20.160082, 20.200083, 20.240084, 20.280085, 20.320086, 20.360086, 20.400087, 20.440088, 20.48009, 20.52009, 20.560091, 20.600092, 20.640093, 20.680094, 20.720095, 20.760096, 20.800097, 20.840097, 20.880098, 20.9201, 20.9601, 21.000101, 21.040102, 21.080103, 21.120104, 21.160105, 21.200106, 21.240107, 21.280107, 21.320108, 21.36011, 21.40011, 21.440111, 21.480112, 21.520113, 21.560114, 21.600115, 21.640116, 21.680117, 21.720118, 21.760118, 21.80012, 21.84012, 21.880121, 21.920122, 21.960123, 22.000124, 22.040125, 22.080126, 22.120127, 22.160128, 22.200129, 22.24013, 22.28013, 22.320131, 22.360132, 22.400133, 22.440134, 22.480135, 22.520136, 22.560137, 22.600138, 22.640139, 22.68014, 22.72014, 22.760141, 22.800142, 22.840143, 22.880144, 22.920145, 22.960146, 23.000147, 23.040148, 23.080149, 23.12015, 23.16015, 23.200151, 23.240152, 23.280153, 23.320154, 23.360155, 23.400156, 23.440157, 23.480158, 23.520159, 23.56016, 23.60016, 23.640162, 23.680162, 23.720163, 23.760164, 23.800165, 23.840166, 23.880167, 23.920168, 23.960169, 24.00017, 24.04017, 24.080172, 24.120173, 24.160173, 24.200174, 24.240175, 24.280176, 24.320177, 24.360178, 24.400179, 24.44018, 24.48018, 24.520182, 24.560183, 24.600183, 24.640184, 24.680185, 24.720186, 24.760187, 24.800188, 24.840189, 24.88019, 24.92019, 24.960192, 25.000193, 25.040194, 25.080194, 25.120195, 25.160196, 25.200197, 25.240198, 25.2802, 25.3202, 25.3602, 25.400202, 25.440203, 25.480204, 25.520205, 25.560205, 25.600206, 25.640207, 25.680208, 25.72021, 25.76021, 25.80021, 25.840212, 25.880213, 25.920214, 25.960215, 26.000216, 26.040216, 26.080217, 26.120218, 26.16022, 26.20022, 26.240221, 26.280222, 26.320223, 26.360224, 26.400225, 26.440226, 26.480227, 26.520227, 26.560228, 26.60023, 26.64023, 26.680231, 26.720232, 26.760233, 26.800234, 26.840235, 26.880236, 26.920237, 26.960238, 27.000238, 27.04024, 27.08024, 27.120241, 27.160242, 27.200243, 27.240244, 27.280245, 27.320246, 27.360247, 27.400248, 27.440248, 27.48025, 27.52025, 27.560251, 27.600252, 27.640253, 27.680254, 27.720255, 27.760256, 27.800257, 27.840258, 27.880259, 27.92026, 27.96026, 28.000261, 28.040262, 28.080263, 28.120264, 28.160265, 28.200266, 28.240267, 28.280268, 28.320269, 28.36027, 28.40027, 28.440271, 28.480272, 28.520273, 28.560274, 28.600275, 28.640276, 28.680277, 28.720278, 28.760279, 28.80028, 28.84028, 28.880281, 28.920282, 28.960283, 29.000284, 29.040285, 29.080286, 29.120287, 29.160288, 29.200289, 29.24029, 29.28029, 29.320292, 29.360292, 29.400293, 29.440294, 29.480295, 29.520296, 29.560297, 29.600298, 29.640299, 29.6803, 29.7203, 29.760302, 29.800303, 29.840303, 29.880304, 29.920305, 29.960306, 30.000307, 30.040308, 30.080309, 30.12031, 30.16031, 30.200312, 30.240313, 30.280313, 30.320314, 30.360315, 30.400316, 30.440317, 30.480318, 30.520319, 30.56032, 30.60032, 30.640322, 30.680323, 30.720324, 30.760324, 30.800325, 30.840326, 30.880327, 30.920328, 30.96033, 31.00033, 31.04033, 31.080332, 31.120333, 31.160334, 31.200335, 31.240335, 31.280336, 31.320337, 31.360338, 31.40034, 31.44034, 31.480341, 31.520342, 31.560343, 31.600344, 31.640345, 31.680346, 31.720346, 31.760347, 31.800348, 31.84035, 31.88035, 31.920351, 31.960352, 32.00035, 32.04035, 32.080353, 32.120354, 32.160355, 32.200356, 32.240356, 32.280357, 32.32036, 32.36036, 32.40036, 32.44036, 32.480362, 32.520363, 32.560364, 32.600365, 32.640366, 32.680367, 32.720367, 32.76037, 32.80037, 32.84037, 32.88037, 32.920372, 32.960373, 33.000374, 33.040375, 33.080376, 33.120377, 33.160378, 33.20038, 33.24038, 33.28038, 33.32038, 33.360382, 33.400383, 33.440384, 33.480385, 33.520386, 33.560387, 33.600388, 33.64039, 33.68039, 33.72039, 33.76039, 33.800392, 33.840393, 33.880394, 33.920395, 33.960396, 34.000397, 34.040398, 34.0804, 34.1204, 34.1604, 34.2004, 34.240402, 34.280403, 34.320404, 34.360405, 34.400406, 34.440407, 34.480408, 34.52041, 34.56041, 34.60041, 34.64041, 34.680412, 34.720413, 34.760414, 34.800415, 34.840416, 34.880417, 34.920418, 34.96042, 35.00042, 35.04042, 35.08042, 35.120422, 35.160423, 35.200424, 35.240425, 35.280426, 35.320427, 35.360428, 35.40043, 35.44043, 35.48043, 35.52043, 35.560432, 35.600433, 35.640434, 35.680435, 35.720436, 35.760437, 35.800438, 35.84044, 35.88044, 35.92044, 35.96044, 36.000443, 36.040443, 36.080444, 36.120445, 36.160446, 36.200447, 36.240448, 36.28045, 36.32045, 36.36045, 36.40045, 36.440453, 36.480453, 36.520454, 36.560455, 36.600456, 36.640457, 36.680458, 36.72046, 36.76046, 36.80046, 36.84046, 36.880463, 36.920464, 36.960464, 37.000465, 37.040466, 37.080467, 37.12047, 37.16047, 37.20047, 37.24047, 37.28047, 37.320473, 37.360474, 37.400475, 37.440475, 37.480476, 37.520477, 37.56048, 37.60048, 37.64048, 37.68048, 37.72048, 37.760483, 37.800484, 37.840485, 37.880486, 37.920486, 37.960487, 38.00049, 38.04049, 38.08049, 38.12049, 38.160492, 38.200493, 38.240494, 38.280495, 38.320496, 38.360497, 38.400497, 38.4405, 38.4805, 38.5205, 38.5605, 38.600502, 38.640503, 38.680504, 38.720505, 38.760506, 38.800507, 38.840508, 38.88051, 38.92051, 38.96051, 39.00051, 39.040512, 39.080513, 39.120514, 39.160515, 39.200516, 39.240517, 39.280518, 39.32052, 39.36052, 39.40052, 39.44052, 39.480522, 39.520523, 39.560524, 39.600525, 39.640526, 39.680527, 39.720528, 39.76053, 39.80053, 39.84053, 39.88053, 39.920532, 39.960533]}, {"y": [0.05, 0.050205365, 0.06224224, 0.0747987, 0.065162905, 0.056069747, 0.04743547, 0.03919608, 0.031277858, 0.00075507164, -0.029761303, -0.037659604, -0.023010574, 0.01431556, 0.051763266, 0.08961947, 0.13960168, 0.17942575, 0.22065093, 0.27481857, 0.32026565, 0.34665233, 0.35440493, 0.35432476, 0.34641054, 0.3198583, 0.29594505, 0.27437213, 0.25506857, 0.23780292, 0.22250609, 0.23126327, 0.24192622, 0.23239662, 0.21366568, 0.18549709, 0.1363802, 0.08842297, 0.06395237, 0.039995678, 0.016375918, 0.0043235607, 0.003743016, 0.003193365, -0.02020235, -0.043768235, -0.044842925, -0.034863014, -0.036598757, -0.05006402, -0.05252134, -0.03256936, -0.012891032, -0.016188297, -0.019619746, -0.023212792, -0.026998306, -0.008140886, 0.010648154, 0.018088633, 0.025677722, 0.022045769, -0.004272338, -0.030626481, -0.034370102, -0.0155366175, 0.0031663869, 0.021895057, 0.040807545, 0.037201513, 0.022474758, -0.0035022777, -0.05237659, -0.10169783, -0.12910135, -0.13489065, -0.14178333, -0.14985067, -0.14784579, -0.1357461, -0.11344425, -0.10343327, -0.094266765, -0.07449753, -0.043929234, -0.013732521, -0.0065170866, -0.022226907, -0.03812291, -0.054328553, -0.08240104, -0.09976192, -0.11792992, -0.15974927, -0.2029215, -0.22537017, -0.2273989, -0.23128062, -0.23706017, -0.24476211, -0.25447655, -0.24417375, -0.21370578, -0.18502752, -0.15781939, -0.13193637, -0.107108995, -0.060422048, -0.014250103, 0.008939279, 0.009330352, -0.013073645, -0.05845124, -0.104326844, -0.13966656, -0.16484667, -0.19134222, -0.21944101, -0.22703901, -0.22536422, -0.21439536, -0.18285766, -0.1528565, -0.14667472, -0.15299445, -0.14927594, -0.13548607, -0.11147166, -0.08839392, -0.088810675, -0.11272425, -0.13758545, -0.1409294, -0.14542858, -0.15112981, -0.15806605, -0.16631058, -0.15337867, -0.14172173, -0.14253174, -0.15581273, -0.17039035, -0.16384132, -0.15864584, -0.15474844, -0.15212725, -0.15075456, -0.15062255, -0.17430225, -0.19944166, -0.20381862, -0.2098563, -0.21762273, -0.20486659, -0.17141603, -0.13940877, -0.13116103, -0.12398481, -0.11783616, -0.11265293, -0.108402275, -0.11640365, -0.11400856, -0.11255479, -0.11202775, -0.11242413, -0.13644929, -0.16162021, -0.1655568, -0.14830759, -0.13229582, -0.117355466, -0.103394404, -0.090273805, -0.055117715, -0.02042894, -0.008772474, 0.0028123865, 0.014420708, 0.01471217, 0.026559092, 0.038627803, 0.051011633, 0.06382151, 0.054330673, 0.022447668, -0.009245232, -0.029577887, -0.038724385, -0.048187807, -0.05805275, -0.04556115, -0.010595109, 0.04715126, 0.10530118, 0.14156973, 0.15636508, 0.14986406, 0.14460231, 0.14052469, 0.13760729, 0.13581985, 0.12382671, 0.124190696, 0.13691449, 0.13944367, 0.14311694, 0.14797243, 0.1314321, 0.11598901, 0.10148663, 0.08783152, 0.08628196, 0.074048094, 0.06243201, 0.0741491, 0.09788033, 0.11104393, 0.12510982, 0.14021935, 0.15646371, 0.17401308, 0.17048015, 0.16835105, 0.1788588, 0.17959349, 0.18179998, 0.18550234, 0.19071966, 0.19750945, 0.18351334, 0.17104049, 0.17121336, 0.16152781, 0.15318057, 0.15737404, 0.15157188, 0.12441519, 0.098305345, 0.0957467, 0.1167256, 0.15001938, 0.17325647, 0.19787866, 0.22415598, 0.22999613, 0.21548, 0.18040812, 0.13559216, 0.10321649, 0.09441781, 0.08639128, 0.079082616, 0.095213026, 0.11214, 0.12997414, 0.14889532, 0.14643139, 0.14517416, 0.14511046, 0.14623995, 0.14857438, 0.12951991, 0.11154886, 0.09448002, 0.078201786, 0.085357174, 0.11598749, 0.1589279, 0.19190356, 0.20401822, 0.21778141, 0.25557867, 0.3064961, 0.3490549, 0.37300712, 0.37875316, 0.36639085, 0.33571056, 0.28620595, 0.23910227, 0.21607253, 0.21707936, 0.21986513, 0.21330555, 0.19731848, 0.17175871, 0.15886302, 0.14725651, 0.11424364, 0.082195975, 0.062204696, 0.054138854, 0.046516594, 0.039281238, 0.055219755, 0.07162072, 0.077202015, 0.09481165, 0.11321539, 0.109829105, 0.08461552, 0.060114373, 0.03609631, -0.010482322, -0.08000752, -0.15021858, -0.19904372, -0.22709252, -0.23474321, -0.22210242, -0.2112967, -0.20220281, -0.17238104, -0.121447764, -0.07154875, -0.04503311, -0.030305816, -0.015827239, 0.009957148, 0.03582553, 0.061981335, 0.08865987, 0.09328683, 0.075910695, 0.059170403, 0.042910177, 0.004151102, -0.03457267, -0.07356888, -0.113189876, -0.13101831, -0.12725508, -0.124543086, -0.14552255, -0.1677208, -0.16875906, -0.14865027, -0.12978439, -0.11196618, -0.0723421, -0.03333244, 0.005414594, 0.044207633, 0.060513772, 0.054487083, 0.02606972, -0.0021273363, -0.030341433, -0.05881188, -0.064932995, -0.071586855, -0.07883545, -0.086729765, -0.09534507, -0.10473963, -0.12636536, -0.13769653, -0.13883337, -0.15241757, -0.16727075, -0.1722216, -0.16732772, -0.16381375, -0.16164199, -0.16080047, -0.16128018, -0.16308384, -0.1887501, -0.2159974, -0.23383276, -0.24248177, -0.23094587, -0.21019414, -0.17996578, -0.15124881, -0.14634484, -0.14263956, -0.117475405, -0.0932988, -0.069871575, -0.047032267, -0.047421142, -0.048201937, -0.049380545, -0.06238696, -0.064495444, -0.044309735, -0.024496289, -0.027747707, -0.031229114, -0.03496767, -0.050423212, -0.05487436, -0.059776645, -0.06517492, -0.07110833, -0.08903274, -0.1190625, -0.15009671, -0.15975831, -0.1707191, -0.18309651, -0.1969557, -0.21244971, -0.20735243, -0.20395966, -0.20223314, -0.20216462, -0.22609405, -0.26299524, -0.29104096, -0.2995954, -0.2887901, -0.2584592, -0.20815179, -0.15960562, -0.112293154, -0.06593612, -0.04292828, -0.04311979, -0.06651196, -0.09046399, -0.09238327, -0.095062956, -0.12127538, -0.17115153, -0.22248077, -0.27550936, -0.33085704, -0.36729822, -0.38543227, -0.38556793, -0.36770755, -0.33154693, -0.29813325, -0.27796754, -0.26004437, -0.25525343, -0.25253472, -0.22976989, -0.18663676, -0.14508101, -0.10466, -0.065127075, -0.048940565, -0.05599601, -0.086339906, -0.11741333, -0.12674172, -0.1144303, -0.103071906, -0.09255381, -0.07141727, -0.062274523, -0.05364135, -0.022609469, 0.0082310345, 0.027702972, 0.0359725, 0.044536725, 0.07631034, 0.10872948, 0.13066289, 0.14234391, 0.13256352, 0.10120758, 0.070706435, 0.04076886, 0.011176037, -0.01832755, -0.047986105, -0.06661167, -0.09718338, -0.12857601, -0.16098864, -0.194758, -0.20770252, -0.22232318, -0.23878622, -0.235013, -0.21095037, -0.16626802, -0.12299409, -0.08068067, -0.016230367, 0.05951432, 0.12434223, 0.16749325, 0.18947642, 0.2129747, 0.23825075, 0.2432815, 0.22813822, 0.1926083, 0.15869918, 0.14863212, 0.16236949, 0.17745796, 0.19397877, 0.2121134, 0.20965204, 0.20890951, 0.23218437, 0.25739357, 0.28463855, 0.3251377, 0.3574937, 0.38200796, 0.39904872, 0.39832017, 0.3798096, 0.34319732, 0.2986557, 0.2674288, 0.26033255, 0.25534362, 0.23034754, 0.20727067, 0.18584965, 0.1434878, 0.10234202, 0.084765136, 0.09066177, 0.120064974, 0.15048066, 0.18209001, 0.21523091, 0.2500635, 0.30897608, 0.37048364, 0.4137079, 0.43939537, 0.46846092, 0.5013099, 0.51819956, 0.5194567, 0.5051058, 0.4947499, 0.4882403, 0.46553683, 0.42621234, 0.39037102, 0.37861013, 0.3698511, 0.36407942, 0.3612191, 0.33996105, 0.29995424, 0.26244846, 0.24906288, 0.23768215, 0.20604627, 0.1761384, 0.1476286, 0.120360255, 0.11676107, 0.13681401, 0.15801261, 0.16921076, 0.18179, 0.18462768, 0.18897463, 0.19487867, 0.17996296, 0.16654257, 0.16574045, 0.17755301, 0.20206372, 0.21705125, 0.2115085, 0.20770442, 0.20559596, 0.18280533, 0.16154218, 0.14157687, 0.12279576, 0.11635507, 0.09951022, 0.08349773, 0.09095123, 0.1105391, 0.11967862, 0.12979448, 0.14098935, 0.13070709, 0.11017242, 0.07916644, 0.048829578, 0.04173228, 0.034977335, 0.0056509897, -0.023627548, -0.030233867, -0.037088342, -0.0442512, -0.05177694, -0.059733357, -0.06817899, -0.07719198, -0.08683607, -0.097203106, -0.085608274, -0.074726984, -0.08725783, -0.100516856, -0.114592984, -0.12962551, -0.14570795, -0.18557462, -0.22700852, -0.24803777, -0.2489583, -0.22978333, -0.2125151, -0.19695586, -0.18303159, -0.19304578, -0.2046579, -0.19557674, -0.17690118, -0.14840822, -0.10984366, -0.08354738, -0.08071613, -0.07854937, -0.07703207, -0.076149814, -0.07589647, -0.09906178, -0.14577515, -0.19372675, -0.22083451, -0.22746035, -0.2136956, -0.20169951, -0.21370171, -0.22747259, -0.24307387, -0.26068693, -0.2583702, -0.23608983, -0.21577166, -0.21948993, -0.24725181, -0.27707642, -0.29814583, -0.3216128, -0.33688688, -0.34411067, -0.36477613, -0.4095548, -0.45770496, -0.48908275, -0.52415586, -0.56342846, -0.58791894, -0.59812385, -0.5942526, -0.5762256, -0.56269485, -0.5534592, -0.5292621, -0.48962438, -0.45393956, -0.43194133, -0.40300402, -0.37735358, -0.37581414, -0.37728703, -0.36062664, -0.34690246, -0.33593604, -0.30616412, -0.25711364, -0.21022697, -0.16496378, -0.10979815, -0.06689888, -0.04735631, -0.051044486, -0.07798806, -0.105589405, -0.122678585, -0.12943567, -0.11458328, -0.10068713, -0.110357046, -0.1209449, -0.10982812, -0.07688324, -0.044588957, -0.035504334, -0.026710369, 0.0047286414, 0.04764131, 0.09094106, 0.12361148, 0.14593668, 0.1581608, 0.14911012, 0.118675336, 0.08924145, 0.060519524, 0.032308724, 0.0043563936, -0.046429135, -0.09761117, -0.12683208, -0.13441299, -0.14309132, -0.15295687, -0.1414856, -0.10854083, -0.0651339, -0.033674937, -0.013916526, -0.005707996, -0.008982436, -0.0008947514, 0.030057281, 0.06126368, 0.07014433, 0.06819432, 0.05539483, 0.043057982, 0.053921998, 0.08805965, 0.12294291, 0.1474729, 0.18448356, 0.22305016, 0.24116033, 0.23906678, 0.21673986, 0.17387041, 0.10990057, 0.046870723, 0.0070558414, -0.021267852, -0.03833483, -0.032860655, -0.0047959927, 0.023228124, 0.051438116, 0.08008258, 0.10936628, 0.13957265, 0.14828391, 0.15820329, 0.16943541, 0.15952925, 0.15094529, 0.15488805, 0.17137618, 0.2117523, 0.25391328, 0.27603933, 0.2784556, 0.2831269, 0.29011014, 0.2775919, 0.26735282, 0.25926903, 0.23123267, 0.20512655, 0.20302631, 0.20258693, 0.20380636, 0.20669824, 0.21127638, 0.23987654, 0.27048042, 0.28129113, 0.27247033, 0.24388589, 0.21733531, 0.19251184, 0.16929787, 0.14744008, 0.1268167, 0.12989609, 0.15669195, 0.18480429, 0.20318009, 0.23436315, 0.26750797, 0.28081575, 0.2744849, 0.2703966, 0.26850134, 0.26879236, 0.2712731, 0.2539759, 0.23877773, 0.23659036, 0.2474077, 0.26026407, 0.2531851, 0.24818122, 0.2451932, 0.22205637, 0.17844734, 0.13634777, 0.117989056, 0.100583926, 0.08402044, 0.079534315, 0.08709705, 0.11814495, 0.1501902, 0.16086374, 0.15029494, 0.12966806, 0.110100545, 0.09143365, 0.084904365, 0.07907132, 0.07389425, 0.06932417, 0.06532876, 0.08468515, 0.10475198, 0.10293707, 0.079220794, 0.056171007, 0.056412008, 0.057119094, 0.035460655, -0.008764204, -0.05306372, -0.074959025, -0.07466355, -0.0749844, -0.09871813, -0.12328213, -0.1261669, -0.10740516, -0.0667881, -0.02673867, -0.009768106, -0.015748646, -0.04472629, -0.074081935, -0.08123779, -0.07766474, -0.06332756, -0.049519803, -0.03611526, -0.00014982373, 0.03581432, 0.049214847, 0.06301615, 0.077344224, 0.069505595, 0.050834805, 0.044003617, 0.0375335, 0.03137532, 0.025474804, 0.019785952, 0.037126265, 0.066201024, 0.084415555, 0.09192626, 0.0888084, 0.06364058, 0.01616954, -0.031163506, -0.055891916, -0.081069164, -0.10692933, -0.11093081, -0.093117386, -0.07608364, -0.082470335, -0.089541815, -0.08596114, -0.071692415, -0.058022846, -0.05624292, -0.06634036, -0.09979378, -0.13409147, -0.1468305, -0.13816045, -0.10797903, -0.07870866, -0.07287324, -0.090438105, -0.108760536, -0.1052445, -0.10259872, -0.10079707, -0.07707125, -0.05399389, -0.03135126, -0.008972554, -0.009538375, -0.010183048, 0.011960189, 0.034204077, 0.03386942, 0.010953216, -0.011870727, -0.011921028, -0.01206986, -0.035187628, -0.0586017, -0.05966446, -0.06121934, -0.07469129, -0.077377155, -0.08069982, -0.08469075, -0.06658958, -0.026214156, 0.013938282, 0.031335916, 0.03755931, 0.032663845, 0.028039493, 0.02364578, 0.019448273, 0.015410811, 6.57849e-5, -0.026714515, -0.053720325, -0.06974385, -0.07493675, -0.08074484, -0.087223075, -0.09441606, -0.10239325, -0.11120758, -0.12094623, -0.10898025, -0.097921915, -0.1104141, -0.12382683, -0.11556704, -0.096906945, -0.067652866, -0.02755348, 0.00088725425, 0.006463351, -0.010779781, -0.050979577, -0.0916128, -0.11022386, -0.10701231, -0.08194301, -0.057563663, -0.056484282, -0.0786979, -0.124341056, -0.17104116, -0.19661525, -0.20139043, -0.20780624, -0.21593478, -0.20351833, -0.18159287, -0.14989418, -0.11945823, -0.10133224, -0.095413044, -0.10165569, -0.1200975, -0.13954471, -0.13749777, -0.12526177, -0.12539245, -0.12655653, -0.10607838, -0.086489625, -0.0675997, -0.0492768, -0.05419806, -0.08239602, -0.11128938, -0.14106949, -0.17203814, -0.18189804, -0.17077395, -0.1610655, -0.17521349, -0.19081862, -0.18555215, -0.18181418, -0.20200977, -0.22388937, -0.23644717, -0.23984325, -0.22301759, -0.18573564, -0.15001905, -0.115485616, -0.08192864, -0.071825184], "type": "scatter", "name": "Î¸", "yaxis": "y2", "x": [0.0, 0.04, 0.08, 0.12, 0.16, 0.19999999, 0.23999998, 0.27999997, 0.31999996, 0.35999995, 0.39999995, 0.43999994, 0.47999993, 0.5199999, 0.55999994, 0.59999996, 0.64, 0.68, 0.72, 0.76000005, 0.8000001, 0.8400001, 0.8800001, 0.92000014, 0.96000016, 1.0000001, 1.0400001, 1.08, 1.12, 1.16, 1.1999999, 1.2399999, 1.2799999, 1.3199998, 1.3599998, 1.3999997, 1.4399997, 1.4799997, 1.5199996, 1.5599996, 1.5999995, 1.6399995, 1.6799995, 1.7199994, 1.7599994, 1.7999994, 1.8399993, 1.8799993, 1.9199992, 1.9599992, 1.9999992, 2.0399992, 2.0799992, 2.1199992, 2.1599991, 2.199999, 2.239999, 2.279999, 2.319999, 2.359999, 2.399999, 2.4399989, 2.4799988, 2.5199988, 2.5599988, 2.5999987, 2.6399987, 2.6799986, 2.7199986, 2.7599986, 2.7999985, 2.8399985, 2.8799984, 2.9199984, 2.9599984, 2.9999983, 3.0399983, 3.0799983, 3.1199982, 3.1599982, 3.1999981, 3.239998, 3.279998, 3.319998, 3.359998, 3.399998, 3.439998, 3.4799979, 3.5199978, 3.5599978, 3.5999978, 3.6399977, 3.6799977, 3.7199976, 3.7599976, 3.7999976, 3.8399975, 3.8799975, 3.9199975, 3.9599974, 3.9999974, 4.0399976, 4.0799975, 4.1199975, 4.1599975, 4.1999974, 4.2399974, 4.2799973, 4.3199973, 4.3599973, 4.399997, 4.439997, 4.479997, 4.519997, 4.559997, 4.599997, 4.639997, 4.679997, 4.719997, 4.759997, 4.799997, 4.839997, 4.879997, 4.9199967, 4.9599967, 4.9999967, 5.0399966, 5.0799966, 5.1199965, 5.1599965, 5.1999965, 5.2399964, 5.2799964, 5.3199964, 5.3599963, 5.3999963, 5.4399962, 5.479996, 5.519996, 5.559996, 5.599996, 5.639996, 5.679996, 5.719996, 5.759996, 5.799996, 5.839996, 5.879996, 5.919996, 5.9599957, 5.9999957, 6.0399957, 6.0799956, 6.1199956, 6.1599956, 6.1999955, 6.2399955, 6.2799954, 6.3199954, 6.3599954, 6.3999953, 6.4399953, 6.4799953, 6.519995, 6.559995, 6.599995, 6.639995, 6.679995, 6.719995, 6.759995, 6.799995, 6.839995, 6.879995, 6.919995, 6.959995, 6.9999948, 7.0399947, 7.0799947, 7.1199946, 7.1599946, 7.1999946, 7.2399945, 7.2799945, 7.3199944, 7.3599944, 7.3999944, 7.4399943, 7.4799943, 7.5199943, 7.559994, 7.599994, 7.639994, 7.679994, 7.719994, 7.759994, 7.799994, 7.839994, 7.879994, 7.919994, 7.959994, 7.999994, 8.039994, 8.079994, 8.119994, 8.159994, 8.199994, 8.239994, 8.279994, 8.319994, 8.359994, 8.399994, 8.439994, 8.479994, 8.519994, 8.559994, 8.599994, 8.639994, 8.679994, 8.719994, 8.759994, 8.7999935, 8.8399935, 8.879993, 8.919993, 8.959993, 8.999993, 9.039993, 9.079993, 9.119993, 9.159993, 9.199993, 9.239993, 9.279993, 9.319993, 9.359993, 9.399993, 9.439993, 9.479993, 9.519993, 9.559993, 9.599993, 9.639993, 9.679993, 9.719993, 9.759993, 9.799993, 9.839993, 9.8799925, 9.919992, 9.959992, 9.999992, 10.039992, 10.079992, 10.119992, 10.159992, 10.199992, 10.239992, 10.279992, 10.319992, 10.359992, 10.399992, 10.439992, 10.479992, 10.519992, 10.559992, 10.599992, 10.639992, 10.679992, 10.719992, 10.759992, 10.799992, 10.839992, 10.879992, 10.9199915, 10.959991, 10.999991, 11.039991, 11.079991, 11.119991, 11.159991, 11.199991, 11.239991, 11.279991, 11.319991, 11.359991, 11.399991, 11.439991, 11.479991, 11.519991, 11.559991, 11.599991, 11.639991, 11.679991, 11.719991, 11.759991, 11.799991, 11.839991, 11.879991, 11.919991, 11.9599905, 11.99999, 12.03999, 12.07999, 12.11999, 12.15999, 12.19999, 12.23999, 12.27999, 12.31999, 12.35999, 12.39999, 12.43999, 12.47999, 12.51999, 12.55999, 12.59999, 12.63999, 12.67999, 12.71999, 12.75999, 12.79999, 12.83999, 12.87999, 12.91999, 12.95999, 12.9999895, 13.039989, 13.079989, 13.119989, 13.159989, 13.199989, 13.239989, 13.279989, 13.319989, 13.359989, 13.399989, 13.439989, 13.479989, 13.519989, 13.559989, 13.599989, 13.639989, 13.679989, 13.719989, 13.759989, 13.799989, 13.839989, 13.879989, 13.919989, 13.959989, 13.999989, 14.0399885, 14.0799885, 14.119988, 14.159988, 14.199988, 14.239988, 14.279988, 14.319988, 14.359988, 14.399988, 14.439988, 14.479988, 14.519988, 14.559988, 14.599988, 14.639988, 14.679988, 14.719988, 14.759988, 14.799988, 14.839988, 14.879988, 14.919988, 14.959988, 14.999988, 15.039988, 15.079988, 15.1199875, 15.159987, 15.199987, 15.239987, 15.279987, 15.319987, 15.359987, 15.399987, 15.439987, 15.479987, 15.519987, 15.559987, 15.599987, 15.639987, 15.679987, 15.719987, 15.759987, 15.799987, 15.839987, 15.879987, 15.919987, 15.959987, 15.999987, 16.039988, 16.079988, 16.11999, 16.15999, 16.199991, 16.239992, 16.279993, 16.319994, 16.359995, 16.399996, 16.439997, 16.479998, 16.519999, 16.56, 16.6, 16.640001, 16.680002, 16.720003, 16.760004, 16.800005, 16.840006, 16.880007, 16.920008, 16.960009, 17.00001, 17.04001, 17.080011, 17.120012, 17.160013, 17.200014, 17.240015, 17.280016, 17.320017, 17.360018, 17.400019, 17.44002, 17.48002, 17.520021, 17.560022, 17.600023, 17.640024, 17.680025, 17.720026, 17.760027, 17.800028, 17.840029, 17.88003, 17.92003, 17.960032, 18.000032, 18.040033, 18.080034, 18.120035, 18.160036, 18.200037, 18.240038, 18.280039, 18.32004, 18.36004, 18.400042, 18.440042, 18.480043, 18.520044, 18.560045, 18.600046, 18.640047, 18.680048, 18.720049, 18.76005, 18.80005, 18.840052, 18.880053, 18.920053, 18.960054, 19.000055, 19.040056, 19.080057, 19.120058, 19.160059, 19.20006, 19.24006, 19.280062, 19.320063, 19.360064, 19.400064, 19.440065, 19.480066, 19.520067, 19.560068, 19.600069, 19.64007, 19.68007, 19.720072, 19.760073, 19.800074, 19.840075, 19.880075, 19.920076, 19.960077, 20.000078, 20.04008, 20.08008, 20.12008, 20.160082, 20.200083, 20.240084, 20.280085, 20.320086, 20.360086, 20.400087, 20.440088, 20.48009, 20.52009, 20.560091, 20.600092, 20.640093, 20.680094, 20.720095, 20.760096, 20.800097, 20.840097, 20.880098, 20.9201, 20.9601, 21.000101, 21.040102, 21.080103, 21.120104, 21.160105, 21.200106, 21.240107, 21.280107, 21.320108, 21.36011, 21.40011, 21.440111, 21.480112, 21.520113, 21.560114, 21.600115, 21.640116, 21.680117, 21.720118, 21.760118, 21.80012, 21.84012, 21.880121, 21.920122, 21.960123, 22.000124, 22.040125, 22.080126, 22.120127, 22.160128, 22.200129, 22.24013, 22.28013, 22.320131, 22.360132, 22.400133, 22.440134, 22.480135, 22.520136, 22.560137, 22.600138, 22.640139, 22.68014, 22.72014, 22.760141, 22.800142, 22.840143, 22.880144, 22.920145, 22.960146, 23.000147, 23.040148, 23.080149, 23.12015, 23.16015, 23.200151, 23.240152, 23.280153, 23.320154, 23.360155, 23.400156, 23.440157, 23.480158, 23.520159, 23.56016, 23.60016, 23.640162, 23.680162, 23.720163, 23.760164, 23.800165, 23.840166, 23.880167, 23.920168, 23.960169, 24.00017, 24.04017, 24.080172, 24.120173, 24.160173, 24.200174, 24.240175, 24.280176, 24.320177, 24.360178, 24.400179, 24.44018, 24.48018, 24.520182, 24.560183, 24.600183, 24.640184, 24.680185, 24.720186, 24.760187, 24.800188, 24.840189, 24.88019, 24.92019, 24.960192, 25.000193, 25.040194, 25.080194, 25.120195, 25.160196, 25.200197, 25.240198, 25.2802, 25.3202, 25.3602, 25.400202, 25.440203, 25.480204, 25.520205, 25.560205, 25.600206, 25.640207, 25.680208, 25.72021, 25.76021, 25.80021, 25.840212, 25.880213, 25.920214, 25.960215, 26.000216, 26.040216, 26.080217, 26.120218, 26.16022, 26.20022, 26.240221, 26.280222, 26.320223, 26.360224, 26.400225, 26.440226, 26.480227, 26.520227, 26.560228, 26.60023, 26.64023, 26.680231, 26.720232, 26.760233, 26.800234, 26.840235, 26.880236, 26.920237, 26.960238, 27.000238, 27.04024, 27.08024, 27.120241, 27.160242, 27.200243, 27.240244, 27.280245, 27.320246, 27.360247, 27.400248, 27.440248, 27.48025, 27.52025, 27.560251, 27.600252, 27.640253, 27.680254, 27.720255, 27.760256, 27.800257, 27.840258, 27.880259, 27.92026, 27.96026, 28.000261, 28.040262, 28.080263, 28.120264, 28.160265, 28.200266, 28.240267, 28.280268, 28.320269, 28.36027, 28.40027, 28.440271, 28.480272, 28.520273, 28.560274, 28.600275, 28.640276, 28.680277, 28.720278, 28.760279, 28.80028, 28.84028, 28.880281, 28.920282, 28.960283, 29.000284, 29.040285, 29.080286, 29.120287, 29.160288, 29.200289, 29.24029, 29.28029, 29.320292, 29.360292, 29.400293, 29.440294, 29.480295, 29.520296, 29.560297, 29.600298, 29.640299, 29.6803, 29.7203, 29.760302, 29.800303, 29.840303, 29.880304, 29.920305, 29.960306, 30.000307, 30.040308, 30.080309, 30.12031, 30.16031, 30.200312, 30.240313, 30.280313, 30.320314, 30.360315, 30.400316, 30.440317, 30.480318, 30.520319, 30.56032, 30.60032, 30.640322, 30.680323, 30.720324, 30.760324, 30.800325, 30.840326, 30.880327, 30.920328, 30.96033, 31.00033, 31.04033, 31.080332, 31.120333, 31.160334, 31.200335, 31.240335, 31.280336, 31.320337, 31.360338, 31.40034, 31.44034, 31.480341, 31.520342, 31.560343, 31.600344, 31.640345, 31.680346, 31.720346, 31.760347, 31.800348, 31.84035, 31.88035, 31.920351, 31.960352, 32.00035, 32.04035, 32.080353, 32.120354, 32.160355, 32.200356, 32.240356, 32.280357, 32.32036, 32.36036, 32.40036, 32.44036, 32.480362, 32.520363, 32.560364, 32.600365, 32.640366, 32.680367, 32.720367, 32.76037, 32.80037, 32.84037, 32.88037, 32.920372, 32.960373, 33.000374, 33.040375, 33.080376, 33.120377, 33.160378, 33.20038, 33.24038, 33.28038, 33.32038, 33.360382, 33.400383, 33.440384, 33.480385, 33.520386, 33.560387, 33.600388, 33.64039, 33.68039, 33.72039, 33.76039, 33.800392, 33.840393, 33.880394, 33.920395, 33.960396, 34.000397, 34.040398, 34.0804, 34.1204, 34.1604, 34.2004, 34.240402, 34.280403, 34.320404, 34.360405, 34.400406, 34.440407, 34.480408, 34.52041, 34.56041, 34.60041, 34.64041, 34.680412, 34.720413, 34.760414, 34.800415, 34.840416, 34.880417, 34.920418, 34.96042, 35.00042, 35.04042, 35.08042, 35.120422, 35.160423, 35.200424, 35.240425, 35.280426, 35.320427, 35.360428, 35.40043, 35.44043, 35.48043, 35.52043, 35.560432, 35.600433, 35.640434, 35.680435, 35.720436, 35.760437, 35.800438, 35.84044, 35.88044, 35.92044, 35.96044, 36.000443, 36.040443, 36.080444, 36.120445, 36.160446, 36.200447, 36.240448, 36.28045, 36.32045, 36.36045, 36.40045, 36.440453, 36.480453, 36.520454, 36.560455, 36.600456, 36.640457, 36.680458, 36.72046, 36.76046, 36.80046, 36.84046, 36.880463, 36.920464, 36.960464, 37.000465, 37.040466, 37.080467, 37.12047, 37.16047, 37.20047, 37.24047, 37.28047, 37.320473, 37.360474, 37.400475, 37.440475, 37.480476, 37.520477, 37.56048, 37.60048, 37.64048, 37.68048, 37.72048, 37.760483, 37.800484, 37.840485, 37.880486, 37.920486, 37.960487, 38.00049, 38.04049, 38.08049, 38.12049, 38.160492, 38.200493, 38.240494, 38.280495, 38.320496, 38.360497, 38.400497, 38.4405, 38.4805, 38.5205, 38.5605, 38.600502, 38.640503, 38.680504, 38.720505, 38.760506, 38.800507, 38.840508, 38.88051, 38.92051, 38.96051, 39.00051, 39.040512, 39.080513, 39.120514, 39.160515, 39.200516, 39.240517, 39.280518, 39.32052, 39.36052, 39.40052, 39.44052, 39.480522, 39.520523, 39.560524, 39.600525, 39.640526, 39.680527, 39.720528, 39.76053, 39.80053, 39.84053, 39.88053, 39.920532, 39.960533]}, {"y": [0.0, -0.00093293114, -1.1446139, -0.0033041239, 1.1379578, -0.0057849884, 1.1359303, -0.007592559, 1.1345052, 2.277039, 1.1344777, -0.007654071, -1.1498702, -2.2926104, -1.1503701, -2.2939868, -2.2954237, -1.1563231, -2.29983, -2.3025405, -1.1681821, -0.03638637, 1.0940988, 1.0880625, 2.2185578, 3.3503754, 2.2075465, 3.3413486, 2.1977491, 3.3330925, 2.1890655, 1.044928, 2.180572, 3.3161805, 3.3123507, 4.4497705, 5.589135, 4.4455433, 3.301728, 4.4434953, 3.3002317, 3.3000422, 2.1571465, 3.2999017, 4.4428625, 3.3006592, 2.1587746, 2.1595078, 3.3029284, 3.303717, 2.1619983, 1.0200357, 2.1632304, 3.306312, 2.1638439, 3.3070407, 2.1647239, 1.0222374, 2.1650326, 2.1647666, 2.1643615, 3.3067048, 4.4493628, 3.306863, 2.1647077, 1.0223705, 2.165297, 1.0222633, 2.1644666, 3.3064752, 3.3059344, 4.4485817, 5.5917034, 4.450218, 3.3100615, 2.1706157, 3.3149478, 2.1759443, 2.178676, 1.0395123, 1.0416634, 2.1858385, 1.0453684, 1.0468559, -0.09479046, 1.0484549, 2.1914582, 3.3345265, 2.1923046, 3.335844, 3.3369865, 2.1962504, 3.3403666, 4.484124, 3.3456044, 2.208867, 1.0728667, 2.21702, 1.0813203, 2.2253563, 1.0902627, -0.04485023, -1.1812388, -0.037305593, -1.1756942, -0.031563997, -1.1715637, -2.3128567, -1.169621, -0.02675891, 1.1158862, 2.2587352, 3.4019625, 2.2606308, 2.26252, 1.1235325, 2.267614, 1.1304086, -0.0057462454, -0.0016803304, -1.1380011, -2.2755482, -1.1315224, 0.012840748, 0.015576998, -1.12323, -1.1206765, -2.2605472, -1.1165371, 0.02751112, 1.1715167, 0.0316844, -1.1075186, 0.036836624, -1.1020733, 0.04229331, -1.0961531, -2.2347178, -1.0903941, -1.0877812, 0.056530714, -1.0819386, -2.2202027, -1.0758225, -2.2144396, -1.0700591, -2.2088516, -1.0644729, 0.07969749, -1.0581074, -2.1950898, -1.0508118, -2.1873443, -3.3240247, -4.461997, -3.3179982, -2.1736827, -3.313275, -2.1690109, -3.30898, -2.1647696, -2.1627088, -3.3026733, -2.1584415, -3.2984936, -2.1542675, -1.0101506, -2.1492424, -3.28762, -4.426332, -3.2820654, -4.421794, -3.2776322, -4.4181795, -5.5595584, -4.4162264, -3.2731493, -4.4159136, -3.2732568, -3.2735283, -4.416705, -3.2745428, -4.4180765, -3.2764878, -2.1349497, -0.9928442, -2.1357617, -2.135419, -3.2775426, -2.1340203, -3.2757053, -4.4174285, -5.559751, -6.7027206, -5.561055, -4.420973, -3.2820215, -2.143272, -3.287634, -2.1485384, -3.292886, -2.1536024, -2.1559508, -3.3002415, -3.302597, -2.16336, -3.3077126, -2.168714, -1.0294409, -2.173658, -1.0334461, -2.1775022, -2.1791167, -1.0381136, -2.1819346, -3.3257556, -3.3272362, -2.1869082, -3.3311112, -2.191627, -3.335906, -2.197474, -1.0594043, -2.203794, -2.2069178, -1.0691053, -2.2134829, -1.0758271, -2.2201786, -1.0828984, 0.05458331, -1.0897348, -1.0928607, 0.045449495, -1.0989118, -1.1017523, 0.036947966, 1.176424, 0.032390118, -1.1117274, -2.2557993, -2.2579043, -1.1192616, -2.26336, -1.1264122, 0.009576082, 1.1458302, 2.2835102, 2.281368, 1.1374075, -0.006706476, 1.1339993, -0.009984016, -1.1539629, -0.013584971, -1.1577667, -0.018457651, 1.120464, -0.023904324, 1.1150998, -0.029268146, 1.1096619, 2.2489848, 1.1048069, 2.245205, 1.1012338, -0.04274082, -1.1866548, -1.1885823, -0.050189137, 1.086992, -0.057188153, -1.2004849, -1.2037287, -0.07082772, 1.0595709, 2.1887844, 3.3182652, 4.449437, 5.583518, 4.441048, 3.297163, 2.1529183, 3.2892375, 3.285353, 4.4223065, 4.419233, 3.2749076, 4.413692, 5.553497, 4.409618, 4.4083447, 3.2646286, 4.4063764, 3.2628577, 2.119304, 3.2607589, 3.2593818, 2.1154232, 3.2558038, 4.395882, 5.5365214, 4.392741, 5.5345964, 6.6772146, 7.8202066, 6.678935, 5.53999, 4.4030714, 3.2672257, 2.131225, 3.2754087, 2.1385932, 1.0006912, -0.13913572, 1.0043849, 2.148008, 2.1486907, 2.1491094, 1.0063454, 2.1487389, 1.005218, 2.1464458, 3.2871265, 4.428039, 3.2842505, 4.4260044, 5.5684247, 4.425861, 5.5693254, 4.428382, 3.2885427, 2.1490111, 3.2933087, 4.437501, 3.2987077, 2.1604996, 1.0218084, 2.1660385, 1.0261619, -0.11481178, 1.0286413, -0.11394358, 1.0284542, 2.170177, 3.3117473, 4.453796, 3.3107882, 4.453861, 3.3119044, 2.1704433, 3.31428, 2.173162, 3.317144, 2.1764536, 3.320574, 3.3225806, 2.183107, 2.1856506, 3.3299537, 2.191369, 2.1944609, 1.0562984, 2.2006855, 1.0622816, 2.2066722, 1.068211, 2.212601, 3.356707, 2.2194133, 2.2232904, 1.0877316, -0.047918558, -0.044169463, -1.1818066, -0.037745595, 1.1066209, -0.03240955, -1.1720976, -0.028068304, -1.1691073, -0.025455594, 1.1181288, -0.02367282, 1.1199324, 1.12095, -0.020466805, -1.162158, -0.01879239, 1.1244806, -0.017740965, 1.1256397, 1.1264107, -0.015276551, 1.1284353, -0.013009548, 1.1308274, 1.1322529, 2.2761931, 1.1365414, -0.0021648407, 1.1421791, 0.004219532, 1.1484903, 0.011521101, -1.1251447, 0.01915133, -1.1177672, 0.026549459, 1.170514, 1.1740997, 0.039738417, -1.0933486, -2.2265162, -3.3611054, -4.4981937, -3.3548665, -4.4948773, -3.3512998, -2.2076943, -1.064168, 0.0794394, -1.0616841, -2.2023487, -1.0582582, 0.085758805, 1.229243, 0.09088147, 1.2332077, 0.09838104, -1.033194, -2.1626387, -3.2913399, -4.420763, -5.5522966, -4.409995, -4.405208, -4.4006476, -3.2567506, -4.391613, -5.5272965, -6.6649156, -5.521218, -6.661351, -5.517667, -4.373993, -3.230346, -2.0866342, -3.2272224, -4.3669972, -5.5068364, -4.362667, -5.503173, -5.501738, -4.357924, -5.499488, -6.6415944, -5.4986663, -5.4989853, -4.3568115, -5.500291, -6.6438894, -5.5030346, -5.505108, -4.365775, -3.2264862, -2.08631, -3.2301354, -2.0884151, -3.2316241, -2.088752, -3.2309535, -3.2299345, -2.086139, -3.226429, -2.0824332, -3.2207656, -4.357833, -3.2136893, -4.3496637, -5.485241, -6.621676, -7.7600164, -6.61631, -7.757092, -8.899321, -8.899428, -7.757897, -6.61829, -5.4802504, -6.624304, -5.48794, -4.352516, -3.2167726, -2.079521, -3.2234192, -4.3677645, -5.5120792, -4.3738413, -5.5180883, -4.3810143, -3.2443967, -4.388681, -5.5326157, -4.3970127, -5.5402956, -5.544445, -4.412738, -4.41847, -3.289763, -2.1617308, -1.0329041, 0.098109245, 0.09403723, -1.0489174, -2.1927295, -1.0580246, 0.07765055, -1.0663018, 0.071074724, 1.2100941, 0.06631839, -1.0777009, -2.2217336, -3.36569, -2.2260714, -3.370061, -2.232467, -3.3759284, -4.517281, -3.383359, -2.2533135, -1.1260805, -2.2649891, -1.1403214, -0.017860174, 1.1037798, 2.2260132, 1.0878642, 2.2110398, 3.3355136, 4.4626555, 3.3227706, 2.1811907, 3.310524, 2.1684418, 3.298291, 4.4290895, 5.562172, 4.419477, 3.2756557, 4.41103, 5.5477257, 4.403782, 5.5425186, 4.398429, 3.25417, 2.109991, 3.2490726, 3.2461271, 3.2429905, 4.38066, 3.2363014, 4.373659, 5.511266, 4.366953, 4.3639145, 3.2195823, 3.2164018, 4.3531775, 5.489674, 4.345399, 5.4821806, 6.6196713, 5.4754696, 6.614397, 5.470185, 5.467989, 6.6082425, 5.464227, 4.3202033, 4.3184295, 5.458418, 4.3141546, 5.4535193, 6.592863, 6.59076, 7.7315397, 6.5878925, 5.4443407, 6.5863667, 7.728818, 6.586159, 5.44388, 6.587266, 5.4452925, 6.5888805, 5.4472637, 6.5910473, 5.4498577, 6.5938263, 5.453157, 4.3124647, 5.4564104, 6.600356, 5.4597244, 6.603868, 5.4640546, 6.6083136, 7.7520676, 6.6144514, 5.478664, 4.3435774, 3.2079058, 4.351965, 3.2149959, 4.359266, 5.5035844, 4.366442, 3.2293725, 3.2326076, 2.0938966, 2.0958202, 3.2397714, 4.3837585, 3.2427673, 4.3867126, 3.24564, 4.389568, 5.533489, 6.677118, 5.5380583, 4.4009495, 3.2648382, 2.1285095, 3.2727299, 4.41695, 3.280622, 4.4246016, 3.2894766, 2.1548557, 1.0194312, 2.1633935, 3.3076253, 4.4513264, 3.3162813, 3.3210368, 3.3260317, 2.1943939, 2.2001936, 3.3422213, 4.482053, 3.3535428, 2.228382, 3.3650022, 2.2428656, 1.1235981, 0.0059211254, -1.1114637, -2.2298563, -1.0944554, -2.2140377, -3.3351445, -4.4590945, -3.3212461, -3.3145752, -4.442437, -3.3014696, -2.1595802, -3.2887335, -4.418499, -3.2762587, -4.4072313, -5.5397367, -6.674958, -5.5320582, -6.6704545, -6.6689196, -5.525277, -4.38162, -3.238011, -2.0943274, -3.2351403, -3.2331104, -4.3727493, -5.512564, -4.3684187, -3.2242608, -4.36424, -5.5042305, -6.645071, -5.5014744, -4.357999, -5.5001893, -6.6428146, -6.643188, -6.6441665, -5.5037007, -5.506023, -4.367199, -3.2284527, -2.0887918, -3.2327409, -2.091483, -3.234961, -2.092475, -0.9494363, -2.091056, -3.2313304, -4.3708277, -3.2265038, -4.3654447, -5.504424, -6.6443954, -6.6431737, -5.499686, -5.4992642, -4.356267, -4.35613, -5.4988575, -6.641898, -5.4999485, -4.3585787, -4.3598638, -3.2183833, -4.361975, -5.505562, -6.6492286, -5.5086784, -5.5109744, -6.6548157, -5.5172205, -4.381278, -3.245839, -2.1096385, -0.9715903, 0.16886532, -0.97430885, -2.1174793, -2.1173604, -3.2595925, -4.40168, -5.544157, -4.4015036, -5.5448704, -4.4033923, -5.5472918, -4.4073544, -3.2682843, -4.41263, -3.2742183, -2.1358404, -3.280198, -3.2829976, -4.427282, -5.5708833, -4.434186, -3.2995024, -2.1656525, -3.309277, -2.1758153, -1.0421592, -2.1858351, -1.0513406, 0.084329486, -1.0595492, -2.2038558, -1.0669276, -2.2112393, -1.0744015, -2.2186768, -3.362423, -2.227065, -1.0930841, 0.040828228, 1.1759932, 0.0322299, 1.1693698, 0.025226116, 1.1639556, 0.019753695, -1.124554, -2.2686658, -1.1302344, -1.1336018, -2.2773235, -1.1417001, -0.0076055527, 1.1262486, -0.017484665, 1.116706, -0.027083158, 1.1070837, 2.2417502, 1.0978645, 1.0936097, -0.050383568, 1.0845743, 2.2193418, 1.0753756, 2.2105432, 3.3465374, 4.484448, 3.3407373, 2.1965365, 3.3367553, 2.1927376, 2.1912234, 1.0472354, -0.09668207, 1.0430018, 2.181695, 3.3203855, 3.3179522, 3.3158479, 3.3140655, 2.1700304, 3.3109632, 2.167036, 3.3082442, 2.164413, 1.0205662, 2.1612234, 3.301529, 4.4423194, 3.298572, 2.1548696, 3.2964535, 4.4383388, 5.580944, 4.4386272, 3.2971568, 2.1560369, 3.299952, 4.443863, 3.303618, 2.163948, 1.0239388, -0.11717379, 1.0262108, 2.1693406, 3.3123908, 4.455661, 3.3140285, 2.1729758, 2.1744487, 1.0331821, 2.176859, 1.0349228, -0.10758448, 1.0349281, 2.176871, 1.0331981, 2.1744728, 3.3156397, 3.3145702, 2.170989, 3.3129597, 2.169561, 3.3118072, 2.1685915, 1.0253042, 1.0244465, 2.1656098, 2.163986, 3.3046815, 4.4458733, 5.588054, 4.4453573, 3.3033967, 4.4471345, 3.3064003, 2.1662428, 1.0258102, 2.1697612, 3.313714, 2.1728888, 2.174514, 1.0334655, 2.1772404, 2.1783032, 3.322051, 4.465817, 3.3255904, 2.186378, 1.0472609, -0.09271705, 1.051174, 2.195093, 3.339014, 2.1985083, 1.0582942, 2.2024632, 1.0620944, -0.07875371, 1.0649717, -0.07699859, 1.0661496, 2.2091396, 1.0665089, -0.07632327, 1.0660586, 2.208179, 3.350576, 2.2077677, 1.0651776, 2.2082157, 3.3514144, 2.2095346, 1.0680096, 2.2117627, 2.2129967, 1.0719081, 2.2158642, 1.0749542, -0.0661999, -1.2082006, -0.065304756, 1.0770775, 1.0764387, 2.2185364, 1.0751984, 2.2174997, 1.0743011, 2.2167792, 2.2166393, 3.359671, 2.217628, 2.2187386, 1.0775486, 2.2214882, 1.0806087, 2.2246647, 1.0841864, 2.2283568, 1.088395, -0.05161345, 1.0925227, 2.2366552, 1.0967023, -0.043141127, -0.041263357, -1.1823418, -1.1816378, -0.038609505, 1.1041417, 2.2469988, 3.390213, 2.2487345, 1.1082637, -0.031901598, -1.1726143, -0.028857589, 1.1148539, 2.258607, 3.4022384, 2.262574, 1.1246513, -0.012433171, 1.131853, -0.004755497, -1.1414905, -1.1382118, -2.276894, -1.1328629, -1.1309078, 0.013213277, 0.015029476, 1.1591543, 0.019535303, -1.1196895, -1.1173155, 0.026981592, -1.1126468, -2.2526882, -1.1086665, -2.2497892, -1.10611, 0.03753102, 1.1812391, 0.040545225, 1.1845723, 0.045607686, -1.0923266, -2.2302945, -1.0859402, 0.058369994, -1.0794067, -2.2169008, -1.072533, 0.07163179, -1.0651387, -1.061091, -2.1966152, -3.332561, -4.4700885, -3.3262155, -4.4659925, -3.322135, -2.1782184], "type": "scatter", "name": "xÌ‡", "yaxis": "y", "x": [0.0, 0.04, 0.08, 0.12, 0.16, 0.19999999, 0.23999998, 0.27999997, 0.31999996, 0.35999995, 0.39999995, 0.43999994, 0.47999993, 0.5199999, 0.55999994, 0.59999996, 0.64, 0.68, 0.72, 0.76000005, 0.8000001, 0.8400001, 0.8800001, 0.92000014, 0.96000016, 1.0000001, 1.0400001, 1.08, 1.12, 1.16, 1.1999999, 1.2399999, 1.2799999, 1.3199998, 1.3599998, 1.3999997, 1.4399997, 1.4799997, 1.5199996, 1.5599996, 1.5999995, 1.6399995, 1.6799995, 1.7199994, 1.7599994, 1.7999994, 1.8399993, 1.8799993, 1.9199992, 1.9599992, 1.9999992, 2.0399992, 2.0799992, 2.1199992, 2.1599991, 2.199999, 2.239999, 2.279999, 2.319999, 2.359999, 2.399999, 2.4399989, 2.4799988, 2.5199988, 2.5599988, 2.5999987, 2.6399987, 2.6799986, 2.7199986, 2.7599986, 2.7999985, 2.8399985, 2.8799984, 2.9199984, 2.9599984, 2.9999983, 3.0399983, 3.0799983, 3.1199982, 3.1599982, 3.1999981, 3.239998, 3.279998, 3.319998, 3.359998, 3.399998, 3.439998, 3.4799979, 3.5199978, 3.5599978, 3.5999978, 3.6399977, 3.6799977, 3.7199976, 3.7599976, 3.7999976, 3.8399975, 3.8799975, 3.9199975, 3.9599974, 3.9999974, 4.0399976, 4.0799975, 4.1199975, 4.1599975, 4.1999974, 4.2399974, 4.2799973, 4.3199973, 4.3599973, 4.399997, 4.439997, 4.479997, 4.519997, 4.559997, 4.599997, 4.639997, 4.679997, 4.719997, 4.759997, 4.799997, 4.839997, 4.879997, 4.9199967, 4.9599967, 4.9999967, 5.0399966, 5.0799966, 5.1199965, 5.1599965, 5.1999965, 5.2399964, 5.2799964, 5.3199964, 5.3599963, 5.3999963, 5.4399962, 5.479996, 5.519996, 5.559996, 5.599996, 5.639996, 5.679996, 5.719996, 5.759996, 5.799996, 5.839996, 5.879996, 5.919996, 5.9599957, 5.9999957, 6.0399957, 6.0799956, 6.1199956, 6.1599956, 6.1999955, 6.2399955, 6.2799954, 6.3199954, 6.3599954, 6.3999953, 6.4399953, 6.4799953, 6.519995, 6.559995, 6.599995, 6.639995, 6.679995, 6.719995, 6.759995, 6.799995, 6.839995, 6.879995, 6.919995, 6.959995, 6.9999948, 7.0399947, 7.0799947, 7.1199946, 7.1599946, 7.1999946, 7.2399945, 7.2799945, 7.3199944, 7.3599944, 7.3999944, 7.4399943, 7.4799943, 7.5199943, 7.559994, 7.599994, 7.639994, 7.679994, 7.719994, 7.759994, 7.799994, 7.839994, 7.879994, 7.919994, 7.959994, 7.999994, 8.039994, 8.079994, 8.119994, 8.159994, 8.199994, 8.239994, 8.279994, 8.319994, 8.359994, 8.399994, 8.439994, 8.479994, 8.519994, 8.559994, 8.599994, 8.639994, 8.679994, 8.719994, 8.759994, 8.7999935, 8.8399935, 8.879993, 8.919993, 8.959993, 8.999993, 9.039993, 9.079993, 9.119993, 9.159993, 9.199993, 9.239993, 9.279993, 9.319993, 9.359993, 9.399993, 9.439993, 9.479993, 9.519993, 9.559993, 9.599993, 9.639993, 9.679993, 9.719993, 9.759993, 9.799993, 9.839993, 9.8799925, 9.919992, 9.959992, 9.999992, 10.039992, 10.079992, 10.119992, 10.159992, 10.199992, 10.239992, 10.279992, 10.319992, 10.359992, 10.399992, 10.439992, 10.479992, 10.519992, 10.559992, 10.599992, 10.639992, 10.679992, 10.719992, 10.759992, 10.799992, 10.839992, 10.879992, 10.9199915, 10.959991, 10.999991, 11.039991, 11.079991, 11.119991, 11.159991, 11.199991, 11.239991, 11.279991, 11.319991, 11.359991, 11.399991, 11.439991, 11.479991, 11.519991, 11.559991, 11.599991, 11.639991, 11.679991, 11.719991, 11.759991, 11.799991, 11.839991, 11.879991, 11.919991, 11.9599905, 11.99999, 12.03999, 12.07999, 12.11999, 12.15999, 12.19999, 12.23999, 12.27999, 12.31999, 12.35999, 12.39999, 12.43999, 12.47999, 12.51999, 12.55999, 12.59999, 12.63999, 12.67999, 12.71999, 12.75999, 12.79999, 12.83999, 12.87999, 12.91999, 12.95999, 12.9999895, 13.039989, 13.079989, 13.119989, 13.159989, 13.199989, 13.239989, 13.279989, 13.319989, 13.359989, 13.399989, 13.439989, 13.479989, 13.519989, 13.559989, 13.599989, 13.639989, 13.679989, 13.719989, 13.759989, 13.799989, 13.839989, 13.879989, 13.919989, 13.959989, 13.999989, 14.0399885, 14.0799885, 14.119988, 14.159988, 14.199988, 14.239988, 14.279988, 14.319988, 14.359988, 14.399988, 14.439988, 14.479988, 14.519988, 14.559988, 14.599988, 14.639988, 14.679988, 14.719988, 14.759988, 14.799988, 14.839988, 14.879988, 14.919988, 14.959988, 14.999988, 15.039988, 15.079988, 15.1199875, 15.159987, 15.199987, 15.239987, 15.279987, 15.319987, 15.359987, 15.399987, 15.439987, 15.479987, 15.519987, 15.559987, 15.599987, 15.639987, 15.679987, 15.719987, 15.759987, 15.799987, 15.839987, 15.879987, 15.919987, 15.959987, 15.999987, 16.039988, 16.079988, 16.11999, 16.15999, 16.199991, 16.239992, 16.279993, 16.319994, 16.359995, 16.399996, 16.439997, 16.479998, 16.519999, 16.56, 16.6, 16.640001, 16.680002, 16.720003, 16.760004, 16.800005, 16.840006, 16.880007, 16.920008, 16.960009, 17.00001, 17.04001, 17.080011, 17.120012, 17.160013, 17.200014, 17.240015, 17.280016, 17.320017, 17.360018, 17.400019, 17.44002, 17.48002, 17.520021, 17.560022, 17.600023, 17.640024, 17.680025, 17.720026, 17.760027, 17.800028, 17.840029, 17.88003, 17.92003, 17.960032, 18.000032, 18.040033, 18.080034, 18.120035, 18.160036, 18.200037, 18.240038, 18.280039, 18.32004, 18.36004, 18.400042, 18.440042, 18.480043, 18.520044, 18.560045, 18.600046, 18.640047, 18.680048, 18.720049, 18.76005, 18.80005, 18.840052, 18.880053, 18.920053, 18.960054, 19.000055, 19.040056, 19.080057, 19.120058, 19.160059, 19.20006, 19.24006, 19.280062, 19.320063, 19.360064, 19.400064, 19.440065, 19.480066, 19.520067, 19.560068, 19.600069, 19.64007, 19.68007, 19.720072, 19.760073, 19.800074, 19.840075, 19.880075, 19.920076, 19.960077, 20.000078, 20.04008, 20.08008, 20.12008, 20.160082, 20.200083, 20.240084, 20.280085, 20.320086, 20.360086, 20.400087, 20.440088, 20.48009, 20.52009, 20.560091, 20.600092, 20.640093, 20.680094, 20.720095, 20.760096, 20.800097, 20.840097, 20.880098, 20.9201, 20.9601, 21.000101, 21.040102, 21.080103, 21.120104, 21.160105, 21.200106, 21.240107, 21.280107, 21.320108, 21.36011, 21.40011, 21.440111, 21.480112, 21.520113, 21.560114, 21.600115, 21.640116, 21.680117, 21.720118, 21.760118, 21.80012, 21.84012, 21.880121, 21.920122, 21.960123, 22.000124, 22.040125, 22.080126, 22.120127, 22.160128, 22.200129, 22.24013, 22.28013, 22.320131, 22.360132, 22.400133, 22.440134, 22.480135, 22.520136, 22.560137, 22.600138, 22.640139, 22.68014, 22.72014, 22.760141, 22.800142, 22.840143, 22.880144, 22.920145, 22.960146, 23.000147, 23.040148, 23.080149, 23.12015, 23.16015, 23.200151, 23.240152, 23.280153, 23.320154, 23.360155, 23.400156, 23.440157, 23.480158, 23.520159, 23.56016, 23.60016, 23.640162, 23.680162, 23.720163, 23.760164, 23.800165, 23.840166, 23.880167, 23.920168, 23.960169, 24.00017, 24.04017, 24.080172, 24.120173, 24.160173, 24.200174, 24.240175, 24.280176, 24.320177, 24.360178, 24.400179, 24.44018, 24.48018, 24.520182, 24.560183, 24.600183, 24.640184, 24.680185, 24.720186, 24.760187, 24.800188, 24.840189, 24.88019, 24.92019, 24.960192, 25.000193, 25.040194, 25.080194, 25.120195, 25.160196, 25.200197, 25.240198, 25.2802, 25.3202, 25.3602, 25.400202, 25.440203, 25.480204, 25.520205, 25.560205, 25.600206, 25.640207, 25.680208, 25.72021, 25.76021, 25.80021, 25.840212, 25.880213, 25.920214, 25.960215, 26.000216, 26.040216, 26.080217, 26.120218, 26.16022, 26.20022, 26.240221, 26.280222, 26.320223, 26.360224, 26.400225, 26.440226, 26.480227, 26.520227, 26.560228, 26.60023, 26.64023, 26.680231, 26.720232, 26.760233, 26.800234, 26.840235, 26.880236, 26.920237, 26.960238, 27.000238, 27.04024, 27.08024, 27.120241, 27.160242, 27.200243, 27.240244, 27.280245, 27.320246, 27.360247, 27.400248, 27.440248, 27.48025, 27.52025, 27.560251, 27.600252, 27.640253, 27.680254, 27.720255, 27.760256, 27.800257, 27.840258, 27.880259, 27.92026, 27.96026, 28.000261, 28.040262, 28.080263, 28.120264, 28.160265, 28.200266, 28.240267, 28.280268, 28.320269, 28.36027, 28.40027, 28.440271, 28.480272, 28.520273, 28.560274, 28.600275, 28.640276, 28.680277, 28.720278, 28.760279, 28.80028, 28.84028, 28.880281, 28.920282, 28.960283, 29.000284, 29.040285, 29.080286, 29.120287, 29.160288, 29.200289, 29.24029, 29.28029, 29.320292, 29.360292, 29.400293, 29.440294, 29.480295, 29.520296, 29.560297, 29.600298, 29.640299, 29.6803, 29.7203, 29.760302, 29.800303, 29.840303, 29.880304, 29.920305, 29.960306, 30.000307, 30.040308, 30.080309, 30.12031, 30.16031, 30.200312, 30.240313, 30.280313, 30.320314, 30.360315, 30.400316, 30.440317, 30.480318, 30.520319, 30.56032, 30.60032, 30.640322, 30.680323, 30.720324, 30.760324, 30.800325, 30.840326, 30.880327, 30.920328, 30.96033, 31.00033, 31.04033, 31.080332, 31.120333, 31.160334, 31.200335, 31.240335, 31.280336, 31.320337, 31.360338, 31.40034, 31.44034, 31.480341, 31.520342, 31.560343, 31.600344, 31.640345, 31.680346, 31.720346, 31.760347, 31.800348, 31.84035, 31.88035, 31.920351, 31.960352, 32.00035, 32.04035, 32.080353, 32.120354, 32.160355, 32.200356, 32.240356, 32.280357, 32.32036, 32.36036, 32.40036, 32.44036, 32.480362, 32.520363, 32.560364, 32.600365, 32.640366, 32.680367, 32.720367, 32.76037, 32.80037, 32.84037, 32.88037, 32.920372, 32.960373, 33.000374, 33.040375, 33.080376, 33.120377, 33.160378, 33.20038, 33.24038, 33.28038, 33.32038, 33.360382, 33.400383, 33.440384, 33.480385, 33.520386, 33.560387, 33.600388, 33.64039, 33.68039, 33.72039, 33.76039, 33.800392, 33.840393, 33.880394, 33.920395, 33.960396, 34.000397, 34.040398, 34.0804, 34.1204, 34.1604, 34.2004, 34.240402, 34.280403, 34.320404, 34.360405, 34.400406, 34.440407, 34.480408, 34.52041, 34.56041, 34.60041, 34.64041, 34.680412, 34.720413, 34.760414, 34.800415, 34.840416, 34.880417, 34.920418, 34.96042, 35.00042, 35.04042, 35.08042, 35.120422, 35.160423, 35.200424, 35.240425, 35.280426, 35.320427, 35.360428, 35.40043, 35.44043, 35.48043, 35.52043, 35.560432, 35.600433, 35.640434, 35.680435, 35.720436, 35.760437, 35.800438, 35.84044, 35.88044, 35.92044, 35.96044, 36.000443, 36.040443, 36.080444, 36.120445, 36.160446, 36.200447, 36.240448, 36.28045, 36.32045, 36.36045, 36.40045, 36.440453, 36.480453, 36.520454, 36.560455, 36.600456, 36.640457, 36.680458, 36.72046, 36.76046, 36.80046, 36.84046, 36.880463, 36.920464, 36.960464, 37.000465, 37.040466, 37.080467, 37.12047, 37.16047, 37.20047, 37.24047, 37.28047, 37.320473, 37.360474, 37.400475, 37.440475, 37.480476, 37.520477, 37.56048, 37.60048, 37.64048, 37.68048, 37.72048, 37.760483, 37.800484, 37.840485, 37.880486, 37.920486, 37.960487, 38.00049, 38.04049, 38.08049, 38.12049, 38.160492, 38.200493, 38.240494, 38.280495, 38.320496, 38.360497, 38.400497, 38.4405, 38.4805, 38.5205, 38.5605, 38.600502, 38.640503, 38.680504, 38.720505, 38.760506, 38.800507, 38.840508, 38.88051, 38.92051, 38.96051, 39.00051, 39.040512, 39.080513, 39.120514, 39.160515, 39.200516, 39.240517, 39.280518, 39.32052, 39.36052, 39.40052, 39.44052, 39.480522, 39.520523, 39.560524, 39.600525, 39.640526, 39.680527, 39.720528, 39.76053, 39.80053, 39.84053, 39.88053, 39.920532, 39.960533]}, {"y": [0.0, 0.010275189, 0.5919021, 0.036447108, -0.5186413, 0.063735366, -0.49579382, 0.08358312, -0.47979593, -1.047437, -0.47947115, 0.08425534, 0.64874625, 1.2188451, 0.65495133, 1.2388353, 1.2619461, 0.7312988, 1.3303951, 1.3797325, 0.8955504, 0.4255738, -0.037404686, 0.033393748, -0.4296503, -0.8997574, -0.295861, -0.7841549, -0.18108445, -0.68323296, -0.08172017, 0.51964605, 0.014134049, -0.49117744, -0.4459868, -0.9640028, -1.4943563, -0.90450126, -0.31964856, -0.87914205, -0.30257607, -0.30045423, 0.27140686, -0.29890785, -0.87165296, -0.30753064, 0.25375336, 0.24558337, -0.33242244, -0.34130073, 0.21833569, 0.7800392, 0.20449859, -0.36946875, 0.19777131, -0.3775373, 0.18812078, 0.7554311, 0.18466574, 0.18761295, 0.19210126, -0.37383235, -0.9429959, -0.37566215, 0.1883387, 0.75403106, 0.18174535, 0.7553005, 0.19103718, -0.37147737, -0.36536372, -0.934401, -1.5108074, -0.95734286, -0.41411573, 0.12436801, -0.46912694, 0.065354645, 0.03495699, 0.5706269, 0.5452192, -0.044453144, 0.5031913, 0.48594245, 1.0437146, 0.46704984, -0.10603905, -0.67996264, -0.11543542, -0.69531417, -0.7092636, -0.15954977, -0.74924165, -1.3424687, -0.8184492, -0.30526328, 0.20370746, -0.39782208, 0.108502865, -0.49364573, 0.00733608, 0.50842416, 1.0167477, 0.4174745, 0.94436276, 0.35023475, 0.892306, 1.4440526, 0.86590827, 0.29434562, -0.2747776, -0.84618133, -1.4240464, -0.87170094, -0.8964732, -0.36380917, -0.9613208, -0.4451931, 0.0648523, 0.01894192, 0.53013074, 1.0484962, 0.45199734, -0.1428042, -0.17339501, 0.35950813, 0.3304509, 0.8714132, 0.28299493, -0.3038423, -0.89237326, -0.35187668, 0.18451267, -0.4095508, 0.12420082, -0.4711247, 0.05847037, 0.5887905, -0.0057449937, -0.03478267, -0.62949026, -0.1001485, 0.42794156, -0.16808718, 0.36315578, -0.23205346, 0.30075705, -0.294154, -0.89019084, -0.3681425, 0.14904904, -0.45099348, 0.062228978, 0.5762999, 1.0980402, 0.5028337, -0.09029454, 0.44945037, -0.14189345, 0.4012974, -0.18867326, -0.21166676, 0.33153236, -0.2588128, 0.28518826, -0.3050152, -0.89671576, -0.3630982, 0.16606307, 0.69727993, 0.10359514, 0.64413685, 0.054215193, 0.60240316, 1.1568799, 0.57858837, 0.004614711, 0.57502776, 0.0057971478, 0.008785939, 0.583937, 0.01995629, 0.59959465, 0.04142046, -0.51634896, -1.079023, -0.50667703, -0.51065016, 0.05297643, -0.5264238, 0.032779932, 0.592299, 1.1573012, 1.7318343, 1.178107, 0.6370449, 0.103469014, -0.42884958, 0.1656723, -0.36975545, 0.22383279, -0.31329167, -0.2867708, 0.3049768, 0.33164278, -0.20505843, 0.3887873, -0.14576831, -0.68206465, -0.09039682, -0.6353908, -0.047682524, -0.029848278, -0.58237165, 0.0012610555, 0.58490175, 0.60246557, 0.05631715, 0.6472654, 0.108947754, 0.7035455, 0.17483771, -0.3516698, 0.24518383, 0.28055525, -0.24378037, 0.3541335, -0.1688149, 0.42974496, -0.089880586, -0.6106886, -0.013122916, 0.021772183, -0.5065544, 0.08905971, 0.12075459, -0.41115737, -0.9480115, -0.35803455, 0.23004341, 0.81935936, 0.8464473, 0.31660867, 0.9148074, 0.40054625, -0.10819721, -0.6184471, -1.1370807, -1.105207, -0.5142302, 0.07409549, -0.4757763, 0.110162616, 0.69674647, 0.1503756, 0.7416902, 0.20530474, -0.32862467, 0.2657414, -0.268929, 0.32542264, -0.20858228, -0.7450808, -0.15383452, -0.7003839, -0.11392254, 0.4718691, 1.0603323, 1.0881237, 0.5624061, 0.043999016, 0.64429164, 1.2458086, 1.3016868, 0.82910967, 0.37018055, -0.0824675, -0.53652465, -0.9996115, -1.4788622, -0.87646836, -0.27519196, 0.32554197, -0.18609232, -0.14210397, -0.65814453, -0.6206942, -0.024282932, -0.5566321, -1.0956061, -0.50750417, -0.4927402, 0.08922565, -0.4706418, 0.10866165, 0.6887221, 0.13200372, 0.14725156, 0.73365647, 0.18737125, -0.35684186, -0.90496767, -0.3207214, -0.8811299, -1.4494421, -2.0287697, -1.4850092, -0.95881885, -0.44521356, 0.062229514, 0.5705459, -0.030164003, 0.48536903, 1.007334, 1.5418619, 0.9542161, 0.37229264, 0.3645753, 0.35984856, 0.9302578, 0.3641106, 0.9444241, 0.39063668, -0.15908283, -0.7104798, -0.12697715, -0.6866821, -1.2526889, -0.68488055, -1.2660105, -0.7167938, -0.17547703, 0.36381942, -0.22816622, -0.8211944, -0.2898538, 0.23788488, 0.76858634, 0.17504734, 0.7167076, 1.2662444, 0.68532544, 1.2534013, 0.68767583, 0.12828499, -0.4298638, -0.99210024, -0.418675, -0.99293816, -0.43169725, 0.12538904, -0.4582569, 0.0955174, -0.4904275, 0.05927956, -0.52921975, -0.5527984, -0.014312446, -0.042567555, -0.63687694, -0.10654819, -0.14116032, 0.38611096, -0.21036059, 0.31906044, -0.27697176, 0.25295997, -0.34317034, -0.940492, -0.42339152, -0.46896425, 0.036000043, 0.541476, 0.49679416, 1.0162838, 0.41998947, -0.17470759, 0.36015546, 0.89926827, 0.3100921, 0.8622801, 0.28031158, -0.29976618, 0.26069474, -0.3196584, -0.33110636, 0.22559434, 0.7845026, 0.20676875, -0.36944067, 0.19523889, -0.38228, -0.39102513, 0.16828847, -0.41353774, 0.14340067, -0.4402269, -0.45660353, -1.0455453, -0.5076798, 0.02410543, -0.5723089, -0.04722202, -0.64590317, -0.12966675, 0.38481778, -0.21514452, 0.30156624, -0.29813945, -0.8985501, -0.94771, -0.45633784, 0.028062105, 0.5128964, 1.0055403, 1.5127672, 0.9151408, 1.4527836, 0.8661588, 0.28487474, -0.29445535, -0.8758035, -0.3228252, 0.22677499, -0.36082155, -0.95036477, -1.5442784, -1.024998, -1.6266755, -1.1442943, -0.6802794, -0.22771841, 0.22092322, 0.6733681, 1.1371562, 0.53343564, 0.47549757, 0.42124262, -0.1816769, 0.31777713, 0.8218175, 1.3372558, 0.74113774, 1.2818367, 0.69576323, 0.11400974, -0.46698034, -1.0509986, -0.504081, 0.037216187, 0.57893795, -0.0107726455, 0.53715014, 0.5203953, -0.06301516, 0.49502456, 1.057758, 0.48529196, 0.48897022, -0.07518065, 0.50364494, 1.0858959, 0.5364973, 0.5609145, 0.023711562, -0.5132116, -1.0560546, -0.46973848, -1.0283455, -0.45221597, -1.0239553, -0.46009392, -0.47181916, -1.0575178, -0.5135736, -1.107615, -0.58265007, -0.06529772, -0.66586745, -0.15824968, 0.34713298, 0.8573861, 1.379148, 0.7852871, 1.3322921, 1.8927246, 1.897016, 1.347206, 0.81249386, 0.28783864, 0.88732517, 0.37794453, -0.12610584, -0.63195044, -1.1465483, -0.54936594, 0.04584819, 0.64124066, 0.113977015, 0.7122704, 0.19547498, -0.31868124, 0.28154784, 0.882389, 0.37958056, 0.98273414, 1.0435076, 0.57650614, 0.6499632, 0.20331079, -0.23979002, -0.6870763, -1.1460975, -1.0823829, -0.47899115, 0.12415546, -0.37390608, -0.8773868, -0.2766546, -0.7955769, -1.3247055, -0.73339206, -0.1458627, 0.4408363, 1.0299662, 0.49230444, 1.0886269, 0.5702512, 1.1716162, 1.7740679, 1.3054819, 0.8588613, 0.427443, 1.0252987, 0.61975026, 0.22610283, -0.16314113, -0.5555657, 0.038012385, -0.3640089, -0.7729492, -1.1962781, -0.5953055, 0.0073860884, -0.44595912, 0.15742376, -0.30063856, -0.7637336, -1.2392648, -0.63606447, -0.03327161, -0.53644407, -1.0471697, -0.44857574, -0.9783848, -0.38553786, 0.20550472, 0.79753166, 0.2634635, 0.29681987, 0.33256257, -0.19052416, 0.4079253, -0.1123991, -0.63419515, -0.03701216, -0.0031216294, 0.59391737, 0.63243276, 0.11779195, -0.3952446, 0.20500398, -0.3105449, -0.8302374, -0.2332198, -0.76605517, -0.17335397, -0.14889872, -0.6941189, -0.10688406, 0.4797384, 0.5003197, -0.04291469, 0.5489074, 0.011385739, -0.52600193, -0.50142545, -1.050253, -0.4674, 0.11232889, -0.45033586, -1.0170467, -0.44790745, 0.11734474, -0.46027607, 0.10185462, -0.47835714, 0.08021361, -0.50272155, 0.051689923, -0.5341327, 0.015319586, 0.564935, -0.02059853, -0.60625523, -0.05728829, -0.6468238, -0.105519235, -0.6988947, -1.2950221, -0.7789909, -0.27371728, 0.22763348, 0.7322519, 0.13130337, 0.64752954, 0.04885167, -0.5496839, -0.031567276, 0.48612875, 0.4482719, 0.9778476, 0.95167404, 0.36374223, -0.22210926, 0.33054018, -0.2546363, 0.29878598, -0.2861122, -0.8727093, -1.4638754, -0.9362054, -0.42070693, 0.08902931, 0.5999999, -7.6413155e-5, -0.6001529, -0.08918542, -0.6909821, -0.1907385, 0.30671647, 0.8086387, 0.20742708, -0.39337343, -0.9949093, -0.49815333, -0.555994, -0.6181014, -0.14663455, -0.21478167, -0.8183415, -1.4200926, -0.9910482, -0.5803027, -1.1725056, -0.7943779, -0.43222106, -0.0789015, 0.27279416, 0.63009065, 0.046031713, 0.4165213, 0.79533434, 1.1897497, 0.59376407, 0.506798, 0.9422202, 0.34001166, -0.26305497, 0.18930426, 0.64489245, 0.04121822, 0.5078529, 0.98272586, 1.4728837, 0.87175506, 1.3938469, 1.366266, 0.779713, 0.19795507, -0.38246882, -0.9654277, -0.4158616, -0.4391763, 0.10099664, 0.6423285, 0.052781224, -0.53648925, 0.006598592, 0.5497608, 1.0989437, 0.516644, -0.062143266, 0.5021725, 1.0708861, 1.0762005, 1.0902485, 0.5447721, 0.57224, 0.039587796, -0.49258155, -1.0306388, -0.44170415, -0.9956093, -0.41574204, -0.98288405, -1.5579817, -1.0032566, -0.45914644, 0.07973248, -0.5138059, 0.020030499, 0.5541087, 1.0947015, 1.0771073, 0.4967332, 0.49186224, -0.081163526, -0.0826706, 0.4873354, 1.0612676, 0.50027514, -0.05587268, -0.04169446, -0.5988056, -0.018389344, 0.5619008, 1.1458629, 0.59989315, 0.6274322, 1.2236418, 0.70683765, 0.19973493, -0.3045361, -0.813109, -1.3327119, -1.8689085, -1.2841227, -0.70784944, -0.7093007, -0.14468938, 0.41860688, 0.9856413, 0.4165489, 0.99479055, 0.43861932, 1.0262425, 0.48551792, -0.04951918, 0.54564995, 0.016540468, -0.51236266, 0.08302599, 0.114244245, 0.7104124, 1.3088603, 0.8016553, 0.30601466, -0.18504849, 0.41862008, -0.06901041, -0.5576949, 0.04571694, -0.45040363, -0.9530899, -0.35244048, 0.24740607, -0.2693985, 0.33038253, -0.18562514, 0.41457742, 1.015651, 0.5164041, 0.024812073, -0.4664058, -0.96455973, -0.36315894, -0.8794028, -0.2815982, -0.8124123, -0.21913224, 0.37316072, 0.9670996, 0.4399951, 0.47940582, 1.080012, 0.57923007, 0.08699331, -0.4039325, 0.19950801, -0.29438764, 0.30893898, -0.18474644, -0.68117666, -0.078813195, -0.030627482, 0.5715585, 0.07204214, -0.42642143, 0.17620352, -0.32578292, -0.8324168, -1.3504446, -0.7551892, -0.1631031, -0.707957, -0.12060672, -0.10385254, 0.48217484, 1.0709082, 0.53291637, 0.0013072491, -0.5302859, -0.5017533, -0.4772839, -0.45669392, 0.1300725, -0.4219766, 0.16299096, -0.39168665, 0.1918098, 0.7765027, 0.22773558, -0.31856304, -0.86829853, -0.28479797, 0.29685426, -0.26147056, -0.82230246, -1.3905027, -0.8261269, -0.26954323, 0.2843287, -0.30037934, -0.88687843, -0.3424608, 0.19808352, 0.74088407, 1.2917378, 0.7118773, 0.13719583, -0.43641812, -1.0133502, -0.45562738, 0.09752709, 0.081247464, 0.63621366, 0.054557085, 0.6161923, 1.1833742, 0.6161275, 0.054421842, 0.6360258, 0.0809809, -0.47324383, -0.46093205, 0.1191757, -0.44293198, 0.13483617, -0.4300821, 0.14545915, 0.72209203, 0.7326346, 0.17886752, 0.19692357, -0.3529536, -0.9065085, -1.468859, -0.8994328, -0.3379526, -0.9215696, -0.37259066, 0.1723311, 0.7191487, 0.13295484, -0.45244962, 0.09856528, 0.08059027, 0.63345814, 0.050385952, 0.038671095, -0.54381686, -1.1296707, -0.5868221, -0.0507614, 0.48469448, 1.025811, 0.43838876, -0.14646864, -0.73220885, -0.18474144, 0.36070412, -0.22835606, 0.31851965, 0.8688142, 0.28566766, 0.8473443, 0.27230382, -0.3006131, 0.26835603, 0.8395587, 0.2734533, -0.29019827, -0.8564512, -0.28552645, 0.28300905, -0.29045498, -0.86615926, -0.30546707, 0.25228477, -0.33007067, -0.3439861, 0.20957744, -0.3757934, 0.17607284, 0.729756, 1.2905934, 0.7183624, 0.1521588, 0.15922381, -0.40418333, 0.17282307, -0.3926713, 0.18266225, -0.38468087, -0.38309523, -0.95679116, -0.3945424, -0.40718007, 0.14731485, -0.43786567, 0.113670886, -0.4734906, 0.074270606, -0.5151796, 0.027788639, 0.5710698, -0.017906904, -0.6069789, -0.064290166, 0.47767007, 0.45596763, 1.0079994, 0.99832815, 0.42463464, -0.14563608, -0.7171048, -1.2941056, -0.73925674, -0.19214052, 0.35286564, 0.9017154, 0.31788814, -0.26388896, -0.8473767, -1.4357886, -0.9015481, -0.37853515, 0.13950962, -0.46036625, 0.05347663, 0.56804645, 0.5289552, 1.057626, 0.4647318, 0.44218165, -0.14608634, -0.16625838, -0.7562249, -0.2170772, 0.31952465, 0.29268906, -0.29922578, 0.24096516, 0.78390265, 0.19598621, 0.7493163, 0.16733187, -0.4135337, -0.99710304, -0.44885945, -1.0407242, -0.50929, 0.015768051, 0.5410276, -0.05546063, -0.65214497, -0.12895352, 0.39256233, -0.20561612, -0.80440015, -0.29082358, -0.33747983, 0.1674729, 0.67479515, 1.1913807, 0.59496444, 1.133382, 0.545225, -0.039795697], "type": "scatter", "name": "Î¸Ì‡", "yaxis": "y2", "x": [0.0, 0.04, 0.08, 0.12, 0.16, 0.19999999, 0.23999998, 0.27999997, 0.31999996, 0.35999995, 0.39999995, 0.43999994, 0.47999993, 0.5199999, 0.55999994, 0.59999996, 0.64, 0.68, 0.72, 0.76000005, 0.8000001, 0.8400001, 0.8800001, 0.92000014, 0.96000016, 1.0000001, 1.0400001, 1.08, 1.12, 1.16, 1.1999999, 1.2399999, 1.2799999, 1.3199998, 1.3599998, 1.3999997, 1.4399997, 1.4799997, 1.5199996, 1.5599996, 1.5999995, 1.6399995, 1.6799995, 1.7199994, 1.7599994, 1.7999994, 1.8399993, 1.8799993, 1.9199992, 1.9599992, 1.9999992, 2.0399992, 2.0799992, 2.1199992, 2.1599991, 2.199999, 2.239999, 2.279999, 2.319999, 2.359999, 2.399999, 2.4399989, 2.4799988, 2.5199988, 2.5599988, 2.5999987, 2.6399987, 2.6799986, 2.7199986, 2.7599986, 2.7999985, 2.8399985, 2.8799984, 2.9199984, 2.9599984, 2.9999983, 3.0399983, 3.0799983, 3.1199982, 3.1599982, 3.1999981, 3.239998, 3.279998, 3.319998, 3.359998, 3.399998, 3.439998, 3.4799979, 3.5199978, 3.5599978, 3.5999978, 3.6399977, 3.6799977, 3.7199976, 3.7599976, 3.7999976, 3.8399975, 3.8799975, 3.9199975, 3.9599974, 3.9999974, 4.0399976, 4.0799975, 4.1199975, 4.1599975, 4.1999974, 4.2399974, 4.2799973, 4.3199973, 4.3599973, 4.399997, 4.439997, 4.479997, 4.519997, 4.559997, 4.599997, 4.639997, 4.679997, 4.719997, 4.759997, 4.799997, 4.839997, 4.879997, 4.9199967, 4.9599967, 4.9999967, 5.0399966, 5.0799966, 5.1199965, 5.1599965, 5.1999965, 5.2399964, 5.2799964, 5.3199964, 5.3599963, 5.3999963, 5.4399962, 5.479996, 5.519996, 5.559996, 5.599996, 5.639996, 5.679996, 5.719996, 5.759996, 5.799996, 5.839996, 5.879996, 5.919996, 5.9599957, 5.9999957, 6.0399957, 6.0799956, 6.1199956, 6.1599956, 6.1999955, 6.2399955, 6.2799954, 6.3199954, 6.3599954, 6.3999953, 6.4399953, 6.4799953, 6.519995, 6.559995, 6.599995, 6.639995, 6.679995, 6.719995, 6.759995, 6.799995, 6.839995, 6.879995, 6.919995, 6.959995, 6.9999948, 7.0399947, 7.0799947, 7.1199946, 7.1599946, 7.1999946, 7.2399945, 7.2799945, 7.3199944, 7.3599944, 7.3999944, 7.4399943, 7.4799943, 7.5199943, 7.559994, 7.599994, 7.639994, 7.679994, 7.719994, 7.759994, 7.799994, 7.839994, 7.879994, 7.919994, 7.959994, 7.999994, 8.039994, 8.079994, 8.119994, 8.159994, 8.199994, 8.239994, 8.279994, 8.319994, 8.359994, 8.399994, 8.439994, 8.479994, 8.519994, 8.559994, 8.599994, 8.639994, 8.679994, 8.719994, 8.759994, 8.7999935, 8.8399935, 8.879993, 8.919993, 8.959993, 8.999993, 9.039993, 9.079993, 9.119993, 9.159993, 9.199993, 9.239993, 9.279993, 9.319993, 9.359993, 9.399993, 9.439993, 9.479993, 9.519993, 9.559993, 9.599993, 9.639993, 9.679993, 9.719993, 9.759993, 9.799993, 9.839993, 9.8799925, 9.919992, 9.959992, 9.999992, 10.039992, 10.079992, 10.119992, 10.159992, 10.199992, 10.239992, 10.279992, 10.319992, 10.359992, 10.399992, 10.439992, 10.479992, 10.519992, 10.559992, 10.599992, 10.639992, 10.679992, 10.719992, 10.759992, 10.799992, 10.839992, 10.879992, 10.9199915, 10.959991, 10.999991, 11.039991, 11.079991, 11.119991, 11.159991, 11.199991, 11.239991, 11.279991, 11.319991, 11.359991, 11.399991, 11.439991, 11.479991, 11.519991, 11.559991, 11.599991, 11.639991, 11.679991, 11.719991, 11.759991, 11.799991, 11.839991, 11.879991, 11.919991, 11.9599905, 11.99999, 12.03999, 12.07999, 12.11999, 12.15999, 12.19999, 12.23999, 12.27999, 12.31999, 12.35999, 12.39999, 12.43999, 12.47999, 12.51999, 12.55999, 12.59999, 12.63999, 12.67999, 12.71999, 12.75999, 12.79999, 12.83999, 12.87999, 12.91999, 12.95999, 12.9999895, 13.039989, 13.079989, 13.119989, 13.159989, 13.199989, 13.239989, 13.279989, 13.319989, 13.359989, 13.399989, 13.439989, 13.479989, 13.519989, 13.559989, 13.599989, 13.639989, 13.679989, 13.719989, 13.759989, 13.799989, 13.839989, 13.879989, 13.919989, 13.959989, 13.999989, 14.0399885, 14.0799885, 14.119988, 14.159988, 14.199988, 14.239988, 14.279988, 14.319988, 14.359988, 14.399988, 14.439988, 14.479988, 14.519988, 14.559988, 14.599988, 14.639988, 14.679988, 14.719988, 14.759988, 14.799988, 14.839988, 14.879988, 14.919988, 14.959988, 14.999988, 15.039988, 15.079988, 15.1199875, 15.159987, 15.199987, 15.239987, 15.279987, 15.319987, 15.359987, 15.399987, 15.439987, 15.479987, 15.519987, 15.559987, 15.599987, 15.639987, 15.679987, 15.719987, 15.759987, 15.799987, 15.839987, 15.879987, 15.919987, 15.959987, 15.999987, 16.039988, 16.079988, 16.11999, 16.15999, 16.199991, 16.239992, 16.279993, 16.319994, 16.359995, 16.399996, 16.439997, 16.479998, 16.519999, 16.56, 16.6, 16.640001, 16.680002, 16.720003, 16.760004, 16.800005, 16.840006, 16.880007, 16.920008, 16.960009, 17.00001, 17.04001, 17.080011, 17.120012, 17.160013, 17.200014, 17.240015, 17.280016, 17.320017, 17.360018, 17.400019, 17.44002, 17.48002, 17.520021, 17.560022, 17.600023, 17.640024, 17.680025, 17.720026, 17.760027, 17.800028, 17.840029, 17.88003, 17.92003, 17.960032, 18.000032, 18.040033, 18.080034, 18.120035, 18.160036, 18.200037, 18.240038, 18.280039, 18.32004, 18.36004, 18.400042, 18.440042, 18.480043, 18.520044, 18.560045, 18.600046, 18.640047, 18.680048, 18.720049, 18.76005, 18.80005, 18.840052, 18.880053, 18.920053, 18.960054, 19.000055, 19.040056, 19.080057, 19.120058, 19.160059, 19.20006, 19.24006, 19.280062, 19.320063, 19.360064, 19.400064, 19.440065, 19.480066, 19.520067, 19.560068, 19.600069, 19.64007, 19.68007, 19.720072, 19.760073, 19.800074, 19.840075, 19.880075, 19.920076, 19.960077, 20.000078, 20.04008, 20.08008, 20.12008, 20.160082, 20.200083, 20.240084, 20.280085, 20.320086, 20.360086, 20.400087, 20.440088, 20.48009, 20.52009, 20.560091, 20.600092, 20.640093, 20.680094, 20.720095, 20.760096, 20.800097, 20.840097, 20.880098, 20.9201, 20.9601, 21.000101, 21.040102, 21.080103, 21.120104, 21.160105, 21.200106, 21.240107, 21.280107, 21.320108, 21.36011, 21.40011, 21.440111, 21.480112, 21.520113, 21.560114, 21.600115, 21.640116, 21.680117, 21.720118, 21.760118, 21.80012, 21.84012, 21.880121, 21.920122, 21.960123, 22.000124, 22.040125, 22.080126, 22.120127, 22.160128, 22.200129, 22.24013, 22.28013, 22.320131, 22.360132, 22.400133, 22.440134, 22.480135, 22.520136, 22.560137, 22.600138, 22.640139, 22.68014, 22.72014, 22.760141, 22.800142, 22.840143, 22.880144, 22.920145, 22.960146, 23.000147, 23.040148, 23.080149, 23.12015, 23.16015, 23.200151, 23.240152, 23.280153, 23.320154, 23.360155, 23.400156, 23.440157, 23.480158, 23.520159, 23.56016, 23.60016, 23.640162, 23.680162, 23.720163, 23.760164, 23.800165, 23.840166, 23.880167, 23.920168, 23.960169, 24.00017, 24.04017, 24.080172, 24.120173, 24.160173, 24.200174, 24.240175, 24.280176, 24.320177, 24.360178, 24.400179, 24.44018, 24.48018, 24.520182, 24.560183, 24.600183, 24.640184, 24.680185, 24.720186, 24.760187, 24.800188, 24.840189, 24.88019, 24.92019, 24.960192, 25.000193, 25.040194, 25.080194, 25.120195, 25.160196, 25.200197, 25.240198, 25.2802, 25.3202, 25.3602, 25.400202, 25.440203, 25.480204, 25.520205, 25.560205, 25.600206, 25.640207, 25.680208, 25.72021, 25.76021, 25.80021, 25.840212, 25.880213, 25.920214, 25.960215, 26.000216, 26.040216, 26.080217, 26.120218, 26.16022, 26.20022, 26.240221, 26.280222, 26.320223, 26.360224, 26.400225, 26.440226, 26.480227, 26.520227, 26.560228, 26.60023, 26.64023, 26.680231, 26.720232, 26.760233, 26.800234, 26.840235, 26.880236, 26.920237, 26.960238, 27.000238, 27.04024, 27.08024, 27.120241, 27.160242, 27.200243, 27.240244, 27.280245, 27.320246, 27.360247, 27.400248, 27.440248, 27.48025, 27.52025, 27.560251, 27.600252, 27.640253, 27.680254, 27.720255, 27.760256, 27.800257, 27.840258, 27.880259, 27.92026, 27.96026, 28.000261, 28.040262, 28.080263, 28.120264, 28.160265, 28.200266, 28.240267, 28.280268, 28.320269, 28.36027, 28.40027, 28.440271, 28.480272, 28.520273, 28.560274, 28.600275, 28.640276, 28.680277, 28.720278, 28.760279, 28.80028, 28.84028, 28.880281, 28.920282, 28.960283, 29.000284, 29.040285, 29.080286, 29.120287, 29.160288, 29.200289, 29.24029, 29.28029, 29.320292, 29.360292, 29.400293, 29.440294, 29.480295, 29.520296, 29.560297, 29.600298, 29.640299, 29.6803, 29.7203, 29.760302, 29.800303, 29.840303, 29.880304, 29.920305, 29.960306, 30.000307, 30.040308, 30.080309, 30.12031, 30.16031, 30.200312, 30.240313, 30.280313, 30.320314, 30.360315, 30.400316, 30.440317, 30.480318, 30.520319, 30.56032, 30.60032, 30.640322, 30.680323, 30.720324, 30.760324, 30.800325, 30.840326, 30.880327, 30.920328, 30.96033, 31.00033, 31.04033, 31.080332, 31.120333, 31.160334, 31.200335, 31.240335, 31.280336, 31.320337, 31.360338, 31.40034, 31.44034, 31.480341, 31.520342, 31.560343, 31.600344, 31.640345, 31.680346, 31.720346, 31.760347, 31.800348, 31.84035, 31.88035, 31.920351, 31.960352, 32.00035, 32.04035, 32.080353, 32.120354, 32.160355, 32.200356, 32.240356, 32.280357, 32.32036, 32.36036, 32.40036, 32.44036, 32.480362, 32.520363, 32.560364, 32.600365, 32.640366, 32.680367, 32.720367, 32.76037, 32.80037, 32.84037, 32.88037, 32.920372, 32.960373, 33.000374, 33.040375, 33.080376, 33.120377, 33.160378, 33.20038, 33.24038, 33.28038, 33.32038, 33.360382, 33.400383, 33.440384, 33.480385, 33.520386, 33.560387, 33.600388, 33.64039, 33.68039, 33.72039, 33.76039, 33.800392, 33.840393, 33.880394, 33.920395, 33.960396, 34.000397, 34.040398, 34.0804, 34.1204, 34.1604, 34.2004, 34.240402, 34.280403, 34.320404, 34.360405, 34.400406, 34.440407, 34.480408, 34.52041, 34.56041, 34.60041, 34.64041, 34.680412, 34.720413, 34.760414, 34.800415, 34.840416, 34.880417, 34.920418, 34.96042, 35.00042, 35.04042, 35.08042, 35.120422, 35.160423, 35.200424, 35.240425, 35.280426, 35.320427, 35.360428, 35.40043, 35.44043, 35.48043, 35.52043, 35.560432, 35.600433, 35.640434, 35.680435, 35.720436, 35.760437, 35.800438, 35.84044, 35.88044, 35.92044, 35.96044, 36.000443, 36.040443, 36.080444, 36.120445, 36.160446, 36.200447, 36.240448, 36.28045, 36.32045, 36.36045, 36.40045, 36.440453, 36.480453, 36.520454, 36.560455, 36.600456, 36.640457, 36.680458, 36.72046, 36.76046, 36.80046, 36.84046, 36.880463, 36.920464, 36.960464, 37.000465, 37.040466, 37.080467, 37.12047, 37.16047, 37.20047, 37.24047, 37.28047, 37.320473, 37.360474, 37.400475, 37.440475, 37.480476, 37.520477, 37.56048, 37.60048, 37.64048, 37.68048, 37.72048, 37.760483, 37.800484, 37.840485, 37.880486, 37.920486, 37.960487, 38.00049, 38.04049, 38.08049, 38.12049, 38.160492, 38.200493, 38.240494, 38.280495, 38.320496, 38.360497, 38.400497, 38.4405, 38.4805, 38.5205, 38.5605, 38.600502, 38.640503, 38.680504, 38.720505, 38.760506, 38.800507, 38.840508, 38.88051, 38.92051, 38.96051, 39.00051, 39.040512, 39.080513, 39.120514, 39.160515, 39.200516, 39.240517, 39.280518, 39.32052, 39.36052, 39.40052, 39.44052, 39.480522, 39.520523, 39.560524, 39.600525, 39.640526, 39.680527, 39.720528, 39.76053, 39.80053, 39.84053, 39.88053, 39.920532, 39.960533]}]} // Get the plotly listeners const plotly_listeners = {} // Get the JS listeners const js_listeners = {} // Deal with eventual custom classes let custom_classlist = [] // Load the plotly library if (!window.Plotly) { const {plotly} = await import('https://cdn.plot.ly/plotly-2.16.1.min.js') } // Check if we have to force local mathjax font cache if (false && window?.MathJax?.config?.svg?.fontCache === 'global') { window.MathJax.config.svg.fontCache = 'local' } // Flag to check if this cell was manually ran or reactively ran const firstRun = this ? false : true const PLOT = this ?? document.createElement("div"); const parent = currentScript.parentElement const isPlutoWrapper = parent.classList.contains('raw-html-wrapper') if (firstRun) { // It seem plot divs would not autosize themself inside flexbox containers without this parent.appendChild(PLOT) } // If width is not specified, set it to 100% PLOT.style.width = plot_obj.layout.width ? "" : "100%" // For the height we have to also put a fixed value in case the plot is put on a non-fixed-size container (like the default wrapper) PLOT.style.height = plot_obj.layout.height ? "" : (isPlutoWrapper || parent.clientHeight == 0) ? "400px" : "100%" PLOT.classList.forEach(cn => { if (cn !== 'js-plotly-plot' && !custom_classlist.includes(cn)) { PLOT.classList.toggle(cn, false) } }) for (const className of custom_classlist) { PLOT.classList.toggle(className, true) } // Create the resizeObserver to make the plot even more responsive! :magic: const resizeObserver = new ResizeObserver(entries => { PLOT.style.height = plot_obj.layout.height ? "" : (isPlutoWrapper || parent.clientHeight == 0) ? "400px" : "100%" /* The addition of the invalid argument `plutoresize` seems to fix the problem with calling `relayout` simply with `{autosize: true}` as update breaking mouse relayout events tracking. See https://github.com/plotly/plotly.js/issues/6156 for details */ Plotly.relayout(PLOT, {..._.pick(PLOT.layout, ['width','height']), autosize: true, plutoresize: true}) }) resizeObserver.observe(PLOT) Plotly.react(PLOT, plot_obj).then(() => { // Assign the Plotly event listeners for (const [key, listener_vec] of Object.entries(plotly_listeners)) { for (const listener of listener_vec) { PLOT.on(key, listener) } } // Assign the JS event listeners for (const [key, listener_vec] of Object.entries(js_listeners)) { for (const listener of listener_vec) { PLOT.addEventListener(key, listener) } } } ) invalidation.then(() => { // Remove all plotly listeners PLOT.removeAllListeners() // Remove all JS listeners for (const [key, listener_vec] of Object.entries(js_listeners)) { for (const listener of listener_vec) { PLOT.removeEventListener(key, listener) } } // Remove the resizeObserver resizeObserver.disconnect() }) return PLOT ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•:Ðå°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$daf35bfe-8f9c-4f55-971d-4d443be8f8bf¹depends_on_disabled_cellsÂ§runtimeÎ¾Š½µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$8e096fae-9941-49d8-ae87-c68b02f68da5Š¦queuedÂ¤logs§runningÂ¦output†¤body…¦prefixÚContinuousMDP{Float32, Tuple{Float32, Float32}, Float32, ContinuousMDPTransitionSampler{Float32, Tuple{Float32, Float32}, Float32, var"#step#1606"}, typeof(Main.var"workspace#8".MountainCarTask.initialize_state), typeof(Main.var"workspace#8".MountainCarTask.isterm), Returns{Bool}}¨elements”’£ptf’…¦prefixÙZContinuousMDPTransitionSampler{Float32, Tuple{Float32, Float32}, Float32, var"#step#1606"}¨elements‘’¤step’ÙJ(::Main.var"workspace#8".var"#step#1606") (generic function with 1 method)ªtext/plain¤type¦struct¬prefix_short¾ContinuousMDPTransitionSampler¨objectid°ffffffffd87e3d3fÙ!application/vnd.pluto.tree+object’°initialize_state’Ù1initialize_state (generic function with 1 method)ªtext/plain’¦isterm’Ù'isterm (generic function with 1 method)ªtext/plain’¯is_valid_action’³Returns{Bool}(true)ªtext/plain¤type¦struct¬prefix_shortContinuousMDP¨objectid¨da132fcf¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÙ%const mountaincar_continuous_beta_mdp²last_run_timestampËAÚ•>‚ J°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$8e096fae-9941-49d8-ae87-c68b02f68da5¹depends_on_disabled_cellsÂ§runtimeÎù÷Mµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$666a4e89-306b-4fb2-bdc4-3dda2c63153fŠ¦queuedÂ¤logs§runningÂ¦output†¤body ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•s6Ò°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$666a4e89-306b-4fb2-bdc4-3dda2c63153f¹depends_on_disabled_cellsÂ§runtimeÎÆ µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$5d35e515-e2d3-443e-becf-eb28c25db346Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚd>

$\lambda_\theta$: 0.85

$\lambda_\mathbf{w}$: 0.95

$\alpha_{\overline{r}}$:

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•!8žg°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5d35e515-e2d3-443e-becf-eb28c25db346¹depends_on_disabled_cellsÂ§runtimeÎÍ…‡µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$4c34640f-efa2-4e1d-8a70-0acd2ce45428Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ

Bonus Problems: Comparing Techniques

Consider the case of applying the techniques in this chapter to problems where we choose feature vectors and parameters to effectively compute the tabular case. That is we enumerate every state and state/action pair. Our parameters for each function will store a single value for each case. Let's consider the gradients for both the state-value estimate and the policy. We will use two sets of parameters: $\mathbf{w}$ and $\mathbf{\theta}$. $\mathbf{w}_s$ is the parameter for state s and $\mathbf{\theta}_{s, a}$ is the parameter for state/action pair $(s, a)$. Using this notation $\mathbf{w}$ is a vector and $\theta$ is a matrix.

Starting with the state-value function:

$$\begin{align} \hat v(s, \mathbf{w}) &= \mathbf{w}_s \\ \nabla v(s, \mathbf{w}) &= \nabla \mathbf{w}_s \\ &= \mathbf{e}_s \end{align}$$

where $\mathbf{e}_s$ is the one-hot vector for index s and length equal to the number of states.

Now moving on to the policy, we will use a soft-max function to convert action preferences into probabilities.

$$\begin{align} \pi(a|s, \theta) &= \frac{\exp{\theta_{s, a}}}{\sum_{i = 1}^{n_A}{\exp{\theta_{s, i}}}} \\ \nabla \pi(a|s, \theta) &= \nabla \frac{\exp{\theta_{s, a}}}{\sum_{i = 1}^{n_A}{\exp{\theta_{s, i}}}} \\ \end{align}$$

But we already calculated the gradient of the soft-max function of a vector $\mathbf{x}$.

$$\nabla\sigma(\mathbf{x})_{i, j} = \sigma(\mathbf{x})_i \left ( \delta_{i, j} - \sigma(\mathbf{x})_j \right )$$

Comparing to what we desire, $\mathbf{x} = \mathbf{\theta}_s$ which is the parameter vector for the state s and $\sigma = \pi$. So we can immediately write down the components of this gradient:

$$\begin{align} \nabla \pi(a|\theta_s)_i &= \pi(a|\theta_s) \left (\delta_{a, i} - \pi(i|\theta_s) \right ) \\ \frac{\nabla \pi(a|\theta_s)_i}{\pi(a|\theta_s)} = \nabla \ln \pi(a|\theta_s)_i &= \left (\delta_{a, i} - \pi(i|\theta_s) \right ) \\ \end{align}$$

$$\begin{equation} \nabla \ln{\pi(a|\theta_s)}_i = \begin{cases} -\pi(i|\theta_s) & i \neq a \\ 1 - \pi(i|\theta_s) & i = a \end{cases} \end{equation}$$

This is a gradient vector which corresponds to the components of $\theta_s$ which is the parameter vector for each action at that state. We have a new vector update for each unique state/action pair observed, but once those two are fixed the number of components that need to be calculated is just a vector with a length equal to the number of actions.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô–¢^°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$4c34640f-efa2-4e1d-8a70-0acd2ce45428¹depends_on_disabled_cellsÂ§runtimeÎ ÂRµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$e7566274-5518-4e28-8738-d4b1747d0cfbŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ:form_state_value_function (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ• "–°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$e7566274-5518-4e28-8738-d4b1747d0cfb¹depends_on_disabled_cellsÂ§runtimeÎ÷óµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$6bf5ad39-1400-4e1f-a843-a1934b8aaa48Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙNupdate_squashed_gaussian_eligibility_vector! (generic function with 2 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•#SmH°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$6bf5ad39-1400-4e1f-a843-a1934b8aaa48¹depends_on_disabled_cellsÂ§runtimeÎôÿ^µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$17d07ef4-7c0a-47cc-a701-32c60336571bŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ×

Noticing this pattern, the kth term will be of the form $\gamma^k \sum_{x \in \mathcal{S}} \Pr(s \rightarrow x, k, \pi)f(x)$ and the total expression will just be a sum of all of these terms to infinity or the maximum length of an episode under the policy. Looking more closely at the probability term, we can equate it to some other probabilities regarding episode length.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô†ÇÒ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$17d07ef4-7c0a-47cc-a701-32c60336571b¹depends_on_disabled_cellsÂ§runtimeÎYÜµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$76fd79a2-2bc8-45f8-a243-48415118898aŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ'BinarySquashedGaussianEligibilityVector¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•!”>ó°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$76fd79a2-2bc8-45f8-a243-48415118898a¹depends_on_disabled_cellsÂ§runtimeÎMüyµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$0b01ba67-3921-4f3f-a7e8-235190bc84ebŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ/make_beta_dist (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•!bî°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$0b01ba67-3921-4f3f-a7e8-235190bc84eb¹depends_on_disabled_cellsÂ§runtimeÎvßµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$9acdbf38-2e10-45ec-85a0-d0db8453a599Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ;fcann_feature_vector_setup (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•2|ÓÈ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$9acdbf38-2e10-45ec-85a0-d0db8453a599¹depends_on_disabled_cellsÂ§runtimeÎM=/µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$d4e87ac4-6008-43b2-aa06-e232ec2b2b5bŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÛŠª

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@tl”°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d4e87ac4-6008-43b2-aa06-e232ec2b2b5b¹depends_on_disabled_cellsÂ§runtimeÎs×µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$05f120be-9695-4824-82fd-142a0df13098Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙoactor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•/ÍIà°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$05f120be-9695-4824-82fd-142a0df13098¹depends_on_disabled_cellsÂ§runtimeÎNS+µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$b2539398-fdbc-42a2-a8f3-d327358f3643Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙB

Waiting to run parameter study

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•/©WÚ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$b2539398-fdbc-42a2-a8f3-d327358f3643¹depends_on_disabled_cellsÂ§runtimeÎM¤µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$c5dd7e99-57e0-4bc7-97d2-2c780b23bcffŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ–

Discrete Action Space

As an initial test, consider the discrete action space originally used for the mountain car problem where there are three actions (-1, 0, 1) corresponding to full throttle reverse, idle, and full throttle forward. We can apply the same tile coding solution technique from before but with a policy gradient method instead of Sarsa.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô•×¿°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$c5dd7e99-57e0-4bc7-97d2-2c780b23bcff¹depends_on_disabled_cellsÂ§runtimeÎÔeµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements™’¬step_rewards’…¦prefix§Float32¨elements›’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’ ’£0.0ªtext/plain¤more’ÎB@’£0.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°4b70953b2e6e8649Ù!application/vnd.pluto.tree+object’¬total_reward’¤24.0ªtext/plain’«total_steps’§1000000ªtext/plain’±policy_parameters’ƒ¨elements’’’…¦prefix¯Matrix{Float32}¨elements”’’Ú@32Ã—2 Matrix{Float32}: 0.374307 -0.234304 -1.03248 -0.269911 -0.975325 0.849759 -0.263278 -0.385389 -0.107577 0.263333 -0.156025 -0.58667 1.3691 -0.270142 â‹® -0.318091 -0.298098 -0.180892 0.259876 0.400833 -1.47239 -0.804528 0.356901 -0.326647 -0.10743 -0.822327 -0.125748ªtext/plain’’ÚL32Ã—32 Matrix{Float32}: -0.162673 0.242498 0.19849 â€¦ -0.0109337 0.0482874 0.0460886 0.00193776 0.0534719 0.0860083 0.212865 -0.0703064 0.164744 -0.0152179 0.0776176 -0.0888733 0.0287231 -0.161298 -0.0916366 0.193733 0.457864 -0.162398 0.0144948 -0.0480639 0.117131 -0.0145481 0.0171614 0.209093 0.151762 0.0403211 -0.122444 -0.00489918 -0.157581 -0.228196 â€¦ 0.0034082 0.243973 -0.0950734 0.0173464 0.114532 0.128324 0.107545 -0.105426 -0.0336514 â‹® â‹± â‹® 0.248201 -0.10714 -0.0626096 -0.261732 -0.0755426 0.0384916 -0.0113953 0.0904461 -0.114413 -0.516735 0.451616 0.0130118 0.0351254 0.273681 -0.10648 0.173477 -0.0711579 -0.108224 0.196006 -0.257032 0.0930074 0.00664788 0.0640232 0.00205874 -0.529999 -0.130317 0.230962 â€¦ 0.06913 0.0748414 0.0767005 -0.0918788 0.0198111 0.153276 -0.0242945 0.0524623 0.0539445ªtext/plain’’ÚE32Ã—32 Matrix{Float32}: 0.347561 -0.206838 -0.110822 â€¦ 0.255197 0.0296856 -0.173584 -0.229165 -0.188896 0.146177 -0.0633724 -0.093465 -0.229342 -0.106134 0.0831949 0.125071 0.0767818 -0.159762 0.0817969 0.0219935 -0.205179 0.394048 0.401863 0.175625 -0.110434 -0.0176931 -0.1009 -0.00526891 -0.155378 -0.160092 -0.0125793 -0.207594 0.0895822 0.0657224 â€¦ -0.32334 0.0428685 -0.269805 0.2151 -0.18638 0.267279 -0.0182067 -0.146437 -0.174477 â‹® â‹± â‹® -0.252566 -0.250205 0.0205444 0.139519 0.110159 -0.0384127 0.350132 -0.095402 -0.354336 -0.00293087 0.0494648 0.120143 0.202794 0.0784002 0.0125694 -0.0405287 -0.0384652 -0.392017 -0.0430515 -0.169582 -0.0726074 0.127153 0.258665 -0.180882 0.21014 -0.244566 -0.0983913 â€¦ -0.041157 0.0283188 0.25321 -0.0835586 0.169916 0.109187 0.043264 -0.197076 0.0629424ªtext/plain’’Ú3Ã—32 Matrix{Float32}: 0.124274 -0.0137226 0.260139 0.113848 â€¦ -0.0893076 -0.293468 -0.279324 -0.279206 0.0308841 -0.234585 -0.107916 -0.0757386 -0.188042 -0.128437 0.0175082 0.232399 0.387761 0.109973 -0.15397 -0.0257646 0.087439ªtext/plain¤type¥Array¬prefix_short ¨objectid°924a9802c2b3d1bfÙ!application/vnd.pluto.tree+object’’…¦prefix¯Vector{Float32}¨elements”’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°15a29cc9c963e49bÙ!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°6d94a44679480443Ù!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°2df4df9efa07e239Ù!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°26f93cf46c10a6ccÙ!application/vnd.pluto.tree+object¤type¥Array¬prefix_short ¨objectid°7870826d8effa5f9Ù!application/vnd.pluto.tree+object¤type¥Tuple¨objectid°88491d5c3a018601Ù!application/vnd.pluto.tree+object’°value_parameters’ƒ¨elements’’’…¦prefix¯Matrix{Float32}¨elements”’’Ú@32Ã—2 Matrix{Float32}: 0.374307 -0.234304 -1.03248 -0.269911 -0.975325 0.849759 -0.263278 -0.385389 -0.107577 0.263333 -0.156025 -0.58667 1.3691 -0.270142 â‹® -0.318091 -0.298098 -0.180892 0.259876 0.400833 -1.47239 -0.804528 0.356901 -0.326647 -0.10743 -0.822327 -0.125748ªtext/plain’’ÚL32Ã—32 Matrix{Float32}: -0.162673 0.242498 0.19849 â€¦ -0.0109337 0.0482874 0.0460886 0.00193776 0.0534719 0.0860083 0.212865 -0.0703064 0.164744 -0.0152179 0.0776176 -0.0888733 0.0287231 -0.161298 -0.0916366 0.193733 0.457864 -0.162398 0.0144948 -0.0480639 0.117131 -0.0145481 0.0171614 0.209093 0.151762 0.0403211 -0.122444 -0.00489918 -0.157581 -0.228196 â€¦ 0.0034082 0.243973 -0.0950734 0.0173464 0.114532 0.128324 0.107545 -0.105426 -0.0336514 â‹® â‹± â‹® 0.248201 -0.10714 -0.0626096 -0.261732 -0.0755426 0.0384916 -0.0113953 0.0904461 -0.114413 -0.516735 0.451616 0.0130118 0.0351254 0.273681 -0.10648 0.173477 -0.0711579 -0.108224 0.196006 -0.257032 0.0930074 0.00664788 0.0640232 0.00205874 -0.529999 -0.130317 0.230962 â€¦ 0.06913 0.0748414 0.0767005 -0.0918788 0.0198111 0.153276 -0.0242945 0.0524623 0.0539445ªtext/plain’’ÚE32Ã—32 Matrix{Float32}: 0.347561 -0.206838 -0.110822 â€¦ 0.255197 0.0296856 -0.173584 -0.229165 -0.188896 0.146177 -0.0633724 -0.093465 -0.229342 -0.106134 0.0831949 0.125071 0.0767818 -0.159762 0.0817969 0.0219935 -0.205179 0.394048 0.401863 0.175625 -0.110434 -0.0176931 -0.1009 -0.00526891 -0.155378 -0.160092 -0.0125793 -0.207594 0.0895822 0.0657224 â€¦ -0.32334 0.0428685 -0.269805 0.2151 -0.18638 0.267279 -0.0182067 -0.146437 -0.174477 â‹® â‹± â‹® -0.252566 -0.250205 0.0205444 0.139519 0.110159 -0.0384127 0.350132 -0.095402 -0.354336 -0.00293087 0.0494648 0.120143 0.202794 0.0784002 0.0125694 -0.0405287 -0.0384652 -0.392017 -0.0430515 -0.169582 -0.0726074 0.127153 0.258665 -0.180882 0.21014 -0.244566 -0.0983913 â€¦ -0.041157 0.0283188 0.25321 -0.0835586 0.169916 0.109187 0.043264 -0.197076 0.0629424ªtext/plain’’Ùk1Ã—32 Matrix{Float32}: 0.00542627 0.0200724 0.0360005 0.0195947 â€¦ -0.014911 0.00224356 -0.0349084ªtext/plain¤type¥Array¬prefix_short ¨objectid°479b0cbaa6cbd394Ù!application/vnd.pluto.tree+object’’…¦prefix¯Vector{Float32}¨elements”’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°15a29cc9c963e49bÙ!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°6d94a44679480443Ù!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°2df4df9efa07e239Ù!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°7910f55289b7a643Ù!application/vnd.pluto.tree+object¤type¥Array¬prefix_short ¨objectid°9c627d14d05acfd3Ù!application/vnd.pluto.tree+object¤type¥Tuple¨objectid¯fe8a6e1639811d0Ù!application/vnd.pluto.tree+object’¯policy_function’¢Ï€ªtext/plain’´policy_sample_action’©Ï€_sampleªtext/plain’´estimate_state_value’´estimate_state_valueªtext/plain’°policy_and_value’°policy_and_valueªtext/plain¤typeªNamedTuple¨objectid°93d530c3a3b92f70¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÙ'const mountaincar_continuing_fcann_test²last_run_timestampËAÚ•=$Öñ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9¹depends_on_disabled_cellsÂ§runtimeÏÈ -µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$042fbafe-2401-4fb7-ac13-4531e0782c79Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙBupdate_binary_eligibility_vector! (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•ÈJÜ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$042fbafe-2401-4fb7-ac13-4531e0782c79¹depends_on_disabled_cellsÂ§runtimeÎ åˆµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$d57375a5-b9e0-4742-b5f7-6a7da891604aŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙNmountaincar_binary_continuing_parameter_study (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•=2ß°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d57375a5-b9e0-4742-b5f7-6a7da891604a¹depends_on_disabled_cellsÂ§runtimeÎ2Œµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$07ad517a-c2ac-4377-99fb-adb13d0f1d0cŠ¦queuedÂ¤logs§runningÂ¦output†¤body…¦prefix§Float32¨elements’’’¨0.470621ªtext/plain’’¨0.529379ªtext/plain¤type¥Array¬prefix_short ¨objectid°6c423950a4c4737e¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÀ²last_run_timestampËAÚ•%Ae=°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$07ad517a-c2ac-4377-99fb-adb13d0f1d0c¹depends_on_disabled_cellsÂ§runtimeÎÙ0ý µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$71a5fce8-6d9a-4625-bad1-a951d61bff28Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚcX

$\lambda_\theta$: 0.05

$\lambda_\mathbf{w}$: 0.8

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@/S?°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$71a5fce8-6d9a-4625-bad1-a951d61bff28¹depends_on_disabled_cellsÂ§runtimeÎÀ»µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$77906355-08f8-4b08-b051-84697199b519Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements’’’£0.5ªtext/plain’’¤0.07ªtext/plain¤type¥Tuple¨objectid°7f36f7939155dfec¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeºconst mountaincar_max_vals²last_run_timestampËAÚ•:n3°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$77906355-08f8-4b08-b051-84697199b519¹depends_on_disabled_cellsÂ§runtimeÎ6tµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$5207308e-f636-4d47-b135-036a6e7b8ecdŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ›»Total Reward: -147.0

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•A"…°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5207308e-f636-4d47-b135-036a6e7b8ecd¹depends_on_disabled_cellsÂ§runtimeÎÄæ“µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$16113560-e911-47b4-abc4-641bbd246454Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ-$ ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•>9¹°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$16113560-e911-47b4-abc4-641bbd246454¹depends_on_disabled_cellsÂ§runtimeÎ²7[µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$b7f77935-bcab-4ef1-8e1b-a7d059784ff3Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚæ

Evaluation State for Policy Function

x position: 0.0

velocity: 0.0

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•>cÊC°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$b7f77935-bcab-4ef1-8e1b-a7d059784ff3¹depends_on_disabled_cellsÂ§runtimeÎø—Üµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5Š¦queuedÂ¤logs§runningÂ¦output†¤body‚£msgÙ¦UndefVarError: `reinforce_test` not defined in `Main.var"workspace#8"` Suggestion: add an appropriate import or assignment. This global was declared but not assigned.ªstacktrace‘Œªcall_short¯top-level scope§inlinedÂ£urlÀ¤pathÙØ/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5®source_packageÀ¤call¯top-level scopeªlinfo_typeCore.CodeInfo¤line¤fileÙMChapter_13_Policy_Gradient_Methods.jl#==#f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5¤func¯top-level scopeparent_moduleÀ¦from_cÂ¤mimeÙ'application/vnd.pluto.stacktrace+object¬rootassigneeÀ²last_run_timestampËAÚ•0ýÈ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5¹depends_on_disabled_cellsÂ§runtimeÀµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÃÙ$00bd2835-b006-4244-9877-bc7e031e3ef8Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ7plot_squashed_gaussian (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•!‡@¼°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$00bd2835-b006-4244-9877-bc7e031e3ef8¹depends_on_disabled_cellsÂ§runtimeÎŒ¢µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$50ae94c4-70f3-4215-82bd-eb2227c2badfŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙB

Waiting to run parameter study

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•2¢T*°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$50ae94c4-70f3-4215-82bd-eb2227c2badf¹depends_on_disabled_cellsÂ§runtimeÎ>,µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$cc3ac95e-a398-438a-ba3d-62b6733f6342Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙAupdate_fcann_action_preferences! (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•ùœ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$cc3ac95e-a398-438a-ba3d-62b6733f6342¹depends_on_disabled_cellsÂ§runtimeÎ®Pµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$c926b6df-c40b-4c4c-8a95-ce9e41feb100Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÀ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampË°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$c926b6df-c40b-4c4c-8a95-ce9e41feb100¹depends_on_disabled_cellsÃ§runtimeÀµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$740a3f41-9302-481d-b373-762c0dea8effŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙEupdate_gaussian_eligibility_vector! (generic function with 4 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•#]Ö»°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$740a3f41-9302-481d-b373-762c0dea8eff¹depends_on_disabled_cellsÂ§runtimeÎ%‚ µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$ba642a22-6623-482a-ab4a-81585b83e457Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ8average_continuing_runs (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ• _´°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$ba642a22-6623-482a-ab4a-81585b83e457¹depends_on_disabled_cellsÂ§runtimeÎ/pS›µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$d17a4bd0-5992-4247-912d-73d51758d2f3Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙJ

Continuing Cartpole Example

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô $°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d17a4bd0-5992-4247-912d-73d51758d2f3¹depends_on_disabled_cellsÂ§runtimeÎj’µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7efŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÛyÚ

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@Zõ™°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7ef¹depends_on_disabled_cellsÂ§runtimeÎZÇ®µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$5ee4ce72-7740-4297-8d84-619e0708e4acŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙJcartpole_continuing_fcann_parameter_study (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•8àe°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5ee4ce72-7740-4297-8d84-619e0708e4ac¹depends_on_disabled_cellsÂ§runtimeÎwwµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$645e93e7-e92e-49c4-9757-8294fabf4e9bŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚz^ ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•+·{–°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$645e93e7-e92e-49c4-9757-8294fabf4e9b¹depends_on_disabled_cellsÂ§runtimeÎ'M}µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$0c56b341-24eb-4c78-844e-182f44a7221aŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÛ&V ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•#ø_°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$0c56b341-24eb-4c78-844e-182f44a7221a¹depends_on_disabled_cellsÂ§runtimeÎ_b•µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$d34d22ad-89c2-423e-91dd-bfb895dc6540Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ?cartpole_fcann_parameter_study (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•3.ÿÚ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d34d22ad-89c2-423e-91dd-bfb895dc6540¹depends_on_disabled_cellsÂ§runtimeÎ£šµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$20776e09-7d9b-4db8-a060-7bceeec65b47Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙfactor_critic_with_eligibility_traces_binary_features_gaussian_actions (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•/¿Ö°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$20776e09-7d9b-4db8-a060-7bceeec65b47¹depends_on_disabled_cellsÂ§runtimeÎ>¤Jµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$7856b8a0-565d-4c86-9b3c-4424ff9b86ddŠ¦queuedÂ¤logs§runningÂ¦output†¤body ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ”ô”d9°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$7856b8a0-565d-4c86-9b3c-4424ff9b86dd¹depends_on_disabled_cellsÂ§runtimeÍ&—µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$735b548a-88f5-4a30-ab8f-dfb3d6401b2bŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚÈ

13.7 Policy Parameterization for Continuous Actions

With a parameterized policy we are to learn statistics of the distribution that selects actions. As a foundation consider the normal distribution:

$$p(x) \doteq \frac{1}{\sigma \sqrt{2\pi}} \exp \left ( - \frac{(x-\mu)^2}{2\sigma^2} \right ) \tag{13.18}$$

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‘X°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$735b548a-88f5-4a30-ab8f-dfb3d6401b2b¹depends_on_disabled_cellsÂ§runtimeÎáåµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$7cf26604-9c2b-4a77-9674-7d4dac2f99f0Š¦queuedÂ¤logs”ˆ¤lineÿ£msg’ÙsReplacing docs for `Main.var"workspace#8".make_random_walk_mrp :: Tuple{Integer}` in module `Main.var"workspace#8"`ªtext/plain§cell_idÙ$7cf26604-9c2b-4a77-9674-7d4dac2f99f0¦kwargs¢id²Base_Docs_4352b6d8¤file¬docs/Docs.jl¥group¤Docs¥level¤Warnˆ¤lineÿ£msg’ÙsReplacing docs for `Main.var"workspace#8".make_random_walk_mrp :: Tuple{Integer}` in module `Main.var"workspace#8"`ªtext/plain§cell_idÙ$7cf26604-9c2b-4a77-9674-7d4dac2f99f0¦kwargs¢id²Base_Docs_4352b6d8¤file¬docs/Docs.jl¥group¤Docs¥level¤Warnˆ¤lineÿ£msg’ÙsReplacing docs for `Main.var"workspace#8".make_random_walk_mrp :: Tuple{Integer}` in module `Main.var"workspace#8"`ªtext/plain§cell_idÙ$7cf26604-9c2b-4a77-9674-7d4dac2f99f0¦kwargs¢id²Base_Docs_4352b6d8¤file¬docs/Docs.jl¥group¤Docs¥level¤Warnˆ¤lineÿ£msg’Ú}WARNING: could not import FCANN.cuda_allocate into workspace#8 WARNING: could not import FCANN.cuda_allocate into workspace#8 WARNING: replacing module MountainCarTask. WARNING: could not import FCANN.cuda_allocate into workspace#8 WARNING: replacing module MountainCarTask. WARNING: could not import FCANN.cuda_allocate into workspace#8 WARNING: replacing module MountainCarTask. ªtext/plain§cell_idÙ$7cf26604-9c2b-4a77-9674-7d4dac2f99f0¦kwargs¢id´PlutoRunner_d1acb81e¤fileÙP/home/runner/.julia/packages/Pluto/5ete1/src/runner/PlutoRunner/src/io/stdout.jl¥group¦stdout¥level®LogLevel(-555)§runningÂ¦output†¤bodyÚ "# This file is machine-generated - editing it directly is not advised\n\njulia_version = \"1.11.5\"\nmanifest_format = \"2.0\"\nproject_hash = \"52f0e08d74c26001471ce64a62da0627b2421990\"\n\n[[deps.AbstractPlutoDingetjes]]\ndeps = [\"Pkg\"]\ngit-tree-sha1 = \"6e1d2a35f2f90a4bc7c2ed98079b2ba09c35b83a\"\nuuid = \"6e696" â‹¯ 22302 bytes â‹¯ " \"8e850b90-86db-534c-a0d3-1478176c7d93\"\nversion = \"5.11.0+0\"\n\n[[deps.nghttp2_jll]]\ndeps = [\"Artifacts\", \"Libdl\"]\nuuid = \"8e850ede-7688-5339-a07c-302acd2aaf8d\"\nversion = \"1.59.0+0\"\n\n[[deps.p7zip_jll]]\ndeps = [\"Artifacts\", \"Libdl\"]\nuuid = \"3f19e933-33d8-53b3-aaab-bd5110c3b7a0\"\nversion = \"17.4.0+2\"\n"¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•Ú5‹°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$7cf26604-9c2b-4a77-9674-7d4dac2f99f0¹depends_on_disabled_cellsÂ§runtimeÏ¬\µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$87ee21f3-16ca-4c8c-a0b9-f9e2fd258a91Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙd

REINFORCE Implementation for Continuous Action Spaces

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô“É6°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$87ee21f3-16ca-4c8c-a0b9-f9e2fd258a91¹depends_on_disabled_cellsÂ§runtimeÎ‚µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$54f1546d-87ae-49d2-92ed-6fcc9b66e027Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ?

Mountain Car MDP

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô•ºZ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$54f1546d-87ae-49d2-92ed-6fcc9b66e027¹depends_on_disabled_cellsÂ§runtimeÎqµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$63fbf8f4-e4e2-4893-be09-67450e92dbd7Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ*plot_cart (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•@{!¯°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$63fbf8f4-e4e2-4893-be09-67450e92dbd7¹depends_on_disabled_cellsÂ§runtimeÎIçkµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$d5020a8d-1dd7-403c-9d1f-665b95543943Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙmreinforce_with_baseline_monte_carlo_control_linear_features_gaussian_actions (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•'A³°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d5020a8d-1dd7-403c-9d1f-665b95543943¹depends_on_disabled_cellsÂ§runtimeÎ@0Sµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$37a8ef7e-e859-4ef0-81e2-76c02a324031Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ

Policy Gradient Theorem Proof

In all cases below when a sum over states is taken, it is assumed to be over the set of non-terminal states: $\sum_s \implies \sum_{s \in \mathcal{S}}$ Note that for the case of the value function this is identical to the sum over $\mathcal{S}^+$ because the state-action values are always zero for terminal states.

$$\begin{flalign} \nabla v_\pi(s) &= \nabla \left [ \sum_a \pi(a \vert s) q_\pi(s, a) \right ] \text{, } \forall s \in \mathcal{S} \tag{definitiong of value functions and expected value} \\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \nabla q_\pi(s, a) \right ] \tag{product rule} \\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \nabla \sum_{s^\prime, r} p(s^\prime, r \vert s, a)(r + \gamma v_\pi(s^\prime) \right ] \tag{relationship between action and state values} \\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \gamma \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) \right ] \tag{gradient independence}\\ \end{flalign}$$

Note that the final term in the sum is the original expression evaluated at $s^\prime$ instead of $s$, so we have derived a recurssive expression which can be applied repeatedly:

$$\begin{flalign} \nabla v_\pi(s) &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \gamma \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \left [ \nabla \pi(a^\prime \vert s^\prime) q_\pi(s^\prime, a^\prime) + \pi(a^\prime \vert s^\prime) \gamma \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) \nabla v_\pi(s^{\prime \prime}) \right ] \right ] \tag{recur once}\\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) \right ] + \gamma \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \left [ \nabla \pi(a^\prime \vert s^\prime) q_\pi(s^\prime, a^\prime) \right ] \right ] + \\ &\hspace{50px} \gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) \nabla v_\pi(s^{\prime \prime}) \right ] \tag{grouping terms}\\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) \right ] + \gamma \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \left [ \nabla \pi(a^\prime \vert s^\prime) q_\pi(s^\prime, a^\prime) \right ] \right ] + \\ &\hspace{50px} \gamma^2 \sum_a \left [ \pi(a \vert s)\sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) \sum_{a^{\prime \prime}} [ \nabla \pi(a^{\prime \prime} \vert s^{\prime \prime}) q_\pi(s^{\prime \prime}, a^{\prime \prime})\right ] + \cdots \tag{extend recursion}\\ \end{flalign}$$

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô†nƒ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$37a8ef7e-e859-4ef0-81e2-76c02a324031¹depends_on_disabled_cellsÂ§runtimeÎ½µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$98229733-a71e-44ca-a52a-b7229cf8b422Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ$

The probability transition function is normalized over all possible transition states $\sum_{s^\prime \in \mathcal{S}^+} p(s^\prime \vert s, a) = 1$. If we only take the sum of $\mathcal{S}$ then we instead get the probability that after a single transition we have NOT reached a terminal state. Let's say we also have a policy function $\pi(a \vert s)$ which is normalized over actions: $\sum_a \pi(a \vert s) = 1$. Now if we combine the two, we can arrive at a new distribution over transition states: $p(s^\prime \vert s, \pi) = \sum_a \pi(a \vert s) p(s^\prime \vert s, a)$ which is the probability of transitioning from $s$ to $s^\prime$ under the policy. We can see that this distribution is normalized over the transition states as well as long as we include the terminal state: $\sum_{s^\prime \in \mathcal{S}^+} p(s^\prime \vert s, \pi) = \sum_{s^\prime \in \mathcal{S}^+, a} \pi(a \vert s) p(s^\prime \vert s, a) = \sum_a \pi(a \vert s) \sum_{s^\prime \in \mathcal{S}^+} p(s^\prime \vert s, a) = 1 \times 1 = 1$. If instead we take the sum over $\mathcal{S}$ we simply get the probability of NOT terminating in one step.

What if we consider two steps into the future though? Now we have $\sum_{s^\prime}\sum_{a^\prime}\pi(a^\prime \vert s^\prime)p(s^{\prime \prime} \vert s^\prime, a^\prime)\sum_a \pi(a \vert s) p(s^\prime \vert s, a) = \sum_{s^\prime}p(s^{\prime \prime} \vert s^\prime, \pi) p(s^\prime \vert s, \pi)$. It would appear as though we can just put the two probabilities together and consider a new distribution over $s^{\prime \prime}$ which is $p(s^{\prime \prime} \vert s, \pi, 2)$ where instead of one step this now occurs over two steps, but how is this distribution normalized? In the case of the one step, transition, we saw that its sum over all transition states is 1 as expected. If we sum both transition states over only $\mathcal{S}$ rather than $\mathcal{S}^+$ what is the result? We already know that $\sum_{s^{\prime \prime} \in \mathcal{S}^+} p(s^{\prime \prime} \vert s^\prime , \pi) = \Pr \{ S_1 \neq S_T \ \vert S_0 = s^\prime, \pi \}$ that is the probability that after transitioning out of $s^\prime$ under the policy $\pi$ we have not reached a terminal state.

$$\sum_{s^{\prime \prime} \in \mathcal{S}} \sum_{s^\prime \in \mathcal{S}} p(s^{\prime \prime} \vert s^\prime, \pi) p(s^\prime \vert s, \pi) = \sum_{s^\prime \in \mathcal{S}} p(s^\prime \vert s, \pi) \sum_{s^{\prime \prime} \in \mathcal{S}} p(s^{\prime \prime} \vert s^\prime, \pi) = \Pr \{ S_2 \neq S_T \vert S_0 = s, \pi \}$$

which is to say the probability that after two transitions from $s$ we are not in a terminal state under the policy $\pi$.

For the derivations that follow, we always take sums of these distributions over $\mathcal{S}$. For episodic problems, the on policy distribution $\mu_\pi(s)$ which is the probability of being in a state $s$ during an episode always excludes the terminal state. That is because if there is a non-zero probability of reaching a terminal state under a policy, then considering all possible episodes we may have an infinite number of visits to the terminal state. Technically the episodes have infinite length but we are only interested in the portion of the episode that preceeds the terminal state for the purpose of calculating probabilities. The more careful statement about the on policy distribution is that it measures the probability of being in a state during the non-terminal part of an episode. If we try to include the terminal states, then we cannot have a proper normalized definition of the on-policy distribution. Moreover, we have no need to measure the value of a terminal state accurately, since we always know it to be 0. The on policy distribution is used to formulate the value error objective function and it should only include states for which the value estimation is non-trivial.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô†K²°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$98229733-a71e-44ca-a52a-b7229cf8b422¹depends_on_disabled_cellsÂ§runtimeÎ Tnµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$42775fd1-5b27-48e0-abf1-9b22bb775e6dŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ,l ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•/ DÈ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$42775fd1-5b27-48e0-abf1-9b22bb775e6d¹depends_on_disabled_cellsÂ§runtimeÏz3çµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$7dbb42a3-aa8c-47e5-b668-18e6325d4038Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ8

Tile Coding Method

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô=l°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$7dbb42a3-aa8c-47e5-b668-18e6325d4038¹depends_on_disabled_cellsÂ§runtimeÎ—µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$192b9f82-8d3a-408f-91c2-829cfcd32572Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ8cartpole_vector_update! (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•2›©X°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$192b9f82-8d3a-408f-91c2-829cfcd32572¹depends_on_disabled_cellsÂ§runtimeÎ‰µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$b5319d8b-0420-4ebf-b603-ea0b93365ac1Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙGshow_mountaincar_continuous_trajectory (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•AX—°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$b5319d8b-0420-4ebf-b603-ea0b93365ac1¹depends_on_disabled_cellsÂ§runtimeÎPPµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$4cbdb082-22ba-49e9-a6ed-4380917625acŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙb

Actor-Critic with Eligibility Traces Implementation

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‹¸°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$4cbdb082-22ba-49e9-a6ed-4380917625ac¹depends_on_disabled_cellsÂ§runtimeÎˆµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$cc80848a-6834-4272-9152-e17b45448814Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ,wind_speeds (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•?Ç/°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$cc80848a-6834-4272-9152-e17b45448814¹depends_on_disabled_cellsÂ§runtimeÎñµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$05bfd818-bf4e-4bda-baa9-5ba647867097Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙUactor_critic_with_eligibility_traces_binary_features (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•+~¦°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$05bfd818-bf4e-4bda-baa9-5ba647867097¹depends_on_disabled_cellsÂ§runtimeÎ8Rµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$f0962801-0dfa-421f-8ffc-e64068e49913Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements“’®feature_vector’…¦prefix§Float32¨elements’’’£0.0ªtext/plain’’£0.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°7a7b475772429c5aÙ!application/vnd.pluto.tree+object’¬num_features’¡2ªtext/plain’¶update_feature_vector!’¶update_feature_vector!ªtext/plain¤typeªNamedTuple¨objectid°cf73078e0a2e34b3¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÙ%const mountaincar_fcann_feature_setup²last_run_timestampËAÚ•2†uh°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f0962801-0dfa-421f-8ffc-e64068e49913¹depends_on_disabled_cellsÂ§runtimeÎ ^µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$11a55af7-5301-4507-bb26-88e1e11236dbŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ.Ý ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•2l½©°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$11a55af7-5301-4507-bb26-88e1e11236db¹depends_on_disabled_cellsÂ§runtimeÎo)gµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$ddbca73f-c692-46f2-95f3-a7dd849d33f7Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ’'Total Reward: -113.0

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•A¾Ó°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$ddbca73f-c692-46f2-95f3-a7dd849d33f7¹depends_on_disabled_cellsÂ§runtimeÎ`µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$b4875f2b-5487-429f-80a3-d1032bbccfc1Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ[

Policy Gradient Theorem Proof for Continuing Problems

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôŒMƒ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$b4875f2b-5487-429f-80a3-d1032bbccfc1¹depends_on_disabled_cellsÂ§runtimeÎÖêµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$0cd96c44-cae6-421f-9fae-26141600bef4Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ\ž ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•0Óln°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$0cd96c44-cae6-421f-9fae-26141600bef4¹depends_on_disabled_cellsÂ§runtimeÎ#ïµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$51d6337d-c0bd-40a9-9129-7d88e41e4093Š¦queuedÂ¤logs§runningÂ¦output†¤body ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ”ô•'1°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$51d6337d-c0bd-40a9-9129-7d88e41e4093¹depends_on_disabled_cellsÂ§runtimeÍ*mµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$5859ca11-90f8-4fd6-88ed-c56efe796fe8Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ4 ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•15 !°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5859ca11-90f8-4fd6-88ed-c56efe796fe8¹depends_on_disabled_cellsÂ§runtimeÎ"ZHµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$3ea08816-705e-4be7-a175-dbd3f3e4c17dŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ>

Misc Utilities/Functions

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô–×E°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$3ea08816-705e-4be7-a175-dbd3f3e4c17d¹depends_on_disabled_cellsÂ§runtimeÎdµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$f3e2db06-9cb7-464a-96b8-938175efd26bŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@icé°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$af144759-fe66-4ad0-b378-e9eb4e859db4¹depends_on_disabled_cellsÂ§runtimeÎ¡º/µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2Š¦queuedÂ¤logs§runningÂ¦output†¤body…¦prefixÚ"ContinuousMDP{Float32, Tuple{Float32, Float32}, Float32, ContinuousMDPTransitionSampler{Float32, Tuple{Float32, Float32}, Float32, var"#step#1603"{Float32}}, typeof(Main.var"workspace#8".MountainCarTask.initialize_state), typeof(Main.var"workspace#8".MountainCarTask.isterm), Returns{Bool}}¨elements”’£ptf’…¦prefixÙcContinuousMDPTransitionSampler{Float32, Tuple{Float32, Float32}, Float32, var"#step#1603"{Float32}}¨elements‘’¤step’ÙS(::Main.var"workspace#8".var"#step#1603"{Float32}) (generic function with 1 method)ªtext/plain¤type¦struct¬prefix_short¾ContinuousMDPTransitionSampler¨objectid¨10d85fa8Ù!application/vnd.pluto.tree+object’°initialize_state’Ù1initialize_state (generic function with 1 method)ªtext/plain’¦isterm’Ù'isterm (generic function with 1 method)ªtext/plain’¯is_valid_action’³Returns{Bool}(true)ªtext/plain¤type¦struct¬prefix_shortContinuousMDP¨objectid°81156278aaa90f92¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÙ const mountaincar_continuous_mdp²last_run_timestampËAÚ•=¢sP°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2¹depends_on_disabled_cellsÂ§runtimeÎì+µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$fb8904a9-ae64-41cc-93b6-5a25855edad0Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ;get_corridor_episode_stats (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•!³f°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$fb8904a9-ae64-41cc-93b6-5a25855edad0¹depends_on_disabled_cellsÂ§runtimeÎ3v„µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$a5b002c9-5e11-462a-9da0-6e060c7963f8Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements•’’…¦prefixÙ,Main.var"workspace#8".CartPoleState{Float32}¨elements›’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’¤30.0ªtext/plain’¢Î¸’£0.8ªtext/plain’£áº‹’£0.0ªtext/plain’¤Î¸Ì‡’¤-0.0ªtext/plain’¡t’£0.0ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid¯a843943c04b3a7eÙ!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’§30.0221ªtext/plain’¢Î¸’¨0.795094ªtext/plain’£áº‹’§1.10655ªtext/plain’¤Î¸Ì‡’©-0.245739ªtext/plain’¡t’¤0.04ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°ce8725f349c35742Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’§30.0885ªtext/plain’¢Î¸’¨0.780267ªtext/plain’£áº‹’§2.21393ªtext/plain’¤Î¸Ì‡’©-0.497006ªtext/plain’¡t’¤0.08ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid¯56323732a2cd6c6Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’§30.1993ªtext/plain’¢Î¸’¨0.755186ªtext/plain’£áº‹’¥3.323ªtext/plain’¤Î¸Ì‡’©-0.759398ªtext/plain’¡t’¤0.12ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°e9d1d7d09422679dÙ!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’§30.3544ªtext/plain’¢Î¸’¨0.719291ªtext/plain’£áº‹’§4.43467ªtext/plain’¤Î¸Ì‡’¨-1.03864ªtext/plain’¡t’¤0.16ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°587ca329eacee254Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’§30.5541ªtext/plain’¢Î¸’¨0.671792ªtext/plain’£áº‹’§5.54994ªtext/plain’¤Î¸Ì‡’¨-1.34061ªtext/plain’¡t’£0.2ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°f23d0775120eae84Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’§30.7985ªtext/plain’¢Î¸’§0.61166ªtext/plain’£áº‹’§6.66984ªtext/plain’¤Î¸Ì‡’¨-1.67126ªtext/plain’¡t’¤0.24ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°937ae5d5b0481894Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’§31.0878ªtext/plain’¢Î¸’§0.53763ªtext/plain’£áº‹’§7.79538ªtext/plain’¤Î¸Ì‡’¨-2.03646ªtext/plain’¡t’¤0.28ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°ac7ee607274c7714Ù!application/vnd.pluto.tree+object’ ’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’§31.4222ªtext/plain’¢Î¸’¨0.448211ªtext/plain’£áº‹’§8.92731ªtext/plain’¤Î¸Ì‡’¨-2.44158ªtext/plain’¡t’¤0.32ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°d7442e47daa49912Ù!application/vnd.pluto.tree+object¤more’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’§36.3249ªtext/plain’¢Î¸’¨-1.21277ªtext/plain’£áº‹’¨-1.05244ªtext/plain’¤Î¸Ì‡’¨-1.20747ªtext/plain’¡t’£1.0ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°f9c13b3b4eec42a5Ù!application/vnd.pluto.tree+object¤type¥Array¬prefix_short ¨objectid°1e33aa8b879d9c18Ù!application/vnd.pluto.tree+object’’…¦prefix¥Int64¨elements›’’¡3ªtext/plain’’¡3ªtext/plain’’¡3ªtext/plain’’¡3ªtext/plain’’¡3ªtext/plain’’¡3ªtext/plain’’¡3ªtext/plain’’¡3ªtext/plain’ ’¡3ªtext/plain¤more’’¡1ªtext/plain¤type¥Array¬prefix_short ¨objectid°af7504f82a205193Ù!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements›’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’ ’£1.0ªtext/plain¤more’’£1.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°5288029df0465058Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’§36.2609ªtext/plain’¢Î¸’¨-1.26108ªtext/plain’£áº‹’§-2.1482ªtext/plain’¤Î¸Ì‡’¨-1.21309ªtext/plain’¡t’¤1.04ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°54ad65ee320ed994Ù!application/vnd.pluto.tree+object’’¢26ªtext/plain¤type¥Tuple¨objectid°77b4b6114cbab8a0¤mimeÙ!application/vnd.pluto.tree+object¬rootassignee©const ep2²last_run_timestampËAÚ•:?Ž°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$a5b002c9-5e11-462a-9da0-6e060c7963f8¹depends_on_disabled_cellsÂ§runtimeÎÖY)µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$83640f5b-fe13-4ec1-98a0-67a56c189ba1Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙGactor_critic_with_eligibility_traces! (generic function with 2 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•+qa°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$83640f5b-fe13-4ec1-98a0-67a56c189ba1¹depends_on_disabled_cellsÂ§runtimeÎp‚ïµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$61650a97-b353-4a85-b50b-93fee296ac7bŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements“’®feature_vector’…¦prefix§Float32¨elements”’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°aa97f30b6a1c9855Ù!application/vnd.pluto.tree+object’¬num_features’¡4ªtext/plain’¶update_feature_vector!’¶update_feature_vector!ªtext/plain¤typeªNamedTuple¨objectid°18316424d13c4613¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÙ"const cartpole_fcann_feature_setup²last_run_timestampËAÚ•2•I0°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$61650a97-b353-4a85-b50b-93fee296ac7b¹depends_on_disabled_cellsÂ§runtimeÎ„›µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$602a07dd-8928-4b44-97e5-01c5cbf38351Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ5plot_cartpole_policy (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•@6Ög°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$602a07dd-8928-4b44-97e5-01c5cbf38351¹depends_on_disabled_cellsÂ§runtimeÎu¡µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$f7433324-acc3-49a5-b5b3-ada0c8f09d52Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements•’’…¦prefix¥Int64¨elements“’’¡1ªtext/plain’’¡2ªtext/plain’’¡3ªtext/plain¤type¥Array¬prefix_short ¨objectid°b9d5c490997071d7Ù!application/vnd.pluto.tree+object’’…¦prefix¥Int64¨elements“’’¡2ªtext/plain’’¡1ªtext/plain’’¡2ªtext/plain¤type¥Array¬prefix_short ¨objectid°45f86e4ab1b5cc81Ù!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements“’’¤-1.0ªtext/plain’’¤-1.0ªtext/plain’’¤-1.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°e320a1e36c0632f8Ù!application/vnd.pluto.tree+object’’¡4ªtext/plain’’¡3ªtext/plain¤type¥Tuple¨objectid°2f266a154ddd7c23¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÀ²last_run_timestampËAÚ•!¬– °persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f7433324-acc3-49a5-b5b3-ada0c8f09d52¹depends_on_disabled_cellsÂ§runtimeÍ2]µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$0c9986bb-54c0-4b08-9c29-4bfb0b68b54eŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ

Exercise 13.2

Generalize the proof of the policy gradient theorem and the steps leading to the REINFORCE update equation (13.8), so that (13.8) ends up with a factor of $\gamma^t$ and thus aligns with the general algorithm given in the pseudocode.

See proof above in the section on proving the policy gradient theorem.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‰€°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$3bafd7df-9bc0-4d13-874d-739590cf3ad9¹depends_on_disabled_cellsÂ§runtimeÎj$µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$f27f2bcd-05b6-44fe-bf9e-a3e51556db7cŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements—’¤step’¤stepªtext/plain’§failure’§failureªtext/plain’°initialize_state’°initialize_stateªtext/plain’°discrete_actions’…¦prefix§Float32¨elements“’’¦-300.0ªtext/plain’’£0.0ªtext/plain’’¥300.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°7bda0c2c0e3ade17Ù!application/vnd.pluto.tree+object’¨min_vals’ƒ¨elements”’’¥-50.0ªtext/plain’’¨-1.22173ªtext/plain’’¥-50.0ªtext/plain’’¥-10.0ªtext/plain¤type¥Tuple¨objectid°e43d54ac3f2f06edÙ!application/vnd.pluto.tree+object’¨max_vals’ƒ¨elements”’’¤50.0ªtext/plain’’§1.22173ªtext/plain’’¤50.0ªtext/plain’’¤10.0ªtext/plain¤type¥Tuple¨objectid°670383347ca2d0caÙ!application/vnd.pluto.tree+object’¡h’¤0.04ªtext/plain¤typeªNamedTuple¨objectid°9bb540c63f908a36¤mimeÙ!application/vnd.pluto.tree+object¬rootassignee¸const cartpole_functions²last_run_timestampËAÚ• Êm°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f27f2bcd-05b6-44fe-bf9e-a3e51556db7c¹depends_on_disabled_cellsÂ§runtimeÎê²µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$41dc149d-c6f3-4b0d-a856-06f3aaae3049Š¦queuedÂ¤logs§runningÂ¦output†¤body ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•»PL°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$41dc149d-c6f3-4b0d-a856-06f3aaae3049¹depends_on_disabled_cellsÂ§runtimeÎ&kfµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$38e5d800-4d43-40d2-87ea-f7d4b4283dabŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ

In order to find the p that maximizes the expected value for state 1, we should differentiate by p and set the result to 0

$$\frac{\partial v_1}{\partial p} = -\frac{2p(1-p) - 2(1+p)(1 - 2p)}{p^2(1-p)^2}$$

Setting this equal to 0 implies

$$\begin{flalign} p-p^2 &= 1 - 2p + p - 2p^2\\ p^2 + 2p - 1 &= 0 \\ \end{flalign}$$

Using the quadratic equation, there are two solutions but since we know p has to be positive we only take that one.

$$p = \frac{-2 \pm \sqrt{4 + 4}}{2} = \frac{-2 \pm 2\sqrt{2}}{2} = -1 \pm \sqrt{2} \implies p = \sqrt{2} - 1 \approx 0.41421$$

So, in order to maximize the value at state 1, we have $p_{\text{left}} \approx 0.414$ and $p_{\text{right}} \approx 0.586$. That also implies that $v_1 = -2\frac{1+p}{p(1-p)} = -2\frac{\sqrt{2}}{(\sqrt{2}-1)(2 - \sqrt{2})}= \frac{-2\sqrt{2}}{2 \sqrt{2} - 2 - 2 + \sqrt{2}} = \frac{-2 \sqrt{2}}{3\sqrt{2} - 4} \approx -11.657$

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô…-¡°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$38e5d800-4d43-40d2-87ea-f7d4b4283dab¹depends_on_disabled_cellsÂ§runtimeÎ%±µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙFone_step_actor_critic_binary_features (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•'UÍ[°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2¹depends_on_disabled_cellsÂ§runtimeÎ7›øµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$73b90260-d57a-449a-8db6-47f91e6a4e4fŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙM

Eligibility Vector with Binary Features

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôˆø×°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$73b90260-d57a-449a-8db6-47f91e6a4e4f¹depends_on_disabled_cellsÂ§runtimeÎßµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$5aba4f96-e877-457e-8e95-18737348f99fŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙCactor_critic_fcann_parameter_study (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ• lÛ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5aba4f96-e877-457e-8e95-18737348f99f¹depends_on_disabled_cellsÂ§runtimeÎh¤zµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$fed4dc4c-0d1c-4ee3-9d0e-8ef2a7db7486Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚd<

$\lambda_\theta$: 0.1

$\lambda_\mathbf{w}$: 0.98

$\alpha_{\overline{r}}$:

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•!7V.°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$fed4dc4c-0d1c-4ee3-9d0e-8ef2a7db7486¹depends_on_disabled_cellsÂ§runtimeÎÙ$rµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$27487ad0-4779-42ce-8def-e660ef04bee0Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements’’´action_probabilities’…¦prefix§Float32¨elements“’’¨0.149956ªtext/plain’’«0.000232849ªtext/plain’’¨0.849811ªtext/plain¤type¥Array¬prefix_short ¨objectid°566b0efc4555caa4Ù!application/vnd.pluto.tree+object’´state_value_estimate’§665.762ªtext/plain¤typeªNamedTuple¨objectid°733e2d8aae6a0198¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÀ²last_run_timestampËAÚ•7ñ‚°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$27487ad0-4779-42ce-8def-e660ef04bee0¹depends_on_disabled_cellsÂ§runtimeÍœ³µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$0d93132d-5819-47dc-8cf2-462d480d9c3dŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙB

Waiting to run parameter study

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@0o°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$0d93132d-5819-47dc-8cf2-462d480d9c3d¹depends_on_disabled_cellsÂ§runtimeÎí‚µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$9978d537-49ff-4014-a971-b42704c50a6bŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚd’

$\lambda_\theta$: 0.95

$\lambda_\mathbf{w}$: 0.2

hidden layer size: , num layers:

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•3@—°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$9978d537-49ff-4014-a971-b42704c50a6b¹depends_on_disabled_cellsÂ§runtimeÎ »éäµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$f8215517-b18f-4a03-9421-8edab4ca8089Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ…, ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•>n$Ì°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f8215517-b18f-4a03-9421-8edab4ca8089¹depends_on_disabled_cellsÂ§runtimeÎ=ÖÕµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$1ac9296f-047b-4051-ba5c-0c23d5f9cde9Š¦queuedÂ¤logs§runningÂ¦output†¤body…¦prefixÙ•StateMDP{Float32, Int64, Symbol, StateMDPTransitionSampler{Float32, Int64, var"#step#1204"}, var"#1203#1205", Returns{Bool}, TabularRL.var"#164#169"}¨elements–’§actions’…¦prefix¦Symbol¨elements’’’¥:leftªtext/plain’’¦:rightªtext/plain¤type¥Array¬prefix_short ¨objectid°56b8e3577fbdce3bÙ!application/vnd.pluto.tree+object’£ptf’…¦prefixÙ:StateMDPTransitionSampler{Float32, Int64, var"#step#1204"}¨elements‘’¤step’ÙJ(::Main.var"workspace#8".var"#step#1204") (generic function with 1 method)ªtext/plain¤type¦struct¬prefix_short¹StateMDPTransitionSampler¨objectid°ffffffff611818f4Ù!application/vnd.pluto.tree+object’°initialize_state’Ùҳ (generic function with 1 method)ªtext/plain’¦isterm’´Returns{Bool}(false)ªtext/plain’¯is_valid_action’Ù%#164 (generic function with 1 method)ªtext/plain’¬action_index’…¦prefix³Dict{Symbol, Int64}¨elements’’’¥:leftªtext/plain’¡1ªtext/plain’’¦:rightªtext/plain’¡2ªtext/plain¤type¤Dict¬prefix_short¤Dict¨objectid°11acce9bbde33ff7Ù!application/vnd.pluto.tree+object¤type¦struct¬prefix_short¨StateMDP¨objectid°76806692942ce065¤mimeÙ!application/vnd.pluto.tree+object¬rootassignee½const corridor_continuing_mdp²last_run_timestampËAÚ• Vz°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$1ac9296f-047b-4051-ba5c-0c23d5f9cde9¹depends_on_disabled_cellsÂ§runtimeÎðCCµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$c87dba8c-9a96-41b3-9dc7-a6c088ec1eafŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚž?Total Reward: -160.0

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•A¸®°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$c87dba8c-9a96-41b3-9dc7-a6c088ec1eaf¹depends_on_disabled_cellsÂ§runtimeÎ!Ì µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$5cc4d12d-b537-47e2-8109-4e7a234fdf25Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ2make_corridor_mdp (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•ôîN°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5cc4d12d-b537-47e2-8109-4e7a234fdf25¹depends_on_disabled_cellsÂ§runtimeÎ&e‡µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$5334064b-5a16-4135-afa0-86a48291725bŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements“’¥value’¨-11.0794ªtext/plain’¦action’¡2ªtext/plain’action_values’…¦prefix§Float32¨elements’’’¨-11.0825ªtext/plain’’¨-11.0794ªtext/plain¤type¥Array¬prefix_short ¨objectid°293c0cefdbfaa2e0Ù!application/vnd.pluto.tree+object¤typeªNamedTuple¨objectid°54bdcd18010f08f6¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÀ²last_run_timestampËAÚ•ˆÂ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5334064b-5a16-4135-afa0-86a48291725b¹depends_on_disabled_cellsÂ§runtimeÍAeµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$9c342958-1971-48ec-b919-5dfdcbc915a4Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ£

Change Plot Background Color

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•)¶k°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$9c342958-1971-48ec-b919-5dfdcbc915a4¹depends_on_disabled_cellsÂ§runtimeÎ¤aµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$966ef17c-23be-49dc-bc37-4cb52b34c049Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ;

Neural Network Method

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôWi°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$966ef17c-23be-49dc-bc37-4cb52b34c049¹depends_on_disabled_cellsÂ§runtimeÎÈ µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$e7e49ff8-32df-48a4-afb2-462859592e92Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙGform_state_and_policy_function_outputs (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ• 5?£°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$e7e49ff8-32df-48a4-afb2-462859592e92¹depends_on_disabled_cellsÂ§runtimeÎÜ±¢µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$78c83673-2117-4542-b4d8-1c243e8f610bŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ‘

Eligibility Vector

Recall for the gaussian case and linear approximation we had:

$$\begin{flalign} \pi(a \vert s, \boldsymbol{\theta}) &= \frac{1}{\sqrt{2 \pi \sigma(s, \boldsymbol{\theta})^2}} \exp \left ( - \frac{(a - \mu(s, \boldsymbol{\theta}))^2}{2 \sigma(s, \boldsymbol{\theta})^2} \right )\\ \mu(s, \boldsymbol{\theta}) & \doteq \boldsymbol{\theta}_\mu ^ \top \mathbf{x}_\mu(s) \\ \sigma(s, \boldsymbol{\theta}) & \doteq \exp \left ( \boldsymbol{\theta}_\sigma ^ \top \mathbf{x}_\sigma(s) \right ) \\ \nabla \ln \pi(a \vert s, \boldsymbol{\theta}_\mu) &= \frac{1}{\sigma(s, \boldsymbol{\theta})^2} \left ( a - \mu(s, \boldsymbol{\theta}) \right ) \mathbf{x}_\mu(s) \\ \nabla \ln \pi(a \vert s, \boldsymbol{\theta}_\sigma) &= \left (\frac{(a - \mu(s, \boldsymbol{\theta}))^2}{\sigma(s, \boldsymbol{\theta})^2} \right )\mathbf{x}_\sigma(s) \\ \end{flalign}$$

For the squashed gaussian we can apply the previous results to the new pdf:

$$\begin{flalign} \pi(a \vert s, \boldsymbol{\theta}) &= \frac{1}{\sqrt{2 \pi \sigma(s, \boldsymbol{\theta})^2}} \exp \left ( - \frac{(\tanh^{-1}(a) - \mu(s, \boldsymbol{\theta}))^2}{2 \sigma(s, \boldsymbol{\theta})^2} \right ) \left \vert \frac{1}{1 - a^2} \right \vert\\ \mu(s, \boldsymbol{\theta}) & \doteq \boldsymbol{\theta}_\mu ^ \top \mathbf{x}_\mu(s) \\ \sigma(s, \boldsymbol{\theta}) & \doteq \exp \left ( \boldsymbol{\theta}_\sigma ^ \top \mathbf{x}_\sigma(s) \right ) \\ \nabla \ln \pi(a \vert s, \boldsymbol{\theta}_\mu) &= \frac{1}{\sigma(s, \boldsymbol{\theta})^2} \left ( \tanh^{-1}(a) - \mu(s, \boldsymbol{\theta}) \right ) \mathbf{x}_\mu(s) \\ \nabla \ln \pi(a \vert s, \boldsymbol{\theta}_\sigma) &= \left (\frac{(\tanh^{-1}(a) - \mu(s, \boldsymbol{\theta}))^2}{\sigma(s, \boldsymbol{\theta})^2} \right )\mathbf{x}_\sigma(s) \\ \end{flalign}$$

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô“6°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$78c83673-2117-4542-b4d8-1c243e8f610b¹depends_on_disabled_cellsÂ§runtimeÎ.Žµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8fŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚùTotal Reward: -626.0

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•A,°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8f¹depends_on_disabled_cellsÂ§runtimeÎÉ$Sµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$396e0047-d848-462f-a769-0cc2829abc78Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements’’´action_probabilities’…¦prefix§Float32¨elements’’’ª0.00209303ªtext/plain’’¨0.997907ªtext/plain¤type¥Array¬prefix_short ¨objectid°2bca965e70e9689cÙ!application/vnd.pluto.tree+object’´state_value_estimate’¨-160.412ªtext/plain¤typeªNamedTuple¨objectid°860e53ebb42149ae¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÀ²last_run_timestampËAÚ•+&`Ù°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$396e0047-d848-462f-a769-0cc2829abc78¹depends_on_disabled_cellsÂ§runtimeÎ&0õµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$ff4f977e-48df-4c12-845c-c245b4d39d6dŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙEactor_critic_linear_parameter_study (generic function with 3 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•+Éd °persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$ff4f977e-48df-4c12-845c-c245b4d39d6d¹depends_on_disabled_cellsÂ§runtimeÎ|$«µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$aa450da4-fe84-4eea-b6c4-9820b7982437Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ'

With continuous policy parametrization, we can smoothly very action selection probabilities by arbitrarily small amounts, something that was not possible with Ïµ-greedy action selection. Therefore stronger convergence guarantees are possible for policy-gradient methods than for action-value methods.

In the episodic case, assuming some particular non-random starting state $s_0$, we define the performance of a policy parametrized by Î¸ as:

$$\begin{align} J(\mathbf{\theta}) \doteq v_{\pi_\mathbf{\theta}}(s_0) \tag{13.4} \end{align}$$

where $v_{\pi_\mathbf{\theta}}$ is the true value function for $\pi_\mathbf{\theta}$, the policy determined by $\mathbf{\theta}$.

The policy gradient theorem provides an analytic expression for the gradient of performance with respect to the policy parameter that does not involve the derivative of the state distribution:

$$\begin{align} \nabla J(\mathbf{\theta}) \propto \sum_s \mu (s) \sum_a q_\pi (s, a) \nabla \pi (a|s,\mathbf{\theta}) \tag{13.5} \end{align}$$

where the gradients are column vectors of partial derivatives with respect to the components of $\mathbf{\theta}$. In the episodic case, the constant of proportionality is the average length of an episode, and in the continuing case it is 1. The distribution here $\mu$ is the on-policy distribution under $\pi$.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô…¦a°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$aa450da4-fe84-4eea-b6c4-9820b7982437¹depends_on_disabled_cellsÂ§runtimeÎ®Vµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$bb1ef180-39ac-475f-beea-ef573e71a3bfŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ3ñ ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•:!Ñ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$bb1ef180-39ac-475f-beea-ef573e71a3bf¹depends_on_disabled_cellsÂ§runtimeÎ¦Qµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements™’¬step_rewards’…¦prefix§Float32¨elements›’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’ ’£0.0ªtext/plain¤more’Î“à’£0.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°95970e74b5b370e0Ù!application/vnd.pluto.tree+object’¬total_reward’¦-919.0ªtext/plain’«total_steps’¦300000ªtext/plain’±policy_parameters’ƒ¨elements’’’…¦prefix¯Matrix{Float32}¨elements“’’ÙÃ4Ã—4 Matrix{Float32}: -0.61371 -0.300138 -1.8085 -0.459389 1.14587 1.97024 -0.0989286 3.46696 0.37794 1.43625 -0.486325 1.01832 -1.23568 -0.360545 -0.007061 0.142772ªtext/plain’’ÙÄ4Ã—4 Matrix{Float32}: 1.49822 -1.71173 -0.432459 -0.420095 1.66924 -0.585423 -0.137278 0.266889 -0.305919 -0.21953 -0.214605 0.665994 -0.358615 -1.48606 -0.193572 -0.27517ªtext/plain’’Ùš3Ã—4 Matrix{Float32}: 1.12812 -0.066766 1.00552 1.5663 0.149909 -0.769449 -0.710168 0.0421501 -1.57573 -0.817095 0.0582138 -1.37842ªtext/plain¤type¥Array¬prefix_short ¨objectid°d9856d724a422ef9Ù!application/vnd.pluto.tree+object’’…¦prefix¯Vector{Float32}¨elements“’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°fc5785410fe653aeÙ!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°e9ead6b56b0827f7Ù!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°e3ea0176df069512Ù!application/vnd.pluto.tree+object¤type¥Array¬prefix_short ¨objectid°17e49c6ec7008c4dÙ!application/vnd.pluto.tree+object¤type¥Tuple¨objectid°42fb8df5a41099a9Ù!application/vnd.pluto.tree+object’°value_parameters’ƒ¨elements’’’…¦prefix¯Matrix{Float32}¨elements“’’ÙÃ4Ã—4 Matrix{Float32}: -0.61371 -0.300138 -1.8085 -0.459389 1.14587 1.97024 -0.0989286 3.46696 0.37794 1.43625 -0.486325 1.01832 -1.23568 -0.360545 -0.007061 0.142772ªtext/plain’’ÙÄ4Ã—4 Matrix{Float32}: 1.49822 -1.71173 -0.432459 -0.420095 1.66924 -0.585423 -0.137278 0.266889 -0.305919 -0.21953 -0.214605 0.665994 -0.358615 -1.48606 -0.193572 -0.27517ªtext/plain’’Ù?1Ã—4 Matrix{Float32}: 0.252601 0.459532 -0.215838 -0.295855ªtext/plain¤type¥Array¬prefix_short ¨objectid°22a18aa94fc7c923Ù!application/vnd.pluto.tree+object’’…¦prefix¯Vector{Float32}¨elements“’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°fc5785410fe653aeÙ!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°e9ead6b56b0827f7Ù!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°fea60baa07c69e8dÙ!application/vnd.pluto.tree+object¤type¥Array¬prefix_short ¨objectid°216f1039cd628e70Ù!application/vnd.pluto.tree+object¤type¥Tuple¨objectid°11d70edd308a4afbÙ!application/vnd.pluto.tree+object’¯policy_function’¢Ï€ªtext/plain’´policy_sample_action’©Ï€_sampleªtext/plain’´estimate_state_value’´estimate_state_valueªtext/plain’°policy_and_value’°policy_and_valueªtext/plain¤typeªNamedTuple¨objectid°a9ee11dbd1e5a8b1¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÙ$const cartpole_continuing_fcann_test²last_run_timestampËAÚ•3R°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27¹depends_on_disabled_cellsÂ§runtimeÎ_Q1aµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$5b868eba-c1af-49f6-8f93-79b78c319a6fŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙNreinforce_with_baseline_monte_carlo_control! (generic function with 2 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•#gV`°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5b868eba-c1af-49f6-8f93-79b78c319a6f¹depends_on_disabled_cellsÂ§runtimeÎf rµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$68469a40-7976-48b7-b7a1-eaa4c5f33a18Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙCplot_mountaincar_continuous_values (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•@ÚÁ±°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$68469a40-7976-48b7-b7a1-eaa4c5f33a18¹depends_on_disabled_cellsÂ§runtimeÎGÕÉµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$2a586e46-66e4-461a-85c8-5817e4d1aa43Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ

$$\begin{flalign} \nabla J(\boldsymbol{\theta}) &= \nabla v_\pi(s_0) \\ &= \sum_s \sum_k \gamma^k \Pr \{ s_0 \rightarrow s, k, \pi \} f(s) \\ &= \sum_s \sum_k \gamma^k \frac{\sum_{x \in \mathcal{S}} \sum_{t = 0}^\infty \Pr \{ s_0 \rightarrow x, t, \pi \}}{\sum_{x \in \mathcal{S}} \sum_{t = 0}^\infty \Pr \{ s_0 \rightarrow x, t, \pi \}} \Pr \{ s_0 \rightarrow s, k, \pi \} f(s) \tag{multiply by 1}\\ &= \eta \sum_s \sum_k \gamma^k \frac{\Pr \{ s_0 \rightarrow s, k, \pi \}}{\sum_{x \in \mathcal{S}} \sum_{t = 0}^\infty \Pr \{ s_0 \rightarrow x, t, \pi \}} f(s) \tag{average episode length}\\ &= \eta \sum_s \sum_k \gamma^k \mu_\pi(s, k) f(s) \tag{on policy distribution over states and steps}\\ &= \eta \mathbb{E}_\pi[ \gamma^k f(s) \mid S_0 = s_0, S_k = s] \tag{definition of expected value}\\ &\propto \mathbb{E}_\pi \left [ \gamma^k \sum_a \nabla \pi(a \vert s) q_\pi(s, a) \mid S_0 = s_0, S_k = s \right ] \tag{13.5}\\ \end{flalign}$$

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‡°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$2a586e46-66e4-461a-85c8-5817e4d1aa43¹depends_on_disabled_cellsÂ§runtimeÎ±”µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$a206c759-3f6e-4003-8cba-5f6ce6742646Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙæ

Figure 13.1

REINFORCE on short-corridor gridworld (Example 13.1). Performance varies with step size but can approach the ideal. Feature vector encodes every state identically.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‰d(°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$a206c759-3f6e-4003-8cba-5f6ce6742646¹depends_on_disabled_cellsÂ§runtimeÎ¨µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$fc3dcd26-c5cf-4141-bf6c-eaed5fc9bb1dŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ

Consider the linear parameterization proposed with $h_a = \boldsymbol{\theta}^\top \mathbf{x}(s, a)$:

$$\frac{\partial{h_a}}{\partial{\theta_i}} = \mathbf{x}(s, a)_i \implies \nabla(\pi(a \vert s, \boldsymbol{\theta}))_i = \pi_a \left ( \mathbf{x}(s, a)_i - \sum_k \pi_k \mathbf{x}(s, k)_i \right)$$

Now consider $\mathbf{h} = \theta ^ \top \mathbf{x}$ with $h_a = \mathbf{h}_a$. Since the parameters are now represented as a matrix, we can also index the gradient partial derivatives such that $\nabla \left ( f(\theta) \right )_{i, j} = \frac{\partial f(\theta)}{\theta_{i, j}}$

$$\frac{\partial{h_a}}{\partial{\theta_{i, j}}} = \begin{cases} \mathbf{x}(s)_i, & \text{ if } j = a \\ 0, & \text{ else } \end{cases} \implies \nabla(\pi(s, \boldsymbol{\theta})_a)_{i, j} = \pi_a \left ( \frac{\partial h_a}{\partial \theta_{i, j}} - \sum_k \pi_k \frac{\partial h_k}{\partial \theta_{i, j}} \right)=\pi_a \begin{cases} \mathbf{x}(s)_i (1 - \pi_j), & \text{ if } j = a \\ -\pi_j \mathbf{x}(s)_i, & \text{ else }\\ \end{cases}$$

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‚ò°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$fc3dcd26-c5cf-4141-bf6c-eaed5fc9bb1d¹depends_on_disabled_cellsÂ§runtimeÎ¶ýµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$3cfd63ad-b1a2-4b99-ae97-2ff10351e4f5Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙC

Beta Distribution Alternative

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô’új°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$3cfd63ad-b1a2-4b99-ae97-2ff10351e4f5¹depends_on_disabled_cellsÂ§runtimeÎÇµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$31db0f58-28e4-454f-9394-25565687266fŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÛb ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•0Œ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$31db0f58-28e4-454f-9394-25565687266f¹depends_on_disabled_cellsÂ§runtimeÎ aG®µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$822e4d69-2582-4956-858e-06ecb091e76aŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ9display_cartpole_episode (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•0`¦Ò°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$822e4d69-2582-4956-858e-06ecb091e76a¹depends_on_disabled_cellsÂ§runtimeÎHFµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580aŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÛõ€

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@âN×°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580a¹depends_on_disabled_cellsÂ§runtimeÎL—2µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$05b0fcad-628b-48d2-aa24-f6f562dbb660Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚQ

$$\begin{flalign} &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) \sum_{a^{\prime \prime}} [ \nabla \pi(a^{\prime \prime} \vert s^{\prime \prime}) q_\pi(s^{\prime \prime}, a^{\prime \prime})\right ] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) f(s^{\prime \prime}) \right ] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{s^{\prime \prime}} f(s^{\prime \prime}) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) p(s^{\prime \prime} \vert s^\prime, a^\prime) \right ] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \mathbb{E}_\pi[f(s^{\prime \prime}) \vert s^\prime] \right ] \\ &\gamma^2 \mathbb{E}_\pi[f(s^{\prime \prime}) \vert s] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) g(s^\prime) \right ] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \mathbb{E}[g(s^\prime) \vert s, a] \right ] \\ &\gamma^2 \mathbb{E}_\pi[g(s^\prime) \vert s]\\ &\gamma^2 \sum_{s^{\prime \prime}} \Pr(s \rightarrow s^{\prime \prime}, 2, \pi) f(s^{\prime \prime}) \end{flalign}$$

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô†`°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$05b0fcad-628b-48d2-aa24-f6f562dbb660¹depends_on_disabled_cellsÂ§runtimeÎß{µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$d2729657-d0bf-4d39-8ec7-f242a1ad48d6Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙJcreate_continuous_action_mountaincar_beta (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•>t•ê°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d2729657-d0bf-4d39-8ec7-f242a1ad48d6¹depends_on_disabled_cellsÂ§runtimeÎÁµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$5c11a92d-7496-4aba-af15-2537eac49dd7Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ!Array{Vector{T}, 1} where T<:Real¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•ŒNQ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5c11a92d-7496-4aba-af15-2537eac49dd7¹depends_on_disabled_cellsÂ§runtimeÎ>cµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$1753b5ed-c00b-4b60-b492-822180778e8cŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ>update_linear_value_gradient! (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•ãêÀ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$1753b5ed-c00b-4b60-b492-822180778e8c¹depends_on_disabled_cellsÂ§runtimeÎ qžµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$f7ede764-5ad8-426b-a805-cc21b622d977Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ5

Results Caching

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô–¼™°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f7ede764-5ad8-426b-a805-cc21b622d977¹depends_on_disabled_cellsÂ§runtimeÎÚûµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$36d514fa-b27a-4c6b-8399-9d108377b9b5Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚcA

$\lambda_\theta$: 0.75

$\lambda_\mathbf{w}$: 0.25

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•?×ó°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$36d514fa-b27a-4c6b-8399-9d108377b9b5¹depends_on_disabled_cellsÂ§runtimeÎ”·€µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$6b1acb57-159a-4b7f-99fe-5f996522243bŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÀ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampË°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$6b1acb57-159a-4b7f-99fe-5f996522243b¹depends_on_disabled_cellsÃ§runtimeÀµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$45f0a385-6465-4acc-8637-1b007a0fe215Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙAupdate_fcann_eligibility_vector! (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•ºü°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$45f0a385-6465-4acc-8637-1b007a0fe215¹depends_on_disabled_cellsÂ§runtimeÎêfµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$c52c4cec-0ea8-4af3-831a-d284f0e086eeŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ- ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@-ÝU°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$c52c4cec-0ea8-4af3-831a-d284f0e086ee¹depends_on_disabled_cellsÂ§runtimeÎMØµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$f8614042-7c94-4d47-a1b6-4e96676b4e8bŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙLactor_critic_fcann_episodic_parameter_study (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•/·qÞ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f8614042-7c94-4d47-a1b6-4e96676b4e8b¹depends_on_disabled_cellsÂ§runtimeÎ~¦±µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$76eb6743-cac0-4174-9ba3-a0691c200b54Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ:make_n_param_dist_params (generic function with 2 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•'$:°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$76eb6743-cac0-4174-9ba3-a0691c200b54¹depends_on_disabled_cellsÂ§runtimeÎºµµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$94517664-6988-44dc-a297-e9d5873ee540Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚV

Squashed Gaussian Plot Parameters

$\mu$: 0.0

$\sigma$: 0.5

maximum value: 1.0

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•!z"2°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$94517664-6988-44dc-a297-e9d5873ee540¹depends_on_disabled_cellsÂ§runtimeÎ ¤èµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$d037ea92-915c-4dc7-97c6-d006d92e088aŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ,figure_13_1 (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•#—}°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d037ea92-915c-4dc7-97c6-d006d92e088a¹depends_on_disabled_cellsÂ§runtimeÎo© µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$24fa139c-ad4b-49db-ac8f-23c476ed8608Š¦queuedÂ¤logs§runningÂ¦output†¤body‚£msgºInexactError: Int64(NaN32)ªstacktraceÜŒªcall_short¥Int64§inlinedÃ£urlÀ¤pathª./float.jl®source_packageÀ¤call¥Int64ªlinfo_type§Nothing¤lineÍâ¤file¨float.jl¤func¥Int64parent_moduleÀ¦from_cÂŒªcall_short§convert§inlinedÃ£urlÀ¤path«./number.jl®source_packageÀ¤call§convertªlinfo_type§Nothing¤line¤file©number.jl¤func§convertparent_moduleÀ¦from_cÂŒªcall_short®_round_convert§inlinedÃ£urlÀ¤path./rounding.jl®source_packageÀ¤call®_round_convertªlinfo_type§Nothing¤lineÍà¤file«rounding.jl¤func®_round_convertparent_moduleÀ¦from_cÂŒªcall_short¥round§inlinedÃ£urlÀ¤path./rounding.jl®source_packageÀ¤call¥roundªlinfo_type§Nothing¤lineÍß¤file«rounding.jl¤func¥roundparent_moduleÀ¦from_cÂŒªcall_short¤ceil§inlinedÃ£urlÀ¤path./rounding.jl®source_packageÀ¤call¤ceilªlinfo_type§Nothing¤lineÍÜ¤file«rounding.jl¤func¤ceilparent_moduleÀ¦from_cÂŒªcall_short¥#1114§inlinedÃ£urlÀ¤pathÙ«/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-12/Chapter_12_Eligibility_Traces.jl®source_packageÀ¤call¥#1114ªlinfo_type§Nothing¤lineÍÊ¤fileÙ Chapter_12_Eligibility_Traces.jl¤func¥#1114parent_moduleÀ¦from_cÂŒªcall_shortÙ'(::var"#1114#1116"{â€¦})(tiling::Int64)§inlinedÂ£url ¤path¦./none®source_package¤Main¤callÙ»(::var"#1114#1116"{4, Int64, NTuple{4, Float32}, NTuple{4, Float32}, NTuple{4, Int64}, NTuple{4, Float32}, NTuple{4, Int64}, NTuple{4, Float32}, NTuple{4, Float32}, Int64})(tiling::Int64)ªlinfo_type³Core.MethodInstance¤line¤file¤none¤func¥#1114parent_moduleµMain.var"workspace#8"¦from_cÂŒªcall_short§iterate§inlinedÃ£urlÀ¤path®./generator.jl®source_packageÀ¤call§iterateªlinfo_type§Nothing¤line0¤file¬generator.jl¤func§iterateparent_moduleÀ¦from_cÂŒªcall_short§iterate§inlinedÃ£urlÀ¤path®./iterators.jl®source_packageÀ¤call§iterateªlinfo_type§Nothing¤lineÌÎ¤file¬iterators.jl¤func§iterateparent_moduleÀ¦from_cÂŒªcall_short§iterate§inlinedÃ£urlÀ¤path®./iterators.jl®source_packageÀ¤call§iterateªlinfo_type§Nothing¤lineÌÍ¤file¬iterators.jl¤func§iterateparent_moduleÀ¦from_cÂŒªcall_shortÙ|update_binary_feature_vector!(x::BinaryFeatureVector{â€¦}, s::CartPoleState{â€¦}, get_active_features::var"#1549#1552"{â€¦})§inlinedÂ£urlÙàhttps://github.com/jekyllstein/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/tree/fc4495701c659f9d92b015bfca6e6d3b480d4178//Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#8eab55a5-41b7-4f5e-a02f-4c19388bc9ea#L1¤pathÙØ/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#8eab55a5-41b7-4f5e-a02f-4c19388bc9ea®source_package¤Main¤callÚXupdate_binary_feature_vector!(x::BinaryFeatureVector{Int64}, s::CartPoleState{Float32}, get_active_features::var"#1549#1552"{@NamedTuple{num_features::Int64, get_active_features::var"#f#1125"{NTuple{4, Float32}, NTuple{4, Float32}, NTuple{4, Float32}, Int64, NTuple{4, Int64}, NTuple{4, Float32}, Int64, NTuple{4, Int64}, NTuple{4, Float32}}}})ªlinfo_type³Core.MethodInstance¤line¤fileÙMChapter_13_Policy_Gradient_Methods.jl#==#8eab55a5-41b7-4f5e-a02f-4c19388bc9ea¤func½update_binary_feature_vector!parent_moduleµMain.var"workspace#8"¦from_cÂŒªcall_short¶update_feature_vector!§inlinedÃ£urlÀ¤pathÙØ/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#ba5d6311-daee-4abc-b2fb-fae2184ef3eb®source_packageÀ¤call¶update_feature_vector!ªlinfo_type§Nothing¤line¤fileÙMChapter_13_Policy_Gradient_Methods.jl#==#ba5d6311-daee-4abc-b2fb-fae2184ef3eb¤func¶update_feature_vector!parent_moduleÀ¦from_cÂŒªcall_short£Ï€!§inlinedÃ£urlÀ¤pathÙØ/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#f545c800-0bf3-491f-9d7d-42341cfdb573®source_packageÀ¤call£Ï€!ªlinfo_type§Nothing¤line¤fileÙMChapter_13_Policy_Gradient_Methods.jl#==#f545c800-0bf3-491f-9d7d-42341cfdb573¤func£Ï€!parent_moduleÀ¦from_cÂŒªcall_short¢Ï€§inlinedÃ£urlÀ¤pathÙØ/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#5b868eba-c1af-49f6-8f93-79b78c319a6f®source_packageÀ¤call¢Ï€ªlinfo_type§Nothing¤line¤fileÙMChapter_13_Policy_Gradient_Methods.jl#==#5b868eba-c1af-49f6-8f93-79b78c319a6f¤func¢Ï€parent_moduleÀ¦from_cÂŒªcall_shortÙ4(::var"#Ï€_sample#1309"{â€¦})(s::CartPoleState{â€¦})§inlinedÂ£urlÙàhttps://github.com/jekyllstein/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/tree/fc4495701c659f9d92b015bfca6e6d3b480d4178//Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#5b868eba-c1af-49f6-8f93-79b78c319a6f#L8¤pathÙØ/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#5b868eba-c1af-49f6-8f93-79b78c319a6f®source_package¤Main¤callÚë(::var"#Ï€_sample#1309"{typeof(gaussian_action_sampler), var"#Ï€#1308"{Matrix{Float32}, Vector{Float32}, BinaryFeatureVector{Int64}, var"#Ï€!#1303"{var"#update_feature_vector!#1364"{var"#1549#1552"{@NamedTuple{num_features::Int64, get_active_features::var"#f#1125"{NTuple{4, Float32}, NTuple{4, Float32}, NTuple{4, Float32}, Int64, NTuple{4, Int64}, NTuple{4, Float32}, Int64, NTuple{4, Int64}, NTuple{4, Float32}}}}}, typeof(update_binary_action_preferences!)}}})(s::CartPoleState{Float32})ªlinfo_type³Core.MethodInstance¤line¤fileÙMChapter_13_Policy_Gradient_Methods.jl#==#5b868eba-c1af-49f6-8f93-79b78c319a6f¤func©Ï€_sampleparent_moduleµMain.var"workspace#8"¦from_cÂŒªcall_shortÙˆrunepisode!(::Tuple{â€¦}, mdp::ContinuousMDP{â€¦}, Ï€::var"#Ï€_sample#1309"{â€¦}; s0::CartPoleState{â€¦}, a0::Float32, max_steps::Int64)§inlinedÂ£urlÙàhttps://github.com/jekyllstein/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/tree/fc4495701c659f9d92b015bfca6e6d3b480d4178//Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#f946c886-6246-4f98-a96f-f06984691ad8#L2¤pathÙØ/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#f946c886-6246-4f98-a96f-f06984691ad8®source_package¤Main¤callÚ*runepisode!(::Tuple{Vector{CartPoleState{Float32}}, Vector{Float32}, Vector{Float32}}, mdp::ContinuousMDP{Float32, CartPoleState{Float32}, Float32, ContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1521#1537", var"#init_Î¸#1551", var"#1523#1539", var"#1524#1540"}}, var"#failure#1527"{Float32, Float32, Float32, Float32}, Returns{Bool}}, Ï€::var"#Ï€_sample#1309"{typeof(gaussian_action_sampler), var"#Ï€#1308"{Matrix{Float32}, Vector{Float32}, BinaryFeatureVector{Int64}, var"#Ï€!#1303"{var"#update_feature_vector!#1364"{var"#1549#1552"{@NamedTuple{num_features::Int64, get_active_features::var"#f#1125"{NTuple{4, Float32}, NTuple{4, Float32}, NTuple{4, Float32}, Int64, NTuple{4, Int64}, NTuple{4, Float32}, Int64, NTuple{4, Int64}, NTuple{4, Float32}}}}}, typeof(update_binary_action_preferences!)}}}; s0::CartPoleState{Float32}, a0::Float32, max_steps::Int64)ªlinfo_type³Core.MethodInstance¤line ¤fileÙMChapter_13_Policy_Gradient_Methods.jl#==#f946c886-6246-4f98-a96f-f06984691ad8¤func±#runepisode!#1258parent_moduleµMain.var"workspace#8"¦from_cÂŒªcall_short«runepisode!§inlinedÃ£urlÀ¤pathÙØ/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#f946c886-6246-4f98-a96f-f06984691ad8®source_packageÀ¤call«runepisode!ªlinfo_type§Nothing¤line¤fileÙMChapter_13_Policy_Gradient_Methods.jl#==#f946c886-6246-4f98-a96f-f06984691ad8¤func«runepisode!parent_moduleÀ¦from_cÂŒªcall_shortÚÑreinforce_with_baseline_monte_carlo_control!(policy_params::Matrix{â€¦}, âˆ‡lnÏ€::BinaryGaussianEligibilityVector{â€¦}, value_params::Vector{â€¦}, âˆ‡vÌ‚::BinaryFeatureVector{â€¦}, mdp::ContinuousMDP{â€¦}, update_action_distribution!::typeof(update_binary_action_preferences!), action_dist_params::Vector{â€¦}, action_sampler::typeof(gaussian_action_sampler), update_eligibility_vector!::typeof(update_gaussian_eligibility_vector!), x::BinaryFeatureVector{â€¦}, update_feature_vector!::var"#update_feature_vector!#1364"{â€¦}, value_function::typeof(binary_value_function), update_value_gradient!::typeof(update_binary_value_gradient!), max_episodes::Int64; Î±_w::Float32, Î±_Î¸::Float32, Î³::Float32, epkwargs::@Kwargs{})§inlinedÂ£urlÙàhttps://github.com/jekyllstein/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/tree/fc4495701c659f9d92b015bfca6e6d3b480d4178//Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#5b868eba-c1af-49f6-8f93-79b78c319a6f#L2¤pathÙØ/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#5b868eba-c1af-49f6-8f93-79b78c319a6f®source_package¤Main¤callÚ¨reinforce_with_baseline_monte_carlo_control!(policy_params::Matrix{Float32}, âˆ‡lnÏ€::BinaryGaussianEligibilityVector{Float32, Float32, Float32, BinaryFeatureVector{Int64}}, value_params::Vector{Float32}, âˆ‡vÌ‚::BinaryFeatureVector{Int64}, mdp::ContinuousMDP{Float32, CartPoleState{Float32}, Float32, ContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1521#1537", var"#init_Î¸#1551", var"#1523#1539", var"#1524#1540"}}, var"#failure#1527"{Float32, Float32, Float32, Float32}, Returns{Bool}}, update_action_distribution!::typeof(update_binary_action_preferences!), action_dist_params::Vector{Float32}, action_sampler::typeof(gaussian_action_sampler), update_eligibility_vector!::typeof(update_gaussian_eligibility_vector!), x::BinaryFeatureVector{Int64}, update_feature_vector!::var"#update_feature_vector!#1364"{var"#1549#1552"{@NamedTuple{num_features::Int64, get_active_features::var"#f#1125"{NTuple{4, Float32}, NTuple{4, Float32}, NTuple{4, Float32}, Int64, NTuple{4, Int64}, NTuple{4, Float32}, Int64, NTuple{4, Int64}, NTuple{4, Float32}}}}}, value_function::typeof(binary_value_function), update_value_gradient!::typeof(update_binary_value_gradient!), max_episodes::Int64; Î±_w::Float32, Î±_Î¸::Float32, Î³::Float32, epkwargs::@Kwargs{})ªlinfo_type³Core.MethodInstance¤line¤fileÙMChapter_13_Policy_Gradient_Methods.jl#==#5b868eba-c1af-49f6-8f93-79b78c319a6f¤funcÙ2#reinforce_with_baseline_monte_carlo_control!#1304parent_moduleµMain.var"workspace#8"¦from_cÂŒªcall_shortÙûreinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions(mdp::ContinuousMDP{â€¦}, get_active_features::Function, num_features::Int64, max_episodes::Int64; policy_params::Matrix{â€¦}, value_params::Vector{â€¦}, kwargs::@Kwargs{â€¦})§inlinedÂ£urlÙàhttps://github.com/jekyllstein/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/tree/fc4495701c659f9d92b015bfca6e6d3b480d4178//Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00#L1¤pathÙØ/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00®source_package¤Main¤callÚËreinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions(mdp::ContinuousMDP{Float32, CartPoleState{Float32}, Float32, ContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1521#1537", var"#init_Î¸#1551", var"#1523#1539", var"#1524#1540"}}, var"#failure#1527"{Float32, Float32, Float32, Float32}, Returns{Bool}}, get_active_features::Function, num_features::Int64, max_episodes::Int64; policy_params::Matrix{Float32}, value_params::Vector{Float32}, kwargs::@Kwargs{Î±_Î¸::Float32, Î±_w::Float32})ªlinfo_type³Core.MethodInstance¤line¤fileÙMChapter_13_Policy_Gradient_Methods.jl#==#7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00¤funcÙR#reinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions#1367parent_moduleµMain.var"workspace#8"¦from_cÂŒªcall_short¯top-level scope§inlinedÂ£urlÀ¤pathÙØ/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#24fa139c-ad4b-49db-ac8f-23c476ed8608®source_packageÀ¤call¯top-level scopeªlinfo_typeCore.CodeInfo¤line¤fileÙMChapter_13_Policy_Gradient_Methods.jl#==#24fa139c-ad4b-49db-ac8f-23c476ed8608¤func¯top-level scopeparent_moduleÀ¦from_cÂ¤mimeÙ'application/vnd.pluto.stacktrace+object¬rootassignee´const reinforce_test²last_run_timestampËAÚ•0ôU“°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$24fa139c-ad4b-49db-ac8f-23c476ed8608¹depends_on_disabled_cellsÂ§runtimeÀµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÃÙ$2025ff38-f2ec-4224-b771-ff72ffe1af28Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements’’’¤-1.2ªtext/plain’’¥-0.07ªtext/plain¤type¥Tuple¨objectid°8f2fc5f476d649ac¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeºconst mountaincar_min_vals²last_run_timestampËAÚ•:8L°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$2025ff38-f2ec-4224-b771-ff72ffe1af28¹depends_on_disabled_cellsÂ§runtimeÎ6µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$cb70d400-3e9c-441c-b17c-e727e8c928f3Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙB

Waiting to run parameter study

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•2z°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$cb70d400-3e9c-441c-b17c-e727e8c928f3¹depends_on_disabled_cellsÂ§runtimeÎXóµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$e034b9cb-f4ee-46f4-bea6-72c93c75d966Š¦queuedÂ¤logs§runningÂ¦output†¤body ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•%Âz°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$e034b9cb-f4ee-46f4-bea6-72c93c75d966¹depends_on_disabled_cellsÂ§runtimeÎ`?9µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$e6cf9550-2e69-4b82-92cf-5e07a35490aaŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ.zero_params! (generic function with 2 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ• ;»Œ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$e6cf9550-2e69-4b82-92cf-5e07a35490aa¹depends_on_disabled_cellsÂ§runtimeÎÅËµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$717e4c69-59d5-4929-923f-dd35a97fb160Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙpactor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions (generic function with 2 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•/Í³Õ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$717e4c69-59d5-4929-923f-dd35a97fb160¹depends_on_disabled_cellsÂ§runtimeÎ.´Åµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$1386ffdb-940d-4f1b-a872-4e38647b5335Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚÒ

Test One-step Actor-Critic

The following function calls execute the One-step Actor-Critic algorithm on Example 13.1. The output displayed is the policy function acting on the single state representation for the problem. The two values represent the probability of taking the left and right action respectively. If converged properly, the right action probability should be higher, approaching a value of about 60%.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‹8´°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$1386ffdb-940d-4f1b-a872-4e38647b5335¹depends_on_disabled_cellsÂ§runtimeÎÑkµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$a893a87b-2d07-4db5-9d1a-9da8646216f4Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ>update_params_with_gradient! (generic function with 5 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•êÓN°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$a893a87b-2d07-4db5-9d1a-9da8646216f4¹depends_on_disabled_cellsÂ§runtimeÎÛ|µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$2cbc972b-c685-4c1c-8a8d-9d58b197ad90Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ // We start by putting all the variable interpolation here at the beginning // Publish the plot object to JS let plot_obj = {"layout": {"xaxis": {"title": {"text": "Training Step"}}, "template": {"layout": {"coloraxis": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "xaxis": {"gridcolor": "white", "zerolinewidth": 2, "title": {"standoff": 15}, "ticks": "", "zerolinecolor": "white", "automargin": true, "linecolor": "white"}, "hovermode": "closest", "paper_bgcolor": "white", "geo": {"showlakes": true, "showland": true, "landcolor": "#E5ECF6", "bgcolor": "white", "subunitcolor": "white", "lakecolor": "white"}, "colorscale": {"sequential": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]], "diverging": [[0, "#8e0152"], [0.1, "#c51b7d"], [0.2, "#de77ae"], [0.3, "#f1b6da"], [0.4, "#fde0ef"], [0.5, "#f7f7f7"], [0.6, "#e6f5d0"], [0.7, "#b8e186"], [0.8, "#7fbc41"], [0.9, "#4d9221"], [1, "#276419"]], "sequentialminus": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}, "yaxis": {"gridcolor": "white", "zerolinewidth": 2, "title": {"standoff": 15}, "ticks": "", "zerolinecolor": "white", "automargin": true, "linecolor": "white"}, "shapedefaults": {"line": {"color": "#2a3f5f"}}, "hoverlabel": {"align": "left"}, "mapbox": {"style": "light"}, "polar": {"angularaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "bgcolor": "#E5ECF6", "radialaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}}, "autotypenumbers": "strict", "font": {"color": "#2a3f5f"}, "ternary": {"baxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "bgcolor": "#E5ECF6", "caxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "aaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}}, "annotationdefaults": {"arrowhead": 0, "arrowwidth": 1, "arrowcolor": "#2a3f5f"}, "plot_bgcolor": "#E5ECF6", "title": {"x": 0.05}, "scene": {"xaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}, "zaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}, "yaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}}, "colorway": ["#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52"]}, "data": {"barpolar": [{"type": "barpolar", "marker": {"line": {"color": "#E5ECF6", "width": 0.5}}}], "carpet": [{"aaxis": {"gridcolor": "white", "endlinecolor": "#2a3f5f", "minorgridcolor": "white", "startlinecolor": "#2a3f5f", "linecolor": "white"}, "type": "carpet", "baxis": {"gridcolor": "white", "endlinecolor": "#2a3f5f", "minorgridcolor": "white", "startlinecolor": "#2a3f5f", "linecolor": "white"}}], "scatterpolar": [{"type": "scatterpolar", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "parcoords": [{"line": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "type": "parcoords"}], "scatter": [{"type": "scatter", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram2dcontour": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "histogram2dcontour", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "contour": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "contour", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scattercarpet": [{"type": "scattercarpet", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "mesh3d": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "mesh3d"}], "surface": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "surface", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scattermapbox": [{"type": "scattermapbox", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scattergeo": [{"type": "scattergeo", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram": [{"type": "histogram", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "pie": [{"type": "pie", "automargin": true}], "choropleth": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "choropleth"}], "heatmapgl": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "heatmapgl", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "bar": [{"type": "bar", "error_y": {"color": "#2a3f5f"}, "error_x": {"color": "#2a3f5f"}, "marker": {"line": {"color": "#E5ECF6", "width": 0.5}}}], "heatmap": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "heatmap", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "contourcarpet": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "contourcarpet"}], "table": [{"type": "table", "header": {"line": {"color": "white"}, "fill": {"color": "#C8D4E3"}}, "cells": {"line": {"color": "white"}, "fill": {"color": "#EBF0F8"}}}], "scatter3d": [{"line": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "type": "scatter3d", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scattergl": [{"type": "scattergl", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram2d": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "histogram2d", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scatterternary": [{"type": "scatterternary", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scatterpolargl": [{"type": "scatterpolargl", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}]}}, "margin": {"l": 50, "b": 50, "r": 50, "t": 60}, "yaxis": {"title": {"text": "Reward Average"}}}, "config": {"showLink": false, "editable": false, "responsive": true, "staticPlot": false, "scrollZoom": true}, "frames": [], "data": [{"y": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.370569e-5, 5.313496e-5, 5.2576237e-5, 5.2029136e-5, 5.1493305e-5, 5.0965802e-5, 5.045154e-5, 4.9947554e-5, 4.9453538e-5, 4.89692e-5, 4.84919e-5, 4.8026126e-5, 4.7569214e-5, 4.7120913e-5, 9.336197e-5, 9.2494105e-5, 9.164643e-5, 9.081415e-5, 8.999685e-5, 8.919413e-5, 0.00013260255, 0.00013144058, 0.00017373177, 0.00017223562, 0.00017076502, 0.00016931217, 0.00020986359, 0.00020811654, 0.00024767802, 0.00028659162, 0.0003248599, 0.00036252316, 0.00039958445, 0.00043605804, 0.00047195784, 0.0004682744, 0.00050334923, 0.0005763246, 0.0006100583, 0.0006432815, 0.00067600555, 0.0007082153, 0.0007399734, 0.0007712649, 0.00080210005, 0.00083248876, 0.0008624097, 0.00089193333, 0.0009210386, 0.0009497344, 0.0010129589, 0.0010405828, 0.0010678609, 0.0010947656, 0.0011213048, 0.0011474857, 0.0011732761, 0.0012320603, 0.0012239092, 0.0012487266, 0.0013058666, 0.0013297872, 0.0013534416, 0.001376793, 0.001431662, 0.0014542235, 0.0014764552, 0.0015296725, 0.0015512053, 0.0016033052, 0.0016241228, 0.0016750928, 0.0016952232, 0.0017151111, 0.001734761, 0.0017541773, 0.0018028668, 0.0018510357, 0.0018694318, 0.0018876144, 0.0019344593, 0.0019520037, 0.001997945, 0.0020149846, 0.002060052, 0.0020765518, 0.0020928092, 0.0021366929, 0.0021800923, 0.0021955704, 0.0022381744, 0.0022531082, 0.0022679411, 0.0023094688, 0.0023238421, 0.0023646315, 0.0023784984, 0.0023922815, 0.002432072, 0.002445433, 0.0024586557, 0.0024716787, 0.002510246, 0.0025229359, 0.002535497, 0.0025731584, 0.0025852765, 0.0025973378, 0.0026341293, 0.002670557, 0.0026820207, 0.0027177904, 0.0027289118, 0.0027641724, 0.0027749627, 0.0028096633, 0.0028200655, 0.0028542206, 0.0028880525, 0.0029215654, 0.0029313134, 0.0029642424, 0.0029969334, 0.003029322, 0.0030383943, 0.003070296, 0.003101836, 0.003110455, 0.0031189965, 0.0031499607, 0.0031806473, 0.0031886902, 0.0032189318, 0.003248906, 0.0032566122, 0.0032861587, 0.003315375, 0.0033226921, 0.0033515687, 0.0033801969, 0.0033871417, 0.0034153005, 0.0034432919, 0.0034710465, 0.0034774912, 0.003504869, 0.0035319442, 0.0035380549, 0.0035648406, 0.0035914055, 0.0036177516, 0.0036438075, 0.0036493375, 0.003675127, 0.0036804853, 0.0037059416, 0.003711059, 0.003736189, 0.0037412192, 0.0037660303, 0.0037906459, 0.0037953276, 0.003800047, 0.0038047296, 0.0038093757, 0.0038333463, 0.0038570575, 0.003861449, 0.0038849444, 0.0039082607, 0.0039124074, 0.003935369, 0.0039582313, 0.0039809216, 0.0039847344, 0.004007157, 0.0040293382, 0.0040329294, 0.0040364945, 0.0040767607, 0.004080139, 0.004101648, 0.0041230745, 0.004126249, 0.0041474323, 0.0041684634, 0.0041713663, 0.0041921614, 0.0041950336, 0.0042155976, 0.0042183665, 0.0042210417, 0.0042412984, 0.0042439485, 0.004263984, 0.0042665373, 0.004286282, 0.004288741, 0.004308347, 0.0043278197, 0.004330111, 0.0043493034, 0.004368439, 0.0043874453, 0.0043895054, 0.004408314, 0.004410217, 0.0044288305, 0.004447321, 0.0044656885, 0.00446745, 0.0044855573, 0.0045036194, 0.004521563, 0.00452312, 0.0045408844, 0.004542295, 0.004559883, 0.0045773573, 0.0045787105, 0.0045960136, 0.004613133, 0.00461436, 0.0046155793, 0.0046325475, 0.0046494096, 0.0046504345, 0.004667135, 0.004668171, 0.0046847127, 0.004701152, 0.0047174175, 0.004718286, 0.0047344714, 0.0047352826, 0.004751317, 0.0047671823, 0.004783023, 0.0047836783, 0.004784329, 0.004799976, 0.0048005027, 0.004816008, 0.0048314207, 0.0048319204, 0.0048324172, 0.004847574, 0.0048627127, 0.004877763, 0.00487812, 0.0048930375, 0.0048932773, 0.0049080644, 0.004922766, 0.0049229884, 0.0049232095, 0.0049376707, 0.0049378485, 0.0049380255, 0.0049382015, 0.0049525267, 0.0049667004, 0.004966794, 0.0049668876, 0.0049809716, 0.0049949773, 0.005008835, 0.0050226855, 0.0050226226, 0.0050363583, 0.005036258, 0.005049812, 0.0050496757, 0.00504954, 0.0050630155, 0.0050628446, 0.005076142, 0.005089436, 0.0050891954, 0.005088956, 0.005102109, 0.0051017683, 0.005114817, 0.005114512, 0.005127458, 0.0051403353, 0.0051531447, 0.0051658186, 0.005165384, 0.0051780273, 0.005190605, 0.005190109, 0.005202522, 0.0052149384, 0.0052143834, 0.005226705, 0.0052389633, 0.0052382844, 0.0052376753, 0.0052498123, 0.005249177, 0.005248545, 0.0052604955, 0.0052724523, 0.0052717663, 0.005271084, 0.0052704057, 0.005282152, 0.005281449, 0.0052807494, 0.005292448, 0.005291725, 0.00529094, 0.005302527, 0.0053017843, 0.005313288, 0.005312523, 0.005311697, 0.0053230925, 0.0053223087, 0.0053336234, 0.0053448835, 0.005343989, 0.0053551705, 0.005366298, 0.005377372, 0.005376472, 0.0053874054, 0.005398351, 0.005397408, 0.0054082777, 0.0054190964, 0.005418048, 0.0054287924, 0.0054277894, 0.0054384614, 0.0054374402, 0.005447977, 0.0054585277, 0.0054690302, 0.005467948, 0.005466871, 0.0054772184, 0.0054875812, 0.005497897, 0.0055081653, 0.005518387, 0.0055171475, 0.005527302, 0.0055261105, 0.0055361995, 0.005534993, 0.00553373, 0.005543734, 0.0055425186, 0.0055524586, 0.005562354, 0.005572144, 0.0055708764, 0.0055696145, 0.005579385, 0.00557811, 0.0055767796, 0.005575516, 0.0055851876, 0.0055948175, 0.005593523, 0.0056030317, 0.00561256, 0.0056112353, 0.005620705, 0.0056301337, 0.0056394613, 0.005648809, 0.005658117, 0.005667385, 0.005676614, 0.005685743, 0.0056842887, 0.005693423, 0.0057025184, 0.005701038, 0.0056995037, 0.005708529, 0.005717517, 0.0057160174, 0.005724952, 0.0057233837, 0.0057322658, 0.005730748, 0.005739578, 0.005748372, 0.005746771, 0.0057452363, 0.005753964, 0.005762656, 0.005771313, 0.0057798754, 0.005788462, 0.005786861, 0.0057852664, 0.0057937894, 0.00580222, 0.0058106747, 0.0058190953, 0.0058174524, 0.005825826, 0.0058241175, 0.0058324444, 0.005840738, 0.0058489987, 0.0058572264, 0.005865364, 0.0058636554, 0.0058718054, 0.0058799237, 0.0058781966, 0.005886213, 0.00588448, 0.0058925105, 0.005900509, 0.005908477, 0.005916356, 0.0059242626, 0.005932138, 0.005939983, 0.0059381737, 0.00594592, 0.0059441063, 0.0059518684, 0.00595005, 0.005948239, 0.005955892, 0.0059635728, 0.0059617464, 0.005969387, 0.0059769982, 0.0059751007, 0.005982673, 0.0059902165, 0.0059977323, 0.0059958654, 0.006003286, 0.0060014166, 0.006008855, 0.006016266, 0.0060236496, 0.0060217003, 0.0060290466, 0.006036366, 0.0060436577, 0.0060509234, 0.0060581067, 0.006065319, 0.0060633733, 0.0060614347, 0.0060686017, 0.0060666054, 0.006073737, 0.0060808426, 0.0060879225, 0.0060949773, 0.006101951, 0.006108955, 0.006106966, 0.0061139357, 0.006111945, 0.0061188266, 0.006116834, 0.006123737, 0.006130615, 0.0061286124, 0.006135403, 0.006142224, 0.0061490214, 0.0061557945, 0.006153765, 0.006160452, 0.0061584217, 0.006165131, 0.0061631, 0.006169778, 0.0061763786, 0.00618301, 0.006180962, 0.006187563, 0.0061941408, 0.0062006423, 0.006198578, 0.006205103, 0.006203038, 0.006209533, 0.006207415, 0.00621388, 0.0062203235, 0.0062267454, 0.0062246644, 0.0062310044, 0.006237375, 0.006243725, 0.0062500527, 0.0062479503, 0.0062541976, 0.006252095, 0.0062583666, 0.0062646177, 0.006262508, 0.006268679, 0.0062748813, 0.006281063, 0.0062789405, 0.0062933653, 0.0062911776, 0.0062972917, 0.0063116145, 0.0063094595, 0.006315513, 0.0063214954, 0.0063193347, 0.006317181, 0.006315035, 0.00632103, 0.006326955, 0.0063329116, 0.006338849, 0.0063447673, 0.0063506668, 0.006348439, 0.0063543133, 0.0063521382, 0.0063579874, 0.0063638184, 0.006361638, 0.006367394, 0.0063731815, 0.006378951, 0.006376761, 0.0063825063, 0.006388183, 0.0063859886, 0.0063916924, 0.006397378, 0.0064030457, 0.0064007915, 0.006406436, 0.006404234, 0.0064098556, 0.0064154593, 0.006420996, 0.0064265653, 0.006432117, 0.006437652, 0.006435425, 0.0064408877, 0.006446383, 0.0064441534, 0.0064496268, 0.0064473986, 0.006452801, 0.006458236, 0.006456005, 0.0064614187, 0.0064668157, 0.006472147, 0.006477511, 0.006482859, 0.0064806114, 0.006485938, 0.0064912, 0.006496495, 0.0065017743, 0.006507037, 0.0065122847, 0.0065174676, 0.0065226834, 0.006527884, 0.0065330686, 0.006538238, 0.006543343, 0.006548482, 0.006553605, 0.0065513025, 0.0065564066, 0.006554058, 0.006559143, 0.0065568457, 0.0065619117, 0.006559617, 0.0065646158, 0.006562324, 0.006567352, 0.0065723653, 0.0065700724, 0.006575019, 0.00658, 0.0065849656, 0.0065826676, 0.0065876152, 0.006592501, 0.00659742, 0.006602325, 0.0066072163, 0.0066049057, 0.006609732, 0.006614591, 0.006619436, 0.0066171214, 0.0066219494, 0.006619591, 0.0066244015, 0.0066220933, 0.006619791, 0.0066245813, 0.0066293105, 0.0066270083, 0.0066317674, 0.0066365134, 0.0066342107, 0.0066388934, 0.006643609, 0.006648312, 0.0066530015, 0.006650692, 0.0066553187, 0.006659979, 0.0066576693, 0.006662313, 0.0066600065, 0.006664588, 0.006662285, 0.006666897, 0.006664597, 0.006669193, 0.0066668503, 0.0066714305, 0.0066759977, 0.0066805533, 0.006685096, 0.00668958, 0.006694098, 0.0066917893, 0.0066962917, 0.0066939862, 0.006698428, 0.006696126, 0.006700598, 0.006698299, 0.0067027565, 0.0067071565, 0.00671159, 0.0067160116, 0.006720421, 0.0067248186, 0.0067291595, 0.006733534, 0.0067378967, 0.0067422474, 0.0067399265, 0.0067442185, 0.0067419014, 0.006746224, 0.0067505348, 0.0067482186, 0.006752471, 0.006756757, 0.006761031, 0.0067652944, 0.0067695463, 0.0067737424, 0.0067714173, 0.006769098, 0.0067733224, 0.006777536, 0.0067816945, 0.006779374, 0.0067770593, 0.0067812465, 0.0067854226, 0.006783065, 0.006787228, 0.00679138, 0.0067955214, 0.006793207, 0.0067972913, 0.006801409, 0.0068055163, 0.006803201, 0.006807295, 0.00680494, 0.0068026343, 0.0068067135, 0.0068107825, 0.006808478, 0.0068124915, 0.0068165376, 0.0068205735, 0.006824599, 0.0068286145, 0.006832577, 0.0068302653, 0.0068342583, 0.006838241, 0.0068422146, 0.0068461346, 0.0068500875, 0.0068540312, 0.0068579647, 0.0068556443, 0.0068595232, 0.006857207, 0.0068611167, 0.006865017, 0.0068689077, 0.0068727457, 0.006876617, 0.006874297, 0.0068781567, 0.00687584, 0.0068796463, 0.006877334, 0.006881171, 0.0068849986, 0.0068949456, 0.006898705, 0.006902497, 0.00690628, 0.0069100535, 0.006913818, 0.006917531, 0.0069152005, 0.0069189453, 0.006916619, 0.006926407, 0.006924035, 0.006927751, 0.0069314577, 0.0069351555, 0.0069388445, 0.0069424827, 0.006946154, 0.0069498164, 0.00695347, 0.006951133, 0.00694876, 0.0069524013, 0.0069560343, 0.006959659, 0.0069573284, 0.006954962, 0.006958575, 0.0069562537, 0.0069598565, 0.0069575394, 0.006961091, 0.006964675, 0.006968251, 0.0069718184, 0.0069753774, 0.006978887, 0.0069765667, 0.0069801076, 0.0069836406, 0.0069871647, 0.0069906404, 0.0069941483, 0.006997648, 0.006995325, 0.006998815, 0.007002257, 0.0069999364, 0.0070034093, 0.007006874, 0.007010331, 0.007013739, 0.00701718, 0.007020613, 0.00701829, 0.007021714, 0.0070250896, 0.0070284978, 0.0070261764, 0.0070295758, 0.007032967, 0.0070363507, 0.007033991, 0.007037366, 0.007040733, 0.0070440923, 0.007047444, 0.0070507484, 0.007054085, 0.0070517636, 0.0070550917, 0.0070584123, 0.007056054, 0.0070537413, 0.007057052, 0.0070603555, 0.0070580454, 0.007061301, 0.007064588, 0.0070678685, 0.00706556, 0.007068832, 0.007072057, 0.007069752, 0.007073008, 0.007076257, 0.007079499, 0.007082694, 0.00708039, 0.0070836167, 0.0070813163, 0.007084535, 0.0070821997, 0.0070909113, 0.007094108, 0.007091809, 0.0070949984, 0.0070981416, 0.007101317, 0.0070990203, 0.0071021873, 0.007099895, 0.0071030157, 0.007106168, 0.0071093137, 0.0071070236, 0.0071101612, 0.007113254, 0.0071163783, 0.0071140905, 0.007111807, 0.007114923, 0.0071179937, 0.0071210964, 0.0071134386, 0.0071165394, 0.007108903, 0.0071066045, 0.007104349, 0.007102098, 0.0071051945, 0.0071082837, 0.007111329, 0.0071090804, 0.007112156, 0.0071045975, 0.007097055, 0.007089491, 0.0070819804, 0.007074486, 0.007067007, 0.0070595443, 0.0070573343, 0.007060435, 0.0070530027, 0.007045586, 0.007038185, 0.0070307623, 0.0070233922, 0.0070160376, 0.0070086983, 0.0070013744, 0.006994029, 0.006986736, 0.0069794576, 0.0069721946, 0.0069649466, 0.0069576777, 0.00695046, 0.006943257, 0.006936069, 0.006928896, 0.006921702, 0.006914559, 0.00690743, 0.006900316, 0.006893217, 0.0068860967, 0.0068790265, 0.0068719713, 0.00686493, 0.006857903, 0.006850856, 0.006843858, 0.0068368744, 0.006829905, 0.0068229497, 0.0068159737, 0.006809047, 0.006802134, 0.0067952354, 0.00678835, 0.0067814454, 0.006774588, 0.006767745, 0.0067609157, 0.0067541003, 0.0067472644, 0.0067404765, 0.006733702, 0.0067269416, 0.006720194, 0.0067134267, 0.0067067067, 0.0067], "type": "scatter", "x": [1, 201, 401, 602, 802, 1002, 1202, 1402, 1603, 1803, 2003, 2203, 2403, 2604, 2804, 3004, 3204, 3404, 3605, 3805, 4005, 4205, 4405, 4606, 4806, 5006, 5206, 5406, 5607, 5807, 6007, 6207, 6407, 6608, 6808, 7008, 7208, 7408, 7609, 7809, 8009, 8209, 8409, 8610, 8810, 9010, 9210, 9410, 9611, 9811, 10011, 10211, 10411, 10612, 10812, 11012, 11212, 11412, 11613, 11813, 12013, 12213, 12413, 12614, 12814, 13014, 13214, 13414, 13615, 13815, 14015, 14215, 14415, 14616, 14816, 15016, 15216, 15416, 15617, 15817, 16017, 16217, 16417, 16618, 16818, 17018, 17218, 17418, 17619, 17819, 18019, 18219, 18419, 18620, 18820, 19020, 19220, 19420, 19621, 19821, 20021, 20221, 20421, 20622, 20822, 21022, 21222, 21422, 21623, 21823, 22023, 22223, 22423, 22624, 22824, 23024, 23224, 23424, 23625, 23825, 24025, 24225, 24425, 24626, 24826, 25026, 25226, 25426, 25626, 25827, 26027, 26227, 26427, 26627, 26828, 27028, 27228, 27428, 27628, 27829, 28029, 28229, 28429, 28629, 28830, 29030, 29230, 29430, 29630, 29831, 30031, 30231, 30431, 30631, 30832, 31032, 31232, 31432, 31632, 31833, 32033, 32233, 32433, 32633, 32834, 33034, 33234, 33434, 33634, 33835, 34035, 34235, 34435, 34635, 34836, 35036, 35236, 35436, 35636, 35837, 36037, 36237, 36437, 36637, 36838, 37038, 37238, 37438, 37638, 37839, 38039, 38239, 38439, 38639, 38840, 39040, 39240, 39440, 39640, 39841, 40041, 40241, 40441, 40641, 40842, 41042, 41242, 41442, 41642, 41843, 42043, 42243, 42443, 42643, 42844, 43044, 43244, 43444, 43644, 43845, 44045, 44245, 44445, 44645, 44846, 45046, 45246, 45446, 45646, 45847, 46047, 46247, 46447, 46647, 46848, 47048, 47248, 47448, 47648, 47849, 48049, 48249, 48449, 48649, 48850, 49050, 49250, 49450, 49650, 49851, 50051, 50251, 50451, 50651, 50852, 51052, 51252, 51452, 51652, 51853, 52053, 52253, 52453, 52653, 52854, 53054, 53254, 53454, 53654, 53855, 54055, 54255, 54455, 54655, 54856, 55056, 55256, 55456, 55656, 55857, 56057, 56257, 56457, 56657, 56858, 57058, 57258, 57458, 57658, 57859, 58059, 58259, 58459, 58659, 58860, 59060, 59260, 59460, 59660, 59861, 60061, 60261, 60461, 60661, 60862, 61062, 61262, 61462, 61662, 61863, 62063, 62263, 62463, 62663, 62864, 63064, 63264, 63464, 63664, 63865, 64065, 64265, 64465, 64665, 64866, 65066, 65266, 65466, 65666, 65867, 66067, 66267, 66467, 66667, 66868, 67068, 67268, 67468, 67668, 67869, 68069, 68269, 68469, 68669, 68870, 69070, 69270, 69470, 69670, 69871, 70071, 70271, 70471, 70671, 70872, 71072, 71272, 71472, 71672, 71873, 72073, 72273, 72473, 72673, 72874, 73074, 73274, 73474, 73674, 73875, 74075, 74275, 74475, 74675, 74876, 75076, 75276, 75476, 75676, 75876, 76077, 76277, 76477, 76677, 76877, 77078, 77278, 77478, 77678, 77878, 78079, 78279, 78479, 78679, 78879, 79080, 79280, 79480, 79680, 79880, 80081, 80281, 80481, 80681, 80881, 81082, 81282, 81482, 81682, 81882, 82083, 82283, 82483, 82683, 82883, 83084, 83284, 83484, 83684, 83884, 84085, 84285, 84485, 84685, 84885, 85086, 85286, 85486, 85686, 85886, 86087, 86287, 86487, 86687, 86887, 87088, 87288, 87488, 87688, 87888, 88089, 88289, 88489, 88689, 88889, 89090, 89290, 89490, 89690, 89890, 90091, 90291, 90491, 90691, 90891, 91092, 91292, 91492, 91692, 91892, 92093, 92293, 92493, 92693, 92893, 93094, 93294, 93494, 93694, 93894, 94095, 94295, 94495, 94695, 94895, 95096, 95296, 95496, 95696, 95896, 96097, 96297, 96497, 96697, 96897, 97098, 97298, 97498, 97698, 97898, 98099, 98299, 98499, 98699, 98899, 99100, 99300, 99500, 99700, 99900, 100101, 100301, 100501, 100701, 100901, 101102, 101302, 101502, 101702, 101902, 102103, 102303, 102503, 102703, 102903, 103104, 103304, 103504, 103704, 103904, 104105, 104305, 104505, 104705, 104905, 105106, 105306, 105506, 105706, 105906, 106107, 106307, 106507, 106707, 106907, 107108, 107308, 107508, 107708, 107908, 108109, 108309, 108509, 108709, 108909, 109110, 109310, 109510, 109710, 109910, 110111, 110311, 110511, 110711, 110911, 111112, 111312, 111512, 111712, 111912, 112113, 112313, 112513, 112713, 112913, 113114, 113314, 113514, 113714, 113914, 114115, 114315, 114515, 114715, 114915, 115116, 115316, 115516, 115716, 115916, 116117, 116317, 116517, 116717, 116917, 117118, 117318, 117518, 117718, 117918, 118119, 118319, 118519, 118719, 118919, 119120, 119320, 119520, 119720, 119920, 120121, 120321, 120521, 120721, 120921, 121122, 121322, 121522, 121722, 121922, 122123, 122323, 122523, 122723, 122923, 123124, 123324, 123524, 123724, 123924, 124125, 124325, 124525, 124725, 124925, 125125, 125326, 125526, 125726, 125926, 126126, 126327, 126527, 126727, 126927, 127127, 127328, 127528, 127728, 127928, 128128, 128329, 128529, 128729, 128929, 129129, 129330, 129530, 129730, 129930, 130130, 130331, 130531, 130731, 130931, 131131, 131332, 131532, 131732, 131932, 132132, 132333, 132533, 132733, 132933, 133133, 133334, 133534, 133734, 133934, 134134, 134335, 134535, 134735, 134935, 135135, 135336, 135536, 135736, 135936, 136136, 136337, 136537, 136737, 136937, 137137, 137338, 137538, 137738, 137938, 138138, 138339, 138539, 138739, 138939, 139139, 139340, 139540, 139740, 139940, 140140, 140341, 140541, 140741, 140941, 141141, 141342, 141542, 141742, 141942, 142142, 142343, 142543, 142743, 142943, 143143, 143344, 143544, 143744, 143944, 144144, 144345, 144545, 144745, 144945, 145145, 145346, 145546, 145746, 145946, 146146, 146347, 146547, 146747, 146947, 147147, 147348, 147548, 147748, 147948, 148148, 148349, 148549, 148749, 148949, 149149, 149350, 149550, 149750, 149950, 150150, 150351, 150551, 150751, 150951, 151151, 151352, 151552, 151752, 151952, 152152, 152353, 152553, 152753, 152953, 153153, 153354, 153554, 153754, 153954, 154154, 154355, 154555, 154755, 154955, 155155, 155356, 155556, 155756, 155956, 156156, 156357, 156557, 156757, 156957, 157157, 157358, 157558, 157758, 157958, 158158, 158359, 158559, 158759, 158959, 159159, 159360, 159560, 159760, 159960, 160160, 160361, 160561, 160761, 160961, 161161, 161362, 161562, 161762, 161962, 162162, 162363, 162563, 162763, 162963, 163163, 163364, 163564, 163764, 163964, 164164, 164365, 164565, 164765, 164965, 165165, 165366, 165566, 165766, 165966, 166166, 166367, 166567, 166767, 166967, 167167, 167368, 167568, 167768, 167968, 168168, 168369, 168569, 168769, 168969, 169169, 169370, 169570, 169770, 169970, 170170, 170371, 170571, 170771, 170971, 171171, 171372, 171572, 171772, 171972, 172172, 172373, 172573, 172773, 172973, 173173, 173374, 173574, 173774, 173974, 174174, 174375, 174575, 174775, 174975, 175175, 175375, 175576, 175776, 175976, 176176, 176376, 176577, 176777, 176977, 177177, 177377, 177578, 177778, 177978, 178178, 178378, 178579, 178779, 178979, 179179, 179379, 179580, 179780, 179980, 180180, 180380, 180581, 180781, 180981, 181181, 181381, 181582, 181782, 181982, 182182, 182382, 182583, 182783, 182983, 183183, 183383, 183584, 183784, 183984, 184184, 184384, 184585, 184785, 184985, 185185, 185385, 185586, 185786, 185986, 186186, 186386, 186587, 186787, 186987, 187187, 187387, 187588, 187788, 187988, 188188, 188388, 188589, 188789, 188989, 189189, 189389, 189590, 189790, 189990, 190190, 190390, 190591, 190791, 190991, 191191, 191391, 191592, 191792, 191992, 192192, 192392, 192593, 192793, 192993, 193193, 193393, 193594, 193794, 193994, 194194, 194394, 194595, 194795, 194995, 195195, 195395, 195596, 195796, 195996, 196196, 196396, 196597, 196797, 196997, 197197, 197397, 197598, 197798, 197998, 198198, 198398, 198599, 198799, 198999, 199199, 199399, 199600, 199800, 200000]}]} // Get the plotly listeners const plotly_listeners = {} // Get the JS listeners const js_listeners = {} // Deal with eventual custom classes let custom_classlist = [] // Load the plotly library if (!window.Plotly) { const {plotly} = await import('https://cdn.plot.ly/plotly-2.16.1.min.js') } // Check if we have to force local mathjax font cache if (false && window?.MathJax?.config?.svg?.fontCache === 'global') { window.MathJax.config.svg.fontCache = 'local' } // Flag to check if this cell was manually ran or reactively ran const firstRun = this ? false : true const PLOT = this ?? document.createElement("div"); const parent = currentScript.parentElement const isPlutoWrapper = parent.classList.contains('raw-html-wrapper') if (firstRun) { // It seem plot divs would not autosize themself inside flexbox containers without this parent.appendChild(PLOT) } // If width is not specified, set it to 100% PLOT.style.width = plot_obj.layout.width ? "" : "100%" // For the height we have to also put a fixed value in case the plot is put on a non-fixed-size container (like the default wrapper) PLOT.style.height = plot_obj.layout.height ? "" : (isPlutoWrapper || parent.clientHeight == 0) ? "400px" : "100%" PLOT.classList.forEach(cn => { if (cn !== 'js-plotly-plot' && !custom_classlist.includes(cn)) { PLOT.classList.toggle(cn, false) } }) for (const className of custom_classlist) { PLOT.classList.toggle(className, true) } // Create the resizeObserver to make the plot even more responsive! :magic: const resizeObserver = new ResizeObserver(entries => { PLOT.style.height = plot_obj.layout.height ? "" : (isPlutoWrapper || parent.clientHeight == 0) ? "400px" : "100%" /* The addition of the invalid argument `plutoresize` seems to fix the problem with calling `relayout` simply with `{autosize: true}` as update breaking mouse relayout events tracking. See https://github.com/plotly/plotly.js/issues/6156 for details */ Plotly.relayout(PLOT, {..._.pick(PLOT.layout, ['width','height']), autosize: true, plutoresize: true}) }) resizeObserver.observe(PLOT) Plotly.react(PLOT, plot_obj).then(() => { // Assign the Plotly event listeners for (const [key, listener_vec] of Object.entries(plotly_listeners)) { for (const listener of listener_vec) { PLOT.on(key, listener) } } // Assign the JS event listeners for (const [key, listener_vec] of Object.entries(js_listeners)) { for (const listener of listener_vec) { PLOT.addEventListener(key, listener) } } } ) invalidation.then(() => { // Remove all plotly listeners PLOT.removeAllListeners() // Remove all JS listeners for (const [key, listener_vec] of Object.entries(js_listeners)) { for (const listener of listener_vec) { PLOT.removeEventListener(key, listener) } } // Remove the resizeObserver resizeObserver.disconnect() }) return PLOT ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•=Z×(°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$98222fcd-b456-477c-90dd-844df36877e5¹depends_on_disabled_cellsÂ§runtimeÎ£µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$f7f58fd2-facc-4b87-9172-5e911677c8f4Š¦queuedÂ¤logs§runningÂ¦output†¤body ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ”ô•ˆa°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f7f58fd2-facc-4b87-9172-5e911677c8f4¹depends_on_disabled_cellsÂ§runtimeÍ(©µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$58403c8e-0ee4-4466-ba25-ee0c86fb0b47Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚn

Consider $\mathbf{x}(s)$ and $\mathbf{h}(s, \boldsymbol{\theta})$ which produces a vector of action preferences. We would like to derive an expression for $\nabla \ln \pi (a \vert s, \boldsymbol{\theta})$ in the case of $\mathbf{\pi}(s, \boldsymbol{\theta}) = \sigma(\mathbf{h}(s, \boldsymbol{\theta}))$ where $\sigma(\mathbf{x})$ is the softmax function defined in section 13.1. Here I'm using the notation $\mathbf{\pi}(s, \boldsymbol{\theta})$ to refer to the vector of action probabilities at a given state. The subscript on the vector refers to selecting that element from the vector. To shorten expressions, the following terms are equivalent:

$$\begin{flalign} \mathbf{\pi} &\doteq \mathbf{\pi}(s, \boldsymbol{\theta}) \\ \mathbf{h} &\doteq \mathbf{h}(s, \boldsymbol{\theta}) \\ x_i &\doteq \mathbf{x}_i \text{ for all vectors} \\ \end{flalign}$$

Using these conventions, we previously had an expression for the ith component of the gradient of the policy:

$$\nabla \left( \pi_a \right )_i = \pi_a \left ( \frac{\partial{h_a}}{\partial{\theta_i}} - \sum_k{\pi_k \frac{\partial{h_k}}{\partial{\theta_i}}} \right )$$

We can use this expression to derive the components of the eligibility vector in general:

$$\begin{flalign} \nabla \left( \ln \mathbf{\pi}_a \right)_i &= \frac{\nabla \left( \pi_a \right )_i}{\pi_a}\\ &=\frac{\partial{h_a}}{\partial{\theta_i}} - \sum_k{\pi_k \frac{\partial{h_k}}{\partial{\theta_i}}} \\ \end{flalign}$$

Connection to Cross-Entropy Loss

Classification problems involve training a function to predict the class label of an input. The function returns a vector of class preferences which can be converted to a probability distribution by the soft-max function. The cross-entropy loss is a way of comparing this distribution with the desired output label to generate an error value.

Let's denote $\mathbf{p}(s)$ as the vector of true probabilities for an example $s$ and keep our output function as $\pi(s,\theta) = \sigma(\mathbf{h}(s, \boldsymbol{\theta}))$. The cross entropy loss is defined as:

$$\mathcal{L}(\mathbf{p}, \mathbf{\pi}) = -\sum_i \mathbf{p}_i \ln \mathbf{\pi}_i$$

omitting $s$ and $\boldsymbol{\theta}$.

In a typical situation with a dataset, $\mathbf{p}(s)$ will be a one-hot vector representing the index of label of the example in the dataset. Let's call that index $a$ such that $p_a = 1$ and $p_i = 0 \: \forall i \neq a$. The loss then simplifies to $\mathcal{L}(a, \mathbf{\pi}) = -\ln \mathbf{\pi}_a$. When we train with gradient descent on such a dataset, we must compute the gradient of this loss with respect to the parameters or $-\nabla \ln \pi_a$ which is just negative one times the eligibility vector for general paramaterized approximation. So if we have a function that computes the gradient of the cross entropy loss of the soft-max output for a vector function and a label index, we can replace the label index of the dataset with the desired action index $a$ and then that gradient will match our desired gradient after multiplying by negative one.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôˆ×þ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$58403c8e-0ee4-4466-ba25-ee0c86fb0b47¹depends_on_disabled_cellsÂ§runtimeÎ µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$e1aec891-d95a-47d1-97d7-d2a4cfb16e64Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙGsetup_fcann_policy_and_value_arguments (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ• -ëÿ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$e1aec891-d95a-47d1-97d7-d2a4cfb16e64¹depends_on_disabled_cellsÂ§runtimeÎïëµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$3d065608-eef2-4caa-b17d-ec60714e3d58Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙSactor_critic_binary_episodic_beta_parameter_study (generic function with 2 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•/ôóº°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$3d065608-eef2-4caa-b17d-ec60714e3d58¹depends_on_disabled_cellsÂ§runtimeÎ0ïTµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$b87ff1a9-abff-40f7-a1d8-f751a1c8b060Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚS

In the episodic case, we provided a reward of -1 per step and then considered an episode finished when a failure state was reached. In the continuing case, the step function will provide a reward of 0 unless a failure occurs in which case it will provide a reward of -1 and then initialize a new state.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô#…°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$b87ff1a9-abff-40f7-a1d8-f751a1c8b060¹depends_on_disabled_cellsÂ§runtimeÎ#¼µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$e89bdc84-dbb5-4c73-a39c-6392e5f79704Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÛW

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@®&³°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$e89bdc84-dbb5-4c73-a39c-6392e5f79704¹depends_on_disabled_cellsÂ§runtimeÎÃPµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$d3b56fca-5b79-4465-8987-8d0005f854d8Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements˜’¯episode_rewards’…¦prefix§Float32¨elements›’’¤25.0ªtext/plain’’¤32.0ªtext/plain’’¤32.0ªtext/plain’’¤29.0ªtext/plain’’¤27.0ªtext/plain’’¤57.0ªtext/plain’’¤25.0ªtext/plain’’¤22.0ªtext/plain’ ’¤29.0ªtext/plain¤more’Í'’¤20.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°a84c5aa4de73c4dcÙ!application/vnd.pluto.tree+object’episode_steps’…¦prefix¥Int64¨elements›’’¢25ªtext/plain’’¢32ªtext/plain’’¢32ªtext/plain’’¢29ªtext/plain’’¢27ªtext/plain’’¢57ªtext/plain’’¢25ªtext/plain’’¢22ªtext/plain’ ’¢29ªtext/plain¤more’Í'’¢20ªtext/plain¤type¥Array¬prefix_short ¨objectid°8ee6ad21aba86826Ù!application/vnd.pluto.tree+object’¯policy_function’£Ï€2ªtext/plain’´policy_sample_action’ªÏ€_sample2ªtext/plain’±policy_parameters’Ùê52488Ã—3 Matrix{Float32}: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 â‹® 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0ªtext/plain’´estimate_state_value’´estimate_state_valueªtext/plain’°value_parameters’…¦prefix§Float32¨elements›’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’ ’£0.0ªtext/plain¤more’ÍÍ’£0.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°1f27a50400f1227eÙ!application/vnd.pluto.tree+object’°policy_and_value’°policy_and_valueªtext/plain¤typeªNamedTuple¨objectid°cb87f538fb76a21f¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeµconst reinforce_test2²last_run_timestampËAÚ•1/°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d3b56fca-5b79-4465-8987-8d0005f854d8¹depends_on_disabled_cellsÂ§runtimeÎ&š Œµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$d21617aa-6f38-4a90-8586-4b32022497adŠ¦queuedÂ¤logs§runningÂ¦output†¤body…¦prefixÚêStateMDP{Float32, CartPoleState{Float32}, Float32, StateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1516#1532"{var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}, Vector{Float32}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1521#1537", var"#init_Î¸#1551", var"#1523#1539", var"#1524#1540"}}, Returns{Bool}, TabularRL.var"#164#169"}¨elements–’§actions’…¦prefix§Float32¨elements“’’¦-300.0ªtext/plain’’£0.0ªtext/plain’’¥300.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°98ad56d5f22ee7f4Ù!application/vnd.pluto.tree+object’£ptf’…¦prefixÚStateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1516#1532"{var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}, Vector{Float32}}}¨elements‘’¤step’Ù׬ (generic function with 1 method)ªtext/plain¤type¦struct¬prefix_short¹StateMDPTransitionSampler¨objectid°1a36bcae07eee980Ù!application/vnd.pluto.tree+object’°initialize_state’Ú-(::Main.var"workspace#8".var"#initialize_state#1525"{Main.var"workspace#8".var"#initialize_state#1514#1526"{Main.var"workspace#8".var"#1521#1537", Main.var"workspace#8".var"#init_Î¸#1551", Main.var"workspace#8".var"#1523#1539", Main.var"workspace#8".var"#1524#1540"}}) (generic function with 1 method)ªtext/plain’¦isterm’´Returns{Bool}(false)ªtext/plain’¯is_valid_action’Ù%#164 (generic function with 1 method)ªtext/plain’¬action_index’…¦prefix´Dict{Float32, Int64}¨elements“’’£0.0ªtext/plain’¡2ªtext/plain’’¥300.0ªtext/plain’¡3ªtext/plain’’¦-300.0ªtext/plain’¡1ªtext/plain¤type¤Dict¬prefix_short¤Dict¨objectid°c3e528862580ef49Ù!application/vnd.pluto.tree+object¤type¦struct¬prefix_short¨StateMDP¨objectid°45fb03e2144b629b¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÀ²last_run_timestampËAÚ•:Î6°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d21617aa-6f38-4a90-8586-4b32022497ad¹depends_on_disabled_cellsÂ§runtimeÍCnµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$0574f5a0-72e7-4aa2-80ac-f4ce4f0fe7c2Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÛÄh

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@Qs°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$0574f5a0-72e7-4aa2-80ac-f4ce4f0fe7c2¹depends_on_disabled_cellsÂ§runtimeÎ{C©µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7dŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements™’¬step_rewards’…¦prefix§Float32¨elements¤type¥Array¬prefix_short ¨objectid°8bf530e829da7fcbÙ!application/vnd.pluto.tree+object’episode_steps’…¦prefix¥Int64¨elements›’’¥50726ªtext/plain’’¥56742ªtext/plain’’¥58488ªtext/plain’’¥59843ªtext/plain’’¥61006ªtext/plain’’¥61966ªtext/plain’’¥63617ªtext/plain’’¥67188ªtext/plain’ ’¥70662ªtext/plain¤more’ÍÆ’¦999851ªtext/plain¤type¥Array¬prefix_short ¨objectid¯11b0fa894263dfeÙ!application/vnd.pluto.tree+object’¯episode_rewards’…¦prefix§Float32¨elements›’’¨-50725.0ªtext/plain’’§-6016.0ªtext/plain’’§-1746.0ªtext/plain’’§-1355.0ªtext/plain’’§-1163.0ªtext/plain’’¦-960.0ªtext/plain’’§-1651.0ªtext/plain’’§-3571.0ªtext/plain’ ’§-3474.0ªtext/plain¤more’ÍÆ’¦-150.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°466b48119ec6acefÙ!application/vnd.pluto.tree+object’±policy_parameters’ÚF1452Ã—2 Matrix{Float32}: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 â‹® 0.202401 -0.0946094 0.181695 -0.0776743 0.0960714 -0.0431665 0.0161015 -0.00757248 0.000145402 9.99713f-5 0.115561 -0.0530652ªtext/plain’°value_parameters’…¦prefix§Float32¨elements›’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’ ’£0.0ªtext/plain¤more’Í¬’¨-2.51317ªtext/plain¤type¥Array¬prefix_short ¨objectid°d12e54d6c0758517Ù!application/vnd.pluto.tree+object’¯policy_function’¢Ï€ªtext/plain’´policy_sample_action’©Ï€_sampleªtext/plain’´estimate_state_value’´estimate_state_valueªtext/plain’°policy_and_value’°policy_and_valueªtext/plain¤typeªNamedTuple¨objectid°4211faf2c3e65661¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÙ(const mountaincar_continuous_test_train3²last_run_timestampËAÚ•>W ×°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7d¹depends_on_disabled_cellsÂ§runtimeÎH¿‘µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$d82e7ab8-c372-4462-afb5-1617560cdb56Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÛÛl

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@»dA°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d82e7ab8-c372-4462-afb5-1617560cdb56¹depends_on_disabled_cellsÂ§runtimeÎekbµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$3c89209c-9202-4d5d-841c-ea34be369616Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements™’¬step_rewards’…¦prefix§Float32¨elements›’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’ ’£0.0ªtext/plain¤more’Íu0’£0.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°55b4385e79639d7eÙ!application/vnd.pluto.tree+object’¬total_reward’¦-266.0ªtext/plain’«total_steps’¥30000ªtext/plain’±policy_parameters’Ùê29160Ã—3 Matrix{Float32}: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 â‹® 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0ªtext/plain’°value_parameters’…¦prefix§Float32¨elements›’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’ ’£0.0ªtext/plain¤more’Íqè’£0.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°7ce03c391e3787fdÙ!application/vnd.pluto.tree+object’¯policy_function’¢Ï€ªtext/plain’´policy_sample_action’©Ï€_sampleªtext/plain’´estimate_state_value’´estimate_state_valueªtext/plain’°policy_and_value’°policy_and_valueªtext/plain¤typeªNamedTuple¨objectid°9e144f403694fa96¤mimeÙ!application/vnd.pluto.tree+object¬rootassignee¾const cartpole_continuing_test²last_run_timestampËAÚ•+¦æd°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$3c89209c-9202-4d5d-841c-ea34be369616¹depends_on_disabled_cellsÂ§runtimeÎ.¯‰…µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$635abb34-2c97-4f04-a74c-22fbec32f408Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ5fcann_value_function (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ• Lb°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$635abb34-2c97-4f04-a74c-22fbec32f408¹depends_on_disabled_cellsÂ§runtimeÎpåµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$0bf3b988-b3fb-49d5-8dde-b25766596363Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ6linear_value_function (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ• vÀ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$0bf3b988-b3fb-49d5-8dde-b25766596363¹depends_on_disabled_cellsÂ§runtimeÎqµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$d8222abf-139c-4220-8e92-cc987ec6900cŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ#

Note that for the corridor problem, the state-value learning rates have very little impact and learning is most effective when $\lambda_{\boldsymbol{\theta}}$ is close to 1 which mimics REINFORCE with baseline.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‹í °persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d8222abf-139c-4220-8e92-cc987ec6900c¹depends_on_disabled_cellsÂ§runtimeÎ7±µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$68e6f17e-8c87-40f0-a673-1115ecd1b71dŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ

Exercise 13.5

A Bernoulli-logistic unit is a stochastic neuron-like unit used in some ANNs. Its input at time t is a feature vector $\mathbf{x}(S_t)$; its output, $A_t$, is a random variable having two values, 0 and 1, with $\Pr \{A_t=1 \}=P_t$ and $\Pr\{A_t=0\}=1-P_t$ (the Bernoulli distribution). Let $h(s, 0, \mathbf{\theta})$ and $h(s, 1, \mathbf{\theta})$ be the preferences in state $s$ for the unit's two actions given by policy parameter $\mathbf{\theta}$. Assume that the difference between the action preferences is given by a weights sum of teh unit's input vector, that is, assume that $h(s, 1, \mathbf{\theta})-h(s,0, \mathbf{\theta}) = \mathbf{\theta}^\top \mathbf{x}(s)$, where $\mathbf{\theta}$ is the unit's weight vector.

Show that if the exponential soft-max distribution (13.2) is used to convert action preferences to policies, then ${P_t = \pi(1|S_t, \theta_t)=1/(1+\exp(-\theta_t^\top\mathbf{x}(S_t)))}$ (the logistic function).

What is the Monte-Carlo REINFORCE update of $\theta_t$ to $\theta_{t+1}$ upon receipt of return $G_t$?

Express the eligility $\nabla \ln \pi(a|s, \theta)$ for a Bernoulli-logistic unit, in terms of $a$, $\mathbf{x}(s)$, and $\pi(a|s, \theta)$ by calculating the gradient.

Hint for part (c): Define $P=\pi(1|s,\theta)$ and compute the derivative of the logarithm, for each action, using the chain rule on $P$. Combine the two results into one expression that depends on $a$ and $P$, and then use the chain rule again, this time on $\theta^\top\mathbf{x}(s)$, noting that the derivative of the logistic function $f(x)=1/(1+e^{-x})$ is $f(x)(1-f(x))$.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô(N°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$68e6f17e-8c87-40f0-a673-1115ecd1b71d¹depends_on_disabled_cellsÂ§runtimeÎ<µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$cf1859d6-f889-4923-8c87-2d7c039f26c3Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements•’’…¦prefixÙ,Main.var"workspace#8".CartPoleState{Float32}¨elements›’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’£0.0ªtext/plain’¢Î¸’©-0.523599ªtext/plain’£áº‹’£0.0ªtext/plain’¤Î¸Ì‡’£0.0ªtext/plain’¡t’£0.0ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°d91ae9eead284024Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’ª1.14627f-7ªtext/plain’¢Î¸’§-0.5236ªtext/plain’£áº‹’«0.000229253ªtext/plain’¤Î¸Ì‡’«-0.00254927ªtext/plain’¡t’¥0.001ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°3cb8eeb34cc5b920Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’ª4.94244f-7ªtext/plain’¢Î¸’©-0.523604ªtext/plain’£áº‹’«0.000529982ªtext/plain’¤Î¸Ì‡’ª-0.0051295ªtext/plain’¡t’¥0.002ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°5925df27c7c0a403Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’©1.2003f-6ªtext/plain’¢Î¸’¨-0.52361ªtext/plain’£áº‹’ª0.00088212ªtext/plain’¤Î¸Ì‡’«-0.00773202ªtext/plain’¡t’¥0.003ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°799cb419fdaa824aÙ!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’ª2.17078f-6ªtext/plain’¢Î¸’©-0.523619ªtext/plain’£áº‹’ª0.00105886ªtext/plain’¤Î¸Ì‡’ª-0.0102586ªtext/plain’¡t’¥0.004ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°e537319306b8eb4cÙ!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’ª3.33277f-6ªtext/plain’¢Î¸’©-0.523631ªtext/plain’£áº‹’ª0.00126512ªtext/plain’¤Î¸Ì‡’©-0.012798ªtext/plain’¡t’¥0.005ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°5f23ac83d9a77f3eÙ!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’ª4.61946f-6ªtext/plain’¢Î¸’©-0.523645ªtext/plain’£áº‹’ª0.00130825ªtext/plain’¤Î¸Ì‡’ª-0.0152669ªtext/plain’¡t’¥0.006ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°c033031b74ffa399Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’ª6.06162f-6ªtext/plain’¢Î¸’©-0.523661ªtext/plain’£áº‹’ª0.00157607ªtext/plain’¤Î¸Ì‡’ª-0.0178331ªtext/plain’¡t’¥0.007ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°c3aa3a90ea2893c0Ù!application/vnd.pluto.tree+object’ ’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’ª7.67533f-6ªtext/plain’¢Î¸’¨-0.52368ªtext/plain’£áº‹’ª0.00165135ªtext/plain’¤Î¸Ì‡’©-0.020316ªtext/plain’¡t’¥0.008ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°1b8c6c0669b914c2Ù!application/vnd.pluto.tree+object¤more’Í9’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’©0.0456676ªtext/plain’¢Î¸’¨-1.56841ªtext/plain’£áº‹’«0.000772072ªtext/plain’¤Î¸Ì‡’¨-2.90933ªtext/plain’¡t’¨0.823993ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°904ba40814a327d6Ù!application/vnd.pluto.tree+object¤type¥Array¬prefix_short ¨objectid°3d2481548487fb39Ù!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements›’’¨0.314051ªtext/plain’’§1.07348ªtext/plain’’¦1.6197ªtext/plain’’¨-0.24391ªtext/plain’’©0.0697864ªtext/plain’’¨-1.66355ªtext/plain’’§0.72386ªtext/plain’’§-1.3219ªtext/plain’ ’¨0.500745ªtext/plain¤more’Í9’¨0.871042ªtext/plain¤type¥Array¬prefix_short ¨objectid°cfcb72d929d5dd7cÙ!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements›’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’ ’£1.0ªtext/plain¤more’Í9’£1.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°4bd6ccd0a1ec8e2dÙ!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’¨0.045668ªtext/plain’¢Î¸’¨-1.57132ªtext/plain’£áº‹’ª8.09035f-5ªtext/plain’¤Î¸Ì‡’¨-2.91423ªtext/plain’¡t’¨0.824993ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°8d5acb06e1f87e75Ù!application/vnd.pluto.tree+object’’£825ªtext/plain¤type¥Tuple¨objectid°a3522606ceb35086¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÀ²last_run_timestampËAÚ•0w·,°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$cf1859d6-f889-4923-8c87-2d7c039f26c3¹depends_on_disabled_cellsÂ§runtimeÎ§çÁµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$5500fd8e-64cb-4af7-808d-230440746319Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙN

Continuing Mountain Car Example

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôsÒ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5500fd8e-64cb-4af7-808d-230440746319¹depends_on_disabled_cellsÂ§runtimeÎjµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$76d54520-baa3-44bf-b303-4cdcb8b87080Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ4make_sample_vector (generic function with 2 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•2vŽ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$76d54520-baa3-44bf-b303-4cdcb8b87080¹depends_on_disabled_cellsÂ§runtimeÎ x‹µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$27441783-d3c6-40be-9c36-4941613e6ae9Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚg– ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•: îÔ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$27441783-d3c6-40be-9c36-4941613e6ae9¹depends_on_disabled_cellsÂ§runtimeÎû8®µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$fac138d9-3c5d-44b0-a87c-b13872f19450Š¦queuedÂ¤logs§runningÂ¦output†¤body ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ• ™z_°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$fac138d9-3c5d-44b0-a87c-b13872f19450¹depends_on_disabled_cellsÂ§runtimeÎ0µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$82e0e9a0-9662-429a-87e3-e6bdae02709aŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements™’¬step_rewards’…¦prefix§Float32¨elements›’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’ ’£0.0ªtext/plain¤more’ÎB@’£0.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°7d0e4ef3f6cc0757Ù!application/vnd.pluto.tree+object’¬total_reward’§-2782.0ªtext/plain’«total_steps’§1000000ªtext/plain’±policy_parameters’ƒ¨elements’’’…¦prefix¯Matrix{Float32}¨elements“’’Ú~32Ã—4 Matrix{Float32}: 0.321239 0.977521 -0.464886 0.62383 -0.713521 -0.761162 0.719418 -0.470045 1.61797 1.53386 -0.401547 1.00554 0.367269 1.45895 0.371751 1.28611 -0.158712 0.279256 1.20228 1.70069 0.283697 1.21211 -0.329526 0.839996 1.00743 0.554876 -0.24803 -0.450346 â‹® -0.762991 1.46526 -0.173236 0.414643 0.215731 1.20015 -0.432446 0.804414 1.15654 1.65033 -0.0695841 2.1971 0.25408 1.3398 -1.09059 0.634848 -0.359608 -1.55853 1.21788 -1.01226 0.450864 -0.360808 0.469054 -0.341672ªtext/plain’’Ú832Ã—32 Matrix{Float32}: 0.298359 -0.109246 -0.227421 â€¦ 0.109085 0.180137 0.256977 -0.117696 -0.165926 0.10662 -0.149729 -0.0800637 0.20747 -0.026369 -0.0294542 0.277826 0.0174747 0.0704003 -0.0902252 0.0273941 -0.21855 -0.0283726 0.0638221 -0.11311 0.0278709 -0.100952 0.414816 0.261797 0.351694 -0.0347838 0.187904 -0.242686 0.070586 -0.317466 â€¦ 0.0113853 -0.0128254 -0.327565 0.0240385 0.212511 0.0172485 0.174674 -0.134726 0.277685 â‹® â‹± â‹® -0.0967271 -0.165915 -0.405567 -0.159964 -0.421851 -0.0303206 0.302258 0.234183 -0.597393 0.351072 -0.75114 -0.125634 -0.182872 -0.197682 0.0573918 0.217619 0.0951767 0.0406607 0.025299 -0.139596 0.169242 -0.118836 -0.0188807 0.305599 0.140254 -0.0946509 0.0477902 â€¦ -0.0605054 0.082238 0.174884 0.291454 -0.00377796 0.28423 -0.0312474 -0.359758 0.0568826ªtext/plain’’Ù÷3Ã—32 Matrix{Float32}: -0.264022 0.00228838 -0.296634 â€¦ -0.856357 0.00620197 -0.0195779 -0.0371509 0.21098 0.21658 0.122437 0.176658 -0.142418 0.039277 -0.021776 0.0372981 0.453329 0.119452 0.134094ªtext/plain¤type¥Array¬prefix_short ¨objectid°d11f8b5eec6ee50bÙ!application/vnd.pluto.tree+object’’…¦prefix¯Vector{Float32}¨elements“’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°5489af23f3423d7fÙ!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°f10200752c32f98bÙ!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°2fe67d2d29fbf4beÙ!application/vnd.pluto.tree+object¤type¥Array¬prefix_short ¨objectid°4c8103cd6331d69aÙ!application/vnd.pluto.tree+object¤type¥Tuple¨objectid°cba062158ba2207aÙ!application/vnd.pluto.tree+object’°value_parameters’ƒ¨elements’’’…¦prefix¯Matrix{Float32}¨elements“’’Ú~32Ã—4 Matrix{Float32}: 0.321239 0.977521 -0.464886 0.62383 -0.713521 -0.761162 0.719418 -0.470045 1.61797 1.53386 -0.401547 1.00554 0.367269 1.45895 0.371751 1.28611 -0.158712 0.279256 1.20228 1.70069 0.283697 1.21211 -0.329526 0.839996 1.00743 0.554876 -0.24803 -0.450346 â‹® -0.762991 1.46526 -0.173236 0.414643 0.215731 1.20015 -0.432446 0.804414 1.15654 1.65033 -0.0695841 2.1971 0.25408 1.3398 -1.09059 0.634848 -0.359608 -1.55853 1.21788 -1.01226 0.450864 -0.360808 0.469054 -0.341672ªtext/plain’’Ú832Ã—32 Matrix{Float32}: 0.298359 -0.109246 -0.227421 â€¦ 0.109085 0.180137 0.256977 -0.117696 -0.165926 0.10662 -0.149729 -0.0800637 0.20747 -0.026369 -0.0294542 0.277826 0.0174747 0.0704003 -0.0902252 0.0273941 -0.21855 -0.0283726 0.0638221 -0.11311 0.0278709 -0.100952 0.414816 0.261797 0.351694 -0.0347838 0.187904 -0.242686 0.070586 -0.317466 â€¦ 0.0113853 -0.0128254 -0.327565 0.0240385 0.212511 0.0172485 0.174674 -0.134726 0.277685 â‹® â‹± â‹® -0.0967271 -0.165915 -0.405567 -0.159964 -0.421851 -0.0303206 0.302258 0.234183 -0.597393 0.351072 -0.75114 -0.125634 -0.182872 -0.197682 0.0573918 0.217619 0.0951767 0.0406607 0.025299 -0.139596 0.169242 -0.118836 -0.0188807 0.305599 0.140254 -0.0946509 0.0477902 â€¦ -0.0605054 0.082238 0.174884 0.291454 -0.00377796 0.28423 -0.0312474 -0.359758 0.0568826ªtext/plain’’Ùl1Ã—32 Matrix{Float32}: -0.0115592 -0.0158553 0.0209644 -0.0216808 â€¦ 0.0455652 0.0202239 -0.0194471ªtext/plain¤type¥Array¬prefix_short ¨objectid°ada0492357e9f8c9Ù!application/vnd.pluto.tree+object’’…¦prefix¯Vector{Float32}¨elements“’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°5489af23f3423d7fÙ!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°f10200752c32f98bÙ!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°aea01b1ad9e99c14Ù!application/vnd.pluto.tree+object¤type¥Array¬prefix_short ¨objectid°fca9734af5fc5753Ù!application/vnd.pluto.tree+object¤type¥Tuple¨objectid°8f92ee17ba43b477Ù!application/vnd.pluto.tree+object’¯policy_function’¢Ï€ªtext/plain’´policy_sample_action’©Ï€_sampleªtext/plain’´estimate_state_value’´estimate_state_valueªtext/plain’°policy_and_value’°policy_and_valueªtext/plain¤typeªNamedTuple¨objectid°a50652c9a06ac8ae¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeµconst reinforce_test5²last_run_timestampËAÚ•:iw°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$82e0e9a0-9662-429a-87e3-e6bdae02709a¹depends_on_disabled_cellsÂ§runtimeÏË]_Îµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$d3c1379f-acd6-4e15-be7e-a5dbe46a4f62Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ±¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•!6:°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d3c1379f-acd6-4e15-be7e-a5dbe46a4f62¹depends_on_disabled_cellsÂ§runtimeÎv,µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$fad02876-efba-46a7-9cb7-43820528779fŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚÔ¼

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@š°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$fad02876-efba-46a7-9cb7-43820528779f¹depends_on_disabled_cellsÂ§runtimeÎ ª_Áµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$1ce4bc6c-7cde-48e9-8ff1-7281697fd121Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚÔ¹

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@›ÈÍ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$1ce4bc6c-7cde-48e9-8ff1-7281697fd121¹depends_on_disabled_cellsÂ§runtimeÎ‹µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$024dcd1a-8eaa-4a95-8037-2f578828309cŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements’’¨episodic’ƒ¨elements’’¨discrete’…¦prefixÚÌStateMDP{Float32, CartPoleState{Float32}, Float32, StateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1515#1531"{var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}, Vector{Float32}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1517#1533", var"#1518#1534", var"#1519#1535", var"#1520#1536"}}, var"#failure#1527"{Float32, Float32, Float32, Float32}, var"#164#169"}¨elements–’§actions’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°c0c262704df82f17Ù!application/vnd.pluto.tree+object’£ptf’…¦prefixÙÎStateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1515#1531"{var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}, Vector{Float32}}}¨elements‘’¤step’¥#1515ªtext/plain¤type¦struct¬prefix_short¹StateMDPTransitionSampler¨objectid¯9ee4213a397680fÙ!application/vnd.pluto.tree+object’°initialize_state’°initialize_stateªtext/plain’¦isterm’§failureªtext/plain’¯is_valid_action’¤#164ªtext/plain’¬action_index’…¦prefix´Dict{Float32, Int64}¨elements‘¤more¤type¤Dict¬prefix_short¤Dict¨objectid°e7e786fa670e0f8aÙ!application/vnd.pluto.tree+object¤type¦struct¬prefix_short¨StateMDP¨objectid°bd86d1ba5aef60c1Ù!application/vnd.pluto.tree+object’ªcontinuous’…¦prefixÚ½ContinuousMDP{Float32, CartPoleState{Float32}, Float32, ContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1517#1533", var"#1518#1534", var"#1519#1535", var"#1520#1536"}}, var"#failure#1527"{Float32, Float32, Float32, Float32}, Returns{Bool}}¨elements”’£ptf’…¦prefixÙºContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#episodic_step#1529"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}}}¨elements‘’¤step’episodic_stepªtext/plain¤type¦struct¬prefix_short¾ContinuousMDPTransitionSampler¨objectid°af5b729c12e09a75Ù!application/vnd.pluto.tree+object’°initialize_state’°initialize_stateªtext/plain’¦isterm’§failureªtext/plain’¯is_valid_action’³Returns{Bool}(true)ªtext/plain¤type¦struct¬prefix_shortContinuousMDP¨objectid°218961bfbacff4daÙ!application/vnd.pluto.tree+object¤typeªNamedTuple¨objectid°fb8d4c7a0f6d0ab6Ù!application/vnd.pluto.tree+object’ªcontinuing’ƒ¨elements’’¨discrete’…¦prefixÚÝStateMDP{Float32, CartPoleState{Float32}, Float32, StateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1516#1532"{var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}, Vector{Float32}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1517#1533", var"#1518#1534", var"#1519#1535", var"#1520#1536"}}, Returns{Bool}, var"#164#169"}¨elements–’§actions’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°9ea52cd90e32ffacÙ!application/vnd.pluto.tree+object’£ptf’…¦prefixÚStateMDPTransitionSampler{Float32, CartPoleState{Float32}, var"#1516#1532"{var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}, Vector{Float32}}}¨elements‘’¤step’¥#1516ªtext/plain¤type¦struct¬prefix_short¹StateMDPTransitionSampler¨objectid°d8cb80a42895dd80Ù!application/vnd.pluto.tree+object’°initialize_state’°initialize_stateªtext/plain’¦isterm’´Returns{Bool}(false)ªtext/plain’¯is_valid_action’¤#164ªtext/plain’¬action_index’…¦prefix´Dict{Float32, Int64}¨elements‘¤more¤type¤Dict¬prefix_short¤Dict¨objectid°cadcdafa3882e5d2Ù!application/vnd.pluto.tree+object¤type¦struct¬prefix_short¨StateMDP¨objectid°be73bc3cc632ca17Ù!application/vnd.pluto.tree+object’ªcontinuous’…¦prefixÚÎContinuousMDP{Float32, CartPoleState{Float32}, Float32, ContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}}, var"#initialize_state#1525"{var"#initialize_state#1514#1526"{var"#1517#1533", var"#1518#1534", var"#1519#1535", var"#1520#1536"}}, Returns{Bool}, Returns{Bool}}¨elements”’£ptf’…¦prefixÙôContinuousMDPTransitionSampler{Float32, CartPoleState{Float32}, Float32, var"#continuing_step#1530"{Float32, var"#step#1528"{Float32, Float32, Float32, Float32, CartPoleVehicle{Float32}}, var"#failure#1527"{Float32, Float32, Float32, Float32}}}¨elements‘’¤step’¯continuing_stepªtext/plain¤type¦struct¬prefix_short¾ContinuousMDPTransitionSampler¨objectid°56c89dca7fef04f4Ù!application/vnd.pluto.tree+object’°initialize_state’°initialize_stateªtext/plain’¦isterm’´Returns{Bool}(false)ªtext/plain’¯is_valid_action’³Returns{Bool}(true)ªtext/plain¤type¦struct¬prefix_shortContinuousMDP¨objectid°1aec13aa82c2a793Ù!application/vnd.pluto.tree+object¤typeªNamedTuple¨objectid°a6cba0b5616fa636Ù!application/vnd.pluto.tree+object¤typeªNamedTuple¨objectid°1bb5ff08b1fd3f62¤mimeÙ!application/vnd.pluto.tree+object¬rootassignee³const cartpole_mdps²last_run_timestampËAÚ•0Yèû°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$024dcd1a-8eaa-4a95-8037-2f578828309c¹depends_on_disabled_cellsÂ§runtimeÎhiÍµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$e1274f57-75cb-4659-a82f-e5870c5367e2Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements•’’…¦prefixÙ,Main.var"workspace#8".CartPoleState{Float32}¨elements›’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’£0.0ªtext/plain’¢Î¸’¥-0.05ªtext/plain’£áº‹’£0.0ªtext/plain’¤Î¸Ì‡’£0.0ªtext/plain’¡t’£0.0ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°ae4ff4dafbab7702Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’ª-0.0228369ªtext/plain’¢Î¸’ª-0.0387834ªtext/plain’£áº‹’¨-1.14189ªtext/plain’¤Î¸Ì‡’¨0.561267ªtext/plain’¡t’¤0.04ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°f4f18354d3b0f1bfÙ!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’ª-0.0456448ªtext/plain’¢Î¸’ª-0.0278905ªtext/plain’£áº‹’ª0.00148022ªtext/plain’¤Î¸Ì‡’ª-0.0162889ªtext/plain’¡t’¤0.08ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°5b583c0a7544def1Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’ª-0.0227183ªtext/plain’¢Î¸’ª-0.0400882ªtext/plain’£áº‹’§1.14486ªtext/plain’¤Î¸Ì‡’©-0.593963ªtext/plain’¡t’¤0.12ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°72477b9cd9f8e47dÙ!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’©0.0459466ªtext/plain’¢Î¸’©-0.075463ªtext/plain’£áº‹’¦2.2884ªtext/plain’¤Î¸Ì‡’¨-1.17575ªtext/plain’¡t’¤0.16ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°1f26dac2911cc407Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’¨0.114661ªtext/plain’¢Î¸’©-0.111478ªtext/plain’£áº‹’§1.14752ªtext/plain’¤Î¸Ì‡’©-0.626576ªtext/plain’¡t’£0.2ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°b6636b0b1a7abc71Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’¨0.183441ªtext/plain’¢Î¸’©-0.148371ªtext/plain’£áº‹’¦2.2914ªtext/plain’¤Î¸Ì‡’¨-1.21882ªtext/plain’¡t’¤0.24ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°73ab8a844681fbc1Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’¨0.252315ªtext/plain’¢Î¸’¨-0.18652ªtext/plain’£áº‹’¦1.1526ªtext/plain’¤Î¸Ì‡’©-0.690589ªtext/plain’¡t’¤0.28ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°c0e09a18786c0171Ù!application/vnd.pluto.tree+object’ ’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’¨0.275668ªtext/plain’¢Î¸’©-0.203737ªtext/plain’£áº‹’©0.0152452ªtext/plain’¤Î¸Ì‡’©-0.171239ªtext/plain’¡t’¤0.32ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°dbe0f46ac8b4faa8Ù!application/vnd.pluto.tree+object¤more’Íè’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’¨-21.9987ªtext/plain’¢Î¸’©-0.229379ªtext/plain’£áº‹’¨-7.55936ªtext/plain’¤Î¸Ì‡’©-0.870074ªtext/plain’¡t’§39.9605ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°3d78166ff8c9568dÙ!application/vnd.pluto.tree+object¤type¥Array¬prefix_short ¨objectid°2eb7920274ac0e71Ù!application/vnd.pluto.tree+object’’…¦prefix¥Int64¨elements›’’¡1ªtext/plain’’¡3ªtext/plain’’¡3ªtext/plain’’¡3ªtext/plain’’¡1ªtext/plain’’¡3ªtext/plain’’¡1ªtext/plain’’¡1ªtext/plain’ ’¡3ªtext/plain¤more’Íè’¡1ªtext/plain¤type¥Array¬prefix_short ¨objectid°c5435db221c3a676Ù!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements›’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’ ’£1.0ªtext/plain¤more’Íè’£1.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°eb8910706e4f0966Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’£0.0ªtext/plain’¢Î¸’¥-0.05ªtext/plain’£áº‹’£0.0ªtext/plain’¤Î¸Ì‡’£0.0ªtext/plain’¡t’£0.0ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°ae4ff4dafbab7702Ù!application/vnd.pluto.tree+object’’¤1000ªtext/plain¤type¥Tuple¨objectid°ee1344ec172eb9ce¤mimeÙ!application/vnd.pluto.tree+object¬rootassignee¨const ep²last_run_timestampËAÚ•7û”Í°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$e1274f57-75cb-4659-a82f-e5870c5367e2¹depends_on_disabled_cellsÂ§runtimeÎ«ˆµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$fdd3f4fd-4706-4d6b-b150-6ee6b4b370cbŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ•

Notes on Probability Distributions

In order to prove the policy gradient theorem, we must manipulate terms that are probability distributions over states and visit steps. In order to build intuition for these distributions, we can visualize how data is being averaged with the sort corridor example. The following function simulates many episodes in the environment with a stochastic policy that has some probability of moving left regardless of the state. The simulation keeps track of the visit count for a given state and the visit step. The result of the accumulation is a matrix who's columns contain the number of times each state was visited on every step of an episode across all of the simulated episodes. If we divide each count by the number of episodes simulated, then we have an unbiased sample of the probability of visiting a state on each step $k$ of an episode: $\Pr \{ S_k = s \mid \pi \}$ such that $\sum_{s \in \mathcal{S}^+} \Pr \{ S_k = s \mid \pi \} = 1$.

Note that this distribution is only normalized over the sum of all states including terminal states which is denoted in episodic problems by the notation $\mathcal{S}^+$. The notation $\mathcal{S}$ excludes all terminal states, so if we sum the above probabilities over that set on a given step $k$ we calculate the probability that we are NOT in a terminal state by the time we reach step $k$: $\sum_\mathcal{S} \Pr \{ S_k = s \mid \pi \} = \Pr \{ T \gt k \mid \pi \}$ where we use the notation that $T$ is the step of termination for a particular episode.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô…Ê`°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$fdd3f4fd-4706-4d6b-b150-6ee6b4b370cb¹depends_on_disabled_cellsÂ§runtimeÎú¯µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$b02ba928-5b9f-4695-b980-07988c788bb9Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements™’¬step_rewards’…¦prefix§Float32¨elements›’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’ ’£0.0ªtext/plain¤more’Î @’£0.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°b3b4eac83edd1ed9Ù!application/vnd.pluto.tree+object’¬total_reward’¦1340.0ªtext/plain’«total_steps’¦200000ªtext/plain’±policy_parameters’Úé1452Ã—3 Matrix{Float32}: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 â‹® -0.500074 -0.203697 0.703771 0.162614 -0.501013 0.3384 0.0877269 0.0986086 -0.186336 0.0 0.0 0.0 0.0 0.0 0.0 0.00818538 0.00600036 -0.0141857ªtext/plain’°value_parameters’…¦prefix§Float32¨elements›’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’ ’£0.0ªtext/plain¤more’Í¬’ª0.00032779ªtext/plain¤type¥Array¬prefix_short ¨objectid°b0ccbc779c7fa779Ù!application/vnd.pluto.tree+object’¯policy_function’¢Ï€ªtext/plain’´policy_sample_action’©Ï€_sampleªtext/plain’´estimate_state_value’´estimate_state_valueªtext/plain’°policy_and_value’°policy_and_valueªtext/plain¤typeªNamedTuple¨objectid°5d7c92c5c5b78181¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÙ&const mountaincar_continuing_tile_test²last_run_timestampËAÚ•=Z'N°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$b02ba928-5b9f-4695-b980-07988c788bb9¹depends_on_disabled_cellsÂ§runtimeÎŸãµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$f946c886-6246-4f98-a96f-f06984691ad8Š¦queuedÂ¤logs§runningÂ¦output†¤body ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•!š•°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f946c886-6246-4f98-a96f-f06984691ad8¹depends_on_disabled_cellsÂ§runtimeÎ3€‰µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$3c316495-bb6c-41e2-a38f-ba867a319fbbŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ5create_cartpole_mdps (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•/þ5°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$3c316495-bb6c-41e2-a38f-ba867a319fbb¹depends_on_disabled_cellsÂ§runtimeÎM¹µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$6c5e9bb2-4c38-4613-9652-dec99e97b512Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ<

Policy Function Output

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô–qz°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$6c5e9bb2-4c38-4613-9652-dec99e97b512¹depends_on_disabled_cellsÂ§runtimeÎðOµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$b0a66a19-ee76-463b-a704-8fcee85444d0Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ>update_params_with_gradient! (generic function with 4 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•Á÷s°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$b0a66a19-ee76-463b-a704-8fcee85444d0¹depends_on_disabled_cellsÂ§runtimeÎ^ßžµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$13ebc12f-ff6f-4266-88d3-28d6df5fcf59Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙWactor_critic_binary_episodic_gaussian_parameter_study (generic function with 2 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•/ô¢°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$13ebc12f-ff6f-4266-88d3-28d6df5fcf59¹depends_on_disabled_cellsÂ§runtimeÎ2x¹µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$7a6fb1f0-fc3c-4c29-a6d9-769d32ca98a9Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙK

Example 13.1 Short corridor gridworld

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôƒ(R°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$7a6fb1f0-fc3c-4c29-a6d9-769d32ca98a9¹depends_on_disabled_cellsÂ§runtimeÎËÂµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$f2f2dd1d-180c-4d36-b515-5079d129f93aŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÛ$î ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•¸©Ú°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f2f2dd1d-180c-4d36-b515-5079d129f93a¹depends_on_disabled_cellsÂ§runtimeÎœ¹ýµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$553b0ceb-f2ca-41ee-99bc-9f53a4487b49Š¦queuedÂ¤logs§runningÂ¦output†¤body¨11.67104¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•&á›°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$553b0ceb-f2ca-41ee-99bc-9f53a4487b49¹depends_on_disabled_cellsÂ§runtimeÎ éaµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$f9facbba-39d4-483e-9066-275603156db0Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ8plot_mountaincar_values (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•@¢K °persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f9facbba-39d4-483e-9066-275603156db0¹depends_on_disabled_cellsÂ§runtimeÎ<ÚÉµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0Š¦queuedÂ¤logs§runningÂ¦output†¤body…¦prefix§Float32¨elements’’’¨0.999969ªtext/plain’’ª3.11996f-5ªtext/plain¤type¥Array¬prefix_short ¨objectid°b2757275efe1e0ca¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÀ²last_run_timestampËAÚ•+5ç°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0¹depends_on_disabled_cellsÂ§runtimeÎŠm¨µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$d41f1dd1-45fe-4456-9a01-ed47fd6704a7Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙAupdate_beta_eligibility_vector! (generic function with 4 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•#^V°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d41f1dd1-45fe-4456-9a01-ed47fd6704a7¹depends_on_disabled_cellsÂ§runtimeÎ$eµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$ba5d6311-daee-4abc-b2fb-fae2184ef3ebŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙGsetup_binary_gaussian_policy_arguments (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•''E€°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$ba5d6311-daee-4abc-b2fb-fae2184ef3eb¹depends_on_disabled_cellsÂ§runtimeÎûeµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$8e742d32-c074-4981-b35b-b596b64c869bŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚdD

$\lambda_\theta$: 0.95

$\lambda_\mathbf{w}$: 0.05

$\alpha_{\overline{r}}$:

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ• ñ[}°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$8e742d32-c074-4981-b35b-b596b64c869b¹depends_on_disabled_cellsÂ§runtimeÎl¹µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$03a218cb-aa83-4000-85b5-c6f247087053Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ>update_binary_value_gradient! (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•êO°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$03a218cb-aa83-4000-85b5-c6f247087053¹depends_on_disabled_cellsÂ§runtimeÎ Uµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$1ec1acf1-f833-4478-9b3c-88029340a629Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚV

Non-linear Features

This version of REINFORCE uses non-linear features in a fully connected neural network. The number of parameters no longer matches the size of the input feature vector, but a mapping from state to feature vector is still required. One must specify the size of the feature vector, a function that updates the values in a feature vector given a state, and the size of each hidden layer in the neural network. Additional keyword arguments are available to change the construction of the neural network such as adding residual layers.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôŠ=q°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$1ec1acf1-f833-4478-9b3c-88029340a629¹depends_on_disabled_cellsÂ§runtimeÎðˆµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$de3cba34-9842-44d1-9b79-47126c0a0751Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements’’¬num_features’¥29160ªtext/plain’³get_active_features’¡fªtext/plain¤typeªNamedTuple¨objectid°e42a62c766d4f57b¤mimeÙ!application/vnd.pluto.tree+object¬rootassignee¿const cartpole_tilecoding_setup²last_run_timestampËAÚ• îÖs°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$de3cba34-9842-44d1-9b79-47126c0a0751¹depends_on_disabled_cellsÂ§runtimeÎhµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$04f42c09-8ab5-4233-b196-51c4aa2dcedbŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙB

Waiting to run parameter study

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•=3z°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$04f42c09-8ab5-4233-b196-51c4aa2dcedb¹depends_on_disabled_cellsÂ§runtimeÎ©’µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$54ff46a2-489a-4dd2-bc30-df70c780cc42Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÀ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampË°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$54ff46a2-489a-4dd2-bc30-df70c780cc42¹depends_on_disabled_cellsÃ§runtimeÀµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$7126aefd-b847-497a-9545-514e9b9afa71Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÀ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampË°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$7126aefd-b847-497a-9545-514e9b9afa71¹depends_on_disabled_cellsÃ§runtimeÀµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$48dcd2d0-a940-41da-a097-90c780f2ec4dŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ

Alternative Paramaterization

If the action space is small enough, then it may be convenient to create a function that simply outputs the preferences for all of the actions at a given state. Let's call $N_a$ to be the number of available actions. We would then consider the vector function $\mathbf{h}(s, \boldsymbol{\theta}) \in \mathbb{R}^{N_a}$ and its components $h_1, h_2, h_3, \dots, h_{N_a}$. To be the action preferences at each state. With this style of paramaterization, we need only compute state feature vectors $\mathbf{x}(s) \in \mathbb{R}^d$.

Similarly, the policy function would also be a vector function. In order to compute the softmax, we must evaluate the denominator of (13.2) which requires knowing all of the action preferences. Practically, it is only defined as a function on vectors, so consider the following notation to simplify expressions where we use the symbol $\mathbf{\sigma}$ to denote the soft-max vector function.

$$\sigma(\mathbf{x}) = \frac{e^{\mathbf{x}}}{\sum_j{e^{x_j}}} \text{ where we abuse the notation } e^{\mathbf{x}} = \begin{pmatrix} e^{x_1} \\ e^{x_2} \\ \vdots \\ e^{x_n} \end{pmatrix}$$

Using this notation, we can write down the policy function under this new parameterization: $\mathbf{\pi}(s, \boldsymbol{\theta}) = \mathbf{\sigma}(\mathbf{h}(s, \boldsymbol{\theta}))$. What do linear preferences look like with this parameterization? Instead of a parameter vector $\boldsymbol{\theta} \in \mathbb{R}^{d^\prime}$, we have a parameter matrix $\boldsymbol{\theta} \in \mathbb{R}^{d \times N_a}$ and the vector of preferences is the result of a matrix vector multiplication: $\mathbf{h}(s, \boldsymbol{\theta}) = \theta^\top \mathbf{x}(s) \in \mathbb{R}^{N_a}$. Subscript notation is used to refer to single preference values so $\mathbf{h}_i$ would be the $ith$ index of $\mathbf{h}$ for the $ith$ action preference equivalent to $h_i$.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‚°K°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$48dcd2d0-a940-41da-a097-90c780f2ec4d¹depends_on_disabled_cellsÂ§runtimeÎžµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$e1493cea-19c4-475d-98a0-86d27fb04af1Š¦queuedÂ¤logs§runningÂ¦output†¤body§12.0527¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•"¯B°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$e1493cea-19c4-475d-98a0-86d27fb04af1¹depends_on_disabled_cellsÂ§runtimeÎºe%µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$511a847f-234c-465e-8f4a-688e79d9b975Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ¸

13.6 Policy Gradient for Continuing Problems

In the continuing case we need to define the average reward per time step as discussed in Section 10.3. In the update procedure the Î´ is calculated differently in terms of the reward compared to this long running average. The value functions in this case will also learn the reward difference from the average which is assumed to have a well defined expected value under the stationary state distribution for the policy. This shift in the value function will not affect performance since shifting the value function up and down by a constant does not affect the learned policy. To implement this we need a new learning rate $Î±_{\overline{R}}$ which controls how quickly the reward average updates. This replaces $Î³$ in a sense since we no longer discount rewards of future time steps.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôŒ©°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$511a847f-234c-465e-8f4a-688e79d9b975¹depends_on_disabled_cellsÂ§runtimeÎX µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$697b2310-9d96-4f7f-be62-c3bd6bf736f3Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙRreinforce_with_baseline_monte_carlo_control_fcann (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•&èL^°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$697b2310-9d96-4f7f-be62-c3bd6bf736f3¹depends_on_disabled_cellsÂ§runtimeÎ?ÅHµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$056a8adc-92f4-4b33-90d9-4b3b4026bbbcŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ?update_traces_with_gradient! (generic function with 16 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•'E1°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$056a8adc-92f4-4b33-90d9-4b3b4026bbbc¹depends_on_disabled_cellsÂ§runtimeÎÄü:µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$bc8a399b-8864-4473-89d2-e3b0a03d15b5Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ9corridor_parameter_study (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•=r°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$bc8a399b-8864-4473-89d2-e3b0a03d15b5¹depends_on_disabled_cellsÂ§runtimeÎv)µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$bba13634-ff0e-47f7-a23b-8d56098f4ac6Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ7make_gaussian_sampler (generic function with 2 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•#/‹½°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$bba13634-ff0e-47f7-a23b-8d56098f4ac6¹depends_on_disabled_cellsÂ§runtimeÎ>=¹µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$407a0724-4bb6-4c83-ab2d-17a0e19c4072Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements™’¬step_rewards’…¦prefix§Float32¨elements¤type¥Array¬prefix_short ¨objectid°549a1baeeaca6eefÙ!application/vnd.pluto.tree+object’episode_steps’…¦prefix¥Int64¨elements›’’¢22ªtext/plain’’¢43ªtext/plain’’¢67ªtext/plain’’¢80ªtext/plain’’¢95ªtext/plain’’£135ªtext/plain’’£150ªtext/plain’’£170ªtext/plain’ ’£204ªtext/plain¤more’ÍC’¦999842ªtext/plain¤type¥Array¬prefix_short ¨objectid°2ae5060d976042e1Ù!application/vnd.pluto.tree+object’¯episode_rewards’…¦prefix§Float32¨elements›’’¤21.0ªtext/plain’’¤21.0ªtext/plain’’¤24.0ªtext/plain’’¤13.0ªtext/plain’’¤15.0ªtext/plain’’¤40.0ªtext/plain’’¤15.0ªtext/plain’’¤20.0ªtext/plain’ ’¤34.0ªtext/plain¤more’ÍC’¦2275.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°d47c5540ea8ef0bdÙ!application/vnd.pluto.tree+object’±policy_parameters’ƒ¨elements’’’…¦prefix¯Matrix{Float32}¨elements“’’Úw64Ã—4 Matrix{Float32}: -5.53118 5.62263 -9.49019 5.85546 -1.17843 -1.11469 -1.2416 -3.49056 -1.92707 -1.40302 -2.22741 -1.58268 -2.94946 -3.06743 -1.8371 -6.95662 1.86754 -2.16378 3.88687 -3.24863 0.291204 -1.61792 0.682906 -2.27799 -0.79289 -2.07449 -1.00932 -3.37475 â‹® -0.79335 -1.99861 -0.0176253 -3.12513 1.74487 -2.19695 4.04714 -4.79236 1.69527 -1.42853 4.23525 -1.5761 4.19682 4.37338 6.10929 7.2588 1.20301 -0.674417 2.90166 -1.15281 0.625654 -0.512086 0.92509 -0.696122ªtext/plain’’Ú64Ã—64 Matrix{Float32}: 0.60997 0.103415 0.094945 â€¦ -0.419245 -0.119642 0.0531449 -2.50197 -0.28097 -0.123862 0.044202 -0.084665 -0.292616 -2.09229 0.334419 0.180306 -0.818408 -0.131487 0.0162404 -1.33884 -0.49032 -0.125689 0.110491 0.105055 0.0762119 -0.368561 -0.0121925 -0.42925 -0.117518 -0.0350838 -0.164547 -0.271998 -0.0338218 0.142059 â€¦ 0.383608 0.0644618 -0.167785 1.88499 -0.0478991 -0.0462724 -0.0637653 0.285258 0.190184 â‹® â‹± -1.28638 -0.190348 -0.160667 0.246749 0.0221627 -0.000938494 2.39131 -0.0584706 0.122001 0.195413 0.175769 0.176224 -1.76799 0.122958 0.301179 â€¦ -1.10118 -0.374552 -0.0232428 1.57355 0.19686 0.350402 0.069147 0.136391 -0.109028 -2.17096 0.155426 -0.267845 -0.0634167 -0.379201 -0.16344 0.227783 -0.0117455 -0.235104 -0.0324068 0.133656 0.130253ªtext/plain’’Ú3Ã—64 Matrix{Float32}: 0.526751 -0.232876 0.513466 -0.544604 â€¦ 0.0746399 -0.140322 -0.371757 0.176781 -0.139932 0.0111613 -0.0715213 -0.0481775 -0.291423 -0.20689 -0.478238 0.0583387 -0.167361 0.302655 0.05054 0.0507894 0.134926ªtext/plain¤type¥Array¬prefix_short ¨objectid°ca0199107099d027Ù!application/vnd.pluto.tree+object’’…¦prefix¯Vector{Float32}¨elements“’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°a33118e322f8d991Ù!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°23a7de8ba3e649b8Ù!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°678f2ddb2bb4cfdeÙ!application/vnd.pluto.tree+object¤type¥Array¬prefix_short ¨objectid°aee4891a3a5fb0e7Ù!application/vnd.pluto.tree+object¤type¥Tuple¨objectid°7a6612f9aa87890dÙ!application/vnd.pluto.tree+object’°value_parameters’ƒ¨elements’’’…¦prefix¯Matrix{Float32}¨elements“’’Úw64Ã—4 Matrix{Float32}: -5.53118 5.62263 -9.49019 5.85546 -1.17843 -1.11469 -1.2416 -3.49056 -1.92707 -1.40302 -2.22741 -1.58268 -2.94946 -3.06743 -1.8371 -6.95662 1.86754 -2.16378 3.88687 -3.24863 0.291204 -1.61792 0.682906 -2.27799 -0.79289 -2.07449 -1.00932 -3.37475 â‹® -0.79335 -1.99861 -0.0176253 -3.12513 1.74487 -2.19695 4.04714 -4.79236 1.69527 -1.42853 4.23525 -1.5761 4.19682 4.37338 6.10929 7.2588 1.20301 -0.674417 2.90166 -1.15281 0.625654 -0.512086 0.92509 -0.696122ªtext/plain’’Ú64Ã—64 Matrix{Float32}: 0.60997 0.103415 0.094945 â€¦ -0.419245 -0.119642 0.0531449 -2.50197 -0.28097 -0.123862 0.044202 -0.084665 -0.292616 -2.09229 0.334419 0.180306 -0.818408 -0.131487 0.0162404 -1.33884 -0.49032 -0.125689 0.110491 0.105055 0.0762119 -0.368561 -0.0121925 -0.42925 -0.117518 -0.0350838 -0.164547 -0.271998 -0.0338218 0.142059 â€¦ 0.383608 0.0644618 -0.167785 1.88499 -0.0478991 -0.0462724 -0.0637653 0.285258 0.190184 â‹® â‹± -1.28638 -0.190348 -0.160667 0.246749 0.0221627 -0.000938494 2.39131 -0.0584706 0.122001 0.195413 0.175769 0.176224 -1.76799 0.122958 0.301179 â€¦ -1.10118 -0.374552 -0.0232428 1.57355 0.19686 0.350402 0.069147 0.136391 -0.109028 -2.17096 0.155426 -0.267845 -0.0634167 -0.379201 -0.16344 0.227783 -0.0117455 -0.235104 -0.0324068 0.133656 0.130253ªtext/plain’’Ùn1Ã—64 Matrix{Float32}: -5.14633 8.35324 7.5348 7.53401 -7.5177 â€¦ 6.49972 -5.74128 6.85215 -7.85695ªtext/plain¤type¥Array¬prefix_short ¨objectid°39708eb2342a44c2Ù!application/vnd.pluto.tree+object’’…¦prefix¯Vector{Float32}¨elements“’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°a33118e322f8d991Ù!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°23a7de8ba3e649b8Ù!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements‘¤more¤type¥Array¬prefix_short ¨objectid°a143fc9d03d11d9aÙ!application/vnd.pluto.tree+object¤type¥Array¬prefix_short ¨objectid°87ff3a09514822efÙ!application/vnd.pluto.tree+object¤type¥Tuple¨objectid°f81615e58a2eea74Ù!application/vnd.pluto.tree+object’¯policy_function’¢Ï€ªtext/plain’´policy_sample_action’©Ï€_sampleªtext/plain’´estimate_state_value’´estimate_state_valueªtext/plain’°policy_and_value’°policy_and_valueªtext/plain¤typeªNamedTuple¨objectid°d248bb92e282e08e¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeµconst reinforce_test4²last_run_timestampËAÚ•7ïJ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$407a0724-4bb6-4c83-ab2d-17a0e19c4072¹depends_on_disabled_cellsÂ§runtimeÏTÊÇµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$77cf3a74-899f-4ade-99f2-5aaf7a98c02dŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ4scale_fcann_params! (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•ÿ.b°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$77cf3a74-899f-4ade-99f2-5aaf7a98c02d¹depends_on_disabled_cellsÂ§runtimeÎæ½µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$28ce6e60-59cf-408a-8081-b978507b3c72Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚaf

x position: 0.0

pole angle: 0.0012229534

x velocity: 0.0

pole angular velocity: 0.0

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•!ƒ0°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$28ce6e60-59cf-408a-8081-b978507b3c72¹depends_on_disabled_cellsÂ§runtimeÎQQúµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$7ccadf01-fbba-4dfd-a5ad-770dab9946f9Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ

We can define our policy as a normal distribution function over actions for a given state and parameter vector.

$$\pi(a|s, \mathbf{\theta}) \doteq \frac{1}{\sigma(s, \mathbf{\theta}) \sqrt{2\pi}} \exp \left ( - \frac{(a-\mu(s, \mathbf{\theta}))^2}{2\sigma(s, \mathbf{\theta})^2} \right ) \tag{13.19}$$

This policy requires Î¼ and Ïƒ to be parameterized by the parameter vector. To make a linear model for both parameters we can use the following formulas:

$$\mu(s, \mathbf{\theta}) \doteq \mathbf{\theta}_\mu ^\top \mathbf{x}_\mu(s) \text{ and } \sigma(s, \mathbf{\theta}) \doteq \exp{( \mathbf{\theta}_\sigma ^ \top \mathbf{x}_\sigma (s))} \tag{13.20}$$

where $\mathbf{x}_\mu(s)$ and $\mathbf{x}_\sigma(s)$ are state feature vectors. With these formulas we can apply the previous algorithms to solve environments with real-valued actions.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô³°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$7ccadf01-fbba-4dfd-a5ad-770dab9946f9¹depends_on_disabled_cellsÂ§runtimeÎ\÷µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$b72e030f-7d52-481f-b4f7-2b16b227e547Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ±

Figure 13.2

Adding a baseline to REINFORCE can make it learn much faster as illustrated here on the short-corridor gridworld (Example 13.1). Here the approximate state-value function used in the baseline is $\hat v(s, \mathbf{w}) = w$. There is only one component of the feature vector and the state value approximation parameters.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôŠZŸ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$b72e030f-7d52-481f-b4f7-2b16b227e547¹depends_on_disabled_cellsÂ§runtimeÎb/µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$4c5cb75e-79b5-4502-b1eb-6246e002feafŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚcM

$\lambda_\theta$: 0.1

$\lambda_\mathbf{w}$: 0.9

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@.Ñ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$4c5cb75e-79b5-4502-b1eb-6246e002feaf¹depends_on_disabled_cellsÂ§runtimeÎ±0:µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$48b342f2-e48f-457a-9bd3-b3504a79f3a6Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙø

Binary Features

This version of REINFORCE uses binary feature vectors for which one needs to specify the total number of features as well as a function that returns the active features for a given state.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôŠÚ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$48b342f2-e48f-457a-9bd3-b3504a79f3a6¹depends_on_disabled_cellsÂ§runtimeÎ®µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$5d50a5d0-8fe2-4c6e-b76c-d5614e4fd884Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ4show_or_lookup_plot (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•>³Ñ¿°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5d50a5d0-8fe2-4c6e-b76c-d5614e4fd884¹depends_on_disabled_cellsÂ§runtimeÎ:æ¾µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$ba645f6b-143f-4e83-9003-707770ae308dŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ

Probability distributions for short corridor gridworld example with probability of left action selected below

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô†#–°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$ca360680-afc9-4dd9-9351-493643f91575¹depends_on_disabled_cellsÂ§runtimeÎF7µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$d95f75b5-21d8-4862-baa7-50b58d9725b8Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚï

Soft-max notation and gradients

To use policy gradient methods, we must be able to take the gradient of the policy function for every state-action pair. Using the above notation and treating the policy as a vector function, we must know the gradient of the soft-max applied to a vector function at a particular index. Each gradient is a column vector of length $d$ where $d$ is the number of parameters. There is a separate gradient available for every index in the vector output which is one for each action or a total of $N_a$. To simplify expressions, $\mathbf{h}(s, \boldsymbol{\theta})$ will we written as $\mathbf{h}$ and $\mathbf{\pi} = \mathbf{\sigma}(\mathbf{h})$. Our desired gradient is with respect to a particular component of $\mathbf{\sigma}(\mathbf{h})$ denoted $\mathbf{\sigma}(\mathbf{h})_a$ where $a$ represents the action index. The gradient itself is the vector of partial derivatives with respect to the parameters $\theta$. The $ith$ component of the gradient $\nabla(f(\theta))_i = \frac{\partial f(\theta)}{\partial \theta_i}$. When we compute the gradient we need all the components whose expression is derived below.

$$\begin{align} \nabla \left ( \sigma(\mathbf{h})_a \right )_i &= \frac{\partial}{\partial \theta_i} \left ( \frac{e^{h_a}}{\sum_k{e^{h_k}}} \right ) \\ &=\left ( \frac{1}{{\sum_k{e^{h_k}}}} \right )^2 \left ( e^{h_a} \frac{\partial{h_a}}{\partial{\theta_i}} \sum_k{e^{h_k}} - e^{h_a} \sum_k{e^{h_k} \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \\ &=\left ( \frac{1}{{\sum_k{e^{h_k}}}} \right )^2 e^{h_a} \left ( \frac{\partial{h_a}}{\partial{\theta_i}} \sum_k{e^{h_k}} - \sum_k{e^{h_k} \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \tag{factoring out exponenential term}\\ &=\left ( \frac{e^{h_a}}{{\sum_k{e^{h_k}}}} \right ) \left ( \frac{\partial{h_a}}{\partial{\theta_i}} \sum_k{\frac{e^{h_k}}{\sum_l e^{h_l}}} - \sum_k{\frac{e^{h_k}}{\sum_l e^{h_l}} \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \tag{distributing squared fraction}\\ &=\pi_a \left ( \frac{\partial{h_a}}{\partial{\theta_i}} \sum_k{\pi_k} - \sum_k{\pi_k \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \tag{definition of policy function}\\ &=\pi_a \left ( \frac{\partial{h_a}}{\partial{\theta_i}} - \sum_k{\pi_k \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \end{align}$$

The final step results form the fact that the policy function is a probability distribution so the sum over it is always 1.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‚Ót°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d95f75b5-21d8-4862-baa7-50b58d9725b8¹depends_on_disabled_cellsÂ§runtimeÎÌ¦µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$65be0e58-24be-4932-92a9-9e4825b14144Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙbactor_critic_binary_continuing_squashed_gaussian_parameter_study (generic function with 2 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•/õí[°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$65be0e58-24be-4932-92a9-9e4825b14144¹depends_on_disabled_cellsÂ§runtimeÎ&sµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$60c21e9c-e42d-4f0b-a910-3b318440fbc8Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ:¼

Normal Distribution Plot with

$\mu$: 0.0

$\sigma$: 1.0

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•!Ne®°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$60c21e9c-e42d-4f0b-a910-3b318440fbc8¹depends_on_disabled_cellsÂ§runtimeÎÚÆµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$da2d3186-a778-41cc-9b49-759bf1e9b8faŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙšUnion{AbstractVector{I} where I, C1, C2, C3} where {I<:Integer, C1<:AbstractVector{I}, N, C2<:NTuple{N, I}, T<:AbstractVector{I}, C3<:(Base.Generator{T})}¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•“Õ™°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$da2d3186-a778-41cc-9b49-759bf1e9b8fa¹depends_on_disabled_cellsÂ§runtimeÎ[µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$b695ef21-a1ac-4d1f-a0e1-71cd81cede18Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÛÝ½

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@ãµ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$b695ef21-a1ac-4d1f-a0e1-71cd81cede18¹depends_on_disabled_cellsÂ§runtimeÎJ=Ôµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙmreinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•';°°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00¹depends_on_disabled_cellsÂ§runtimeÎ=>¸µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$dcb306ae-a1b1-43d6-ba6e-e38668838689Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙF

Soft-max Implementation

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôƒ²°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$dcb306ae-a1b1-43d6-ba6e-e38668838689¹depends_on_disabled_cellsÂ§runtimeÎ×ôµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$54f559b6-8a62-4a42-894d-c56e41d5ebefŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ“152Ã—3 Matrix{Float64}: 1.0 0.0 0.0 0.499665 0.500335 0.0 0.500081 0.250068 0.249851 0.374974 0.374815 0.125075 0.375395 0.250027 0.187132 0.312784 0.280977 0.124948 0.296775 0.219084 0.140399 â‹® 0.0 0.0 1.0e-6 0.0 1.0e-6 0.0 1.0e-6 0.0 0.0 1.0e-6 0.0 0.0 0.0 1.0e-6 0.0 0.0 0.0 1.0e-6¤mimeªtext/plain¬rootassignee»const corridor_state_counts²last_run_timestampËAÚ•# eI°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$54f559b6-8a62-4a42-894d-c56e41d5ebef¹depends_on_disabled_cellsÂ§runtimeÎsy¸µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$f545c800-0bf3-491f-9d7d-42341cfdb573Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙFform_state_continuous_policy_function (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•#e2 °persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f545c800-0bf3-491f-9d7d-42341cfdb573¹depends_on_disabled_cellsÂ§runtimeÎãÑµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$8b35661b-5075-4d63-bc31-044407f99acfŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements’’´action_probabilities’…¦prefix§Float32¨elements’’’¨0.345429ªtext/plain’’¨0.654571ªtext/plain¤type¥Array¬prefix_short ¨objectid°d2e340ba29bbb4caÙ!application/vnd.pluto.tree+object’´state_value_estimate’ª0.00728306ªtext/plain¤typeªNamedTuple¨objectid¯db1ef684d440398¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÀ²last_run_timestampËAÚ•+fT°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$8b35661b-5075-4d63-bc31-044407f99acf¹depends_on_disabled_cellsÂ§runtimeÎŽª<µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$09dd1440-5d09-421f-addc-b1ede43ff517Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚÌœ ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•![®d°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$09dd1440-5d09-421f-addc-b1ede43ff517¹depends_on_disabled_cellsÂ§runtimeÎäÑµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$a0ca7a5e-0089-4a45-9278-c0f27cd096a0Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÛ

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@êÍ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$a0ca7a5e-0089-4a45-9278-c0f27cd096a0¹depends_on_disabled_cellsÂ§runtimeÎJÀxµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$64b38d1f-ecf9-4843-89a1-4c8953048265Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements•’’…¦prefixÙ,Main.var"workspace#8".CartPoleState{Float32}¨elements›’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’£0.0ªtext/plain’¢Î¸’¤0.05ªtext/plain’£áº‹’£0.0ªtext/plain’¤Î¸Ì‡’£0.0ªtext/plain’¡t’£0.0ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°4addf63ed7bae523Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’ª-0.0228733ªtext/plain’¢Î¸’©0.0616247ªtext/plain’£áº‹’¨-1.14368ªtext/plain’¤Î¸Ì‡’¨0.581558ªtext/plain’¡t’¤0.04ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°1a6d5f3820b638b8Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’ª-0.0686444ªtext/plain’¢Î¸’©0.0851709ªtext/plain’£áº‹’¨-1.14494ªtext/plain’¤Î¸Ì‡’¨0.596552ªtext/plain’¡t’¤0.08ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°ab75208723b5af11Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’¨-0.13732ªtext/plain’¢Î¸’¨0.120792ªtext/plain’£áº‹’¨-2.28879ªtext/plain’¤Î¸Ì‡’§1.18529ªtext/plain’¡t’¤0.12ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid¯cbd7b81704aa595Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’©-0.206073ªtext/plain’¢Î¸’¨0.157435ªtext/plain’£áº‹’¨-1.14915ªtext/plain’¤Î¸Ì‡’¨0.648677ªtext/plain’¡t’¤0.16ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°e28e9577a3e9d405Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’©-0.229269ªtext/plain’¢Î¸’¨0.172792ªtext/plain’£áº‹’ª-0.0107417ªtext/plain’¤Î¸Ì‡’¨0.119943ªtext/plain’¡t’£0.2ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°a8499493c4a25ff7Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’©-0.206936ªtext/plain’¢Î¸’¨0.167051ªtext/plain’£áº‹’§1.12742ªtext/plain’¤Î¸Ì‡’©-0.407285ªtext/plain’¡t’¤0.24ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°2075d559f4fb1949Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’©-0.184727ªtext/plain’¢Î¸’¨0.162689ªtext/plain’£áº‹’ª-0.0169652ªtext/plain’¤Î¸Ì‡’¨0.189115ªtext/plain’¡t’¤0.28ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°172373efb3067c18Ù!application/vnd.pluto.tree+object’ ’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’©-0.162637ªtext/plain’¢Î¸’¨0.159659ªtext/plain’£áº‹’§1.12149ªtext/plain’¤Î¸Ì‡’©-0.340778ªtext/plain’¡t’¤0.32ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°e7d63656f3863dbdÙ!application/vnd.pluto.tree+object¤more’Ì’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’¨-49.8897ªtext/plain’¢Î¸’¨0.197691ªtext/plain’£áº‹’¨-13.1155ªtext/plain’¤Î¸Ì‡’¨0.275797ªtext/plain’¡t’¤5.12ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°843cef00b85f04b3Ù!application/vnd.pluto.tree+object¤type¥Array¬prefix_short ¨objectid°7e1165de0d244ff3Ù!application/vnd.pluto.tree+object’’…¦prefix¥Int64¨elements›’’¡1ªtext/plain’’¡2ªtext/plain’’¡1ªtext/plain’’¡3ªtext/plain’’¡3ªtext/plain’’¡3ªtext/plain’’¡1ªtext/plain’’¡3ªtext/plain’ ’¡3ªtext/plain¤more’Ì’¡3ªtext/plain¤type¥Array¬prefix_short ¨objectid°a5d9f6c7dcb5726aÙ!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements›’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’’£1.0ªtext/plain’ ’£1.0ªtext/plain¤more’Ì’£1.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°132d26ffb75bd388Ù!application/vnd.pluto.tree+object’’…¦prefix¶CartPoleState{Float32}¨elements•’¡x’¨-50.3916ªtext/plain’¢Î¸’¨0.198355ªtext/plain’£áº‹’¨-11.9784ªtext/plain’¤Î¸Ì‡’¨-0.24256ªtext/plain’¡t’¤5.16ªtext/plain¤type¦struct¬prefix_shortCartPoleState¨objectid°72ca98acf18b1cfbÙ!application/vnd.pluto.tree+object’’£129ªtext/plain¤type¥Tuple¨objectid°2d80665ecf07fdf6¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÙ,const cartpole_fcann_continuing_test_episode²last_run_timestampËAÚ•3 ¢°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$64b38d1f-ecf9-4843-89a1-4c8953048265¹depends_on_disabled_cellsÂ§runtimeÎÁÇµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$d963ff6d-f1b6-4799-aa0e-1ae100310d84Š¦queuedÂ¤logs‘ˆ¤lineÿ£msg’ÚB]Checking for cuda toolkit versions No cuda toolkit appears to be installed. If this sytem has an NVIDIA GPU, install the cuda toolkit and add nvcc to the system path to use the GPU backend. Available backends are: CPU Backend is set to CPU Num Grads Func Grads 0.008702 0.008294 -0.211358 -0.208870 -0.271201 -0.272823 -0.794768 -0.795432 1.321912 1.321019 0.010610 0.009960 0.003338 0.004296 -0.009298 -0.010741 0.107527 0.106389 -0.009775 -0.008679 0.024915 0.026168 0.010610 0.010030 0.008583 0.005909 0.058174 0.055782 -0.127435 -0.126989 0.002742 0.001004 -0.070930 -0.071241 -0.005364 -0.006470 0.013947 0.012872 0.053763 0.054418 0.311136 0.311052 0.015497 0.014644 -0.219107 -0.219193 -0.973701 -0.974077 0.538349 0.538771 0.105143 0.105787 0.056386 0.056097 -0.072718 -0.072026 -0.361681 -0.361433 0.279307 0.280814 0.145793 0.143951 0.048995 0.049274 -0.100732 -0.102160 -0.521660 -0.520240 0.269771 0.269774 0.070095 0.070005 0.011086 0.010243 -0.044942 -0.044660 -0.252724 -0.253604 0.093222 0.095847 -0.124216 -0.123114 0.012040 0.010086 0.074387 0.072455 0.361204 0.359321 -0.185847 -0.187581 0.121474 0.122355 0.006318 0.005624 -0.085115 -0.086171 -0.395298 -0.393476 0.196457 0.197631 0.226021 0.226000 0.903606 0.904000 -0.032306 -0.032251 -0.084519 -0.082810 0.115752 0.115757 0.007033 0.004606 -0.187397 -0.188008 -0.291586 -0.292546 -0.107527 -0.107447 -0.564456 -0.563941 -0.046372 -0.045384 0.419974 0.420656 Relative differences for method are 0.0018749096. Should be small (1e-9) Num Grads Func Grads 0.226974 0.228291 -0.147700 -0.150010 -0.397086 -0.396523 -1.448035 -1.449270 1.557469 1.558208 0.043392 0.041430 0.014067 0.013863 0.005603 0.007342 0.279069 0.281456 -0.018597 -0.020660 0.065207 0.068021 0.017762 0.019240 -0.002742 -0.001772 0.151992 0.152121 -0.370026 -0.369779 0.032902 0.033341 -0.059605 -0.061379 -0.015259 -0.016710 0.090718 0.089624 0.031114 0.032405 0.222206 0.224066 0.015378 0.014975 -0.239253 -0.239561 -0.717878 -0.717608 0.439525 0.440638 0.070691 0.070417 0.032902 0.032117 -0.102282 -0.104047 -0.301361 -0.303399 0.291467 0.290279 0.103831 0.102325 0.028253 0.028958 -0.135541 -0.133932 -0.377774 -0.378357 0.208259 0.210493 0.059247 0.058751 0.004411 0.005936 -0.066400 -0.066114 -0.147343 -0.145967 0.050664 0.048794 -0.101089 -0.103440 -0.002027 -0.000536 0.099182 0.098851 0.195622 0.195177 -0.158787 -0.159284 0.091553 0.091211 0.005245 0.005407 -0.094295 -0.097222 -0.277638 -0.277581 0.146151 0.146036 0.285268 0.286000 0.882745 0.884000 0.249863 0.249797 0.197411 0.198590 0.402927 0.403551 0.336289 0.335512 -0.161886 -0.161895 -0.155091 -0.155041 -0.186682 -0.186942 -0.842452 -0.842551 0.088453 0.088449 0.687838 0.688715 Relative differences for method are 0.0016782036. Should be small (1e-9) Num Grads Func Grads 0.028610 0.032211 -0.963449 -0.962050 -1.136065 -1.137463 -3.270149 -3.271130 5.587101 5.584826 0.077486 0.076260 -0.095367 -0.098550 -0.215054 -0.215321 0.079632 0.081413 0.739098 0.737280 0.122309 0.123499 0.018835 0.013728 -0.069857 -0.072780 0.049591 0.052830 -0.226498 -0.222941 0.063896 0.062422 -0.396490 -0.397996 -0.261545 -0.266012 -0.387192 -0.387975 1.085997 1.086820 1.346350 1.346262 -0.049829 -0.046599 -0.909328 -0.913211 -4.127741 -4.128638 2.273083 2.274081 0.267029 0.266290 0.149250 0.147556 -0.131607 -0.133167 -0.934601 -0.935876 1.001120 0.998651 0.482559 0.481657 0.104427 0.105235 -0.309229 -0.307927 -1.800060 -1.797327 0.991821 0.989244 0.315428 0.320147 0.010014 0.010303 -0.196934 -0.198439 -1.142979 -1.142353 0.427246 0.429229 -0.661611 -0.662484 0.063658 0.064201 0.410557 0.409356 1.935720 1.932071 -0.965834 -0.966196 0.538349 0.538219 -0.018835 -0.019689 -0.362873 -0.364258 -1.701593 -1.706390 0.842810 0.843614 0.613928 0.614286 3.943205 3.944720 -0.113726 -0.114396 -0.466585 -0.463806 0.290155 0.290062 -0.287771 -0.287623 -0.538588 -0.538925 -1.173973 -1.177111 -0.309706 -0.309572 -2.575636 -2.575907 -0.058651 -0.058277 2.411604 2.410151 Relative differences for method are 0.00076386787. Should be small (1e-9) Num Grads Func Grads 6.711959 6.693584 -68.183899 -68.190453 38.049698 38.039562 -54.199215 -54.186806 97.707741 97.718063 10.526656 10.566038 -70.594788 -70.581596 38.515091 38.487904 -30.736921 -30.749435 89.469902 89.493088 3.854751 3.846486 -25.175093 -25.155622 10.587691 10.596359 -19.964218 -19.955395 28.196333 28.179987 10.213851 10.200496 -82.588188 -82.617134 52.881237 52.895470 -65.713882 -65.718040 130.901337 130.928772 6.317138 6.333028 -86.460106 -86.465569 -36.693573 -36.688496 -36.502838 -36.511711 44.736858 44.745163 -4.116058 -4.114252 65.225601 65.216415 30.076979 30.074966 27.200697 27.205814 -20.799635 -20.794403 -2.260208 -2.282639 31.826017 31.828533 16.363144 16.344271 13.311385 13.307327 -16.103745 -16.111374 1.338959 1.361882 -24.999617 -24.977449 -10.320662 -10.327084 -10.065078 -10.083112 5.676269 5.666134 -5.649566 -5.641539 85.954659 85.937538 37.216187 37.207138 34.606934 34.607189 -39.308548 -39.303772 2.691269 2.685248 -38.249969 -38.242092 -16.252518 -16.261158 -16.225815 -16.215649 16.819000 16.811707 -1.462936 -1.462608 30.584333 30.600145 2.822876 2.822504 82.855217 82.856033 0.379562 0.378145 -5.743026 -5.740044 -0.720978 -0.720155 -17.814636 -17.773310 -0.083923 -0.083204 -20.568846 -20.545242 0.076294 0.076714 -57.123180 -57.119926 1.134872 1.135409 3.015518 3.019630 -2.380371 -2.382002 3.852844 3.869895 1.008987 1.007515 -25.325773 -25.326454 -2.027512 -2.027638 -69.564819 -69.550819 -0.858307 -0.857046 40.840145 40.822083 1.695633 1.695448 112.745277 112.731529 Relative differences for method are 0.00016244102. Should be small (1e-9) Num Grads Func Grads 0.370979 0.371239 -3.942966 -3.939380 1.880169 1.885892 -4.216194 -4.219157 6.735086 6.733737 0.416279 0.415627 -2.906561 -2.902907 1.408815 1.409868 -1.617193 -1.616733 4.136801 4.136244 0.260353 0.259042 -1.281023 -1.276769 0.546217 0.549732 -1.114845 -1.114833 1.324415 1.325205 0.393391 0.392242 -3.700971 -3.701753 2.170086 2.169993 -3.574133 -3.574246 6.063461 6.066147 0.670671 0.670281 -4.621267 -4.624799 -2.263308 -2.262765 -2.675056 -2.679310 3.272533 3.272666 -0.095606 -0.094364 2.565861 2.564243 1.099110 1.098414 0.891924 0.892591 -0.654697 -0.656995 0.049114 0.045307 0.544071 0.541945 0.272989 0.275064 -0.040054 -0.042496 -0.256300 -0.260528 0.154018 0.157435 -1.469135 -1.473261 -0.620604 -0.615052 -0.800848 -0.800142 0.550508 0.553296 -0.478268 -0.478678 4.086733 4.090587 1.882315 1.881419 2.112389 2.115821 -2.469778 -2.472500 0.278950 0.274027 -2.095699 -2.095469 -0.984669 -0.983863 -1.176834 -1.176937 1.266718 1.266336 -0.509024 -0.509570 2.299547 2.298040 0.456572 0.455321 3.930330 3.928681 0.109434 0.109184 -0.335455 -0.336301 -0.148058 -0.148257 -0.793695 -0.796763 -0.050306 -0.050607 -0.918627 -0.920102 0.015974 0.014915 -2.103090 -2.102135 0.313044 0.312434 -0.062227 -0.059504 -0.477552 -0.477148 -0.045061 -0.046998 0.325680 0.325307 -1.706362 -1.710228 -0.344992 -0.345803 -3.120184 -3.121110 -0.256777 -0.255635 2.459288 2.462493 0.304222 0.304774 5.084037 5.084333 Relative differences for method are 0.000524794. Should be small (1e-9) Num Grads Func Grads 0.312567 0.311857 0.060797 0.061075 -0.182033 -0.180999 -0.234008 -0.232954 -0.136733 -0.139338 -0.059962 -0.057574 -0.219822 -0.221368 0.201702 0.202568 0.025988 0.025125 0.081062 0.083038 -0.260592 -0.261443 0.021577 0.023480 -0.099421 -0.099594 -0.292659 -0.291759 -0.164032 -0.163031 -0.078917 -0.080009 -0.034451 -0.033984 0.128984 0.130029 0.008345 0.010039 0.174046 0.174043 -0.094056 -0.093796 -0.013828 -0.013256 0.057578 0.057035 0.092030 0.092475 -0.010848 -0.012292 0.020504 0.020589 0.103712 0.105378 -0.113130 -0.113505 0.017166 0.019778 -0.064254 -0.062586 0.159979 0.164053 -0.026703 -0.027487 0.051618 0.052212 0.198841 0.202832 0.144124 0.142701 0.209331 0.209704 -0.006795 -0.008453 0.004172 0.003124 -0.060678 -0.063686 -0.091195 -0.091876 -0.159264 -0.161411 -0.062108 -0.065715 -0.352025 -0.351371 -0.360966 -0.360692 0.148058 0.148704 0.161529 0.160150 -0.209570 -0.209143 -0.133514 -0.136240 0.100493 0.101331 0.019431 0.018995 0.087619 0.089369 0.033379 0.037169 0.201225 0.199820 0.211835 0.213000 -0.085711 -0.088189 -0.093460 -0.092003 0.123620 0.121501 0.061631 0.062936 -0.060201 -0.058970 0.001669 0.004213 0.108123 0.106151 -0.020027 -0.020401 0.338316 0.336111 0.303030 0.302518 -0.206590 -0.210287 -0.223994 -0.224138 0.186324 0.186575 0.137210 0.137436 -0.026226 -0.025840 -0.056744 -0.056368 0.064492 0.064329 0.001669 0.000783 0.207186 0.205043 0.207663 0.206123 -0.127316 -0.125652 -0.126839 -0.126457 0.117660 0.117819 0.068069 0.068519 -0.024915 -0.023798 -0.009179 -0.009422 -0.173569 -0.175902 -0.274539 -0.273241 -0.158906 -0.157542 -0.372171 -0.370569 -0.087976 -0.089809 -0.123262 -0.121120 -0.140309 -0.139166 0.042439 0.040605 0.222564 0.222288 -0.209928 -0.209915 0.201821 0.201518 0.084519 0.082732 0.426531 0.426795 0.420213 0.422846 -0.174165 -0.176922 -0.193238 -0.195676 0.250816 0.248087 0.196815 0.194071 -0.122786 -0.123045 -0.048637 -0.050270 0.039816 0.038292 -0.198364 -0.200369 0.256062 0.254771 0.003695 0.001437 -0.253201 -0.251856 -0.325084 -0.325386 0.086784 0.085402 0.296593 0.296466 0.129580 0.127823 -0.335813 -0.337562 0.062466 0.063843 0.195265 0.196389 -0.074983 -0.075120 0.080705 0.082724 0.173926 0.172865 0.202298 0.202088 -0.005603 -0.004043 -0.108004 -0.108222 -0.144482 -0.143103 0.176907 0.174852 0.063300 0.065570 0.256062 0.261189 -0.117183 -0.118304 0.119209 0.120075 0.228763 0.228869 0.277638 0.279610 -0.003099 -0.006243 -0.204802 -0.203730 -0.191450 -0.194106 0.292301 0.293116 -0.031471 -0.030124 0.141859 0.142707 -0.208497 -0.209867 -0.026941 -0.026536 0.196695 0.196141 0.244260 0.246189 -0.070572 -0.072115 -0.244141 -0.243417 -0.098586 -0.096951 0.258088 0.258783 0.208735 0.208590 0.164986 0.164989 0.353813 0.352502 0.459433 0.460969 -0.103235 -0.103246 -0.096798 -0.096121 0.235319 0.234387 0.061750 0.064499 -0.175953 -0.176755 0.091076 0.092398 -0.173450 -0.172762 0.401378 0.405599 0.116706 0.115096 -0.196934 -0.199266 -0.090599 -0.094570 -0.127792 -0.126921 -0.134468 -0.134261 0.315547 0.316991 0.520587 0.518939 -0.836611 -0.836996 0.042558 0.041820 -0.083089 -0.081949 -0.023961 -0.023933 0.035644 0.035489 0.026226 0.026675 0.028372 0.029101 0.048518 0.047496 -0.060201 -0.061848 -0.109315 -0.110237 0.200987 0.201105 0.235200 0.233015 0.101566 0.100780 0.259161 0.259513 -0.268936 -0.271424 -0.100374 -0.100878 0.051141 0.051580 -0.361204 -0.359039 0.009775 0.008971 -0.037432 -0.038072 1.041055 1.039986 -0.142574 -0.143103 0.441074 0.442194 0.192046 0.191359 -0.230789 -0.231685 -0.102997 -0.102431 -0.130296 -0.130364 -0.191212 -0.192988 0.326395 0.328803 0.539780 0.539642 -0.720501 -0.721265 -0.059605 -0.061262 -0.048280 -0.047782 -0.104427 -0.104571 0.070691 0.070518 0.013947 0.014226 -0.003576 -0.003398 0.090122 0.089000 -0.016570 -0.017456 -0.011921 -0.013042 -0.263453 -0.266497 -0.208616 -0.207550 0.619888 0.619279 0.316024 0.315512 -0.364423 -0.364230 -0.186801 -0.185087 -0.189185 -0.189708 -0.403643 -0.404409 0.485420 0.485202 0.818610 0.818659 -1.041293 -1.041815 -0.081539 -0.080142 0.436425 0.439344 0.216126 0.218905 -0.286102 -0.287349 -0.126243 -0.126274 -0.111699 -0.111244 -0.265479 -0.263417 0.304103 0.303301 0.483394 0.483243 -0.408053 -0.409209 0.062704 0.062238 -0.412106 -0.412213 -0.278354 -0.277767 0.311017 0.310491 0.159025 0.162666 0.103116 0.104658 0.409007 0.405954 -0.302196 -0.303278 -0.511885 -0.511293 0.366926 0.366982 -0.054240 -0.053122 0.166178 0.166554 0.028968 0.028728 -0.083208 -0.085248 -0.049710 -0.049947 -0.041842 -0.040493 -0.067949 -0.065471 0.110388 0.110542 0.176191 0.177371 -0.276923 -0.277232 -0.030756 -0.028911 0.551701 0.557508 0.312448 0.313525 -0.409842 -0.413396 -0.157475 -0.157533 -0.108600 -0.107965 -0.349045 -0.346642 0.376224 0.377657 0.558853 0.558513 -0.209928 -0.209954 0.193953 0.192894 0.428915 0.428110 0.445962 0.446616 -0.499606 -0.500053 -0.192881 -0.190013 -0.024080 -0.024177 -0.561595 -0.561750 0.243306 0.243692 0.326037 0.324034 0.806570 0.804877 1.450896 1.450685 -0.115156 -0.118111 1.001477 1.000420 0.400543 0.401283 1.772046 1.771781 0.511885 0.512357 -1.435280 -1.434563 -0.493884 -0.491516 0.315666 0.315025 -0.525355 -0.524058 0.869632 0.869365 0.763893 0.764892 -2.297044 -2.297577 -0.349283 -0.350460 -0.802994 -0.803257 -0.894308 -0.894548 -0.591516 -0.591840 0.586987 0.585220 -0.531197 -0.531900 0.256062 0.254967 0.478744 0.479863 0.458121 0.457866 Relative differences for method are 0.002172598. Should be small (1e-9) [32m[1mBeginning training with the following parameters:[22m[39m input size = 1, hidden layers = [1], output size = 1, batch size = 1024, num epochs = 150, training alpha = 0.002, decay rate = 0.1, L2 Reg Constant = 0.0, max norm reg constant = Inf, dropout rate = 0.0, residual layer size = 0 ------------------------------------------------------------------- Initial cost is 4.5569577 ------------------------------------------------------------------- [32m[1mCompleted training on CPU with the following parameters: [22m[39m input size = 1, hidden layers = [1], output size = 1, batch size = 1024, num epochs = 150, training alpha = 0.002, decay rate = 0.1, L2 Reg Constant = 0.0, max norm reg constant = Inf, dropout rate = 0.0, residual layer size = 0 [31m[1mTraining Results: Cost reduced from 4.7715697to 1.9026493 after 1 seconds and 150 epochs[22m[39m Median time of 48.99338819086552 ns per example Total operations per example = 32.0 foward prop ops + 9.00390625 backprop ops + 0.03515625 update ops = 41.0390625 Approximate GFLOPS = 0.8376449157444136 ------------------------------------------------------------------- Completed benchmark with 1 input [1] hidden 1 output, and 1024 batchSize on a AMD EPYC 7763 64-Core Processor Time to train on CPU took 0.8029501438140869 seconds for 150 epochs Average time of 52.275399987896286 ns per example Total operations per example = 32.0 foward prop ops + 9.00390625 backprop ops + 0.03515625 update ops = 41.0390625 Approximate GFLOPS = 0.7850549686755544 Backend is set to CPU Num Grads Func Grads -0.167504 -0.167347 0.543714 0.543676 -0.144422 -0.144394 -0.008911 -0.008839 -0.272021 -0.272223 0.014335 0.014266 -0.214294 -0.214307 0.017107 0.017179 0.016794 0.016594 0.023171 0.023050 0.037402 0.037344 -0.182405 -0.182453 0.002950 0.002887 0.037596 0.037637 0.062510 0.062431 0.026956 0.027002 -0.002280 -0.002004 -0.009179 -0.009318 0.009418 0.009481 0.028744 0.028698 0.206247 0.206354 -0.404969 -0.404844 -0.597507 -0.597381 0.240430 0.240348 0.100955 0.101023 -0.034809 -0.034889 0.057399 0.057441 0.016525 0.016654 -0.005126 -0.005139 -0.028357 -0.028296 0.161871 0.161984 -0.274017 -0.274126 -0.309244 -0.309130 0.141218 0.141186 0.066102 0.066191 0.082031 0.082081 -0.197053 -0.197146 -0.400871 -0.400995 0.162691 0.162752 0.036269 0.036158 0.120744 0.120639 -0.170439 -0.170605 -0.198990 -0.198996 0.092894 0.092803 0.036508 0.036529 0.027657 0.027803 -0.045031 -0.045067 0.024885 0.024758 -0.005856 -0.005797 0.021622 0.021706 0.999942 1.000000 0.000000 0.000000 -0.260040 -0.260063 0.000000 0.000000 -0.155240 -0.155227 0.000000 0.000000 -0.826180 -0.826232 0.000000 0.000000 -0.518218 -0.518207 0.000000 0.000000 0.204593 0.204622 0.000000 0.000000 Relative differences for method are 0.0001881188. Should be small (1e-9) Checking for cuda toolkit versions No cuda toolkit appears to be installed. If this sytem has an NVIDIA GPU, install the cuda toolkit and add nvcc to the system path to use the GPU backend. Available backends are: CPU ªtext/plain§cell_idÙ$d963ff6d-f1b6-4799-aa0e-1ae100310d84¦kwargs¢id´PlutoRunner_d1acb81e¤fileÙP/home/runner/.julia/packages/Pluto/5ete1/src/runner/PlutoRunner/src/io/stdout.jl¥group¦stdout¥level®LogLevel(-555)§runningÂ¦output†¤bodyÚ ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•k÷Y°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d963ff6d-f1b6-4799-aa0e-1ae100310d84¹depends_on_disabled_cellsÂ§runtimeÏ oðµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$b16899b7-36bf-4a5e-8e2f-4496b8450687Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ6squashed_gaussian_pdf (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•!€»M°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$b16899b7-36bf-4a5e-8e2f-4496b8450687¹depends_on_disabled_cellsÂ§runtimeÎïŒµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$10cdd16e-a337-4421-a7a0-6de4e4b60c0fŠ¦queuedÂ¤logs§runningÂ¦output†¤body¿BinaryGaussianEligibilityVector¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•!‘Êâ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$10cdd16e-a337-4421-a7a0-6de4e4b60c0f¹depends_on_disabled_cellsÂ§runtimeÎM>Ýµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$a8b40b8f-051a-4e6f-a079-ece4f32873deŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ>create_actor_critic_params_UI (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•?Í§~°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$a8b40b8f-051a-4e6f-a079-ece4f32873de¹depends_on_disabled_cellsÂ§runtimeÎ:2¶µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$5eebf3da-bfe7-46eb-81a3-f87f334ee270Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙDcreate_actor_critic_fcann_params_UI (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•35ÙQ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5eebf3da-bfe7-46eb-81a3-f87f334ee270¹depends_on_disabled_cellsÂ§runtimeÎFUöµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$9bce6fdb-2cbc-4758-9a8b-794e490c973dŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ/1¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•:„×°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$9bce6fdb-2cbc-4758-9a8b-794e490c973d¹depends_on_disabled_cellsÂ§runtimeÎääµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbfŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙEcreate_continuous_action_mountaincar (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•=’³_°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbf¹depends_on_disabled_cellsÂ§runtimeÎ%õ†µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements•’’…¦prefix·Tuple{Float32, Float32}¨elements›’’ƒ¨elements’’’©-0.531205ªtext/plain’’£0.0ªtext/plain¤type¥Tuple¨objectid°82cf4db9a775ec1dÙ!application/vnd.pluto.tree+object’’ƒ¨elements’’’©-0.530148ªtext/plain’’ª0.00105704ªtext/plain¤type¥Tuple¨objectid°7ce0ca28f7354422Ù!application/vnd.pluto.tree+object’’ƒ¨elements’’’©-0.528042ªtext/plain’’ª0.00210616ªtext/plain¤type¥Tuple¨objectid°e1eed8cc1ecb8f34Ù!application/vnd.pluto.tree+object’’ƒ¨elements’’’©-0.524902ªtext/plain’’ª0.00313948ªtext/plain¤type¥Tuple¨objectid°f1a26da07d248f21Ù!application/vnd.pluto.tree+object’’ƒ¨elements’’’©-0.520753ªtext/plain’’ª0.00414925ªtext/plain¤type¥Tuple¨objectid°9493e20af903dd19Ù!application/vnd.pluto.tree+object’’ƒ¨elements’’’©-0.515625ªtext/plain’’ª0.00512791ªtext/plain¤type¥Tuple¨objectid°104e7d3122527a14Ù!application/vnd.pluto.tree+object’’ƒ¨elements’’’©-0.509557ªtext/plain’’ª0.00606812ªtext/plain¤type¥Tuple¨objectid°878f21ed8c91c756Ù!application/vnd.pluto.tree+object’’ƒ¨elements’’’©-0.502594ªtext/plain’’ª0.00696283ªtext/plain¤type¥Tuple¨objectid°5b9ea3da43f891d3Ù!application/vnd.pluto.tree+object’ ’ƒ¨elements’’’©-0.494789ªtext/plain’’©0.0078054ªtext/plain¤type¥Tuple¨objectid°82934ae51d4b1cc4Ù!application/vnd.pluto.tree+object¤more’ÌŒ’ƒ¨elements’’’¨0.495647ªtext/plain’’©0.0136242ªtext/plain¤type¥Tuple¨objectid°85c3ada8880376aeÙ!application/vnd.pluto.tree+object¤type¥Array¬prefix_short ¨objectid°7b308864cd041b41Ù!application/vnd.pluto.tree+object’’…¦prefix¥Int64¨elements›’’¡3ªtext/plain’’¡3ªtext/plain’’¡3ªtext/plain’’¡3ªtext/plain’’¡3ªtext/plain’’¡3ªtext/plain’’¡3ªtext/plain’’¡3ªtext/plain’ ’¡3ªtext/plain¤more’ÌŒ’¡2ªtext/plain¤type¥Array¬prefix_short ¨objectid°9a11c7e08ac6a3fcÙ!application/vnd.pluto.tree+object’’…¦prefix§Float32¨elements›’’¤-1.0ªtext/plain’’¤-1.0ªtext/plain’’¤-1.0ªtext/plain’’¤-1.0ªtext/plain’’¤-1.0ªtext/plain’’¤-1.0ªtext/plain’’¤-1.0ªtext/plain’’¤-1.0ªtext/plain’ ’¤-1.0ªtext/plain¤more’ÌŒ’¤-1.0ªtext/plain¤type¥Array¬prefix_short ¨objectid¯8d82f25e85371d1Ù!application/vnd.pluto.tree+object’’ƒ¨elements’’’£0.5ªtext/plain’’©0.0134148ªtext/plain¤type¥Tuple¨objectid°32d698e8ae81b727Ù!application/vnd.pluto.tree+object’’£140ªtext/plain¤type¥Tuple¨objectid°498dfdf2d9f4ea29¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÙ)const mountaincar_continuing_test_episode²last_run_timestampËAÚ•=kÂ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6¹depends_on_disabled_cellsÂ§runtimeÎåL>µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$7afb6fb0-248a-4518-b94f-9876f81eca64Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙDcorridor_continuing_parameter_study (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•+Ïä~°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$7afb6fb0-248a-4518-b94f-9876f81eca64¹depends_on_disabled_cellsÂ§runtimeÎ8?µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$37a273b6-b104-46f0-987a-401dc1c97327Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ©¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ• ñ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$37a273b6-b104-46f0-987a-401dc1c97327¹depends_on_disabled_cellsÂ§runtimeÎöÐUµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$7a6f3f79-ea06-4994-8b62-90b2056e4034Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ@make_squashed_gaussian_sampler (generic function with 2 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•#>Úì°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$7a6f3f79-ea06-4994-8b62-90b2056e4034¹depends_on_disabled_cellsÂ§runtimeÎd¼‰µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$f2ed56c9-c2b7-42cb-a083-e12aeaa126efŠ¦queuedÂ¤logs§runningÂ¦output†¤body…¦prefix§Float32¨elements’’’¨0.423691ªtext/plain’’¨0.576308ªtext/plain¤type¥Array¬prefix_short ¨objectid°c8377fe263c620b2¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÀ²last_run_timestampËAÚ•$Þã°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f2ed56c9-c2b7-42cb-a083-e12aeaa126ef¹depends_on_disabled_cellsÂ§runtimeÎò—Mµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$cbea5840-49d2-4e91-be9c-f5f15666d78aŠ¦queuedÂ¤logs§runningÂ¦output†¤body…¦prefix§Float32¨elements’’’¨0.389351ªtext/plain’’¨0.610649ªtext/plain¤type¥Array¬prefix_short ¨objectid°fe5c659bbdbdad22¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÀ²last_run_timestampËAÚ•%OÕþ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$cbea5840-49d2-4e91-be9c-f5f15666d78a¹depends_on_disabled_cellsÂ§runtimeÎ³zúµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$1f041cb3-618c-4380-a1ec-d7bbe4a80f62Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙMactor_critic_binary_episodic_parameter_study (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•+/uÀ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$1f041cb3-618c-4380-a1ec-d7bbe4a80f62¹depends_on_disabled_cellsÂ§runtimeÎŽµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$96506201-6b66-49e6-8179-06952e2394e1Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ>setup_binary_policy_arguments (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•Ö¹9°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$96506201-6b66-49e6-8179-06952e2394e1¹depends_on_disabled_cellsÂ§runtimeÎ‰µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$76b03e72-da04-4530-8534-6d6468268cbdŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ–

$$\sum_{s \in \mathcal{S}} \sum_{k = 0}^\infty \Pr \{ s_0 \rightarrow s, k, \pi \} = \sum_{k = 0}^\infty \left [ 1 - \Pr \{s_0 \rightarrow S_T, k, \pi \} \right ] = \eta$$

where $\eta$ is the average length of an episode. The quantity inside the brackets is the probability that an episode has not terminated by step k and follows from the fact that the sum over states in $\mathcal{S}$ is over the set of non-terminal states. If the sum was over $\mathcal{S}^+$ instead then it would be infinite since the first sum term would be 1 for every k. Normally to calculate $\eta$, we would use the expected value with the probability of an episode lasting exactly $k$ steps, but the probability we have access to here is actually the distribution function, not the density function. That is $\Pr \{s_0 \rightarrow S_T, k, \pi \} = \sum_{t = 0}^k \Pr \{ T = t \} = \Pr \{ T \leq k \}$ where $T$ is the length of an episode. Using these probabilities, we can write $\eta = \mathbb{E}_\pi [T] = \sum_{k = 0}^\infty k \Pr \{ T = k \} = \Pr \{T = 1 \} + 2 \Pr \{T = 2 \} + \cdots$.

Earlier we had the expression $\eta = \sum_{k = 0}^\infty \left [ 1 - \Pr \{s_0 \rightarrow S_T, k, \pi \} \right ] = \sum_{k = 0}^\infty \Pr \{T \gt k \} = \sum_{k = 0}^\infty \sum_{t = k + 1}^\infty \Pr \{T = t \}$

We can stack up the terms of this double sum to see that it is equivalent to the expected value calcuation from before:

$$\begin{flalign} \Pr \{ T = 1 \} + \Pr \{ T = 2 \} + &\Pr \{ T = 3 \} +\cdots \\ \Pr \{ T = 2 \} + &\Pr \{ T = 3 \} + \cdots \\ &\Pr \{ T = 3 \} + \cdots \\ \vdots \end{flalign}$$

If we count terms along the diagonal, we see that each value of $k$ has exactly $k$ terms, matching the expected value calculation.

What if we wanted to calculate the bivariate distribution over states and steps where we ignore the terminal states $\mu_\pi(s, k)$ such that $\sum_{s \in \mathcal{S}} \sum_k \mu_\pi(s, k) = 1$. This probability represents the chance of sampling a particular step and state simultaneously from a unbiased sample of non-terminal states in an episode. Luckily we can break down this probability into two components: 1) the probability of reaching a step k without terminating 2) the probability of being in a non-terminal state on step k. We saw already that 1) is just $\sum_{s \in \mathcal{S}} \Pr \{ s_0 \rightarrow s, k, \pi \}$ and 2) we can calculate by normalizing those probabilities over only the non-terminal states: $\frac{\Pr \{ s_0 \rightarrow s, k, \pi \}}{\sum_{s \in \mathcal{S}} \Pr \{ s_0 \rightarrow s, k, \pi \} }$. By multiplying these two together we see that the probability is just the original distribution but where the domain of possible input values is $s \in \mathcal{S}$ and all possible steps $k$. Therefore, we can transform this into a normalized bivariate distribution by dividing by its sum over those two sets:

$$\mu_\pi(s, k) = \frac{\Pr \{ s_0 \rightarrow s, k, \pi \}}{\sum_{x \in \mathcal{S}} \sum_{t = 0}^\infty \Pr \{ s_0 \rightarrow x, t, \pi \}}$$

Now that we have established the relationship between the on-policy distribution function and the probability expression we have, we can use it to complete the proof below.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô†ô°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$76b03e72-da04-4530-8534-6d6468268cbd¹depends_on_disabled_cellsÂ§runtimeÎ £Âµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$fd89433e-643c-474b-b3c4-a997678421a6Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙÿ

Linear Features

This version of REINFORCE uses linear feature vectors for which one needs to specify the total number of features as well as a function that updates the values in a feature vector given a state.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôŠ S°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$fd89433e-643c-474b-b3c4-a997678421a6¹depends_on_disabled_cellsÂ§runtimeÎ©½µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$87feff3e-e510-4916-91a9-db3a2cd12225Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚe‚

$\lambda_\theta$: 0.75

$\lambda_\mathbf{w}$: 0.25

$\alpha_{\overline{r}}$:

hidden layer size: , num layers:

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•8 f°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$87feff3e-e510-4916-91a9-db3a2cd12225¹depends_on_disabled_cellsÂ§runtimeÎ j‚µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$5261651e-a51e-4e80-8e23-83a4c10e5259Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙEupdate_gaussian_eligibility_vector! (generic function with 2 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•#E²Õ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5261651e-a51e-4e80-8e23-83a4c10e5259¹depends_on_disabled_cellsÂ§runtimeÎw»þµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$dddc4a2f-34b2-41dc-85b3-55aba4880fa6Š¦queuedÂ¤logs§runningÂ¦output†¤body‚£msgÙ¦UndefVarError: `reinforce_test` not defined in `Main.var"workspace#8"` Suggestion: add an appropriate import or assignment. This global was declared but not assigned.ªstacktrace‘Œªcall_short¯top-level scope§inlinedÂ£urlÀ¤pathÙØ/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jl#==#dddc4a2f-34b2-41dc-85b3-55aba4880fa6®source_packageÀ¤call¯top-level scopeªlinfo_typeCore.CodeInfo¤line¤fileÙMChapter_13_Policy_Gradient_Methods.jl#==#dddc4a2f-34b2-41dc-85b3-55aba4880fa6¤func¯top-level scopeparent_moduleÀ¦from_cÂ¤mimeÙ'application/vnd.pluto.stacktrace+object¬rootassigneeÀ²last_run_timestampËAÚ•0üMŒ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$dddc4a2f-34b2-41dc-85b3-55aba4880fa6¹depends_on_disabled_cellsÂ§runtimeÀµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÃÙ$54fff14b-cf53-47b0-9cfa-8b9ee33df54eŠ¦queuedÂ¤logs§runningÂ¦output†¤body»BinaryBetaEligibilityVector¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•!“°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$54fff14b-cf53-47b0-9cfa-8b9ee33df54e¹depends_on_disabled_cellsÂ§runtimeÎKCÇµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$023f67b8-8f38-470a-9766-ac60a75678aaŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements“’®feature_vector’…¦prefix§Float32¨elements’’’£0.0ªtext/plain’’£0.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°96151aba818dc130Ù!application/vnd.pluto.tree+object’¬num_features’¡2ªtext/plain’¶update_feature_vector!’¶update_feature_vector!ªtext/plain¤typeªNamedTuple¨objectid°534e5799b20b5f5b¤mimeÙ!application/vnd.pluto.tree+object¬rootassignee½const mountaincar_fcann_setup²last_run_timestampËAÚ•:¡j°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$023f67b8-8f38-470a-9766-ac60a75678aa¹depends_on_disabled_cellsÂ§runtimeÎS~µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$1558cec1-c4fd-4bc0-85ed-ae22c6067d41Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ:

We can also repeat this derivation for the alternative linear parameterization where we only have state feature vectors and a parameter matrix with components $\boldsymbol{\theta}_{i, j}$:

$$\begin{flalign} \mathbf{h} &= \boldsymbol{\theta}^\top \mathbf{x}(s) \\ h_a &= \mathbf{h}_a \\ \mathbf{\pi}(s) &= \sigma(\mathbf{h}) \\ \pi_a &= \sigma(\mathbf{h})_a \\ \nabla(\pi_a)_{i, j} &= \pi_a \begin{cases} \mathbf{x}(s)_i (1 - \pi_j), & \text{ if } j = a \\ -\pi_j \mathbf{x}(s)_i, & \text{ else }\\ \end{cases} \end{flalign}$$

We already know how to apply the chain rule to the natural logarithm so our final gradient is:

Applying this to the above expression yields:

which is the per component version of the desired vector expression.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôˆŒÖ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$1558cec1-c4fd-4bc0-85ed-ae22c6067d41¹depends_on_disabled_cellsÂ§runtimeÎ{ˆµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$da8d0bca-105b-4d0b-a73d-ee5c9059aeafŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚø

Notice now that all of the parameters associated with the state-value estimate are irrelevent since they always cancel out in the parameter update. Even though we have added a parameter, this method effectively removes two from the analysis. Also, we seem to actually benefit from an intermediate value of $\lambda_{\boldsymbol{\theta}}$ unlike in the episodic case where using the Monte Carlo method was always the best.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôŒìÀ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$da8d0bca-105b-4d0b-a73d-ee5c9059aeaf¹depends_on_disabled_cellsÂ§runtimeÎZµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$3e7cecec-eb77-4862-8e3c-b510422e06dbŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚÊ ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•!Žaò°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$3e7cecec-eb77-4862-8e3c-b510422e06db¹depends_on_disabled_cellsÂ§runtimeÎøYÿµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$0284f0d7-b8a9-4ae6-add0-ac1078571d9bŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ*

$$\begin{flalign} J(\boldsymbol{\theta}) \doteq r(\pi) &\doteq \lim_{h \rightarrow \infty} \frac{1}{h} \sum_{t=1}^h \mathbb{E} [R_t \mid S_0, A_{0:t-1} \sim \pi] \tag{13.15} \\ &= \lim_{t \rightarrow \infty} \mathbb{E}[R_t \vert S_0,A_{0:t-1} \sim \pi] \\ &= \sum_s \mu(s) \sum_a \pi(a \vert s) \sum_{s^\prime, r} p(s^\prime, r \vert s, a) r \end{flalign}$$

where $\mu$ is the steady-state distribution under $\pi$, $\mu(s) \doteq \lim_{t \rightarrow \infty} \Pr \{ S_t = s \vert A_{0:t} \sim \pi \}$, which is assumed to exist and to be independent of $S_0$ (an ergodicity assumption). Remember that this is the special distribution under which, if you select actions according to $\pi$, you remain the same distribution:

$$\sum_s \mu(s) \sum_a \pi(a \vert s, \boldsymbol{\theta})p(s^\prime \vert s, a) = \mu(s^\prime), \: \forall s^\prime \in \mathcal{S}$$

Naturally, in the continuing case, we define values, $v_\pi(s) \doteq \mathbb{E}_\pi [G_t \vert S_t = s]$ and $q_\pi(s, a) \doteq \mathbb{E}_\pi[G_t \vert S_t = s, A_t = a]$, with respect to the differential return:

$$G_t \doteq R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + R_{t+3} - r(\pi) + \cdots \tag{13.17}$$

With these alternate definitions, the policy gradient theorem as given for the episodic case (13.5) remains true for the continuing case. See proof below:

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôŒ3Î°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$0284f0d7-b8a9-4ae6-add0-ac1078571d9b¹depends_on_disabled_cellsÂ§runtimeÎ«ûµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$b94fc99c-f439-4df2-8da3-c01718a136c4Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚè

Repeating this process for state 2 yields:

$$\begin{flalign} v_2 &= -\frac{2+p}{p(1-p)} \\ \frac{\partial v_2}{\partial p} &= -\frac{p(1-p) - (2+p)(1 - 2p)}{p^2(1-p)^2} \end{flalign}$$

Setting this equal to 0 implies

$$\begin{flalign} p - p^2 &= 2 - 4p + p - 2p^2 \\ p^2 + 4p - 2 &= 0 \\ \end{flalign}$$

Using the quadratic equation and taking only the positive solution yields:

$$p = \frac{-4 + \sqrt{16 + 8}}{2} = \frac{-4 + \sqrt{24}}{2} = -2 + \sqrt{6} \approx 0.4495$$

So, in order to maximize the value at state 2, we have $p_{\text{left}} \approx 0.4495$ and $p_{\text{right}} \approx 0.5505$. Which is different from the value we got for state 1. So There is a different optimal policy depending on the starting state. It should be obvious for example that starting in the third state results in an optimial policy of choosing the right action every time. The value functions for each state are plotted below. The behavior of $v_3$ is not well defined at $p=0$ because for any finite $v_2$ it should be 0 but the limit approaching from the right side is -3. This is because for $p=0$ both $v_1$ and $v_2$ are not finite and the episode never terminates.

The value of the state at this probability is: $v_2 = - \frac{2+p}{p(1-p)} = -\frac{\sqrt{6}}{(\sqrt{6}-2)(3 - \sqrt{6})} = - \frac{\sqrt{6}}{3 \sqrt{6} - 6 - 6 + 2 \sqrt{6}} = - \frac{\sqrt{6}}{5 \sqrt{6} - 12} \approx -9.9$

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô…\S°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$b94fc99c-f439-4df2-8da3-c01718a136c4¹depends_on_disabled_cellsÂ§runtimeÎ[Nµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$b8532822-179b-4cd5-a279-4b71dafb544aŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements™’¬step_rewards’…¦prefix§Float32¨elements¤type¥Array¬prefix_short ¨objectid°1719e83743ccf99bÙ!application/vnd.pluto.tree+object’episode_steps’…¦prefix¥Int64¨elements›’’¦255500ªtext/plain’’¦258973ªtext/plain’’¦271063ªtext/plain’’¦282869ªtext/plain’’¦292280ªtext/plain’’¦295131ªtext/plain’’¦298359ªtext/plain’’¦302972ªtext/plain’ ’¦306408ªtext/plain¤more’ÍÆ’¦999862ªtext/plain¤type¥Array¬prefix_short ¨objectid¯7dd28e625537ba9Ù!application/vnd.pluto.tree+object’¯episode_rewards’…¦prefix§Float32¨elements›’’©-255499.0ªtext/plain’’§-3473.0ªtext/plain’’¨-12090.0ªtext/plain’’¨-11806.0ªtext/plain’’§-9411.0ªtext/plain’’§-2851.0ªtext/plain’’§-3228.0ªtext/plain’’§-4613.0ªtext/plain’ ’§-3436.0ªtext/plain¤more’ÍÆ’¦-161.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°31815ca7f89be3d0Ù!application/vnd.pluto.tree+object’±policy_parameters’Ú?1452Ã—2 Matrix{Float32}: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 â‹® 0.0766672 -0.267737 0.0786335 -0.219924 -0.0847249 -0.0164793 -4.25479f-5 0.000205706 0.0 0.0 0.0384359 -0.00628892ªtext/plain’°value_parameters’…¦prefix§Float32¨elements›’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’ ’£0.0ªtext/plain¤more’Í¬’ª-0.0174626ªtext/plain¤type¥Array¬prefix_short ¨objectid°1587a1bc73b13b35Ù!application/vnd.pluto.tree+object’¯policy_function’¢Ï€ªtext/plain’´policy_sample_action’©Ï€_sampleªtext/plain’´estimate_state_value’´estimate_state_valueªtext/plain’°policy_and_value’°policy_and_valueªtext/plain¤typeªNamedTuple¨objectid°c5001092224d61de¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÙ'const mountaincar_continuous_test_train²last_run_timestampËAÚ•=úÀ8°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$b8532822-179b-4cd5-a279-4b71dafb544a¹depends_on_disabled_cellsÂ§runtimeÎD¡†µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$07ba9fe4-aaa7-4123-9865-cbfa79d0d44aŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÛŠ ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•7ûO°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$07ba9fe4-aaa7-4123-9865-cbfa79d0d44a¹depends_on_disabled_cellsÂ§runtimeÎ¿;µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$f487f2dd-ad09-48ac-ae34-bf50cfa6ac7dŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ·¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•!7?°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f487f2dd-ad09-48ac-ae34-bf50cfa6ac7d¹depends_on_disabled_cellsÂ§runtimeÎ x8µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$5c4a383f-fcf2-4f2b-819f-6d84471dda00Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ=update_fcann_value_gradient! (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•ñ,²°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5c4a383f-fcf2-4f2b-819f-6d84471dda00¹depends_on_disabled_cellsÂ§runtimeÎ!Bµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$135f205a-f87e-4691-8e87-d317d6312c84Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ [

The plots below visualize these distributions for the corridor problem starting with the normalized distributions per step which include the terminal states. If we continued to create these plots for larger values of $k$, then the distribution would collapse to a value of 1 for being in a terminal state. In order to calculate other distributions such as the stationary state distribution, it is necessary to renormalize these probabilities by excluding the terminal states:

On-policy Distributions

$$\begin{flalign} &\mu_{k, \pi}(s) = \Pr\{S_k = s \mid \pi \} \; \forall s \in \mathcal{S}^+ \tag{state visits per step}\\ &\Pr \{ T \leq k \vert \pi \} = 1 - \sum_{s \in \mathcal{S}} \Pr\{S_k = s \mid \pi \} \; \forall k \tag{Chance of terminating already (distribution function not density)}\\ &\mu_\pi(s) = \frac{\sum_k \Pr \{ S_k = s \mid \pi \}}{\sum_{k} \sum_{s \in \mathcal{S}} \Pr \{ S_k = s \mid \pi \}} \; \forall s \in \mathcal{S} \tag{non-terminal state visits}\\ &\mu_\pi(s, k) = \frac{\Pr \{ S_k = s \mid \pi \}}{\sum_{k} \sum_{s \in \mathcal{S}} \Pr \{ S_k = s \mid \pi \}} \; \forall s \in \mathcal{S} \tag{non-terminal state and step visits}\\ \end{flalign}$$

Note that final two distributions are only defined for non-terminal states. If we tried to include terminal states we would be unable to normalize the distribution since $\lim_{k \rightarrow \infty} \Pr \{ S_k = S_T \mid \pi \} = 1$ and we would have a diverging sum in the denominator. The only reason these calculation is possible is that the probabilities reach zero quickly enough at higher $k$ for the non-terminal states.

The plots below visualize the four expressions above. The second expression notably is not a probability density but a cummulative distribution function since it includes a sum of all probabilities that meet the condition.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô† J°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$135f205a-f87e-4691-8e87-d317d6312c84¹depends_on_disabled_cellsÂ§runtimeÎ@µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$4a39f9a7-72d4-44ad-895a-742cd1291f92Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚø 0.5 ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•|±°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$4a39f9a7-72d4-44ad-895a-742cd1291f92¹depends_on_disabled_cellsÂ§runtimeÎŠtµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$ee72af8d-3cb8-4314-82df-580f068e1252Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ_

One common form of linear feature vector is one that selects active features per state. Tile coding is an example of this where a state is assigned a tile in each tiling used and the number of tilings control how many active features a given state will have. Because the only possible feature vector values are 1 or 0, this style of encoding need not be as complex as other methods. We can see by the form of the gradients an abbreviated algorithm that need not compute the eligibility vector explicitely.

We can define a binary feature encoding by the function $\mathcal{F}(s)$ which returns the indices of active features for a state $s$ as well as the knowledge of how many total features there are, $d$. All of the values of $\mathbf{x}(s)$ are zero except for the indices in $\mathcal{F}(s)$ whose values are 1. That simplifies the expression we have before for the linear feature eligibility vector:

$$\begin{flalign} \nabla \left ( \ln \pi_a \right )_{i, j} &= \frac{\nabla \left ( \pi_a \right )_{i, j}}{\pi_a} \\ &= \begin{cases} \mathbf{x}(s)_i (1 - \pi_j), & \text{ if } j = a \\ -\pi_j \mathbf{x}(s)_i, & \text{ else }\\ \end{cases} \\ &= \begin{cases} (1 - \pi_j), & \text{ if } j = a \text{, } i \in \mathcal{F}(s) \\ -\pi_j, & \text{ if } j \neq a \text{, } i \in \mathcal{F}(s) \\ 0, & \text{ otherwise} \end{cases} \end{flalign}$$

We can see from this form of the eligibility vector that it need not be computed explicitely and we do not need to instantiate a feature vector either. Rather we can simply go through the active feature indices and subtract the policy output for the column index at each row and then add 1 to the column corresponding to the selected action:

Loop for each step of the episode $t = 0, 1, \cdots, T-1$

$$G \leftarrow \sum_{k=t+1} \gamma^{k-t-1}R_k$$

$$c = \alpha \times \gamma^t \times G$$

Loop for each action index j

Loop for each feature i

$$\theta_{i, j} \leftarrow \theta_{i, j} - c \times \pi(a_j, S_t, \mathbf{\theta})$$

Define $j_a$ as the column index corresponding to action $A_t$ Loop for each feature i

$$\theta_{i, j_a} \leftarrow \theta_{i, j_a} + c$$

Specialized versions of REINFORCE that use binary features and linear features can be found below as well as the general case that works for any type of parameterized function approximation.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‰*°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$ee72af8d-3cb8-4314-82df-580f068e1252¹depends_on_disabled_cellsÂ§runtimeÎ Å‡µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$e524f8cc-ab69-4f8b-a59f-28156696a104Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ½¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•>J®°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$e524f8cc-ab69-4f8b-a59f-28156696a104¹depends_on_disabled_cellsÂ§runtimeÎ’¾µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$1894ae1a-bb68-4de0-a4d2-ac5d02c49f09Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÀ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampË°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$1894ae1a-bb68-4de0-a4d2-ac5d02c49f09¹depends_on_disabled_cellsÃ§runtimeÀµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$f3bc47b5-03fc-4bd9-a890-26f9608a730bŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚS

Continuing Corridor Gridworld Example

Note that if we try to apply this algorithm to the short corridor gridworld it fails because a terminal state is encountered. This condition is checked inside the algorithm because there is nothing about an MDP the way it is defined which tells you in advance if it is a continuing task or not. In the tabular case you can always check to see if a terminal state exists since every state is available, but for the non-tabular case, all we can do is note the problem if a terminal state is encountered.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôŒ··°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f3bc47b5-03fc-4bd9-a890-26f9608a730b¹depends_on_disabled_cellsÂ§runtimeÎ€Rµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$4915b1ed-ad53-4ece-9b00-bc136d47d8dcŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚí

It is implicit in all expressions below that $\pi$ is a function of $\boldsymbol{\theta}$ and that the gradients are with respect to $\boldsymbol{\theta}$. The performance measure for the continuing case is $J(\boldsymbol{\theta}) = r(\boldsymbol{\theta})$ (13.15) and all value functions use the definition of the differential return. We begin by expressing the gradient of the state value function in terms of the state-action value function, the policy, the average return and gradients thereof:

$$\begin{flalign} \nabla v_\pi(s) &= \nabla \left [ \sum_a \pi(a \vert s) q_\pi (s, a) \right ], \: \forall s \in \mathcal{S} \\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \nabla q_\pi(s, a) \right ] \tag{product rule} \\ &=\sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \nabla \sum_{s^\prime, r} p(s^\prime, r, \vert s, a)\left (r - r(\boldsymbol{\theta}) + v_\pi(s^\prime) \right ) \right ] \tag{differential return definitions} \\ &=\sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) [ -\nabla r(\boldsymbol{\theta}) + \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) ] \right ] \tag{distributing gradient}\\ \end{flalign}$$

The purpose of this expression is to isolate the term which is the gradient of the average return since this is the performance metric gradient we originally sought. Note that if we separate the terms inside the sum, the one with the gradient of $r$ is $\sum_a \pi(a\vert s) [- \nabla r(\boldsymbol{\theta})] = -\nabla r(\boldsymbol{\theta}) \sum_a \pi(a \vert s)$. But the policy function is a probability distribution so its sum over actions is just 1. Therefore, this term simplifies to just $-\nabla r(\boldsymbol{\theta})$ which we can simply move to the other side of the expression swapping its place with the state value function:

$$\begin{flalign} \nabla v_\pi(s)&=-\nabla r(\boldsymbol{\theta}) + \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) \right ] \\ \nabla r(\boldsymbol{\theta}) &=-\nabla v_\pi(s) + \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) \right ] \end{flalign}$$

Now the left hand side is $\nabla J(\boldsymbol{\theta})$ and does not depend on $s$. As such, the right hand side as a whole must be independent of $s$ as well so we are free to take a weighted sum of it over some probability distribution on $s$ since all the terms sum to 1. That is, if $f$ is independent of $s$, then $f = \sum_s \mu(s) f = f \sum_s \mu(s) = f \times 1 = f$:

$$\begin{flalign} \nabla J(\boldsymbol{\theta}) &= \sum_s \mu(s) \left ( \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) \right ] - \nabla v_\pi(s) \right ) \\ &= \sum_s \mu(s) \sum_a \nabla \pi(a \vert s) q_\pi(s, a) + \sum_s \mu(s) \sum_a \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) - \sum_s \mu(s) \nabla v_\pi(s) \tag{separating sum terms}\\ &= \sum_s \mu(s) \sum_a \nabla \pi(a \vert s) q_\pi(s, a) + \sum_{s^\prime} \sum_s \mu(s) \sum_a \pi(a \vert s) p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) - \sum_s \mu(s) \nabla v_\pi(s) \tag{swapping sum order in second term}\\ &= \sum_s \mu(s) \sum_a \nabla \pi(a \vert s) q_\pi(s, a) + \sum_{s^\prime} \mu(s^\prime) \nabla v_\pi(s^\prime) - \sum_s \mu(s) \nabla v_\pi(s) \tag{stationary state distribution definition}\\ &= \sum_s \mu(s) \sum_a \nabla \pi(a \vert s) q_\pi(s, a) \tag{cancelling equivalent sum terms}\\ &= \mathbb{E}_\pi \left [ \sum_a \nabla \pi(a \vert S_t) q_\pi(S_t, a) \right ] \tag{expected value definition}\\ &= \mathbb{E}_\pi \left [ \sum_a \pi(a \vert S_t) \frac{\nabla \pi(a \vert S_t)}{\pi(a \vert S_t)} q_\pi(S_t, a) \right ] \tag{multiplying and dividing by the policy}\\ &= \mathbb{E}_\pi \left [\frac{\nabla \pi(A_t \vert S_t)}{\pi(A_t \vert S_t)} q_\pi(S_t, A_t) \right ] \tag{expected value definition}\\ &= \mathbb{E}_\pi \left [\frac{\nabla \pi(A_t \vert S_t)}{\pi(A_t \vert S_t)} G_t \right ] \tag{differential return definition}\\ &= \mathbb{E}_\pi \left [G_t \nabla \ln \pi(A_t \vert S_t) \right ] \tag{chain rule}\\ \end{flalign}$$

The expression inside the expected value can be sampled on every time step and the gradient is only in terms of the policy function which we have selected as something differentiable with respect to the parameters. Since this method will only be used for continuing problems, we cannot rely on Monte Carlo sampling for the differential return. Instead, our only option is to use a bootstrap value estimate in combination with a running estimate of the average reward and the immediate sample reward: $R - \overline{R} + \hat v^\prime$ where $\hat v^\prime$ is the differential value function estimate at the transition state and $\overline{R}$ is an estimate of the average reward. We can apply the existing actor-critic algorithms to these continuing problems as long as we track that additional information and use an additional step size parameter to update the average reward estimate. This step size parameter replaces the discount rate. See a full implementation below:

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôŒwW°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$4915b1ed-ad53-4ece-9b00-bc136d47d8dc¹depends_on_disabled_cellsÂ§runtimeÎ 3ƒµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$f924eb30-d1cc-4941-8fb5-ff70ad425ab9Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚE

13.3 REINFORCE: Monte Carlo Policy Gradient

If we replace the true action-value function in (13.5) with a learned approximation $\hat q_\pi$, then we have a method called the all-actions method because the update involves the sum over all actions. For the REINFORCE algorithm, we instead sample this value using the actual return and the policy distribution.

We can re-write (13.5) using an expected value under the policy and continue from there:

$$\begin{flalign} \nabla J(\boldsymbol{\theta}) & \propto \mathbb{E}_\pi \left [ \gamma^t \sum_a q_\pi (S_t, a) \nabla \pi(a|S_t, \boldsymbol{\theta}) \right ] \tag{13.6}\\ &= \mathbb{E}_\pi \left [\gamma^t \sum_a \pi(a|S_t, \boldsymbol{\theta}) q_\pi (S_t, a) \frac{\nabla \pi(a|S_t, \boldsymbol{\theta})}{\pi(a|S_t, \boldsymbol{\theta})} \right ] \tag{multiply and divide by policy} \\ &= \mathbb{E}_\pi \left [ \gamma^t q_\pi (S_t, A_t) \frac{\nabla \pi(A_t|S_t, \boldsymbol{\theta})}{\pi(A_t|S_t, \boldsymbol{\theta})} \right ] \tag{replace a with sample under policy} \\ &= \mathbb{E}_\pi \left [ \gamma^t G_t \frac{\nabla \pi(A_t|S_t, \boldsymbol{\theta})}{\pi(A_t|S_t, \boldsymbol{\theta})} \right ] \tag{replace value with sample return} \\ \end{flalign}$$

Using the expression in the brackets we can write down an update rule for the parameters that can be sampled on each time step. This is the REINFORCE update:

$$\begin{align} \boldsymbol{\theta}_{t+1} \doteq \boldsymbol{\theta}_t + \alpha \gamma^t G_t \frac{\nabla \pi(A_t|S_t, \boldsymbol{\theta}_t)}{\pi(A_t|S_t, \boldsymbol{\theta}_t)} \tag{13.8} \end{align}$$

Because it uses all future returns after step t, REINFORCE is a Monte Carlo algorithm and is well defined only for the episodic case. For implementation purposes we can replace $\frac{\nabla \pi(A_t|S_t, \boldsymbol{\theta})}{\pi(A_t|S_t, \boldsymbol{\theta})}$ with $\nabla \ln \pi(A_t|S_t, \boldsymbol{\theta}_t)$ which is usually refered to as the eligibility vector.

With the alternative parameterization, the eligibility vector is $\nabla \ln \pi(S_t, \theta_t)_{A_t}$ where $\pi$ is a vector and the $A_t$ subscript takes the value of that vector at the index corresponding to the action $A_t$.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôˆ+ñ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f924eb30-d1cc-4941-8fb5-ff70ad425ab9¹depends_on_disabled_cellsÂ§runtimeÎêµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$d83dc659-dce7-41dd-a8e7-2933ab39d15cŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ©

REINFORCE with Baseline Implementation

These functions use two sets of parameters, one to calculate the policy function and another to calculate the state value function. The state representation vector is shared between the two functions, but the policy function will return a distribution of preferences over actions while the value function will return a single value. If linear approximation is used to estimate both functions, the the policy parameters $\boldsymbol{\theta}$ will be a $d \times N_a$ matrix where $d$ is the length of the state feature vector representation and the value function parameters $\mathbf{w}$ will be a length $d$ vector. It is also possible to mix linear and non-linear approximation with this method.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‰Éâ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d83dc659-dce7-41dd-a8e7-2933ab39d15c¹depends_on_disabled_cellsÂ§runtimeÎYrµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fceŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚXh ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•3!M'°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fce¹depends_on_disabled_cellsÂ§runtimeÎÞ˜µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$83ca0577-15d7-4448-b597-c77810b812bfŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ1figure_13_2_test (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•%Wj8°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$83ca0577-15d7-4448-b597-c77810b812bf¹depends_on_disabled_cellsÂ§runtimeÎ§ _µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ\reinforce_with_baseline_monte_carlo_control_binary_features (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•%Hó°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb¹depends_on_disabled_cellsÂ§runtimeÎ@Ù_µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$a7dcc8cd-04ec-48f2-a387-116330eaffb2Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÛó5 ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•&„õÃ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$a7dcc8cd-04ec-48f2-a387-116330eaffb2¹depends_on_disabled_cellsÂ§runtimeÏ GÁñµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$0ab70fc3-6188-42eb-aba2-d808f319be9fŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ2

Dependencies

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô–ñA°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$0ab70fc3-6188-42eb-aba2-d808f319be9f¹depends_on_disabled_cellsÂ§runtimeÎÈ µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$047656d1-2921-40f2-b75b-ce4a87098007Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙI

Switched Corridor Parameter Studies

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôŠt6°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$047656d1-2921-40f2-b75b-ce4a87098007¹depends_on_disabled_cellsÂ§runtimeÎÐíµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$5d434c83-c9ca-499f-8695-c7733031c2deŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ9cartpole_continuing_step (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ• ÑÑ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5d434c83-c9ca-499f-8695-c7733031c2de¹depends_on_disabled_cellsÂ§runtimeÎjµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$3a37b53d-9174-4faa-9404-74a40c385b0aŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÛ*¼Total Reward: -1000.0

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•A+°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$3a37b53d-9174-4faa-9404-74a40c385b0a¹depends_on_disabled_cellsÂ§runtimeÎ·Lµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$820752af-8966-4ee8-82f7-a40934522de5Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÀ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampË°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$820752af-8966-4ee8-82f7-a40934522de5¹depends_on_disabled_cellsÃ§runtimeÀµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$6acb549a-5d90-4457-a347-d22448ad8071Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚÙ1¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•3(ˆn°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$6acb549a-5d90-4457-a347-d22448ad8071¹depends_on_disabled_cellsÂ§runtimeÎ~Wµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙJcartpole_fcann_continuing_parameter_study (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•2¢ð°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62¹depends_on_disabled_cellsÂ§runtimeÎâÖµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements’’´action_probabilities’…¦prefix§Float32¨elements’’’«0.000339307ªtext/plain’’¨0.999661ªtext/plain¤type¥Array¬prefix_short ¨objectid°6ea9ffc26d27fa73Ù!application/vnd.pluto.tree+object’´state_value_estimate’¨-91.9871ªtext/plain¤typeªNamedTuple¨objectid°6155308e34755079¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÀ²last_run_timestampËAÚ•+$-°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728¹depends_on_disabled_cellsÂ§runtimeÎÅ?µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$ae0f5a96-7a4b-47f9-be1e-e803a238a071Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ_

MDP Types and Transitions for Continuous Actions

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô“«“°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$ae0f5a96-7a4b-47f9-be1e-e803a238a071¹depends_on_disabled_cellsÂ§runtimeÎvðµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$41d62de1-2c92-41ee-9430-b9ca3007afd9Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ¶

The above matrix represents an estimate of $\Pr \{ S_k = s \mid \pi \}$; however note that the terminal states are excluded from the rows. This corridor problem only has three non-terminal states. If we sum across each row, then we have the probability of reaching that step prior to terminating. The vector defined below measures the probability of an episode terminating prior to each step. Notably, this probablity is 0 for the first three steps since no policy starting from the left can terminate that quickly. As expected, the probability of terminating under the random policy grows with time approaching 1.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô…åX°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$41d62de1-2c92-41ee-9430-b9ca3007afd9¹depends_on_disabled_cellsÂ§runtimeÎ€›µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$8eb42403-1234-4e59-993e-057cc3a6d5c9Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙB

Waiting to run parameter study

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@/Ì°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$8eb42403-1234-4e59-993e-057cc3a6d5c9¹depends_on_disabled_cellsÂ§runtimeÎÎøµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$bbc8864a-1545-433f-bc7c-0ddf6e907138Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ?plot_mountaincar_policy_values (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•@Â-0°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$bbc8864a-1545-433f-bc7c-0ddf6e907138¹depends_on_disabled_cellsÂ§runtimeÎiŒôµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$a12b92d1-e045-4f92-b8cd-eee5d56fa67dŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements˜’¯episode_rewards’…¦prefix§Float32¨elements›’’¤-6.0ªtext/plain’’¤-5.0ªtext/plain’’¤-9.0ªtext/plain’’¤-7.0ªtext/plain’’¤-4.0ªtext/plain’’¥-22.0ªtext/plain’’¥-34.0ªtext/plain’’¤-6.0ªtext/plain’ ’¥-27.0ªtext/plain¤more’d’¥-12.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°561caa86399f4303Ù!application/vnd.pluto.tree+object’episode_steps’…¦prefix¥Int64¨elements›’’¡6ªtext/plain’’¡5ªtext/plain’’¡9ªtext/plain’’¡7ªtext/plain’’¡4ªtext/plain’’¢22ªtext/plain’’¢34ªtext/plain’’¡6ªtext/plain’ ’¢27ªtext/plain¤more’d’¢12ªtext/plain¤type¥Array¬prefix_short ¨objectid°74a251baf6aa9110Ù!application/vnd.pluto.tree+object’¯policy_function’£Ï€2ªtext/plain’´policy_sample_action’ªÏ€_sample2ªtext/plain’±policy_parameters’Ù*1Ã—2 Matrix{Float32}: -0.199834 0.199834ªtext/plain’´estimate_state_value’´estimate_state_valueªtext/plain’°value_parameters’…¦prefix§Float32¨elements‘’’¨-9.63535ªtext/plain¤type¥Array¬prefix_short ¨objectid°35fd187379601766Ù!application/vnd.pluto.tree+object’°policy_and_value’°policy_and_valueªtext/plain¤typeªNamedTuple¨objectid°9c86eb7ef1a9a622¤mimeÙ!application/vnd.pluto.tree+object¬rootassignee¶const best_mc_corridor²last_run_timestampËAÚ•&´°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$a12b92d1-e045-4f92-b8cd-eee5d56fa67d¹depends_on_disabled_cellsÂ§runtimeÎGEáµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$ce33f710-fd9d-4dfa-acda-40204e54d518Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚx

13.5 Actor-Critic Methods

Here we also use the value function estimator to calculate the the return estimate using the one step bootstrap return. When the state value function is used in this way we call it the critic. In general we can use this function with n-step returns and eligibility traces. Recall from the subject of TD learning of value functions that the one-step return is often superior to the actual return regarding variance and ease of computation, although it does introduce bias to the estimate. With the use of eligibility traces we can smoothly vary arbitrarily close to the Monte Carlo return. Note that the bias in the gradient estimate is n due to the bootstrapping as such; the actor would be biased even if the critic was learned by a Monte Carlo method.

The one-step actor-critic method is the analog of the one step methods such as TD$(0)$, Sarsa$(0)$, and Q learning. These methods replace the full return of REINFORCE with the one step return as follows:

$$\begin{flalign} \boldsymbol{\theta}_{t+1} &\doteq \boldsymbol{\theta}_t + \alpha(G_{t:t+1} - \hat v(S_t, \mathbf{w}))\ln\nabla\pi(A_t|S_t, \mathbf{\theta_t}) \tag{13.12} \\ & = \boldsymbol{\theta}_t + \alpha(R_{t+1} + \gamma \hat v(S_{t+1}, \mathbf{w}) - \hat v(S_t, \mathbf{w}))\ln\nabla\pi(A_t|S_t, \mathbf{\theta_t}) \tag{13.13} \\ & = \boldsymbol{\theta}_t + \delta_t\ln\nabla\pi(A_t|S_t, \mathbf{\theta_t}) \tag{13.14} \\ \end{flalign}$$

This can be implemented as a fully online algorithm because we do not have to wait until the end of an episode to calculate return estimates. The natural state-value-function learning method to pair with this is semi-gradient TD(0). See a full implementation below.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôŠÿ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$ce33f710-fd9d-4dfa-acda-40204e54d518¹depends_on_disabled_cellsÂ§runtimeÎãæµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$339b4d2b-2237-46a3-9867-ecc3332856c1Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ•

This expression repeats terms of the form $\nabla \pi(a \vert s) q_\pi(s, a)$ summed over different probabilities. The first appearance of this term is just a sum over all actions at the state $s$ which is the state we are using for the gradient expression. The next appearance of the expression is a sum over actions at state $s^\prime$. Let's define a new expressions:

$$\begin{flalign} f(s) &\doteq \sum_a \nabla \pi(a \vert s) q_\pi(s, a) \\ \end{flalign}$$

Then we can rewrite the second term as follows:

$$\gamma \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) f(s^\prime) \right ] = \gamma \sum_{s^\prime} f(s^\prime) \sum_a \left [ \pi(a \vert s) p(s^\prime \vert s, a) \right ] = \gamma \mathbb{E}_\pi [f(s^\prime) \vert s] = \gamma \sum_{s^\prime} f(s ^\prime) \Pr \{ S_1 = s^\prime \mid S_0 = s, A_1 \sim \pi(s) \}$$

Define a new term $g(s) = \sum_{s^\prime} f(s^\prime) \Pr \{ S_1 = s^\prime \vert S_0 = s, A_1 \sim \pi(s) \} = \sum_{s^\prime} f(s^\prime) \sum_a [\pi(a \vert s) p(s^\prime \vert s, a)$

So the second term can be written as $\gamma g(s)$

where the final expression uses the probability that the agent transitions from state $s$ to $s^\prime$ in one step under the policy $\pi$. Using this same logic, we can rewrite the third expression as well.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô†•°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$339b4d2b-2237-46a3-9867-ecc3332856c1¹depends_on_disabled_cellsÂ§runtimeÎñ!µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$a8349352-3242-46d5-b0d5-1b6eb5d77e90Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ0¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•:Ø¾°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$a8349352-3242-46d5-b0d5-1b6eb5d77e90¹depends_on_disabled_cellsÂ§runtimeÎU£µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$7d63b960-3998-4f7b-8cbb-ccd49db9aeacŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements’’´action_probabilities’…¦prefix§Float32¨elements’’’«0.000138965ªtext/plain’’¨0.999861ªtext/plain¤type¥Array¬prefix_short ¨objectid°d09c2324e4b17111Ù!application/vnd.pluto.tree+object’´state_value_estimate’¨-96.5535ªtext/plain¤typeªNamedTuple¨objectid°26386ed69de54735¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÀ²last_run_timestampËAÚ•'b~°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$7d63b960-3998-4f7b-8cbb-ccd49db9aeac¹depends_on_disabled_cellsÂ§runtimeÎŒñ7µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$65d2add6-fd6f-456c-92ed-3cd9d1862ef6Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ=update_binary_policy_params! (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•Ð@’°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$65d2add6-fd6f-456c-92ed-3cd9d1862ef6¹depends_on_disabled_cellsÂ§runtimeÎ!*µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$f55afa58-962d-4551-8d95-a5b467d61adfŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ?update_params_with_gradient! (generic function with 10 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•#U¿°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f55afa58-962d-4551-8d95-a5b467d61adf¹depends_on_disabled_cellsÂ§runtimeÎLjÀµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$d9d11d69-bc16-400a-8f46-f9a8ecb8516aŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙNactor_critic_binary_episodic_parameter_study (generic function with 2 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•=l °persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d9d11d69-bc16-400a-8f46-f9a8ecb8516a¹depends_on_disabled_cellsÂ§runtimeÎ2€xµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$ed93259c-7b8b-46d7-97fb-f194e0e04b3aŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙCsetup_binary_beta_policy_arguments (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•'-Å¬°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$ed93259c-7b8b-46d7-97fb-f194e0e04b3a¹depends_on_disabled_cellsÂ§runtimeÎÉµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$d1ed25e6-60c6-411f-a541-99986e5da2c5Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ\reinforce_with_baseline_monte_carlo_control_linear_features (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•&’×8°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d1ed25e6-60c6-411f-a541-99986e5da2c5¹depends_on_disabled_cellsÂ§runtimeÎ9–‘µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$b966b248-fb4d-457d-90f6-114370846242Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ7bad_continuous_action (generic function with 3 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•!œ‚Ç°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$b966b248-fb4d-457d-90f6-114370846242¹depends_on_disabled_cellsÂ§runtimeÎbµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$4156d955-9daf-4429-b152-e8332980fb9eŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements™’¬step_rewards’…¦prefix§Float32¨elements¤type¥Array¬prefix_short ¨objectid°3c0520dc1d635d04Ù!application/vnd.pluto.tree+object’episode_steps’…¦prefix¥Int64¨elements›’’¥30971ªtext/plain’’¥33158ªtext/plain’’¥36744ªtext/plain’’¥39697ªtext/plain’’¥42025ªtext/plain’’¥44282ªtext/plain’’¥45403ªtext/plain’’¥47954ªtext/plain’ ’¥49838ªtext/plain¤more’E’¥99724ªtext/plain¤type¥Array¬prefix_short ¨objectid°402d7f32a5b41291Ù!application/vnd.pluto.tree+object’¯episode_rewards’…¦prefix§Float32¨elements›’’¨-30970.0ªtext/plain’’§-2187.0ªtext/plain’’§-3586.0ªtext/plain’’§-2953.0ªtext/plain’’§-2328.0ªtext/plain’’§-2257.0ªtext/plain’’§-1121.0ªtext/plain’’§-2551.0ªtext/plain’ ’§-1884.0ªtext/plain¤more’E’¦-532.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°68896aa3766bea26Ù!application/vnd.pluto.tree+object’±policy_parameters’Ú$1452Ã—2 Matrix{Float32}: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 â‹® 0.0437175 -0.0569757 0.0326978 -0.0276905 0.00138512 0.0006306 0.0 0.0 0.0 0.0 0.0 0.0ªtext/plain’°value_parameters’…¦prefix§Float32¨elements›’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’ ’£0.0ªtext/plain¤more’Í¬’£0.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°f01afe6ea747a0b0Ù!application/vnd.pluto.tree+object’¯policy_function’¢Ï€ªtext/plain’´policy_sample_action’©Ï€_sampleªtext/plain’´estimate_state_value’´estimate_state_valueªtext/plain’°policy_and_value’°policy_and_valueªtext/plain¤typeªNamedTuple¨objectid°7aefd39e1e4696e7¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÙ,const mountaincar_continuous_test_train_beta²last_run_timestampËAÚ•>ªXâ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$4156d955-9daf-4429-b152-e8332980fb9e¹depends_on_disabled_cellsÂ§runtimeÎ*<Åµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$b09e1e48-494e-4967-826a-6e70199acad4Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙC

Squashed Gaussian Alternative

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô“E•°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$b09e1e48-494e-4967-826a-6e70199acad4¹depends_on_disabled_cellsÂ§runtimeÎå(µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$734573e5-547b-4dcc-89bb-412aa6cc42d6Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙEactor_critic_linear_parameter_study (generic function with 2 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•+È{ö°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$734573e5-547b-4dcc-89bb-412aa6cc42d6¹depends_on_disabled_cellsÂ§runtimeÎx",µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙKactor_critic_with_eligibility_traces_fcann (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•/°Bÿ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54¹depends_on_disabled_cellsÂ§runtimeÎEm$µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$692c1043-4eaf-491e-b8fe-368618867f99Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ“

The soft-max distribution is:

$$\sigma(a|s, \theta) = \frac{e^{h(s, a, \theta)}}{\sum_b e^{h(s, b, \theta)}}$$

We only have two possible actions in each state so the policy for action 1 would be given by:

$$\pi(1|S_t, \theta_t) = \frac{e^{h(s, 1, \theta_t)}}{e^{h(S_t, 0, \theta_t)} + e^{h(S_t, 1, \theta)}}$$

Simplify this expression by dividing by $e^{h(s, 1, \theta_t)}$ which results in:

$$\pi(1|S_t, \theta_t) = \frac{1}{e^{h(S_t, 0, \theta_t) - h(S_t, 1, \theta_t)} + 1}$$

Given the assumption that $h(s, 1, \theta)-h(s, 0, \theta) = \theta^\top\mathbf{x}(s)$, we replace the expression in the exponent resulting in the final expression of:

$$\pi(1|S_t, \theta_t) = \frac{1}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1}$$

Using the notation $f(x) = 1/(1+e^{-x})$ we can write $\pi(1|S_t, \theta_t) = f(\theta_t^\top \mathbf{x}(S_t))$ where $f$ is the logistic function. Consider this notation for the rest of the exercises.

The REINFORCE update is given by: $\theta_{t+1} = \theta_t + \alpha G_t \frac{\nabla\pi(A_t|S_t, \theta_t)}{\pi(A_t|S_t, \theta_t)}$, so we need to compute the gradient of the policy in terms of the parameters for this action selection: $\nabla \pi(1|S_t, \theta_t)$. Luckily, the derivative of the logistic function is simply given by: $f(x)(1-f(x))$ where $f(x)$ is the logistic function itself. In our case $x = \theta_t^\top \mathbf{x}_t$ so after applying the chain rule we have:

$$\nabla\pi(1|S_t, \theta_t) = f(x)(1-f(x))\nabla x = f(x)(1-f(x)) \mathbf{x_t}$$

since $x$ is just a linear function of the parameters. So for the parameter update step we have:

$$\frac{\nabla\pi(1|S_t, \theta_t)}{\pi(1|S_t, \theta_t)} = \frac{f(x)(1-f(x))\mathbf{x}_t}{f(x)} = (1 - f(x))\mathbf{x}_t$$

Also note that:

$$1 - f(x) = 1 - \frac{1}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1} = \frac{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1 - 1}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1} = \frac{e^{-\theta_t^\top\mathbf{x}(S_t)}}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1}$$

The REINFORCE update will then be:

$$\theta_{t+1} = \theta_t + \alpha G_t \left ( \frac{e^{-\theta_t^\top\mathbf{x}(S_t)}}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1} \right ) \mathbf{x}_t$$

For the general case, we want to calculate $\frac{\nabla\pi(a|s, \theta)}{\pi(a|s, \theta)}$. We already know this expression for $a = 1$.

$$\nabla {\pi(1|s, \mathbf{\theta})} = f(x)(1 - f(x))\mathbf{x}(s) = \pi(1|s, \mathbf{\theta})(1 - \pi(1|s, \mathbf{\theta})\mathbf{x}(s)$$

Since $\pi(a|s, \theta)$ is a probability distribution across actions, we also know that

$$\pi(0|s, \theta) = 1 - \pi(1|s, \theta)$$

which implies that

$$\nabla \pi(0|s, \theta) = -\nabla \pi(1|s, \theta) = -\pi(1|s, \mathbf{\theta})(1 - \pi(1|s, \mathbf{\theta}))\mathbf{x}(s)$$

We can express this in terms of $\pi(0|s, \theta)$ completely:

$$\nabla \pi(0|s, \theta) = (\pi(0|s, \mathbf{\theta}) - 1)\pi(0|s, \theta)\mathbf{x}(s) = -\pi(0|s, \theta)(1 - \pi(0|s, \mathbf{\theta}))\mathbf{x}(s)$$

Let's now compare the two expressions for the policy gradient at each action:

$$\begin{align} \nabla {\pi(1|s, \mathbf{\theta})} &= \pi(1|s, \mathbf{\theta})(1 - \pi(1|s, \mathbf{\theta})\mathbf{x}(s) \\ \nabla \pi(0|s, \theta) &= -\pi(0|s, \theta)(1 - \pi(0|s, \mathbf{\theta}))\mathbf{x}(s) \\ \therefore \\ \nabla \pi(a|s, \theta) &= \chi (a) \pi(a|s, \theta)(1 - \pi(a|s, \mathbf{\theta}))\mathbf{x}(s) \\ \end{align}$$

Where $\chi (a)$ is a function that returns 1 for $a=1$ and -1 for $a=0$. There are many ways to achieve this but the following expression is simple and works: $\chi(a) = 2a - 1$. Dividing by the policy yields a unified expression for the eligibility vector:

$$\nabla \ln{\pi(a|s,\theta)} = (2a - 1) (1 - \pi(a|s, \mathbf{\theta}))\mathbf{x}(s)$$

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô’Ý°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$692c1043-4eaf-491e-b8fe-368618867f99¹depends_on_disabled_cellsÂ§runtimeÎÝ+µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$2c5d221a-2469-49e1-9249-dfdc2457f2faŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ®¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ• ûÜ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$2c5d221a-2469-49e1-9249-dfdc2457f2fa¹depends_on_disabled_cellsÂ§runtimeÎaµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$7c592385-e8d3-4efe-962c-d39debb64405Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements’’¬num_features’¤1452ªtext/plain’³get_active_features’¡fªtext/plain¤typeªNamedTuple¨objectid°349d5e4a8f483b41¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÙ"const mountaincar_tilecoding_setup²last_run_timestampËAÚ•=,…Q°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$7c592385-e8d3-4efe-962c-d39debb64405¹depends_on_disabled_cellsÂ§runtimeÎZ^µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaebŠ¦queuedÂ¤logs§runningÂ¦output†¤body ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•>³ù©°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaeb¹depends_on_disabled_cellsÂ§runtimeÎ)Žµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$8eab55a5-41b7-4f5e-a02f-4c19388bc9eaŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ>update_binary_feature_vector! (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•Ÿó°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$8eab55a5-41b7-4f5e-a02f-4c19388bc9ea¹depends_on_disabled_cellsÂ§runtimeÎ¤µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$0ac7ea44-14f6-4e80-80f9-d6df8059bb38Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ?reinforce_monte_carlo_control! (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•#n$l°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$0ac7ea44-14f6-4e80-80f9-d6df8059bb38¹depends_on_disabled_cellsÂ§runtimeÎ/±Eµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$5ffc271f-c73f-494a-9727-8d7516af2191Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚd=

$\lambda_\theta$: 0.8

$\lambda_\mathbf{w}$: 0.15

$\alpha_{\overline{r}}$:

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ• üÔö°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5ffc271f-c73f-494a-9727-8d7516af2191¹depends_on_disabled_cellsÂ§runtimeÎ¶¥Èµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$c5a2879c-e89b-47f7-bbd6-48200d7e89e3Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ`actor_critic_binary_episodic_squashed_gaussian_parameter_study (generic function with 3 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•/õ™°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$c5a2879c-e89b-47f7-bbd6-48200d7e89e3¹depends_on_disabled_cellsÂ§runtimeÎ%Ùlµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$537270ba-122b-4f2b-880b-31d086766295Š¦queuedÂ¤logs§runningÂ¦output†¤bodyContinuousMDP¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•!€^°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$537270ba-122b-4f2b-880b-31d086766295¹depends_on_disabled_cellsÂ§runtimeÎr€gµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$dc2efc6c-8da8-425b-aa5f-290949109565Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÛE>

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@ÔI°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$dc2efc6c-8da8-425b-aa5f-290949109565¹depends_on_disabled_cellsÂ§runtimeÎX1¨µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$a019925a-460a-410e-a54b-50a4cfe0e90eŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ6# ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•!ô®À°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$a019925a-460a-410e-a54b-50a4cfe0e90e¹depends_on_disabled_cellsÂ§runtimeÎ€Jµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$f92bb265-4b19-4f0e-a698-d7547bb6dd41Š¦queuedÂ¤logs§runningÂ¦output†¤body ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•˜®.°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f92bb265-4b19-4f0e-a698-d7547bb6dd41¹depends_on_disabled_cellsÂ§runtimeÎ½µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$ac9c8845-284d-4c21-b05d-d930f86598a3Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ´¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•=£^Ì°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$ac9c8845-284d-4c21-b05d-d930f86598a3¹depends_on_disabled_cellsÂ§runtimeÎ“#µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$192cc1cf-9ea1-492d-baa7-f2e197abecd4Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ¨¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•=rÒ•°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$192cc1cf-9ea1-492d-baa7-f2e197abecd4¹depends_on_disabled_cellsÂ§runtimeÎ×Üµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547dŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ†1¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•7ûÝÊ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547d¹depends_on_disabled_cellsÂ§runtimeÎêAµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$c8b47eac-2d45-419a-bec6-2ae0cdc59393Š¦queuedÂ¤logs§runningÂ¦output†¤body ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•!Žú‹°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$c8b47eac-2d45-419a-bec6-2ae0cdc59393¹depends_on_disabled_cellsÂ§runtimeÎ

Chapter 13 Policy Gradient Methods Introduction

Instead of selection actions based on action-value estimates we learn a parameterized policy with parameters $\boldsymbol{Î¸}$. $\pi(a|s, \boldsymbol{\theta}) = \text{Pr}\{A_t=a|S_t=s, \boldsymbol{\theta}_t=\boldsymbol{\theta\}}$ denotes the probability that action a is taken at time t given that the environment is in state s at time t with parameter $\boldsymbol{Î¸}$.

We consider methods that improve the policy parameter using the gradient of some scalar performance measure $J(\boldsymbol{\theta})$ with respect to the policy parameters. We follow gradient ascent since we are trying to maximize this value and methods that use this approach are called policy gradient methods. Methods that learn approximations to both policy and value functions are often called actor-critic methods, where 'actor' is a reference to the learned policy, and 'critic' refers to the learned value function, usually a state-value function.

13.1 Policy Approximation and its Advantages

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‚0°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$36a6e43f-6bcf-4c27-bfbb-047760e77ada¹depends_on_disabled_cellsÂ§runtimeÎµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$436c52d2-280b-4ca4-9360-d6587b8254c7Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ˜

In order to test this algorithm we need to use a continuing task which is lacking a terminal state. We could simply modify the corridor MDP to be a continuing task by altering the reward structure so a reward of 1 is received upon moving to the right from state 3 after which the state is reset to 1. Se below for a version of this MDP updated to be a continuing problem.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôŒÑú°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$436c52d2-280b-4ca4-9360-d6587b8254c7¹depends_on_disabled_cellsÂ§runtimeÎ2rµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$e96d592d-1e54-486d-8ad9-b857f85476e8Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙDactor_critic_linear_parameter_study (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ• e|1°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$e96d592d-1e54-486d-8ad9-b857f85476e8¹depends_on_disabled_cellsÂ§runtimeÎ2µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6cŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ/b ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•*Ü1 °persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6c¹depends_on_disabled_cellsÂ§runtimeÎÑèQµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$4da20fd7-b897-4f26-bf2a-f08d66ddf90fŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙGactor_critic_with_eligibility_traces! (generic function with 4 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•+ ÔN°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$4da20fd7-b897-4f26-bf2a-f08d66ddf90f¹depends_on_disabled_cellsÂ§runtimeÎ€}lµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$11ea640c-3981-404d-87c6-4d3d0708a2b8Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙMactor_critic_linear_episodic_parameter_study (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•+Ç–ž°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$11ea640c-3981-404d-87c6-4d3d0708a2b8¹depends_on_disabled_cellsÂ§runtimeÎ€E.µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$281360af-46bf-4c73-bf11-3cb1153ad3e2Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÀ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampË°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$281360af-46bf-4c73-bf11-3cb1153ad3e2¹depends_on_disabled_cellsÃ§runtimeÀµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690cŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙNupdate_squashed_gaussian_eligibility_vector! (generic function with 4 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•#^Ñå°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690c¹depends_on_disabled_cellsÂ§runtimeÎ#¤µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$da3cb392-78f2-48b2-b0dc-5f016664798cŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ—%Total Reward: -142.0

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•AN©°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$da3cb392-78f2-48b2-b0dc-5f016664798c¹depends_on_disabled_cellsÂ§runtimeÎ Vóµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$dca2f8e2-76af-4679-bf81-3824c15fc76dŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements™’¬step_rewards’…¦prefix§Float32¨elements¤type¥Array¬prefix_short ¨objectid°9f06ece3690faeddÙ!application/vnd.pluto.tree+object’episode_steps’…¦prefix¥Int64¨elements›’’¢19ªtext/plain’’¢34ªtext/plain’’¢61ªtext/plain’’¢75ªtext/plain’’¢90ªtext/plain’’£117ªtext/plain’’£135ªtext/plain’’£178ªtext/plain’ ’£204ªtext/plain¤more’Í’¥99994ªtext/plain¤type¥Array¬prefix_short ¨objectid°245ae93c73b7a831Ù!application/vnd.pluto.tree+object’¯episode_rewards’…¦prefix§Float32¨elements›’’¤18.0ªtext/plain’’¤15.0ªtext/plain’’¤27.0ªtext/plain’’¤14.0ªtext/plain’’¤15.0ªtext/plain’’¤27.0ªtext/plain’’¤18.0ªtext/plain’’¤43.0ªtext/plain’ ’¤26.0ªtext/plain¤more’Í’¤11.0ªtext/plain¤type¥Array¬prefix_short ¨objectid¯1b64855ae6e6478Ù!application/vnd.pluto.tree+object’±policy_parameters’Ùê52488Ã—3 Matrix{Float32}: NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN â‹® NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaNªtext/plain’°value_parameters’…¦prefix§Float32¨elements›’’£NaNªtext/plain’’£NaNªtext/plain’’£NaNªtext/plain’’£NaNªtext/plain’’£NaNªtext/plain’’£NaNªtext/plain’’£NaNªtext/plain’’£NaNªtext/plain’ ’£NaNªtext/plain¤more’ÍÍ’£NaNªtext/plain¤type¥Array¬prefix_short ¨objectid°d26d759cd1b52e3fÙ!application/vnd.pluto.tree+object’¯policy_function’¢Ï€ªtext/plain’´policy_sample_action’©Ï€_sampleªtext/plain’´estimate_state_value’´estimate_state_valueªtext/plain’°policy_and_value’°policy_and_valueªtext/plain¤typeªNamedTuple¨objectid°e492fba9453c9690¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeµconst reinforce_test3²last_run_timestampËAÚ•2k;Å°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$dca2f8e2-76af-4679-bf81-3824c15fc76d¹depends_on_disabled_cellsÂ§runtimeÏG>Åµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$8019bec9-1228-407b-9199-2fe29f26a981Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ à

Exercise 13.1

Use your knowledge of the gridworld and its dynamics to determine an exact symbolic expression for the optimal probability of selecting the right action in Example 13.1

Example 13.1 is a gridworld with 3 non-terminal states and a terminal state at the far right. The reward is -1 per step. States 1 and 3 have actions left/right that move in the expected directions but state 2 reverses the directions. We use a performance measure $J(\mathbf{\theta}) = v_{\pi_\theta}(S)$. Given our feature representations of $\mathbf{x}(s, \text{right}) = [1, 0]^{\top}$ and $\mathbf{x}(s, \text{left}) = [0, 1]^{\top}$, we can only learn policies that are stochastic in terms of left/right action selection but do not vary between states. Also observe that due to probability constraints $p_{\text{right}} = 1 - p_{\text{left}}$. For simplicity, we will use the notation $p \doteq p_{\text{left}}$ and the following for the three state values: $v1, v2, v3$.

$$\begin{flalign} v_1 &= p \times v_1 + (1-p) \times v_2 - 1 \tag{1} \\ v_1 (1-p) &= v_2 (1-p) - 1 \\ v_1 &= v_2 - \frac{1}{1-p} \tag{1â€²}\\ v_2 &= p \times v_3 + (1-p) \times v_1 - 1 \tag{2} \\ v_3 &= p \times v_2 - 1 \tag{3}\\ v_2 &= p \times [p\times v_2 - 1] +(1-p) \times v_1 - 1 \tag{substituting 3 into 2} \\ v_2(1 - p^2) &= -p +(1-p) \times v_1 - 1 \\ v_2 &= \frac{(1-p) v_1 - (1+p)}{(1+p)(1-p)} \tag{collecting terms} \\ &= \frac{(1-p) v_2 - 1 - (1+p)}{(1+p)(1-p)} \tag{using 1â€²} \\ &= \frac{v_2}{1+p} - \frac{2 + p}{(1+p)(1-p)} \\ v_2 \left [1 - \frac{1}{1+p} \right ] &= - \frac{2 + p}{(1+p)(1-p)} \\ v_2 \frac{1+p-1}{1+p} &= - \frac{2 + p}{(1+p)(1-p)} \\ v_2 &= - \frac{2 + p}{(1-p)p} \\ v_1 &= - \frac{2 + p}{(1-p)p} - \frac{1}{1-p} \\ &= \frac{-2 - p - p}{(1-p)p} \\ &= -\frac{2 + 2p}{(1-p)p} \\ v_3 &= -\frac{2 + p}{1-p} - 1\\ &= \frac{-2 - p - 1 + p}{1-p}\\ &= -\frac{3}{1-p}\\ \end{flalign}$$

To summarize all the state values:

$$\begin{flalign} v_1 &= -\frac{2 + 2p}{(1-p)p} \\ v_2 &= - \frac{2 + p}{(1-p)p} \\ v_3 &= -\frac{3}{1-p} \end{flalign}$$

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô…°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$8019bec9-1228-407b-9199-2fe29f26a981¹depends_on_disabled_cellsÂ§runtimeÎÁ(µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$fd964539-2baf-4ff1-b286-5a0bb1b222c4Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ

The beta distribution has two parameters like the normal distribution but is only defined from 0 to 1. The two parameters $\alpha$ and $\beta$ are positive real numbers and control the shape of the distribution. The density function is given below:

$$f(x; \alpha, \beta) = \frac{x^{\alpha-1} (1-x)^{\beta - 1}}{\text{B}(\alpha, \beta)}$$

where $\text{B}(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha + \beta)}$ and $\Gamma(z) = \int_0^\infty t^{z-1}e^{-t} \text{d} t$

We saw earlier from the treatment of the gaussian distribution that we need to find the gradient of a function of each distribution parameter with respect to the parameters of the function approximation. Luckily, the maximum likelihood estimator already computes the gradient we are interested in for this distribution. Note that the likelihood function for a single sample of the random variable $x$ which follows the beta distribution is given by $\mathcal{L}(\alpha, \beta \vert X) = \ln(f(X_i; \alpha, \beta))$ and the partial derivative of this function with respect to each parameter $\alpha$ and $\beta$ is given by:

$$\frac{\partial \mathcal{L}(\alpha, \beta, \vert X)}{\partial \alpha} = \ln X - \frac{\partial \ln \text{B}(\alpha, \beta)}{\partial \alpha}$$

$$\frac{\partial \mathcal{L}(\alpha, \beta, \vert X)}{\partial \beta} = \ln (1-X) - \frac{\partial \ln \text{B}(\alpha, \beta)}{\partial \beta}$$

where $\frac{\partial \ln \text{B}(\alpha, \beta)}{\partial \alpha} = -\psi(\alpha + \beta) + \psi(\alpha)$ and $\frac{\partial \ln \text{B}(\alpha, \beta)}{\partial \beta} = -\psi(\alpha + \beta) + \psi(\beta)$ and $\phi(\alpha)$ is the digamma function which is just the derivative of the logarithm of the gamma function.

Since both $\alpha$ and $\beta$ must be greater than zero, we can use for an estimate for each one the exponential function applied to a dot product of the parameter vector with the feature vector: $\alpha(s, \boldsymbol{\theta}) \doteq \exp \left (\boldsymbol{\theta}_\alpha^\top \mathbf{x}(s) \right )$ and $\beta(s, \boldsymbol{\theta}) \doteq \exp \left (\boldsymbol{\theta}_\beta^\top \mathbf{x}(s) \right )$.

The eligibility vector for this distribution is then:

$$\nabla \ln f(a \vert \alpha(s, \boldsymbol{\theta}_\alpha), \beta(s, \boldsymbol{\theta}_\beta))$$

where $\alpha$ is a function of its parameters and $\beta$ is a function of the other parameter vector. The gradient components corresponding to each vector is only a function of a partial derivative of the distribution with respect to $\alpha$ and $\beta$. That is, since $\frac{\partial \alpha}{\partial \theta_{\beta_i}} = 0 \forall i$ and vice versa, then we can treat each part of the gradient separately.

$$\begin{flalign} \nabla_{\boldsymbol{\theta}_\alpha} \ln f(a \vert \alpha, \beta) &= \frac{\partial f(a \vert \alpha, \beta)}{\partial \alpha} \nabla_{\boldsymbol{\theta}_\alpha}\alpha \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\alpha) \right ) \nabla_{\boldsymbol{\theta}_\alpha} \alpha \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\alpha) \right ) \nabla_{\boldsymbol{\theta}_\alpha} \exp \left ( \boldsymbol{\theta}_\alpha^\top \mathbf{x}(s) \right ) \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\alpha) \right ) \alpha \mathbf{x}(s)\\ \end{flalign}$$

$$\begin{flalign} \nabla_{\boldsymbol{\theta}_\beta} \ln f(a \vert \alpha, \beta) &= \frac{\partial f(a \vert \alpha, \beta)}{\partial \beta} \nabla_{\boldsymbol{\theta}_\beta}\beta \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\beta) \right ) \nabla_{\boldsymbol{\theta}_\beta} \beta \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\beta) \right ) \nabla_{\boldsymbol{\theta}_\beta} \exp \left ( \boldsymbol{\theta}_\beta^\top \mathbf{x}(s) \right ) \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\beta) \right ) \beta \mathbf{x}(s)\\ \end{flalign}$$

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô“+Z°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$fd964539-2baf-4ff1-b286-5a0bb1b222c4¹depends_on_disabled_cellsÂ§runtimeÎ ¼Rµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$5720e942-d3f8-4329-83a8-8bcedf078b6aŠ¦queuedÂ¤logs§runningÂ¦output†¤body…¦prefix§Float32¨elements’’’§0.47295ªtext/plain’’§0.52705ªtext/plain¤type¥Array¬prefix_short ¨objectid°81035272d8cf80fd¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÀ²last_run_timestampËAÚ•$)Ôí°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5720e942-d3f8-4329-83a8-8bcedf078b6a¹depends_on_disabled_cellsÂ§runtimeÎ!*‰µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$62e677ac-2070-4f6b-9df2-90849d89fa9fŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ¯152Ã—1 Matrix{Float64}: 0.0 0.0 0.0 0.12513599999999991 0.1874460000000001 0.28129099999999996 0.343742 â‹® 0.999999 0.999999 0.999999 0.999999 0.999999 0.999999¤mimeªtext/plain¬rootassigneeÙ%const corridor_terminal_probabilities²last_run_timestampËAÚ•#'G°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$62e677ac-2070-4f6b-9df2-90849d89fa9f¹depends_on_disabled_cellsÂ§runtimeÎPÁEµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$11b9beea-b0cd-45eb-84c6-151728894df0Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙHform_state_and_policy_function_outputs (generic function with 2 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•'Eï¿°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$11b9beea-b0cd-45eb-84c6-151728894df0¹depends_on_disabled_cellsÂ§runtimeÎëµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙNreinforce_monte_carlo_control_binary_features (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•#umP°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290¹depends_on_disabled_cellsÂ§runtimeÎ/„•µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$55ba8725-0ddf-4196-a41d-3f3c490a8d84Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙVactor_critic_binary_episodic_gaussian_parameter_study (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•/ÔçÉ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$55ba8725-0ddf-4196-a41d-3f3c490a8d84¹depends_on_disabled_cellsÂ§runtimeÎXÉµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$a540814a-57a1-4b98-9443-59e401425444Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ6binary_value_function (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ• ×°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$a540814a-57a1-4b98-9443-59e401425444¹depends_on_disabled_cellsÂ§runtimeÎ Ýñµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$1b102220-6d78-480d-a77f-0e57bad23dcaŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙKcartpole_binary_continuing_parameter_study (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•/©Ö°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$1b102220-6d78-480d-a77f-0e57bad23dca¹depends_on_disabled_cellsÂ§runtimeÎ"¡µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$4d4ae57b-afc3-44f9-b6fc-892f59f82921Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ7one_step_actor_critic! (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•'O)°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$4d4ae57b-afc3-44f9-b6fc-892f59f82921¹depends_on_disabled_cellsÂ§runtimeÎ[B¶µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$61949faa-8174-4b7b-8fbc-01d5f850b419Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙXactor_critic_binary_continuing_gaussian_parameter_study (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•/Ü*®°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$61949faa-8174-4b7b-8fbc-01d5f850b419¹depends_on_disabled_cellsÂ§runtimeÎ%;µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$5b15f5c9-80bf-47f0-898a-f8dead5b927cŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ'

Continuing Case Actor-Critic Implementation

Note that this function has the same name as the episodic version. The only difference other than keyword arguments is that the max_episodes argument is missing. Since we already defined the versions of the algorithm for linear and non-linear cases in a generic manner, we only need to define the core version of this algorithm and the other functions will dispatch to it if they are called without the max_episodes argument.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôŒ—Ù°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$5b15f5c9-80bf-47f0-898a-f8dead5b927c¹depends_on_disabled_cellsÂ§runtimeÎ´Íµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$266d2234-26c8-43f1-9e75-49440a230ed6Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙFactor_critic_with_eligibility_traces! (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•+pP°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$266d2234-26c8-43f1-9e75-49440a230ed6¹depends_on_disabled_cellsÂ§runtimeÎn2¨µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$aa69e4ea-91e0-496a-a7be-529e67f4dbecŠ¦queuedÂ¤logs§runningÂ¦output†¤body…¦prefix§Float32¨elements’’’¨0.508736ªtext/plain’’¨0.491264ªtext/plain¤type¥Array¬prefix_short ¨objectid°7c15309e849c4f78¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÀ²last_run_timestampËAÚ•'§&°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$aa69e4ea-91e0-496a-a7be-529e67f4dbec¹depends_on_disabled_cellsÂ§runtimeÎ#X:îµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$10ee7709-0816-48d2-abe0-9be3dd04700fŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ|œ ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•=%«¨°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$10ee7709-0816-48d2-abe0-9be3dd04700f¹depends_on_disabled_cellsÂ§runtimeÎ%Î‚µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$7d94922e-dc9f-4953-b539-24aaa2c85b12Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚd1

$\lambda_\theta$: 0.75

$\lambda_\mathbf{w}$: 0.25

$\alpha_{\overline{r}}$:

$\log_2 \alpha_\theta$ min:

$\log_2 \alpha_{\mathbf{w}}$ min:

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ• ’ò9°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$7d94922e-dc9f-4953-b539-24aaa2c85b12¹depends_on_disabled_cellsÂ§runtimeÎlG µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$df7f84e8-b42a-4001-9dbf-6bc3ced94207Š¦queuedÂ¤logs§runningÂ¦output†¤body ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•ãÔ(°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$df7f84e8-b42a-4001-9dbf-6bc3ced94207¹depends_on_disabled_cellsÂ§runtimeÎ?]îAµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$352d2952-cb83-47d3-9078-2b2ef9927443Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ:create_cartpole_functions (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ• ž$À°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$352d2952-cb83-47d3-9078-2b2ef9927443¹depends_on_disabled_cellsÂ§runtimeÎî×íµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$0964133c-3a5b-433b-a8c4-a97813c37583Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ=plot_continuing_step_rewards (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•!Ã °persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$0964133c-3a5b-433b-a8c4-a97813c37583¹depends_on_disabled_cellsÂ§runtimeÎ §µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$349631b2-4686-49a9-9f3a-1e4ad588b568Š¦queuedÂ¤logs§runningÂ¦output†¤body…¦prefixÚ"ContinuousMDP{Float32, Tuple{Float32, Float32}, Float32, ContinuousMDPTransitionSampler{Float32, Tuple{Float32, Float32}, Float32, var"#step#1603"{Float32}}, typeof(Main.var"workspace#8".MountainCarTask.initialize_state), typeof(Main.var"workspace#8".MountainCarTask.isterm), Returns{Bool}}¨elements”’£ptf’…¦prefixÙcContinuousMDPTransitionSampler{Float32, Tuple{Float32, Float32}, Float32, var"#step#1603"{Float32}}¨elements‘’¤step’ÙS(::Main.var"workspace#8".var"#step#1603"{Float32}) (generic function with 1 method)ªtext/plain¤type¦struct¬prefix_short¾ContinuousMDPTransitionSampler¨objectid¨5f924a83Ù!application/vnd.pluto.tree+object’°initialize_state’Ù1initialize_state (generic function with 1 method)ªtext/plain’¦isterm’Ù'isterm (generic function with 1 method)ªtext/plain’¯is_valid_action’³Returns{Bool}(true)ªtext/plain¤type¦struct¬prefix_shortContinuousMDP¨objectid°d13aca4d611093b3¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÙ!const mountaincar_continuous_mdp2²last_run_timestampËAÚ•=£#°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$349631b2-4686-49a9-9f3a-1e4ad588b568¹depends_on_disabled_cellsÂ§runtimeÎc|Oµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$8544eddb-2095-4a3c-82e0-920123a88e6dŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚÔ

Test REINFORCE With and Without Baseline

The following function calls execute the REINFORCE algorithm on Example 13.1. The output displayed is the policy function acting on the single state representation for the problem. The two values represent the probability of taking the left and right action respectively. If converged properly, the right action probability should be higher, approaching a value of about 60%.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‰ç°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$8544eddb-2095-4a3c-82e0-920123a88e6d¹depends_on_disabled_cellsÂ§runtimeÎÏŸµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$31f7e903-30b6-4193-9174-88093e004de4Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ 4

In policy gradient methods, the policy can be parameterized in any way, as long as $\pi(a \vert s, \boldsymbol{\theta})$ is differentiable with respect to its parameters, that is, as long as $\nabla \pi(a \vert s, \boldsymbol{\theta})$ exists and is finite for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$, and $\boldsymbol{\theta} \in \mathbb{R}^{d^\prime}$ where $d^\prime$ is the number of parameters.

If the action space is discrete and not too large then we can have numerical preferences for each state/action pair parameterized by $\boldsymbol{\theta}$. $h(s, a, \boldsymbol{\theta})$ and the corresponding policy can be to select actions according to the probability distribution generated by the soft-max. $\pi(a|s, \boldsymbol{\theta}) \doteq \frac{\exp{h(s, a, \boldsymbol{\theta})}}{\sum_b \exp{h(s, b, \boldsymbol{\theta})}}$. One advantage of using the soft-max is that the optimal policy can be stochastic or we can approach a deterministic policy by selecting the action with the highest probability. If we include a temperature parameter in the soft-max then we can vary the same policy to be more or less stochastic as needed.

If we calculate preferences with linear features, then we would have feature vectors $\mathbf{x}(s, a) \in \mathbb{R}^{d^\prime}$ to match with the parameter vector $\boldsymbol{\theta} \in \mathbb{R}^{d^\prime}$. Then the preferences would be calculated:

$$h(s, a, \boldsymbol{\theta}) = \boldsymbol{\theta}^\top \mathbf{x}(s, a)$$

Another advantage is that for some problems the policy may be easier to approximate than the action-value function. We can also inject some prior knowledge of the environment into how the policy is parametrized.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‚ŠY°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$31f7e903-30b6-4193-9174-88093e004de4¹depends_on_disabled_cellsÂ§runtimeÎW3µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$fee14dfe-c5ca-4126-a830-cc9d7eda5433Š¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements™’¬step_rewards’…¦prefix§Float32¨elements¤type¥Array¬prefix_short ¨objectid°119a1546ef839870Ù!application/vnd.pluto.tree+object’episode_steps’…¦prefix¥Int64¨elements›’’¤8703ªtext/plain’’¤9879ªtext/plain’’¥10695ªtext/plain’’¥11518ªtext/plain’’¥12342ªtext/plain’’¥13106ªtext/plain’’¥13404ªtext/plain’’¥14030ªtext/plain’ ’¥14749ªtext/plain¤more’Íj’¥99984ªtext/plain¤type¥Array¬prefix_short ¨objectid°e32ae83612a524cbÙ!application/vnd.pluto.tree+object’¯episode_rewards’…¦prefix§Float32¨elements›’’§-8702.0ªtext/plain’’§-1176.0ªtext/plain’’¦-816.0ªtext/plain’’¦-823.0ªtext/plain’’¦-824.0ªtext/plain’’¦-764.0ªtext/plain’’¦-298.0ªtext/plain’’¦-626.0ªtext/plain’ ’¦-719.0ªtext/plain¤more’Íj’¥-66.0ªtext/plain¤type¥Array¬prefix_short ¨objectid°4801074e5c516386Ù!application/vnd.pluto.tree+object’±policy_parameters’Ú31452Ã—2 Matrix{Float32}: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 â‹® 0.104785 0.0825784 0.389597 0.0243331 -0.269401 0.0689138 -0.134741 0.08453 -0.0867137 0.0451763 1.05009 -0.657261ªtext/plain’°value_parameters’…¦prefix§Float32¨elements›’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’’£0.0ªtext/plain’ ’£0.0ªtext/plain¤more’Í¬’¨-12.9161ªtext/plain¤type¥Array¬prefix_short ¨objectid°e67dc14f194ae6f1Ù!application/vnd.pluto.tree+object’¯policy_function’¢Ï€ªtext/plain’´policy_sample_action’©Ï€_sampleªtext/plain’´estimate_state_value’´estimate_state_valueªtext/plain’°policy_and_value’°policy_and_valueªtext/plain¤typeªNamedTuple¨objectid°95334601429b2f05¤mimeÙ!application/vnd.pluto.tree+object¬rootassigneeÙ(const mountaincar_continuous_test_train2²last_run_timestampËAÚ•> Ã°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$fee14dfe-c5ca-4126-a830-cc9d7eda5433¹depends_on_disabled_cellsÂ§runtimeÎ‚˜1µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$b53dba81-a9e9-41da-8fc2-7736bf25f2dcŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙB

Waiting to run parameter study

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•@/…²°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$b53dba81-a9e9-41da-8fc2-7736bf25f2dc¹depends_on_disabled_cellsÂ§runtimeÎ»§µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$beb01fb8-c77d-4b5c-a66d-3812415e04a3Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ÷

Exercise 13.4

For the Gaussian policy parameterization, derive the formula for the eligibility vector $\nabla \ln{\pi(a|s, \mathbf{\theta})}$

Starting with our expression for the parameter function, we can calculate the gradient:

$$\nabla \pi(a|s, \mathbf{\theta}) = \nabla \left ( \frac{1}{\sigma(s, \mathbf{\theta}) \sqrt{2\pi}} \exp \left ( - \frac{(a-\mu(s, \mathbf{\theta}))^2}{2\sigma(s, \mathbf{\theta})^2} \right ) \right )$$

We will eventually need $\nabla \mu$ and $\nabla \sigma$ so let's calculate them now.

$$\nabla (\sigma(s, \mathbf{\theta})) = \nabla \exp{( \mathbf{\theta}_\sigma ^ \top \mathbf{x}_\sigma (s))} = \sigma(s, \mathbf{\theta})\mathbf{x}_\sigma (s)$$

$$\nabla(\mu(s, \mathbf{\theta})) = \nabla ( \mathbf{\theta}_\mu ^\top \mathbf{x}_\mu(s)) = \mathbf{x}_\mu (s)$$

The first application of the quotient rule is trivial, I will omit the input arguments to Î¼ and Ïƒ keeping in mind that these are functions of the parameters. Also let $\left ( - \frac{(a-\mu)^2}{2\sigma^2} \right ) = f(\mu, \sigma)$ which results in $\pi(a|s, \mathbf{\theta}) = \frac{1}{\sigma \sqrt{2\pi}} \exp{(f(\mu, \sigma))}$. Therefore:

$$\begin{flalign} \nabla \pi(a|s, \mathbf{\theta}) \sqrt{2\pi} &= \frac{1}{\sigma ^2} \left (- \exp{(f(\mu, \sigma))} \nabla \sigma + \sigma \exp{(f(\mu, \sigma))}\nabla f(\mu, \sigma) \right ) \\ &= \frac{1}{\sigma ^2} \left ( -\exp{(f(\mu, \sigma))} \sigma\mathbf{x}_\sigma + \sigma \exp{(f(\mu, \sigma))}\nabla f(\mu, \sigma) \right ) \\ &=\frac{\exp{(f(\mu, \sigma))}}{\sigma} \left (-\mathbf{x}_\sigma + \nabla f(\mu, \sigma) \right ) \\ \end{flalign}$$

Now we need only calculate the gradient of $f$:

$$\begin{flalign} \nabla f(\mu, \sigma) &= \frac{-1}{2} \nabla \left [ \frac{(a-\mu)^2}{\sigma^2} \right ] \\ & = \frac{-1}{2\sigma^4} \left [-2 \sigma^2 (a - \mu) \nabla \mu - (a - \mu)^2 2\sigma \nabla \sigma \right ] \\ & = \frac{-1}{\sigma^3} \left [ -\sigma (a - \mu) \nabla \mu - (a - \mu)^2 \nabla \sigma \right ] \\ & = \frac{-1}{\sigma^3} \left [ -\sigma (a - \mu) \mathbf{x}_\mu (s) - (a - \mu)^2 \sigma \mathbf{x}_\sigma \right ] \tag{substituting gradients}\\ & = \frac{(a - \mu)}{\sigma^2} ((a - \mu) \mathbf{x}_\sigma + \mathbf{x}_\mu) \tag{simplifying}\\ \end{flalign}$$

Now substitute this back into the policy gradient:

$$\nabla \pi(a|s, \mathbf{\theta}) \sqrt{2\pi} = \frac{\exp{(f(\mu, \sigma))}}{\sigma} \left (-\mathbf{x}_\sigma + \frac{(a - \mu)}{\sigma^2} ((a - \mu) \mathbf{x}_\sigma + \mathbf{x}_\mu) \right )$$

Furthermore, observe that $\pi(a|s, \mathbf{\theta}) = \frac{1}{\sigma\sqrt{2\pi}} \exp(f(\mu, \sigma))$

So our expression for the policy gradient is:

$$\nabla \pi(a|s, \mathbf{\theta}) = \pi(a|s, \mathbf{\theta}) \left (-\mathbf{x}_\sigma + \frac{(a - \mu)}{\sigma^2} ((a - \mu) \mathbf{x}_\sigma + \mathbf{x}_\mu) \right )$$

To get the eligibility vector we must divide this by the policy which is conveniently already in the expression:

$$\begin{flalign} \frac{\nabla \pi(a|s, \mathbf{\theta})}{\pi(a|s, \mathbf{\theta})} &= -\mathbf{x}_\sigma + \frac{(a - \mu)}{\sigma^2} ((a - \mu) \mathbf{x}_\sigma + \mathbf{x}_\mu)\\ &= \mathbf{x}_\mu \left [ \frac{(a - \mu)}{\sigma^2} \right ] + \mathbf{x}_\sigma \left [\frac{(a-\mu)^2}{\sigma^2} -1 \right ] \\ \end{flalign}$$

There are two components to the sum, one for $\mu$ and one for $\sigma$. If we think of the paramters and feature vectors as concatenated, then this sum would be an element by element sum where $\mathbf{x}_\mu$ has a zero value for all the feature indices corresponding to $\sigma$ and vice-versa. This way doing the sum will form one complete vector that has gradient components for all the parameters $\mathbf{\theta}_\mu$ and $\mathbf{\theta}_\sigma$. Alternatively, the sum can be separated and each gradient can be treated separately with only those components keeping them separated throughout the calculation.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ôë°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$beb01fb8-c77d-4b5c-a66d-3812415e04a3¹depends_on_disabled_cellsÂ§runtimeÎ4Qµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$8bc280db-e57d-4e40-be46-1790f4f7d9e7Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙDactor_critic_fcann_parameter_study (generic function with 2 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•/¸Rz°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$8bc280db-e57d-4e40-be46-1790f4f7d9e7¹depends_on_disabled_cellsÂ§runtimeÎzŠ µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$89901156-b874-416b-89c1-6dc434a4eb17Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙG

REINFORCE Implementation

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô‰Fþ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$89901156-b874-416b-89c1-6dc434a4eb17¹depends_on_disabled_cellsÂ§runtimeÎÜµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$ff76ef94-fdf5-41f3-a31a-21c4629efabeŠ¦queuedÂ¤logs§runningÂ¦output†¤body…¦prefixÙ—StateMDP{Float32, Int64, Symbol, StateMDPTransitionSampler{Float32, Int64, var"#step#1179"}, var"#1177#1180", var"#1178#1181", TabularRL.var"#164#169"}¨elements–’§actions’…¦prefix¦Symbol¨elements’’’¥:leftªtext/plain’’¦:rightªtext/plain¤type¥Array¬prefix_short ¨objectid¯5c1f673d599d276Ù!application/vnd.pluto.tree+object’£ptf’…¦prefixÙ:StateMDPTransitionSampler{Float32, Int64, var"#step#1179"}¨elements‘’¤step’ÙJ(::Main.var"workspace#8".var"#step#1179") (generic function with 1 method)ªtext/plain¤type¦struct¬prefix_short¹StateMDPTransitionSampler¨objectid°ffffffff142bed64Ù!application/vnd.pluto.tree+object’°initialize_state’Ùҙ (generic function with 1 method)ªtext/plain’¦isterm’ÙҚ (generic function with 1 method)ªtext/plain’¯is_valid_action’Ù%#164 (generic function with 1 method)ªtext/plain’¬action_index’…¦prefix³Dict{Symbol, Int64}¨elements’’’¥:leftªtext/plain’¡1ªtext/plain’’¦:rightªtext/plain’¡2ªtext/plain¤type¤Dict¬prefix_short¤Dict¨objectid°9ee985827f4f600dÙ!application/vnd.pluto.tree+object¤type¦struct¬prefix_short¨StateMDP¨objectid°4a9335336ab44967¤mimeÙ!application/vnd.pluto.tree+object¬rootassignee²const corridor_mdp²last_run_timestampËAÚ•³¢¥°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$ff76ef94-fdf5-41f3-a31a-21c4629efabe¹depends_on_disabled_cellsÂ§runtimeÎ ðµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$581f7e9b-a5c2-4841-9605-85f9585b0274Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙBupdate_linear_action_preferences! (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•’§]°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$581f7e9b-a5c2-4841-9605-85f9585b0274¹depends_on_disabled_cellsÂ§runtimeÎ GVµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$8aa16866-bfda-48df-9cf1-cf3d2e203ccbŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙYcartpole_tilecoding_reinforce_continuous_parameter_study (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•1CÌ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$8aa16866-bfda-48df-9cf1-cf3d2e203ccb¹depends_on_disabled_cellsÂ§runtimeÎZµcµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$04b5929a-2058-49c9-963a-96c752a1d67dŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚó ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•3¿ì°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$04b5929a-2058-49c9-963a-96c752a1d67d¹depends_on_disabled_cellsÂ§runtimeÎ µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$f0104778-81a6-417b-8501-f916e5e7f3afŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ=make_corridor_continuing_mdp (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ• I_¦°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$f0104778-81a6-417b-8501-f916e5e7f3af¹depends_on_disabled_cellsÂ§runtimeÎ!ó±µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$3e3c5897-809f-46e3-bb58-f115b082443eŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙbactor_critic_with_eligibility_traces_binary_features_beta_actions (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•/Æ‰Ê°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$3e3c5897-809f-46e3-bb58-f115b082443e¹depends_on_disabled_cellsÂ§runtimeÎAJ4µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$a9db3f85-ff56-4bbc-be87-47b893ef3b7bŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ // We start by putting all the variable interpolation here at the beginning // Publish the plot object to JS let plot_obj = {"layout": {"template": {"layout": {"coloraxis": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "xaxis": {"gridcolor": "white", "zerolinewidth": 2, "title": {"standoff": 15}, "ticks": "", "zerolinecolor": "white", "automargin": true, "linecolor": "white"}, "hovermode": "closest", "paper_bgcolor": "white", "geo": {"showlakes": true, "showland": true, "landcolor": "#E5ECF6", "bgcolor": "white", "subunitcolor": "white", "lakecolor": "white"}, "colorscale": {"sequential": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]], "diverging": [[0, "#8e0152"], [0.1, "#c51b7d"], [0.2, "#de77ae"], [0.3, "#f1b6da"], [0.4, "#fde0ef"], [0.5, "#f7f7f7"], [0.6, "#e6f5d0"], [0.7, "#b8e186"], [0.8, "#7fbc41"], [0.9, "#4d9221"], [1, "#276419"]], "sequentialminus": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}, "yaxis": {"gridcolor": "white", "zerolinewidth": 2, "title": {"standoff": 15}, "ticks": "", "zerolinecolor": "white", "automargin": true, "linecolor": "white"}, "shapedefaults": {"line": {"color": "#2a3f5f"}}, "hoverlabel": {"align": "left"}, "mapbox": {"style": "light"}, "polar": {"angularaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "bgcolor": "#E5ECF6", "radialaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}}, "autotypenumbers": "strict", "font": {"color": "#2a3f5f"}, "ternary": {"baxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "bgcolor": "#E5ECF6", "caxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}, "aaxis": {"gridcolor": "white", "ticks": "", "linecolor": "white"}}, "annotationdefaults": {"arrowhead": 0, "arrowwidth": 1, "arrowcolor": "#2a3f5f"}, "plot_bgcolor": "#E5ECF6", "title": {"x": 0.05}, "scene": {"xaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}, "zaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}, "yaxis": {"gridcolor": "white", "gridwidth": 2, "backgroundcolor": "#E5ECF6", "ticks": "", "showbackground": true, "zerolinecolor": "white", "linecolor": "white"}}, "colorway": ["#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52"]}, "data": {"barpolar": [{"type": "barpolar", "marker": {"line": {"color": "#E5ECF6", "width": 0.5}}}], "carpet": [{"aaxis": {"gridcolor": "white", "endlinecolor": "#2a3f5f", "minorgridcolor": "white", "startlinecolor": "#2a3f5f", "linecolor": "white"}, "type": "carpet", "baxis": {"gridcolor": "white", "endlinecolor": "#2a3f5f", "minorgridcolor": "white", "startlinecolor": "#2a3f5f", "linecolor": "white"}}], "scatterpolar": [{"type": "scatterpolar", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "parcoords": [{"line": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "type": "parcoords"}], "scatter": [{"type": "scatter", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram2dcontour": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "histogram2dcontour", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "contour": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "contour", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scattercarpet": [{"type": "scattercarpet", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "mesh3d": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "mesh3d"}], "surface": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "surface", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scattermapbox": [{"type": "scattermapbox", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scattergeo": [{"type": "scattergeo", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram": [{"type": "histogram", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "pie": [{"type": "pie", "automargin": true}], "choropleth": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "choropleth"}], "heatmapgl": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "heatmapgl", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "bar": [{"type": "bar", "error_y": {"color": "#2a3f5f"}, "error_x": {"color": "#2a3f5f"}, "marker": {"line": {"color": "#E5ECF6", "width": 0.5}}}], "heatmap": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "heatmap", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "contourcarpet": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "contourcarpet"}], "table": [{"type": "table", "header": {"line": {"color": "white"}, "fill": {"color": "#C8D4E3"}}, "cells": {"line": {"color": "white"}, "fill": {"color": "#EBF0F8"}}}], "scatter3d": [{"line": {"colorbar": {"ticks": "", "outlinewidth": 0}}, "type": "scatter3d", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scattergl": [{"type": "scattergl", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "histogram2d": [{"colorbar": {"ticks": "", "outlinewidth": 0}, "type": "histogram2d", "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}], "scatterternary": [{"type": "scatterternary", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}], "scatterpolargl": [{"type": "scatterpolargl", "marker": {"colorbar": {"ticks": "", "outlinewidth": 0}}}]}}, "margin": {"l": 50, "b": 50, "r": 50, "t": 60}}, "config": {"showLink": false, "editable": false, "responsive": true, "staticPlot": false, "scrollZoom": true}, "frames": [], "data": [{"y": [Infinity, 4.667032721489843, 2.352080488429527, 1.5759894085474497, 1.1865767661252582, 0.952329542347748, 0.7958488403711905, 0.6838903357678036, 0.5998022307539005, 0.5343196607298939, 0.481876506359228, 0.43892664782610735, 0.4031035909389137, 0.3727674794521093, 0.3467459986828037, 0.3241787831808724, 0.3044200943416014, 0.28697580091461466, 0.27146133736832384, 0.25757292023624623, 0.24506738751259943, 0.23374779010782812, 0.2234529073255359, 0.21404949351780772, 0.20542646028339206, 0.19749045291363473, 0.19016244616993805, 0.18337509544622463, 0.17707065470347907, 0.1711993245392484, 0.16571793015256217, 0.16058885480517326, 0.15577917295968144, 0.1512599407923877, 0.14700561172131324, 0.1429935519781872, 0.13920363679625566, 0.13561791198181308, 0.13222030884055902, 0.1289964028946112, 0.12593320873670621, 0.12301900485981125, 0.12024318347272628, 0.11759612123947868, 0.11506906761806092, 0.11265404806440946, 0.11034377984248032, 0.10813159856537688, 0.10601139390463747, 0.10397755315966713, 0.10202491158834057, 0.10014870857199591, 0.09834454883045635, 0.09660836802097619, 0.0949364021535767, 0.09332516033769218, 0.09177140044426524, 0.09027210732572798, 0.08882447328556972, 0.08742588053094563, 0.08607388537727942, 0.0847662040040796, 0.08350069958706603, 0.08227537065388697, 0.08108834052977568, 0.07993784775592509, 0.07882223737755151, 0.07773995301090596, 0.07668952960915504, 0.07566958685632605, 0.07467882312659452, 0.0737160099532511, 0.07277998695786166, 0.07186965719555248, 0.07098398287710962, 0.0701219814327709, 0.06928272188628098, 0.06846532151104168, 0.06766894274307564, 0.06689279032807831, 0.06613610868210289, 0.06539817944744149, 0.06467831922706141, 0.06397587748255844, 0.06329023458201889, 0.06262079998546165, 0.06196701055667709, 0.06132832899130668, 0.060704242351929394, 0.06009426070175016, 0.05949791582923129, 0.0589147600566812, 0.05834436512642086, 0.05778632115869659, 0.05724023567600365, 0.056705732688934064, 0.05618245183906824, 0.0556700475947998, 0.05516818849631744, 0.05467655644627305, 0.0541948460429423, 0.053722763952936625, 0.05326002832075602, 0.05280636821268242, 0.05236152309270511, 0.05192524232834598, 0.05149728472441248, 0.051077418082853834, 0.05066541878703102, 0.050261071408834686, 0.04986416833719888, 0.04947450942666355, 0.04909190166473398, 0.048716158856874775, 0.04834710132805668, 0.04798455563985037, 0.047628354322129966, 0.04727833561851367, 0.04693434324472758, 0.046596226159133586, 0.046263838344712566, 0.045937038601841465, 0.04561569035124531, 0.04529966144654626, 0.04498882399586848, 0.04468305419199243, 0.04438223215058426, 0.04408624175605587, 0.043794970514638834, 0.04350830941428131, 0.043226152791001074, 0.042948398201350095, 0.04267494630066698, 0.042405700726813295, 0.042140567989107335, 0.04187945736218653, 0.04162228078454498, 0.0413689527615075, 0.04111939027241576, 0.040873512681814356, 0.04063124165443721, 0.040392501073805896, 0.04015721696426216, 0.03992531741626653, 0.039696732514804794, 0.03947139427075216, 0.039249236555053725, 0.039030195035587335, 0.03881420711658188, 0.03860121188047153, 0.03839115003207204, 0.03818396384497177, 0.037979597110035596, 0.03777799508592522, 0.037579104451544026, 0.03738287326032021, 0.037189250896245384, 0.036998188031590606, 0.036809636586225755, 0.036623549688471604, 0.036439881637417695, 0.03625858786664232, 0.036079624909274216, 0.03590295036433832, 0.03572852286433089, 0.03555630204397195, 0.035386248510085434, 0.03521832381255977, 0.035052490416344145, 0.03488871167443736, 0.03472695180182868, 0.03456717585035154, 0.03440934968441326, 0.034253439957565204, 0.034099414089879584, 0.033947240246100865, 0.03379688731454087, 0.03364832488668817, 0.03350152323750392, 0.0333564533063771, 0.03321308667871366, 0.033071395568135147, 0.03293135279926322, 0.03279293179106793, 0.03265610654075808, 0.0325208516081934, 0.032387142100798794, 0.0322549536589619, 0.032124262441896145, 0.03199504511395181, 0.03186727883135887, 0.031740941229385686, 0.03161601040989839, 0.031492464929306616, 0.031370283786881344, 0.031249446413431824, 0.03112993266032874, 0.031011722788861003, 0.030894797459914896, 0.03077913772396381, 0.030664725011357967, 0.030551541122903637, 0.030439568220721764, 0.030328788819376507, 0.030219185777264318, 0.030110742288254746, 0.030003441873574455, 0.029897268373926113, 0.02979220594183445, 0.029688239034211677, 0.02958535240513517, 0.029483531098830215, 0.02938276044285117, 0.02928302604145444, 0.029184313769157155, 0.029086609764475243, 0.028989900423835417, 0.02889417239565525, 0.028799412574585966, 0.028705608095912896, 0.02861274633010843, 0.028520814877532743, 0.02842980156327758, 0.028339694432148663, 0.028250481743782377, 0.02816215196789255, 0.028074693779643334, 0.027988096055144258, 0.027902347867063783, 0.02781743848035765, 0.027733357348108598, 0.0276500941074741, 0.02756763857573873, 0.02748598074646827, 0.027405110785762198, 0.027325019028601922, 0.02724569597529179, 0.027167132287990087, 0.02708931878732754, 0.027012246449110586, 0.026935906401107084, 0.026860289919912017, 0.026785388427890865, 0.02671119349019843, 0.026637696811870965, 0.02656489023498947, 0.026492765735912242, 0.026421315422574573, 0.026350531531853794, 0.026280406426997836, 0.02621093259511549, 0.026142102644726693, 0.026073909303371102, 0.02600634541527347, 0.02593940393906413, 0.025873077945553147, 0.025807360615556708, 0.025742245237774207, 0.025677725206714783, 0.02561379402067193, 0.025550445279744836, 0.025487672683905303, 0.025425470031108913, 0.02536383121544939, 0.025302750225354928, 0.02524222114182535, 0.025182238136709202, 0.0251227954710195, 0.025063887493287296, 0.02500550863795202, 0.02494765342378762, 0.024890316452363646, 0.02483349240654034, 0.024777176048996825, 0.02472136222079166, 0.024666045839954766, 0.02461122190011018, 0.024556885469128556, 0.024503031687808954, 0.024449655768588964, 0.024396752994282588, 0.02434431871684517, 0.02429234835616463, 0.02424083739887848, 0.024189781397215858, 0.02413917596786403, 0.02408901679085884, 0.02403929960849835, 0.023990020224279235, 0.023941174501855397, 0.023892758364018152, 0.023844767791697623, 0.023797198822984644, 0.023750047552172915, 0.023703310128820727, 0.02365698275683186, 0.023611061693555327, 0.023565543248903267, 0.023520423784486814, 0.023475699712769398, 0.02343136749623704, 0.023387423646585335, 0.023343864723922744, 0.0233006873359897, 0.023257888137393292, 0.023215463828857124, 0.023173411156485996, 0.02313172691104504, 0.023090407927253108, 0.02304945108308994, 0.02300885329911689, 0.022968611537810872, 0.022928722802911243, 0.022889184138779408, 0.02284999262977068, 0.022811145399618316, 0.02277263961082937, 0.0227344724640921, 0.02269664119769473, 0.02265914308695525, 0.022621975443662064, 0.02258513561552524, 0.022548620985638142, 0.02251242897194914, 0.022476557026743386, 0.02244100263613417, 0.022405763319563853, 0.022370836629314127, 0.02233622015002532, 0.022301911498224733, 0.022267908321863584, 0.022234208299862578, 0.022200809141665882, 0.02216770858680317, 0.022134904404459876, 0.02210239439305514, 0.02207017637982758, 0.022038248220428557, 0.0220066077985228, 0.021975253025396355, 0.021944181839571537, 0.021913392206428895, 0.021882882117835973, 0.02185264959178267, 0.021822692672023296, 0.021793009427724875, 0.02176359795312182, 0.021734456367176736, 0.021705582813247262, 0.021676975458758855, 0.021648632494883305, 0.021620552136222996, 0.02159273262050078, 0.021565172208255167, 0.021537869182541077, 0.02151082184863568, 0.021484028533749495, 0.021457487586742523, 0.021431197377845285, 0.021405156298384874, 0.021379362760515584, 0.02135381519695438, 0.021328512060720946, 0.021303451824882122, 0.02127863298230095, 0.021254054045389922, 0.02122971354586855, 0.021205610034525122, 0.02118174208098257, 0.02115810827346834, 0.021134707218588265, 0.021111537541104276, 0.021088597883715993, 0.02106588690684599, 0.021043403288428814, 0.02102114572370358, 0.020999112925010086, 0.020977303621588516, 0.020955716559382432, 0.020934350500845265, 0.020913204224750022, 0.020892276526002253, 0.02087156621545629, 0.020851072119734474, 0.02083079308104963, 0.02081072795703043, 0.020790875620549855, 0.020771234959556482, 0.020751804876908687, 0.020732584290211725, 0.020713572131657518, 0.020694767347867193, 0.020676168899736367, 0.02065777576228298, 0.02063958692449783, 0.020621601389197598, 0.020603818172880383, 0.020586236305583806, 0.020568854830745397, 0.02055167280506554, 0.02053468929837266, 0.02051790339349078, 0.02050131418610934, 0.02048492078465529, 0.020468722310167376, 0.020452717896172617, 0.020436906688564883, 0.02042128784548568, 0.020405860537206878, 0.02039062394601562, 0.020375577266101175, 0.02036071970344374, 0.02034605047570535, 0.020331568812122485, 0.020317273953400803, 0.02030316515161161, 0.02028924167009018, 0.02027550278333597, 0.02026194777691453, 0.020248575947361246, 0.02023538660208679, 0.02022237905928431, 0.020209552647838264, 0.020196906707234993, 0.020184440587474886, 0.02017215364898616, 0.02016004526254031, 0.020148114809169074, 0.02013636168008294, 0.020124785276591273, 0.02011338501002391, 0.020102160301654216, 0.020091110582623722, 0.02008023529386807, 0.020069533886044603, 0.020059005819461167, 0.020048650564006475, 0.020038467599081785, 0.020028456413533974, 0.02001861650558995, 0.020008947382792436, 0.01999944856193704, 0.019990119569010673, 0.019980959939131217, 0.019971969216488514, 0.019963146954286602, 0.019954492714687185, 0.019946006068754386, 0.01993768659640067, 0.019929533886334033, 0.019921547536006338, 0.01991372715156286, 0.019906072347793048, 0.01989858274808236, 0.019891257984365338, 0.01988409769707977, 0.01987710153512198, 0.0198702691558033, 0.01986360022480756, 0.01985709441614975, 0.01985075141213575, 0.0198445709033231, 0.019838552588482928, 0.019832696174562844, 0.01982700137665095, 0.019821467917940903, 0.019816095529697923, 0.01981088395122598, 0.01980583292983587, 0.01980094222081435, 0.019796211587394332, 0.019791640800725974, 0.019787229639848893, 0.01978297789166522, 0.019778885350913777, 0.019774951820145124, 0.01977117710969763, 0.019767561037674516, 0.019764103429921808, 0.019760804120007282, 0.01975766294920036, 0.019754679766452937, 0.019751854428381153, 0.019749186799248076, 0.019746676750947406, 0.019744324162987985, 0.0197421289224793, 0.019740090924117947, 0.019738210070174914, 0.019736486270483855, 0.019734919442430287, 0.019733509510941594, 0.019732256408478102, 0.019731160075024907, 0.019730220458084716, 0.019729437512671526, 0.019728811201305235, 0.019728341494007158, 0.019728028368296416, 0.019727871809187256, 0.019727871809187256, 0.019728028368296416, 0.019728341494007158, 0.019728811201305235, 0.019729437512671526, 0.019730220458084716, 0.019731160075024907, 0.019732256408478102, 0.019733509510941594, 0.019734919442430283, 0.019736486270483862, 0.019738210070174914, 0.019740090924117947, 0.0197421289224793, 0.019744324162987985, 0.019746676750947406, 0.019749186799248076, 0.019751854428381153, 0.01975467976645294, 0.01975766294920036, 0.019760804120007282, 0.019764103429921808, 0.019767561037674516, 0.01977117710969763, 0.01977495182014512, 0.019778885350913777, 0.01978297789166522, 0.019787229639848893, 0.019791640800725978, 0.019796211587394325, 0.01980094222081435, 0.01980583292983587, 0.01981088395122598, 0.01981609552969792, 0.019821467917940896, 0.019827001376650954, 0.019832696174562844, 0.019838552588482928, 0.019844570903323103, 0.01985075141213575, 0.019857094416149752, 0.01986360022480756, 0.0198702691558033, 0.019877101535121986, 0.01988409769707977, 0.019891257984365338, 0.01989858274808236, 0.019906072347793048, 0.01991372715156286, 0.01992154753600633, 0.019929533886334033, 0.01993768659640067, 0.019946006068754386, 0.01995449271468718, 0.0199631469542866, 0.019971969216488514, 0.019980959939131217, 0.019990119569010673, 0.01999944856193704, 0.020008947382792436, 0.02001861650558995, 0.020028456413533974, 0.020038467599081785, 0.020048650564006475, 0.02005900581946117, 0.020069533886044603, 0.02008023529386807, 0.020091110582623722, 0.020102160301654223, 0.020113385010023913, 0.020124785276591273, 0.02013636168008294, 0.020148114809169074, 0.020160045262540314, 0.020172153648986158, 0.020184440587474886, 0.020196906707234993, 0.020209552647838264, 0.02022237905928431, 0.020235386602086798, 0.020248575947361246, 0.02026194777691453, 0.02027550278333597, 0.02028924167009018, 0.020303165151611607, 0.020317273953400803, 0.020331568812122485, 0.020346050475705348, 0.020360719703443747, 0.020375577266101168, 0.02039062394601562, 0.020405860537206878, 0.02042128784548568, 0.020436906688564883, 0.020452717896172617, 0.020468722310167376, 0.02048492078465529, 0.02050131418610934, 0.02051790339349078, 0.02053468929837266, 0.02055167280506554, 0.020568854830745397, 0.020586236305583802, 0.020603818172880383, 0.020621601389197598, 0.02063958692449783, 0.02065777576228298, 0.020676168899736364, 0.020694767347867196, 0.020713572131657518, 0.020732584290211725, 0.020751804876908687, 0.020771234959556475, 0.020790875620549855, 0.02081072795703043, 0.02083079308104963, 0.020851072119734474, 0.02087156621545629, 0.020892276526002257, 0.020913204224750022, 0.020934350500845265, 0.020955716559382432, 0.020977303621588516, 0.02099911292501009, 0.02102114572370358, 0.021043403288428814, 0.02106588690684599, 0.02108859788371599, 0.021111537541104276, 0.021134707218588265, 0.02115810827346834, 0.02118174208098257, 0.021205610034525122, 0.021229713545868552, 0.021254054045389922, 0.02127863298230095, 0.021303451824882122, 0.021328512060720946, 0.021353815196954385, 0.021379362760515584, 0.021405156298384874, 0.021431197377845285, 0.021457487586742516, 0.021484028533749498, 0.02151082184863568, 0.021537869182541077, 0.021565172208255167, 0.021592732620500776, 0.021620552136223, 0.021648632494883305, 0.021676975458758855, 0.021705582813247266, 0.021734456367176736, 0.021763597953121824, 0.021793009427724875, 0.021822692672023296, 0.02185264959178267, 0.021882882117835973, 0.021913392206428906, 0.021944181839571537, 0.021975253025396355, 0.0220066077985228, 0.022038248220428557, 0.022070176379827583, 0.02210239439305514, 0.022134904404459876, 0.022167708586803173, 0.022200809141665882, 0.022234208299862585, 0.022267908321863584, 0.022301911498224733, 0.02233622015002532, 0.022370836629314127, 0.022405763319563853, 0.02244100263613417, 0.022476557026743386, 0.02251242897194914, 0.022548620985638142, 0.02258513561552524, 0.022621975443662064, 0.02265914308695525, 0.02269664119769473, 0.0227344724640921, 0.02277263961082937, 0.022811145399618316, 0.02284999262977068, 0.022889184138779408, 0.022928722802911243, 0.022968611537810872, 0.02300885329911689, 0.02304945108308994, 0.023090407927253108, 0.02313172691104504, 0.023173411156485996, 0.023215463828857124, 0.023257888137393292, 0.0233006873359897, 0.023343864723922744, 0.023387423646585335, 0.02343136749623704, 0.023475699712769398, 0.023520423784486814, 0.023565543248903267, 0.023611061693555327, 0.02365698275683186, 0.02370331012882072, 0.023750047552172915, 0.023797198822984644, 0.023844767791697623, 0.023892758364018152, 0.023941174501855393, 0.02399002022427923, 0.02403929960849835, 0.02408901679085884, 0.02413917596786403, 0.02418978139721585, 0.024240837398878477, 0.02429234835616463, 0.02434431871684517, 0.024396752994282588, 0.024449655768588964, 0.024503031687808957, 0.02455688546912856, 0.02461122190011018, 0.024666045839954766, 0.024721362220791653, 0.02477717604899683, 0.024833492406540342, 0.024890316452363646, 0.02494765342378762, 0.02500550863795202, 0.025063887493287303, 0.0251227954710195, 0.025182238136709202, 0.02524222114182535, 0.02530275022535492, 0.025363831215449394, 0.025425470031108913, 0.025487672683905303, 0.025550445279744836, 0.025613794020671925, 0.025677725206714783, 0.025742245237774207, 0.025807360615556708, 0.025873077945553147, 0.025939403939064118, 0.026006345415273475, 0.026073909303371102, 0.026142102644726693, 0.02621093259511549, 0.026280406426997832, 0.026350531531853797, 0.026421315422574573, 0.026492765735912242, 0.02656489023498947, 0.02663769681187096, 0.026711193490198435, 0.026785388427890865, 0.026860289919912017, 0.026935906401107084, 0.02701224644911058, 0.027089318787327545, 0.027167132287990094, 0.02724569597529179, 0.02732501902860192, 0.027405110785762195, 0.02748598074646827, 0.02756763857573873, 0.0276500941074741, 0.027733357348108598, 0.027817438480357646, 0.027902347867063783, 0.027988096055144258, 0.028074693779643334, 0.028162151967892547, 0.028250481743782373, 0.028339694432148663, 0.02842980156327758, 0.028520814877532743, 0.02861274633010843, 0.028705608095912893, 0.028799412574585972, 0.02889417239565525, 0.028989900423835417, 0.029086609764475236, 0.02918431376915716, 0.029283026041454448, 0.02938276044285117, 0.029483531098830215, 0.029585352405135164, 0.029688239034211684, 0.02979220594183445, 0.029897268373926113, 0.03000344187357445, 0.03011074228825474, 0.03021918577726432, 0.03032878881937651, 0.030439568220721764, 0.030551541122903633, 0.03066472501135796, 0.030779137723963818, 0.030894797459914903, 0.031011722788861003, 0.031129932660328735, 0.031249446413431824, 0.031370283786881344, 0.03149246492930662, 0.03161601040989839, 0.03174094122938568, 0.03186727883135887, 0.03199504511395182, 0.03212426244189615, 0.0322549536589619, 0.032387142100798794, 0.032520851608193395, 0.03265610654075809, 0.03279293179106793, 0.03293135279926322, 0.03307139556813514, 0.033213086678713664, 0.033356453306377105, 0.03350152323750393, 0.03364832488668817, 0.03379688731454087, 0.03394724024610086, 0.03409941408987959, 0.034253439957565204, 0.03440934968441326, 0.03456717585035153, 0.03472695180182867, 0.03488871167443738, 0.03505249041634415, 0.03521832381255977, 0.03538624851008543, 0.03555630204397195, 0.0357285228643309, 0.03590295036433832, 0.036079624909274216, 0.036258587866642315, 0.03643988163741768, 0.03662354968847162, 0.036809636586225755, 0.036998188031590606, 0.03718925089624538, 0.03738287326032022, 0.037579104451544026, 0.03777799508592522, 0.037979597110035596, 0.03818396384497176, 0.03839115003207205, 0.03860121188047153, 0.03881420711658188, 0.03903019503558733, 0.039249236555053725, 0.03947139427075218, 0.03969673251480481, 0.03992531741626653, 0.04015721696426215, 0.04039250107380589, 0.04063124165443722, 0.04087351268181437, 0.04111939027241576, 0.04136895276150748, 0.04162228078454497, 0.04187945736218656, 0.04214056798910734, 0.042405700726813295, 0.042674946300666976, 0.04294839820135009, 0.0432261527910011, 0.04350830941428132, 0.043794970514638834, 0.04408624175605586, 0.04438223215058425, 0.04468305419199244, 0.04498882399586849, 0.04529966144654626, 0.0456156903512453, 0.04593703860184145, 0.046263838344712586, 0.04659622615913359, 0.04693434324472758, 0.04727833561851366, 0.047628354322129945, 0.0479845556398504, 0.04834710132805669, 0.048716158856874775, 0.04909190166473397, 0.04947450942666354, 0.0498641683371989, 0.05026107140883469, 0.05066541878703102, 0.05107741808285381, 0.05149728472441245, 0.05192524232834599, 0.052361523092705115, 0.05280636821268241, 0.053260028320756006, 0.0537227639529366, 0.054194846042942314, 0.05467655644627305, 0.05516818849631743, 0.055670047594799786, 0.056182451839068205, 0.05670573268893408, 0.057240235676003656, 0.05778632115869659, 0.05834436512642085, 0.05891476005668123, 0.059497915829231314, 0.060094260701750175, 0.06070424235192939, 0.06132832899130665, 0.06196701055667712, 0.06262079998546166, 0.0632902345820189, 0.06397587748255842, 0.06467831922706137, 0.06539817944744152, 0.06613610868210291, 0.06689279032807831, 0.06766894274307562, 0.06846532151104165, 0.06928272188628103, 0.07012198143277093, 0.07098398287710962, 0.07186965719555247, 0.07277998695786163, 0.07371600995325114, 0.07467882312659455, 0.07566958685632605, 0.07668952960915502, 0.0777399530109059, 0.07882223737755155, 0.07993784775592512, 0.08108834052977568, 0.08227537065388693, 0.08350069958706596, 0.08476620400407965, 0.08607388537727945, 0.08742588053094563, 0.0888244732855697, 0.09027210732572791, 0.0917714004442653, 0.09332516033769221, 0.0949364021535767, 0.09660836802097614, 0.09834454883045624, 0.100148708571996, 0.10202491158834061, 0.1039775531596671, 0.10601139390463742, 0.10813159856537677, 0.11034377984248044, 0.11265404806440948, 0.1150690676180609, 0.11759612123947859, 0.12024318347272615, 0.12301900485981138, 0.12593320873670627, 0.12899640289461117, 0.13222030884055894, 0.1356179119818129, 0.1392036367962558, 0.14299355197818725, 0.1470056117213132, 0.15125994079238753, 0.15577917295968122, 0.16058885480517346, 0.16571793015256223, 0.17119932453924833, 0.17707065470347888, 0.18337509544622496, 0.19016244616993824, 0.19749045291363485, 0.20542646028339195, 0.21404949351780742, 0.2234529073255364, 0.23374779010782848, 0.24506738751259952, 0.25757292023624606, 0.2714613373683233, 0.28697580091461555, 0.3044200943416019, 0.32417878318087257, 0.34674599868280326, 0.37276747945210814, 0.40310359093891523, 0.43892664782610835, 0.4818765063592282, 0.5343196607298928, 0.5998022307538973, 0.6838903357678081, 0.7958488403711937, 0.9523295423477482, 1.186576766125252, 1.5759894085474269, 2.3520804884295794, 4.667032721489947, Infinity], "type": "scatter", "x": [0.0, 0.001001001001001001, 0.002002002002002002, 0.003003003003003003, 0.004004004004004004, 0.005005005005005005, 0.006006006006006006, 0.007007007007007007, 0.008008008008008008, 0.009009009009009009, 0.01001001001001001, 0.011011011011011011, 0.012012012012012012, 0.013013013013013013, 0.014014014014014014, 0.015015015015015015, 0.016016016016016016, 0.01701701701701702, 0.018018018018018018, 0.01901901901901902, 0.02002002002002002, 0.021021021021021023, 0.022022022022022022, 0.023023023023023025, 0.024024024024024024, 0.025025025025025027, 0.026026026026026026, 0.02702702702702703, 0.028028028028028028, 0.02902902902902903, 0.03003003003003003, 0.031031031031031032, 0.03203203203203203, 0.03303303303303303, 0.03403403403403404, 0.035035035035035036, 0.036036036036036036, 0.037037037037037035, 0.03803803803803804, 0.03903903903903904, 0.04004004004004004, 0.04104104104104104, 0.042042042042042045, 0.043043043043043044, 0.044044044044044044, 0.04504504504504504, 0.04604604604604605, 0.04704704704704705, 0.04804804804804805, 0.04904904904904905, 0.05005005005005005, 0.05105105105105105, 0.05205205205205205, 0.05305305305305305, 0.05405405405405406, 0.055055055055055056, 0.056056056056056056, 0.057057057057057055, 0.05805805805805806, 0.05905905905905906, 0.06006006006006006, 0.06106106106106106, 0.062062062062062065, 0.06306306306306306, 0.06406406406406406, 0.06506506506506507, 0.06606606606606606, 0.06706706706706707, 0.06806806806806807, 0.06906906906906907, 0.07007007007007007, 0.07107107107107107, 0.07207207207207207, 0.07307307307307308, 0.07407407407407407, 0.07507507507507508, 0.07607607607607608, 0.07707707707707707, 0.07807807807807808, 0.07907907907907907, 0.08008008008008008, 0.08108108108108109, 0.08208208208208208, 0.08308308308308308, 0.08408408408408409, 0.08508508508508508, 0.08608608608608609, 0.08708708708708708, 0.08808808808808809, 0.0890890890890891, 0.09009009009009009, 0.09109109109109109, 0.0920920920920921, 0.09309309309309309, 0.0940940940940941, 0.09509509509509509, 0.0960960960960961, 0.0970970970970971, 0.0980980980980981, 0.0990990990990991, 0.1001001001001001, 0.1011011011011011, 0.1021021021021021, 0.1031031031031031, 0.1041041041041041, 0.10510510510510511, 0.1061061061061061, 0.10710710710710711, 0.10810810810810811, 0.1091091091091091, 0.11011011011011011, 0.1111111111111111, 0.11211211211211211, 0.11311311311311312, 0.11411411411411411, 0.11511511511511512, 0.11611611611611612, 0.11711711711711711, 0.11811811811811812, 0.11911911911911911, 0.12012012012012012, 0.12112112112112113, 0.12212212212212212, 0.12312312312312312, 0.12412412412412413, 0.12512512512512514, 0.12612612612612611, 0.12712712712712712, 0.12812812812812813, 0.12912912912912913, 0.13013013013013014, 0.13113113113113112, 0.13213213213213212, 0.13313313313313313, 0.13413413413413414, 0.13513513513513514, 0.13613613613613615, 0.13713713713713713, 0.13813813813813813, 0.13913913913913914, 0.14014014014014015, 0.14114114114114115, 0.14214214214214213, 0.14314314314314314, 0.14414414414414414, 0.14514514514514515, 0.14614614614614616, 0.14714714714714713, 0.14814814814814814, 0.14914914914914915, 0.15015015015015015, 0.15115115115115116, 0.15215215215215216, 0.15315315315315314, 0.15415415415415415, 0.15515515515515516, 0.15615615615615616, 0.15715715715715717, 0.15815815815815815, 0.15915915915915915, 0.16016016016016016, 0.16116116116116116, 0.16216216216216217, 0.16316316316316315, 0.16416416416416416, 0.16516516516516516, 0.16616616616616617, 0.16716716716716717, 0.16816816816816818, 0.16916916916916916, 0.17017017017017017, 0.17117117117117117, 0.17217217217217218, 0.17317317317317318, 0.17417417417417416, 0.17517517517517517, 0.17617617617617617, 0.17717717717717718, 0.1781781781781782, 0.17917917917917917, 0.18018018018018017, 0.18118118118118118, 0.18218218218218218, 0.1831831831831832, 0.1841841841841842, 0.18518518518518517, 0.18618618618618618, 0.1871871871871872, 0.1881881881881882, 0.1891891891891892, 0.19019019019019018, 0.19119119119119118, 0.1921921921921922, 0.1931931931931932, 0.1941941941941942, 0.19519519519519518, 0.1961961961961962, 0.1971971971971972, 0.1981981981981982, 0.1991991991991992, 0.2002002002002002, 0.2012012012012012, 0.2022022022022022, 0.2032032032032032, 0.2042042042042042, 0.20520520520520522, 0.2062062062062062, 0.2072072072072072, 0.2082082082082082, 0.2092092092092092, 0.21021021021021022, 0.21121121121121122, 0.2122122122122122, 0.2132132132132132, 0.21421421421421422, 0.21521521521521522, 0.21621621621621623, 0.2172172172172172, 0.2182182182182182, 0.21921921921921922, 0.22022022022022023, 0.22122122122122123, 0.2222222222222222, 0.22322322322322322, 0.22422422422422422, 0.22522522522522523, 0.22622622622622623, 0.22722722722722724, 0.22822822822822822, 0.22922922922922923, 0.23023023023023023, 0.23123123123123124, 0.23223223223223224, 0.23323323323323322, 0.23423423423423423, 0.23523523523523523, 0.23623623623623624, 0.23723723723723725, 0.23823823823823823, 0.23923923923923923, 0.24024024024024024, 0.24124124124124124, 0.24224224224224225, 0.24324324324324326, 0.24424424424424424, 0.24524524524524524, 0.24624624624624625, 0.24724724724724725, 0.24824824824824826, 0.24924924924924924, 0.2502502502502503, 0.25125125125125125, 0.25225225225225223, 0.25325325325325326, 0.25425425425425424, 0.2552552552552553, 0.25625625625625625, 0.25725725725725723, 0.25825825825825827, 0.25925925925925924, 0.2602602602602603, 0.26126126126126126, 0.26226226226226224, 0.26326326326326327, 0.26426426426426425, 0.2652652652652653, 0.26626626626626626, 0.2672672672672673, 0.2682682682682683, 0.26926926926926925, 0.2702702702702703, 0.27127127127127126, 0.2722722722722723, 0.2732732732732733, 0.27427427427427425, 0.2752752752752753, 0.27627627627627627, 0.2772772772772773, 0.2782782782782783, 0.27927927927927926, 0.2802802802802803, 0.28128128128128127, 0.2822822822822823, 0.2832832832832833, 0.28428428428428426, 0.2852852852852853, 0.2862862862862863, 0.2872872872872873, 0.2882882882882883, 0.28928928928928926, 0.2902902902902903, 0.2912912912912913, 0.2922922922922923, 0.2932932932932933, 0.29429429429429427, 0.2952952952952953, 0.2962962962962963, 0.2972972972972973, 0.2982982982982983, 0.2992992992992993, 0.3003003003003003, 0.3013013013013013, 0.3023023023023023, 0.3033033033033033, 0.30430430430430433, 0.3053053053053053, 0.3063063063063063, 0.3073073073073073, 0.3083083083083083, 0.30930930930930933, 0.3103103103103103, 0.3113113113113113, 0.3123123123123123, 0.3133133133133133, 0.31431431431431434, 0.3153153153153153, 0.3163163163163163, 0.3173173173173173, 0.3183183183183183, 0.31931931931931934, 0.3203203203203203, 0.3213213213213213, 0.32232232232232233, 0.3233233233233233, 0.32432432432432434, 0.3253253253253253, 0.3263263263263263, 0.32732732732732733, 0.3283283283283283, 0.32932932932932935, 0.3303303303303303, 0.33133133133133136, 0.33233233233233234, 0.3333333333333333, 0.33433433433433435, 0.3353353353353353, 0.33633633633633636, 0.33733733733733734, 0.3383383383383383, 0.33933933933933935, 0.34034034034034033, 0.34134134134134136, 0.34234234234234234, 0.3433433433433433, 0.34434434434434436, 0.34534534534534533, 0.34634634634634637, 0.34734734734734735, 0.3483483483483483, 0.34934934934934936, 0.35035035035035034, 0.35135135135135137, 0.35235235235235235, 0.3533533533533533, 0.35435435435435436, 0.35535535535535534, 0.3563563563563564, 0.35735735735735735, 0.35835835835835833, 0.35935935935935936, 0.36036036036036034, 0.3613613613613614, 0.36236236236236236, 0.3633633633633634, 0.36436436436436437, 0.36536536536536535, 0.3663663663663664, 0.36736736736736736, 0.3683683683683684, 0.36936936936936937, 0.37037037037037035, 0.3713713713713714, 0.37237237237237236, 0.3733733733733734, 0.3743743743743744, 0.37537537537537535, 0.3763763763763764, 0.37737737737737737, 0.3783783783783784, 0.3793793793793794, 0.38038038038038036, 0.3813813813813814, 0.38238238238238237, 0.3833833833833834, 0.3843843843843844, 0.38538538538538536, 0.3863863863863864, 0.38738738738738737, 0.3883883883883884, 0.3893893893893894, 0.39039039039039036, 0.3913913913913914, 0.3923923923923924, 0.3933933933933934, 0.3943943943943944, 0.3953953953953954, 0.3963963963963964, 0.3973973973973974, 0.3983983983983984, 0.3993993993993994, 0.4004004004004004, 0.4014014014014014, 0.4024024024024024, 0.4034034034034034, 0.4044044044044044, 0.40540540540540543, 0.4064064064064064, 0.4074074074074074, 0.4084084084084084, 0.4094094094094094, 0.41041041041041043, 0.4114114114114114, 0.4124124124124124, 0.4134134134134134, 0.4144144144144144, 0.41541541541541543, 0.4164164164164164, 0.4174174174174174, 0.4184184184184184, 0.4194194194194194, 0.42042042042042044, 0.4214214214214214, 0.42242242242242245, 0.42342342342342343, 0.4244244244244244, 0.42542542542542544, 0.4264264264264264, 0.42742742742742745, 0.42842842842842843, 0.4294294294294294, 0.43043043043043044, 0.4314314314314314, 0.43243243243243246, 0.43343343343343343, 0.4344344344344344, 0.43543543543543545, 0.4364364364364364, 0.43743743743743746, 0.43843843843843844, 0.4394394394394394, 0.44044044044044045, 0.44144144144144143, 0.44244244244244246, 0.44344344344344344, 0.4444444444444444, 0.44544544544544545, 0.44644644644644643, 0.44744744744744747, 0.44844844844844844, 0.4494494494494494, 0.45045045045045046, 0.45145145145145144, 0.45245245245245247, 0.45345345345345345, 0.4544544544544545, 0.45545545545545546, 0.45645645645645644, 0.4574574574574575, 0.45845845845845845, 0.4594594594594595, 0.46046046046046046, 0.46146146146146144, 0.4624624624624625, 0.46346346346346345, 0.4644644644644645, 0.46546546546546547, 0.46646646646646645, 0.4674674674674675, 0.46846846846846846, 0.4694694694694695, 0.47047047047047047, 0.47147147147147145, 0.4724724724724725, 0.47347347347347346, 0.4744744744744745, 0.4754754754754755, 0.47647647647647645, 0.4774774774774775, 0.47847847847847846, 0.4794794794794795, 0.4804804804804805, 0.48148148148148145, 0.4824824824824825, 0.48348348348348347, 0.4844844844844845, 0.4854854854854855, 0.4864864864864865, 0.4874874874874875, 0.48848848848848847, 0.4894894894894895, 0.4904904904904905, 0.4914914914914915, 0.4924924924924925, 0.4934934934934935, 0.4944944944944945, 0.4954954954954955, 0.4964964964964965, 0.4974974974974975, 0.4984984984984985, 0.4994994994994995, 0.5005005005005005, 0.5015015015015015, 0.5025025025025025, 0.5035035035035035, 0.5045045045045045, 0.5055055055055055, 0.5065065065065065, 0.5075075075075075, 0.5085085085085085, 0.5095095095095095, 0.5105105105105106, 0.5115115115115115, 0.5125125125125125, 0.5135135135135135, 0.5145145145145145, 0.5155155155155156, 0.5165165165165165, 0.5175175175175175, 0.5185185185185185, 0.5195195195195195, 0.5205205205205206, 0.5215215215215215, 0.5225225225225225, 0.5235235235235235, 0.5245245245245245, 0.5255255255255256, 0.5265265265265265, 0.5275275275275275, 0.5285285285285285, 0.5295295295295295, 0.5305305305305306, 0.5315315315315315, 0.5325325325325325, 0.5335335335335335, 0.5345345345345346, 0.5355355355355356, 0.5365365365365365, 0.5375375375375375, 0.5385385385385385, 0.5395395395395396, 0.5405405405405406, 0.5415415415415415, 0.5425425425425425, 0.5435435435435435, 0.5445445445445446, 0.5455455455455456, 0.5465465465465466, 0.5475475475475475, 0.5485485485485485, 0.5495495495495496, 0.5505505505505506, 0.5515515515515516, 0.5525525525525525, 0.5535535535535535, 0.5545545545545546, 0.5555555555555556, 0.5565565565565566, 0.5575575575575575, 0.5585585585585585, 0.5595595595595596, 0.5605605605605606, 0.5615615615615616, 0.5625625625625625, 0.5635635635635635, 0.5645645645645646, 0.5655655655655656, 0.5665665665665666, 0.5675675675675675, 0.5685685685685685, 0.5695695695695696, 0.5705705705705706, 0.5715715715715716, 0.5725725725725725, 0.5735735735735735, 0.5745745745745746, 0.5755755755755756, 0.5765765765765766, 0.5775775775775776, 0.5785785785785785, 0.5795795795795796, 0.5805805805805806, 0.5815815815815816, 0.5825825825825826, 0.5835835835835835, 0.5845845845845846, 0.5855855855855856, 0.5865865865865866, 0.5875875875875876, 0.5885885885885885, 0.5895895895895896, 0.5905905905905906, 0.5915915915915916, 0.5925925925925926, 0.5935935935935935, 0.5945945945945946, 0.5955955955955956, 0.5965965965965966, 0.5975975975975976, 0.5985985985985987, 0.5995995995995996, 0.6006006006006006, 0.6016016016016016, 0.6026026026026026, 0.6036036036036037, 0.6046046046046046, 0.6056056056056056, 0.6066066066066066, 0.6076076076076076, 0.6086086086086087, 0.6096096096096096, 0.6106106106106106, 0.6116116116116116, 0.6126126126126126, 0.6136136136136137, 0.6146146146146146, 0.6156156156156156, 0.6166166166166166, 0.6176176176176176, 0.6186186186186187, 0.6196196196196196, 0.6206206206206206, 0.6216216216216216, 0.6226226226226226, 0.6236236236236237, 0.6246246246246246, 0.6256256256256256, 0.6266266266266266, 0.6276276276276276, 0.6286286286286287, 0.6296296296296297, 0.6306306306306306, 0.6316316316316316, 0.6326326326326326, 0.6336336336336337, 0.6346346346346347, 0.6356356356356356, 0.6366366366366366, 0.6376376376376376, 0.6386386386386387, 0.6396396396396397, 0.6406406406406406, 0.6416416416416416, 0.6426426426426426, 0.6436436436436437, 0.6446446446446447, 0.6456456456456456, 0.6466466466466466, 0.6476476476476476, 0.6486486486486487, 0.6496496496496497, 0.6506506506506506, 0.6516516516516516, 0.6526526526526526, 0.6536536536536537, 0.6546546546546547, 0.6556556556556556, 0.6566566566566566, 0.6576576576576577, 0.6586586586586587, 0.6596596596596597, 0.6606606606606606, 0.6616616616616616, 0.6626626626626627, 0.6636636636636637, 0.6646646646646647, 0.6656656656656657, 0.6666666666666666, 0.6676676676676677, 0.6686686686686687, 0.6696696696696697, 0.6706706706706707, 0.6716716716716716, 0.6726726726726727, 0.6736736736736737, 0.6746746746746747, 0.6756756756756757, 0.6766766766766766, 0.6776776776776777, 0.6786786786786787, 0.6796796796796797, 0.6806806806806807, 0.6816816816816816, 0.6826826826826827, 0.6836836836836837, 0.6846846846846847, 0.6856856856856857, 0.6866866866866866, 0.6876876876876877, 0.6886886886886887, 0.6896896896896897, 0.6906906906906907, 0.6916916916916916, 0.6926926926926927, 0.6936936936936937, 0.6946946946946947, 0.6956956956956957, 0.6966966966966966, 0.6976976976976977, 0.6986986986986987, 0.6996996996996997, 0.7007007007007007, 0.7017017017017017, 0.7027027027027027, 0.7037037037037037, 0.7047047047047047, 0.7057057057057057, 0.7067067067067067, 0.7077077077077077, 0.7087087087087087, 0.7097097097097097, 0.7107107107107107, 0.7117117117117117, 0.7127127127127127, 0.7137137137137137, 0.7147147147147147, 0.7157157157157157, 0.7167167167167167, 0.7177177177177178, 0.7187187187187187, 0.7197197197197197, 0.7207207207207207, 0.7217217217217218, 0.7227227227227228, 0.7237237237237237, 0.7247247247247247, 0.7257257257257257, 0.7267267267267268, 0.7277277277277278, 0.7287287287287287, 0.7297297297297297, 0.7307307307307307, 0.7317317317317318, 0.7327327327327328, 0.7337337337337337, 0.7347347347347347, 0.7357357357357357, 0.7367367367367368, 0.7377377377377378, 0.7387387387387387, 0.7397397397397397, 0.7407407407407407, 0.7417417417417418, 0.7427427427427428, 0.7437437437437437, 0.7447447447447447, 0.7457457457457457, 0.7467467467467468, 0.7477477477477478, 0.7487487487487487, 0.7497497497497497, 0.7507507507507507, 0.7517517517517518, 0.7527527527527528, 0.7537537537537538, 0.7547547547547547, 0.7557557557557557, 0.7567567567567568, 0.7577577577577578, 0.7587587587587588, 0.7597597597597597, 0.7607607607607607, 0.7617617617617618, 0.7627627627627628, 0.7637637637637638, 0.7647647647647647, 0.7657657657657657, 0.7667667667667668, 0.7677677677677678, 0.7687687687687688, 0.7697697697697697, 0.7707707707707707, 0.7717717717717718, 0.7727727727727728, 0.7737737737737738, 0.7747747747747747, 0.7757757757757757, 0.7767767767767768, 0.7777777777777778, 0.7787787787787788, 0.7797797797797797, 0.7807807807807807, 0.7817817817817818, 0.7827827827827828, 0.7837837837837838, 0.7847847847847848, 0.7857857857857858, 0.7867867867867868, 0.7877877877877878, 0.7887887887887888, 0.7897897897897898, 0.7907907907907908, 0.7917917917917918, 0.7927927927927928, 0.7937937937937938, 0.7947947947947948, 0.7957957957957958, 0.7967967967967968, 0.7977977977977978, 0.7987987987987988, 0.7997997997997998, 0.8008008008008008, 0.8018018018018018, 0.8028028028028028, 0.8038038038038038, 0.8048048048048048, 0.8058058058058059, 0.8068068068068068, 0.8078078078078078, 0.8088088088088088, 0.8098098098098098, 0.8108108108108109, 0.8118118118118118, 0.8128128128128128, 0.8138138138138138, 0.8148148148148148, 0.8158158158158159, 0.8168168168168168, 0.8178178178178178, 0.8188188188188188, 0.8198198198198198, 0.8208208208208209, 0.8218218218218218, 0.8228228228228228, 0.8238238238238238, 0.8248248248248248, 0.8258258258258259, 0.8268268268268268, 0.8278278278278278, 0.8288288288288288, 0.8298298298298298, 0.8308308308308309, 0.8318318318318318, 0.8328328328328328, 0.8338338338338338, 0.8348348348348348, 0.8358358358358359, 0.8368368368368369, 0.8378378378378378, 0.8388388388388388, 0.8398398398398398, 0.8408408408408409, 0.8418418418418419, 0.8428428428428428, 0.8438438438438438, 0.8448448448448449, 0.8458458458458459, 0.8468468468468469, 0.8478478478478478, 0.8488488488488488, 0.8498498498498499, 0.8508508508508509, 0.8518518518518519, 0.8528528528528528, 0.8538538538538538, 0.8548548548548549, 0.8558558558558559, 0.8568568568568569, 0.8578578578578578, 0.8588588588588588, 0.8598598598598599, 0.8608608608608609, 0.8618618618618619, 0.8628628628628628, 0.8638638638638638, 0.8648648648648649, 0.8658658658658659, 0.8668668668668669, 0.8678678678678678, 0.8688688688688688, 0.8698698698698699, 0.8708708708708709, 0.8718718718718719, 0.8728728728728729, 0.8738738738738738, 0.8748748748748749, 0.8758758758758759, 0.8768768768768769, 0.8778778778778779, 0.8788788788788788, 0.8798798798798799, 0.8808808808808809, 0.8818818818818819, 0.8828828828828829, 0.8838838838838838, 0.8848848848848849, 0.8858858858858859, 0.8868868868868869, 0.8878878878878879, 0.8888888888888888, 0.8898898898898899, 0.8908908908908909, 0.8918918918918919, 0.8928928928928929, 0.8938938938938938, 0.8948948948948949, 0.8958958958958959, 0.8968968968968969, 0.8978978978978979, 0.8988988988988988, 0.8998998998998999, 0.9009009009009009, 0.9019019019019019, 0.9029029029029029, 0.9039039039039038, 0.9049049049049049, 0.9059059059059059, 0.9069069069069069, 0.9079079079079079, 0.908908908908909, 0.9099099099099099, 0.9109109109109109, 0.9119119119119119, 0.9129129129129129, 0.913913913913914, 0.914914914914915, 0.9159159159159159, 0.9169169169169169, 0.9179179179179179, 0.918918918918919, 0.91991991991992, 0.9209209209209209, 0.9219219219219219, 0.9229229229229229, 0.923923923923924, 0.924924924924925, 0.9259259259259259, 0.9269269269269269, 0.9279279279279279, 0.928928928928929, 0.92992992992993, 0.9309309309309309, 0.9319319319319319, 0.9329329329329329, 0.933933933933934, 0.934934934934935, 0.9359359359359359, 0.9369369369369369, 0.9379379379379379, 0.938938938938939, 0.93993993993994, 0.9409409409409409, 0.9419419419419419, 0.9429429429429429, 0.943943943943944, 0.944944944944945, 0.9459459459459459, 0.9469469469469469, 0.9479479479479479, 0.948948948948949, 0.94994994994995, 0.950950950950951, 0.9519519519519519, 0.9529529529529529, 0.953953953953954, 0.954954954954955, 0.955955955955956, 0.9569569569569569, 0.9579579579579579, 0.958958958958959, 0.95995995995996, 0.960960960960961, 0.9619619619619619, 0.9629629629629629, 0.963963963963964, 0.964964964964965, 0.965965965965966, 0.9669669669669669, 0.9679679679679679, 0.968968968968969, 0.96996996996997, 0.970970970970971, 0.9719719719719719, 0.972972972972973, 0.973973973973974, 0.974974974974975, 0.975975975975976, 0.9769769769769769, 0.977977977977978, 0.978978978978979, 0.97997997997998, 0.980980980980981, 0.9819819819819819, 0.982982982982983, 0.983983983983984, 0.984984984984985, 0.985985985985986, 0.986986986986987, 0.987987987987988, 0.988988988988989, 0.98998998998999, 0.990990990990991, 0.991991991991992, 0.992992992992993, 0.993993993993994, 0.994994994994995, 0.995995995995996, 0.996996996996997, 0.997997997997998, 0.998998998998999, 1.0]}]} // Get the plotly listeners const plotly_listeners = {} // Get the JS listeners const js_listeners = {} // Deal with eventual custom classes let custom_classlist = [] // Load the plotly library if (!window.Plotly) { const {plotly} = await import('https://cdn.plot.ly/plotly-2.16.1.min.js') } // Check if we have to force local mathjax font cache if (false && window?.MathJax?.config?.svg?.fontCache === 'global') { window.MathJax.config.svg.fontCache = 'local' } // Flag to check if this cell was manually ran or reactively ran const firstRun = this ? false : true const PLOT = this ?? document.createElement("div"); const parent = currentScript.parentElement const isPlutoWrapper = parent.classList.contains('raw-html-wrapper') if (firstRun) { // It seem plot divs would not autosize themself inside flexbox containers without this parent.appendChild(PLOT) } // If width is not specified, set it to 100% PLOT.style.width = plot_obj.layout.width ? "" : "100%" // For the height we have to also put a fixed value in case the plot is put on a non-fixed-size container (like the default wrapper) PLOT.style.height = plot_obj.layout.height ? "" : (isPlutoWrapper || parent.clientHeight == 0) ? "400px" : "100%" PLOT.classList.forEach(cn => { if (cn !== 'js-plotly-plot' && !custom_classlist.includes(cn)) { PLOT.classList.toggle(cn, false) } }) for (const className of custom_classlist) { PLOT.classList.toggle(className, true) } // Create the resizeObserver to make the plot even more responsive! :magic: const resizeObserver = new ResizeObserver(entries => { PLOT.style.height = plot_obj.layout.height ? "" : (isPlutoWrapper || parent.clientHeight == 0) ? "400px" : "100%" /* The addition of the invalid argument `plutoresize` seems to fix the problem with calling `relayout` simply with `{autosize: true}` as update breaking mouse relayout events tracking. See https://github.com/plotly/plotly.js/issues/6156 for details */ Plotly.relayout(PLOT, {..._.pick(PLOT.layout, ['width','height']), autosize: true, plutoresize: true}) }) resizeObserver.observe(PLOT) Plotly.react(PLOT, plot_obj).then(() => { // Assign the Plotly event listeners for (const [key, listener_vec] of Object.entries(plotly_listeners)) { for (const listener of listener_vec) { PLOT.on(key, listener) } } // Assign the JS event listeners for (const [key, listener_vec] of Object.entries(js_listeners)) { for (const listener of listener_vec) { PLOT.addEventListener(key, listener) } } } ) invalidation.then(() => { // Remove all plotly listeners PLOT.removeAllListeners() // Remove all JS listeners for (const [key, listener_vec] of Object.entries(js_listeners)) { for (const listener of listener_vec) { PLOT.removeEventListener(key, listener) } } // Remove the resizeObserver resizeObserver.disconnect() }) return PLOT ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•!oþÁ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$ad0009af-2cfc-4820-bd4a-698ad391f459¹depends_on_disabled_cellsÂ§runtimeÎ5—¯µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$16fcc2d0-9f2f-4226-9dcc-6d86248cab26Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ9plot_state_distributions (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•>ìJ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$16fcc2d0-9f2f-4226-9dcc-6d86248cab26¹depends_on_disabled_cellsÂ§runtimeÎYzµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$11063fff-4d36-46d5-828f-dbed0f46b9cfŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙDactor_critic_fcann_parameter_study (generic function with 3 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•/¹¬°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$11063fff-4d36-46d5-828f-dbed0f46b9cf¹depends_on_disabled_cellsÂ§runtimeÎexîµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$8fcdca63-01a0-4d4b-933c-06a7621d980aŠ¦queuedÂ¤logs§runningÂ¦output†¤body ¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ”ô”Æ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$8fcdca63-01a0-4d4b-933c-06a7621d980a¹depends_on_disabled_cellsÂ§runtimeÍ&Èµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$33c99850-67cd-4754-94b9-6df97b238e27Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ*soft_max! (generic function with 1 method)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•åü°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$33c99850-67cd-4754-94b9-6df97b238e27¹depends_on_disabled_cellsÂ§runtimeÎ'Wîµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$786a5385-b648-4fc3-8e19-bf6582828136Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚh

Continuous Action Space

Now that we have verified the success of policy gradient methods on this problem, we can consider using a continuous action space where the policy can output a distribution over throttles. In the original problem, the maximum throttle value is 1, but the velocity of the car is already capped at 0.07. We can see if a policy attempts to use much higher throttle values to end the episode faster even if the physics is unrealistic. That observation would confirm a successful use of continuous actions where the throttle is an unbounded continuous value. The optimal policy would likely try to use the highest throttle possible to reach the maximum speed in either direction faster. We could apply friction to the problem so that the car would actually slip if it attempts to accelerate too quickly.

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ”ô•õ°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$786a5385-b648-4fc3-8e19-bf6582828136¹depends_on_disabled_cellsÂ§runtimeÎ$Òµpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$573878bb-020d-40f6-9329-3d5f91843010Š¦queuedÂ¤logs§runningÂ¦output†¤body©11.995292¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•"n<›°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$573878bb-020d-40f6-9329-3d5f91843010¹depends_on_disabled_cellsÂ§runtimeÎW±³µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$2e7c737c-c798-4442-a7e1-d74ccfd73119Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ0¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•:a°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$2e7c737c-c798-4442-a7e1-d74ccfd73119¹depends_on_disabled_cellsÂ§runtimeÎ<›µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$9d264543-33ab-498a-90f5-5f913c252484Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚi, ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•7ù#h°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$9d264543-33ab-498a-90f5-5f913c252484¹depends_on_disabled_cellsÂ§runtimeÎ+Ôµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$9cf3dc5f-8a25-479f-93db-06e34f0d37a0Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÛ7UPolicy Probability for Left Action is 0.5 and Average Episode Length is 12.0009365

State Distribution Per Step Including Terminal State

¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•?ÀÎF°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$9cf3dc5f-8a25-479f-93db-06e34f0d37a0¹depends_on_disabled_cellsÂ§runtimeÎtÞ_µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚMŒ ¤mime©text/html¬rootassigneeÀ²last_run_timestampËAÚ•0‡U°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70¹depends_on_disabled_cellsÂ§runtimeÎ õ9[µpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂÙ$bd6a7c16-6c25-4fc2-8e1b-4dab693ce19fŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÙ`actor_critic_binary_episodic_squashed_gaussian_parameter_study (generic function with 2 methods)¤mimeªtext/plain¬rootassigneeÀ²last_run_timestampËAÚ•/õD,°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$bd6a7c16-6c25-4fc2-8e1b-4dab693ce19f¹depends_on_disabled_cellsÂ§runtimeÎ0³ µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72cŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyƒ¨elements“’®value_function’®value_functionªtext/plain’greedy_policy’greedy_policyªtext/plain’§history’ƒ¨elements’’episode_steps’…¦prefix§Float32¨elements¤type¥Array¬prefix_short ¨objectid°c648579b1e24c833Ù!application/vnd.pluto.tree+object’¬step_rewards’…¦prefix§Float32¨elements¤type¥Array¬prefix_short ¨objectid°3ce720620b625923Ù!application/vnd.pluto.tree+object¤typeªNamedTuple¨objectid°ad66844189f9d9c6Ù!application/vnd.pluto.tree+object¤typeªNamedTuple¨objectid°5dcda75341102be0¤mimeÙ!application/vnd.pluto.tree+object¬rootassignee®corridor_train²last_run_timestampËAÚ•õ“l°persist_js_stateÂ·has_pluto_hook_featuresÂ§cell_idÙ$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72c¹depends_on_disabled_cellsÂ§runtimeÎ@hµpublished_object_keys¸depends_on_skipped_cellsÃ§erroredÂ±cell_dependenciesÞ§Ù$4f96be72-ef3e-4e08-ac4c-be4271dcd14c„´precedence_heuristic §cell_idÙ$4f96be72-ef3e-4e08-ac4c-be4271dcd14c´downstream_cells_map€²upstream_cells_map€Ù$19dfabda-7049-4050-8662-0385529c0c5a„´precedence_heuristic §cell_idÙ$19dfabda-7049-4050-8662-0385529c0c5a´downstream_cells_map´sref_cartpole_binary‘Ù$0574f5a0-72e7-4aa2-80ac-f4ce4f0fe7c2²upstream_cells_map§@md_str¤Core¡:§PlutoUI‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70¢|>¨Base.get¥@bind¦Slider¤Base«PlutoRunner·PlutoRunner.create_bond§confirm¯Core.applicable¯PlutoUI.combine¨getindexÙ$b71145a4-2614-4f62-bfd2-7d5d1fecec56„´precedence_heuristic §cell_idÙ$b71145a4-2614-4f62-bfd2-7d5d1fecec56´downstream_cells_mapÙ%actor_critic_with_eligibility_traces!–Ù$05bfd818-bf4e-4bda-baa9-5ba647867097Ù$68806899-9972-460a-9f11-daa708a9d610Ù$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54Ù$20776e09-7d9b-4db8-a060-7bceeec65b47Ù$3e3c5897-809f-46e3-bb58-f115b082443eÙ$05f120be-9695-4824-82fd-142a0df13098²upstream_cells_mapÞ¤zero§typemin¬zero_params!‘Ù$e6cf9550-2e69-4b82-92cf-5e07a35490aa¼update_traces_with_gradient!’Ù$25be5dcf-be63-46c4-b6de-6cf79fa28fd0Ù$056a8adc-92f4-4b33-90d9-4b3b4026bbbc£oneContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295²Base.CoreLogging.!¦Vector¤RealÙ'Base.CoreLogging.Base.fixup_stdlib_path¨deepcopy¡/¥@info±Base.invokelatest¼update_params_with_gradient!“Ù$b0a66a19-ee76-463b-a704-8fcee85444d0Ù$a893a87b-2d07-4db5-9d1a-9da8646216f4Ù$f55afa58-962d-4551-8d95-a5b467d61adf½Base.CoreLogging.invokelatest´Base.CoreLogging.===Ù&form_state_and_policy_function_outputs’Ù$e7e49ff8-32df-48a4-afb2-462859592e92Ù$11b9beea-b0cd-45eb-84c6-151728894df0º#___this_pluto_module_name§Integer¨Function¤Base¢<=¥Int64¥push!´Base.CoreLogging.isa¡-µbad_continuous_action‘Ù$b966b248-fb4d-457d-90f6-114370846242¡+¡*³Base.CoreLogging.>=Ù$c0876a48-ea18-494d-8bfc-e2bceb73b417„´precedence_heuristic §cell_idÙ$c0876a48-ea18-494d-8bfc-e2bceb73b417´downstream_cells_map€²upstream_cells_map‚·plot_mountaincar_values‘Ù$f9facbba-39d4-483e-9066-275603156db0Ù!mountaincar_continuing_fcann_test‘Ù$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9Ù$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091„´precedence_heuristic §cell_idÙ$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091´downstream_cells_mapÙ#reinforce_monte_carlo_control_fcann‘Ù$07ad517a-c2ac-4377-99fb-adb13d0f1d0c²upstream_cells_map¥FCANN‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84«FCANNParams§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84»FCANN.initializeparams_saxe¥Int64¦Vector¤Real¨Function¦length¾reinforce_monte_carlo_control!‘Ù$0ac7ea44-14f6-4e80-80f9-d6df8059bb38¤fill¼setup_fcann_policy_arguments‘Ù$0e9de19e-bcd4-40ac-9831-afb6cad38422Ù$f4b6f10b-4cd0-4be6-98ec-4d4ffb696392„´precedence_heuristic §cell_idÙ$f4b6f10b-4cd0-4be6-98ec-4d4ffb696392´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$9db9ff71-bee9-4bea-a45b-748f8517fed1„´precedence_heuristic §cell_idÙ$9db9ff71-bee9-4bea-a45b-748f8517fed1´downstream_cells_map€²upstream_cells_map†¥Int64¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabe¡^Ù%one_step_actor_critic_linear_features‘Ù$57e5e12a-b722-4ea3-ab3b-e5711029e640¹update_corridor_features!‘Ù$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811§typemaxÙ$4634267b-5dea-4164-8bb2-1eb2fd4d7954„´precedence_heuristic §cell_idÙ$4634267b-5dea-4164-8bb2-1eb2fd4d7954´downstream_cells_mapÙ!update_linear_eligibility_vector!”Ù$8e39bd15-862e-4941-88f9-2794b861a523Ù$d1ed25e6-60c6-411f-a541-99986e5da2c5Ù$57e5e12a-b722-4ea3-ab3b-e5711029e640Ù$68806899-9972-460a-9f11-daa708a9d610²upstream_cells_mapÞ¤zero¤BLAS¦isless©@inbounds£one§nothing¦Vector¡<¯Base.simd_index©eachindex¥@simd¦MatrixÙ!update_linear_action_preferences!‘Ù$581f7e9b-a5c2-4841-9605-85f9585b0274AbstractFloat®julia.simdloop©soft_max!‘Ù$33c99850-67cd-4754-94b9-6df97b238e27§Integer¤BaseµBase.simd_outer_range¡-¶Base.simd_inner_length¡+ªBLAS.gemm!Ù$6c5f51bb-a6be-447e-b73d-4f9c2885e809„´precedence_heuristic §cell_idÙ$6c5f51bb-a6be-447e-b73d-4f9c2885e809´downstream_cells_map€²upstream_cells_map€Ù$cc45091e-b889-4d5a-9eef-84d80f792046„´precedence_heuristic §cell_idÙ$cc45091e-b889-4d5a-9eef-84d80f792046´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$5b15d91e-7119-4f85-a54a-7d4f1fdaf097„´precedence_heuristic §cell_idÙ$5b15d91e-7119-4f85-a54a-7d4f1fdaf097´downstream_cells_mapÙ(create_actor_critic_continuing_params_UI•Ù$7d94922e-dc9f-4953-b539-24aaa2c85b12Ù$8e742d32-c074-4981-b35b-b596b64c869bÙ$5ffc271f-c73f-494a-9727-8d7516af2191Ù$fed4dc4c-0d1c-4ee3-9d0e-8ef2a7db7486Ù$5d35e515-e2d3-443e-becf-eb28c25db346²upstream_cells_map‰§@md_str¡:§PlutoUI‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70«NumberField¢|>¯PlutoUI.combine§confirm¦Slider¨getindexÙ$ba41f521-4ee2-42a6-bf18-078bfa4b875e„´precedence_heuristic §cell_idÙ$ba41f521-4ee2-42a6-bf18-078bfa4b875e´downstream_cells_map¿make_n_param_dist_policy_paramsšÙ$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00Ù$d5020a8d-1dd7-403c-9d1f-665b95543943Ù$20776e09-7d9b-4db8-a060-7bceeec65b47Ù$3e3c5897-809f-46e3-bb58-f115b082443eÙ$05f120be-9695-4824-82fd-142a0df13098Ù$55ba8725-0ddf-4196-a41d-3f3c490a8d84Ù$61949faa-8174-4b7b-8fbc-01d5f850b419Ù$dd8e8cd2-7b41-46c4-8530-adefb7aea684Ù$08505e88-9c23-4e95-91e3-d18bf5133dbcÙ$87482ea5-5265-4e02-92c0-1a8bb44ff0f4²upstream_cells_map…¥zeros¦NTuple¡*§Integer¤RealÙ$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03„´precedence_heuristic §cell_idÙ$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03´downstream_cells_mapÙ-cartpole_tilecoding_reinforce_parameter_study²upstream_cells_map®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6¡:¢|>¶setup_cartpole_problem©max_stepsÙ;reinforce_with_baseline_monte_carlo_control_binary_features‘Ù$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb§scatter¤plot¡/¦foldxt¡+§isempty£Map¤mean¦LayoutÙ$3c695d54-c30f-4f04-bd40-f5da53be2a95„´precedence_heuristic §cell_idÙ$3c695d54-c30f-4f04-bd40-f5da53be2a95´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$0d45ae72-572f-4d17-83cf-9814f2854131„´precedence_heuristic §cell_idÙ$0d45ae72-572f-4d17-83cf-9814f2854131´downstream_cells_mapÙ%mountaincar_binary_continuous_params2‘Ù$0d93132d-5819-47dc-8cf2-462d480d9c3d²upstream_cells_mapˆ¤Core¤Base·PlutoRunner.create_bond«PlutoRunner¯Core.applicable¥@bind¨Base.get½create_actor_critic_params_UI‘Ù$a8b40b8f-051a-4e6f-a079-ece4f32873deÙ$cd9c9eeb-c90d-4499-9503-7773d5250f47„´precedence_heuristic §cell_idÙ$cd9c9eeb-c90d-4499-9503-7773d5250f47´downstream_cells_map€²upstream_cells_mapƒ»mountaincar_continuous_mdp2‘Ù$349631b2-4686-49a9-9f3a-1e4ad588b568Ù&show_mountaincar_continuous_trajectory‘Ù$b5319d8b-0420-4ebf-b603-ea0b93365ac1Ù"mountaincar_continuous_test_train2‘Ù$fee14dfe-c5ca-4126-a830-cc9d7eda5433Ù$fd58402f-da65-44cf-b81a-e21192fd0e63„´precedence_heuristic §cell_idÙ$fd58402f-da65-44cf-b81a-e21192fd0e63´downstream_cells_map€²upstream_cells_map„Ù$cartpole_fcann_continuing_test_state‘Ù$28ce6e60-59cf-408a-8081-b978507b3c72CartPoleState´plot_cartpole_policy‘Ù$602a07dd-8928-4b44-97e5-01c5cbf38351¾cartpole_continuing_fcann_test‘Ù$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27Ù$8e39bd15-862e-4941-88f9-2794b861a523„´precedence_heuristic §cell_idÙ$8e39bd15-862e-4941-88f9-2794b861a523´downstream_cells_mapÙ-reinforce_monte_carlo_control_linear_features‘Ù$5720e942-d3f8-4329-83a8-8bcedf078b6a²upstream_cells_mapŒ¥zeros§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¤copy¦Vector¤RealÙ!update_linear_eligibility_vector!‘Ù$4634267b-5dea-4164-8bb2-1eb2fd4d7954¨Function¦lengthÙ!update_linear_action_preferences!‘Ù$581f7e9b-a5c2-4841-9605-85f9585b0274¦Matrix¾reinforce_monte_carlo_control!‘Ù$0ac7ea44-14f6-4e80-80f9-d6df8059bb38Ù$64900586-ef92-48e4-839e-ff952a46671b„´precedence_heuristic §cell_idÙ$64900586-ef92-48e4-839e-ff952a46671b´downstream_cells_map€²upstream_cells_map€Ù$fddef10c-7695-4596-9e16-987fd45a57e6„´precedence_heuristic §cell_idÙ$fddef10c-7695-4596-9e16-987fd45a57e6´downstream_cells_mapÙ!setup_cartpole_continuous_problem‘Ù$26880577-d267-4950-8725-7afe0d0402b6²upstream_cells_map‡±tile_coding_setup¡-§deg2rad¥Tuple¡/´create_cartpole_mdps‘Ù$3c316495-bb6c-41e2-a38f-ba867a319fbb¤randÙ$e2b09af1-0f22-4f7f-b806-54fa522adb20„´precedence_heuristic §cell_idÙ$e2b09af1-0f22-4f7f-b806-54fa522adb20´downstream_cells_map€²upstream_cells_map€Ù$2be8a812-4f21-4fe8-a2de-50497db0345a„´precedence_heuristic §cell_idÙ$2be8a812-4f21-4fe8-a2de-50497db0345a´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$68806899-9972-460a-9f11-daa708a9d610„´precedence_heuristic §cell_idÙ$68806899-9972-460a-9f11-daa708a9d610´downstream_cells_mapÙ4actor_critic_with_eligibility_traces_linear_features“Ù$11ea640c-3981-404d-87c6-4d3d0708a2b8Ù$734573e5-547b-4dcc-89bb-412aa6cc42d6Ù$ff4f977e-48df-4c12-845c-c245b4d39d6d²upstream_cells_mapŽ¥zeros§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84½update_linear_value_gradient!‘Ù$1753b5ed-c00b-4b60-b492-822180778e8c¤copy¦VectorÙ!update_linear_eligibility_vector!‘Ù$4634267b-5dea-4164-8bb2-1eb2fd4d7954¤Real¨FunctionÙ!update_linear_action_preferences!‘Ù$581f7e9b-a5c2-4841-9605-85f9585b0274¦Matrix¦lengthµlinear_value_function‘Ù$0bf3b988-b3fb-49d5-8dde-b25766596363Ù%actor_critic_with_eligibility_traces!”Ù$266d2234-26c8-43f1-9e75-49440a230ed6Ù$83640f5b-fe13-4ec1-98a0-67a56c189ba1Ù$b71145a4-2614-4f62-bfd2-7d5d1fecec56Ù$4da20fd7-b897-4f26-bf2a-f08d66ddf90fÙ$189798b3-ec6b-48b9-918c-ee0f65935ab3„´precedence_heuristic §cell_idÙ$189798b3-ec6b-48b9-918c-ee0f65935ab3´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$00152954-dc98-4120-b94b-2ea4d987832b„´precedence_heuristic §cell_idÙ$00152954-dc98-4120-b94b-2ea4d987832b´downstream_cells_mapÙ!create_mountaincar_continuing_mdp‘Ù$46fea69b-599e-46ab-8455-d2da865d9a8e²upstream_cells_map„¹StateMDPTransitionSampler‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84»mountaincar_continuing_step‘Ù$a9db3f85-ff56-4bbc-be87-47b893ef3b7b¯MountainCarTask¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$42d4600a-bf3c-45ac-b7f5-d23917713ff5„´precedence_heuristic §cell_idÙ$42d4600a-bf3c-45ac-b7f5-d23917713ff5´downstream_cells_mapÙ(cartpole_continuing_fcann_network_params‘Ù$50ae94c4-70f3-4215-82bd-eb2227c2badf²upstream_cells_map§@md_str¤Core¡:§PlutoUI‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70¢|>¨Base.get¥@bind¤Base«PlutoRunner·PlutoRunner.create_bond«NumberField§confirm¯Core.applicable¯PlutoUI.combine¨getindexÙ$4e29c621-223e-4859-8e96-db04b967815a„´precedence_heuristic §cell_idÙ$4e29c621-223e-4859-8e96-db04b967815a´downstream_cells_mapÙ/setup_binary_squashed_gaussian_policy_arguments‘Ù$05f120be-9695-4824-82fd-142a0df13098²upstream_cells_map‹Ù'BinarySquashedGaussianEligibilityVector‘Ù$76fd79a2-2bc8-45f8-a243-48415118898a³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41¤rand§IntegerContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295¤Real¨Function½update_binary_feature_vector!‘Ù$8eab55a5-41b7-4f5e-a02f-4c19388bc9ea¸make_n_param_dist_params‘Ù$76eb6743-cac0-4174-9ba3-a0691c200b54¥Union¦NTupleÙ$5981f52b-d829-4c7d-b47b-33310f7d64a2„´precedence_heuristic §cell_idÙ$5981f52b-d829-4c7d-b47b-33310f7d64a2´downstream_cells_map€²upstream_cells_mapƒ½corridor_train.value_function®corridor_train‘Ù$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72c¶make_Ïµ_greedy_policy!‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$0e9de19e-bcd4-40ac-9831-afb6cad38422„´precedence_heuristic §cell_idÙ$0e9de19e-bcd4-40ac-9831-afb6cad38422´downstream_cells_map¼setup_fcann_policy_arguments’Ù$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091Ù$e1aec891-d95a-47d1-97d7-d2a4cfb16e64²upstream_cells_mapÞ¥FCANN‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¥zeros¤Bool¤size«FCANNParams£one§Integer¦Vector¤Real¥Int64¶FCANN.form_activations©eachindex¨deepcopyÙ update_fcann_action_preferences!‘Ù$cc3ac95e-a398-438a-ba3d-62b6733f6342¦length¡/¡+Ù update_fcann_eligibility_vector!‘Ù$45f0a385-6465-4acc-8637-1b007a0fe215¤fillÙ$ff3009eb-23f9-44fe-8e56-85dbc7b463d0„´precedence_heuristic §cell_idÙ$ff3009eb-23f9-44fe-8e56-85dbc7b463d0´downstream_cells_map´show_squashed_policy‘Ù$f8215517-b18f-4a03-9421-8edab4ca8089²upstream_cells_mapƒ£exp¶plot_squashed_gaussian‘Ù$00bd2835-b006-4244-9877-bc7e031e3ef8¨FunctionÙ$4fb83451-b6f8-4e6e-a131-1accc8e10b08„´precedence_heuristic §cell_idÙ$4fb83451-b6f8-4e6e-a131-1accc8e10b08´downstream_cells_mapÙ,reinforce_with_baseline_monte_carlo_control!–Ù$0ac7ea44-14f6-4e80-80f9-d6df8059bb38Ù$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbÙ$d1ed25e6-60c6-411f-a541-99986e5da2c5Ù$697b2310-9d96-4f7f-be62-c3bd6bf736f3Ù$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00Ù$d5020a8d-1dd7-403c-9d1f-665b95543943²upstream_cells_mapÞ¤zero£one¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¦lengthsample_action‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¤copy©eachindex«runepisode!‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¨deepcopy¤Real¡/¡^ºform_state_policy_function‘Ù$37ec6802-d4c2-4470-ad69-439d5a732f77¼update_params_with_gradient!“Ù$b0a66a19-ee76-463b-a704-8fcee85444d0Ù$a893a87b-2d07-4db5-9d1a-9da8646216f4Ù$f55afa58-962d-4551-8d95-a5b467d61adfªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¡:¥zeros§Integer¨Function¥Int64¡-¡+¡*¹form_state_value_function‘Ù$e7566274-5518-4e28-8738-d4b1747d0cfbÙ$406638af-1e08-44d2-9ee4-97aa9294a94b„´precedence_heuristic §cell_idÙ$406638af-1e08-44d2-9ee4-97aa9294a94b´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$57e5e12a-b722-4ea3-ab3b-e5711029e640„´precedence_heuristic §cell_idÙ$57e5e12a-b722-4ea3-ab3b-e5711029e640´downstream_cells_mapÙ%one_step_actor_critic_linear_features‘Ù$9db9ff71-bee9-4bea-a45b-748f8517fed1²upstream_cells_mapŽ¥zeros§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84½update_linear_value_gradient!‘Ù$1753b5ed-c00b-4b60-b492-822180778e8c¤copy¦VectorÙ!update_linear_eligibility_vector!‘Ù$4634267b-5dea-4164-8bb2-1eb2fd4d7954¤Real¶one_step_actor_critic!‘Ù$4d4ae57b-afc3-44f9-b6fc-892f59f82921¨Function¦MatrixÙ!update_linear_action_preferences!‘Ù$581f7e9b-a5c2-4841-9605-85f9585b0274¦lengthµlinear_value_function‘Ù$0bf3b988-b3fb-49d5-8dde-b25766596363Ù$374af774-3a97-49b5-a3bb-bc3f7f63a3fa„´precedence_heuristic §cell_idÙ$374af774-3a97-49b5-a3bb-bc3f7f63a3fa´downstream_cells_map€²upstream_cells_mapƒ¢ep‘Ù$e1274f57-75cb-4659-a82f-e5870c5367e2§ep_step‘Ù$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547d©plot_cart‘Ù$63fbf8f4-e4e2-4893-be09-67450e92dbd7Ù$7bf209c8-ef0a-46d1-937e-b1a6e45dc62e„´precedence_heuristic §cell_idÙ$7bf209c8-ef0a-46d1-937e-b1a6e45dc62e´downstream_cells_map«beta_params‘Ù$ad0009af-2cfc-4820-bd4a-698ad391f459²upstream_cells_map§@md_str¤Core¡:§PlutoUI‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70¨Base.get¥@bind¦Slider¤Base«PlutoRunner·PlutoRunner.create_bond¯Core.applicable¯PlutoUI.combine¨getindexÙ$dd8e8cd2-7b41-46c4-8530-adefb7aea684„´precedence_heuristic §cell_idÙ$dd8e8cd2-7b41-46c4-8530-adefb7aea684´downstream_cells_mapÙ1actor_critic_binary_episodic_beta_parameter_study²upstream_cells_mapÞ¦Random‘Ù$df7f84e8-b42a-4001-9dbf-6bc3ced94207ContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295¤copyÙAactor_critic_with_eligibility_traces_binary_features_beta_actions‘Ù$3e3c5897-809f-46e3-bb58-f115b082443e¦Vector¤Real§scatter¡/¦Matrix§isempty¤mean¡:®AbstractVector¢|>£Inf¿make_n_param_dist_policy_params‘Ù$ba41f521-4ee2-42a6-bf18-078bfa4b875e¥zeros¤rand§Integer¨Function¦UInt64¡-¤plot¦foldxt¡+£Map¦Layout¬Random.seed!Ù$4fea7232-f286-4a8b-93f8-a0702818ab31„´precedence_heuristic §cell_idÙ$4fea7232-f286-4a8b-93f8-a0702818ab31´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$26880577-d267-4950-8725-7afe0d0402b6„´precedence_heuristic §cell_idÙ$26880577-d267-4950-8725-7afe0d0402b6´downstream_cells_map®cartpole_setupÜÙ$0cd96c44-cae6-421f-9fae-26141600bef4Ù$64b38d1f-ecf9-4843-89a1-4c8953048265Ù$24fa139c-ad4b-49db-ac8f-23c476ed8608Ù$dddc4a2f-34b2-41dc-85b3-55aba4880fa6Ù$d3b56fca-5b79-4465-8987-8d0005f854d8Ù$5859ca11-90f8-4fd6-88ed-c56efe796fe8Ù$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03Ù$8aa16866-bfda-48df-9cf1-cf3d2e203ccbÙ$dca2f8e2-76af-4679-bf81-3824c15fc76dÙ$11a55af7-5301-4507-bb26-88e1e11236dbÙ$61650a97-b353-4a85-b50b-93fee296ac7bÙ$d34d22ad-89c2-423e-91dd-bfb895dc6540Ù$407a0724-4bb6-4c83-ab2d-17a0e19c4072Ù$27487ad0-4779-42ce-8def-e660ef04bee0Ù$07ba9fe4-aaa7-4123-9865-cbfa79d0d44aÙ$e1274f57-75cb-4659-a82f-e5870c5367e2Ù$5ee4ce72-7740-4297-8d84-619e0708e4acÙ$82e0e9a0-9662-429a-87e3-e6bdae02709aÙ$daf35bfe-8f9c-4f55-971d-4d443be8f8bfÙ$a5b002c9-5e11-462a-9da0-6e060c7963f8Ù$d21617aa-6f38-4a90-8586-4b32022497ad²upstream_cells_mapÙ!setup_cartpole_continuous_problem‘Ù$fddef10c-7695-4596-9e16-987fd45a57e6Ù$a7891c63-18d6-4c1f-ba67-adf7c547d334„´precedence_heuristic §cell_idÙ$a7891c63-18d6-4c1f-ba67-adf7c547d334´downstream_cells_map€²upstream_cells_map€Ù$44f14d4f-7414-4c6f-883a-042ca261a403„´precedence_heuristic §cell_idÙ$44f14d4f-7414-4c6f-883a-042ca261a403´downstream_cells_map€²upstream_cells_map€Ù$94354552-9920-4b90-98d9-f75286d1f53e„´precedence_heuristic §cell_idÙ$94354552-9920-4b90-98d9-f75286d1f53e´downstream_cells_map€²upstream_cells_mapƒ¡:ºcorridor_parameter_studies’Ù$e5c1aca8-7575-4835-8273-e69ca0a55fe8Ù$646bc853-b7fc-49fa-a201-ff98e8f952d4¡^Ù$e5faaa1b-88cb-43e2-8d04-8972b58b4bda„´precedence_heuristic §cell_idÙ$e5faaa1b-88cb-43e2-8d04-8972b58b4bda´downstream_cells_map…¥plist¢v2¢v1¢v3¦traces²upstream_cells_mapŠ¡:¡-§scatter¤plot¡/¡+¡*£zip¦Layout§bgcolor‘Ù$9c342958-1971-48ec-b919-5dfdcbc915a4Ù$70096b14-beab-4f71-9886-6355c749bb8a„´precedence_heuristic §cell_idÙ$70096b14-beab-4f71-9886-6355c749bb8a´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$90d3b96b-ad2b-405c-951b-f48ec7ccf24a„´precedence_heuristic §cell_idÙ$90d3b96b-ad2b-405c-951b-f48ec7ccf24a´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$700dcbc4-c94c-4287-8cf0-0b2c7a320a3a„´precedence_heuristic §cell_idÙ$700dcbc4-c94c-4287-8cf0-0b2c7a320a3a´downstream_cells_map€²upstream_cells_mapƒÙ reinforce_test5.policy_and_valueCartPoleState¯reinforce_test5‘Ù$82e0e9a0-9662-429a-87e3-e6bdae02709aÙ$f59a5dcd-9f4a-4336-a391-e64af35ef799„´precedence_heuristic §cell_idÙ$f59a5dcd-9f4a-4336-a391-e64af35ef799´downstream_cells_map€²upstream_cells_mapƒ¤Base®Base.Docs.HTML©@html_strÙ$5864a5a3-a5a5-43c2-9cb4-7d13b2d20bed„´precedence_heuristic §cell_idÙ$5864a5a3-a5a5-43c2-9cb4-7d13b2d20bed´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$e3a2fb12-37ce-4c23-ad93-5fc89991aabb„´precedence_heuristic §cell_idÙ$e3a2fb12-37ce-4c23-ad93-5fc89991aabb´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$e5c1aca8-7575-4835-8273-e69ca0a55fe8„´precedence_heuristic §cell_idÙ$e5c1aca8-7575-4835-8273-e69ca0a55fe8´downstream_cells_mapºcorridor_parameter_studies’Ù$94354552-9920-4b90-98d9-f75286d1f53eÙ$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6c²upstream_cells_mapÞ¦Layout¡:£sum¦Random‘Ù$df7f84e8-b42a-4001-9dbf-6bc3ced94207¢|>¥Int64¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabeµget_corridor_features‘Ù$6bb0263e-368e-462a-948c-baf9cfa82512§scatter¤plot¡/¤log2¡+¦foldxtÙ-reinforce_monte_carlo_control_binary_features‘Ù$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290£Map¥roundÙ;reinforce_with_baseline_monte_carlo_control_binary_features‘Ù$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb¬Random.seed!Ù$44b32cc0-36a8-41fd-89bc-ce894536926c„´precedence_heuristic §cell_idÙ$44b32cc0-36a8-41fd-89bc-ce894536926c´downstream_cells_map€²upstream_cells_map‚°best_mc_corridor‘Ù$a12b92d1-e045-4f92-b8cd-eee5d56fa67dÙ!best_mc_corridor.policy_and_valueÙ$646bc853-b7fc-49fa-a201-ff98e8f952d4„´precedence_heuristic §cell_idÙ$646bc853-b7fc-49fa-a201-ff98e8f952d4´downstream_cells_mapºcorridor_parameter_studies’Ù$94354552-9920-4b90-98d9-f75286d1f53eÙ$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6c²upstream_cells_mapÞ¦LayoutÙ%one_step_actor_critic_binary_features‘Ù$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2¡:£sum¢|>¦length¥Int64¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabeµget_corridor_features‘Ù$6bb0263e-368e-462a-948c-baf9cfa82512§scatter¡-¡/¤plot¡+¥Inf32¤log2£Map¥round¦foldxt§isemptyÙ$25be5dcf-be63-46c4-b6de-6cf79fa28fd0„´precedence_heuristic §cell_idÙ$25be5dcf-be63-46c4-b6de-6cf79fa28fd0´downstream_cells_map¼update_traces_with_gradient!”Ù$266d2234-26c8-43f1-9e75-49440a230ed6Ù$83640f5b-fe13-4ec1-98a0-67a56c189ba1Ù$b71145a4-2614-4f62-bfd2-7d5d1fecec56Ù$4da20fd7-b897-4f26-bf2a-f08d66ddf90f²upstream_cells_mapÞ¤zero¦isless©@inbounds«FCANNParams£one·BinaryEligibilityVector‘Ù$41dc149d-c6f3-4b0d-a856-06f3aaae3049§nothing¦Vector¡<¯Base.simd_index©eachindex¤Real¥@simd¦Matrix§Float32¡:¥first³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41®julia.simdloop¤BaseµBase.simd_outer_range¡-¶Base.simd_inner_length¡+¡*¥ArrayÙ$38acd032-1d18-4760-9111-67c9cdd2e892„´precedence_heuristic §cell_idÙ$38acd032-1d18-4760-9111-67c9cdd2e892´downstream_cells_map€²upstream_cells_map€Ù$cecc2a35-3850-4f66-84e8-e29da4f3d4b0„´precedence_heuristic §cell_idÙ$cecc2a35-3850-4f66-84e8-e29da4f3d4b0´downstream_cells_mapºget_corridor_episode_stats”Ù$a019925a-460a-410e-a54b-50a4cfe0e90eÙ$e1493cea-19c4-475d-98a0-86d27fb04af1Ù$573878bb-020d-40f6-9329-3d5f91843010Ù$553b0ceb-f2ca-41ee-99bc-9f53a4487b49²upstream_cells_map‹¡:¥first¢|>¨Function¦length¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabe¡/¦foldxt¡+£Mapªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$4c4e643b-d4b9-44f0-8d30-dc521bcc55ac„´precedence_heuristic §cell_idÙ$4c4e643b-d4b9-44f0-8d30-dc521bcc55ac´downstream_cells_map·cartpole_continuing_mdp”Ù$1b102220-6d78-480d-a77f-0e57bad23dcaÙ$3c89209c-9202-4d5d-841c-ea34be369616Ù$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62Ù$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27²upstream_cells_map…Ù#cartpole_functions.initialize_state¹StateMDPTransitionSampler‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84²cartpole_functions‘Ù$f27f2bcd-05b6-44fe-bf9e-a3e51556db7c¸cartpole_continuing_step‘Ù$5d434c83-c9ca-499f-8695-c7733031c2de¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$738ada7f-edc7-4ed3-a15e-e92113468738„´precedence_heuristic §cell_idÙ$738ada7f-edc7-4ed3-a15e-e92113468738´downstream_cells_map€²upstream_cells_map€Ù$cacaaca6-6e01-464f-a2ee-cbf62737a426„´precedence_heuristic §cell_idÙ$cacaaca6-6e01-464f-a2ee-cbf62737a426´downstream_cells_map€²upstream_cells_map„¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabeÙ;reinforce_with_baseline_monte_carlo_control_linear_features‘Ù$d1ed25e6-60c6-411f-a541-99986e5da2c5¡^¹update_corridor_features!‘Ù$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811Ù$daf35bfe-8f9c-4f55-971d-4d443be8f8bf„´precedence_heuristic §cell_idÙ$daf35bfe-8f9c-4f55-971d-4d443be8f8bf´downstream_cells_map€²upstream_cells_map…®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6¸display_cartpole_episode‘Ù$822e4d69-2582-4956-858e-06ecb091e76a¢|>¯reinforce_test5‘Ù$82e0e9a0-9662-429a-87e3-e6bdae02709aªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$8e096fae-9941-49d8-ae87-c68b02f68da5„´precedence_heuristic §cell_idÙ$8e096fae-9941-49d8-ae87-c68b02f68da5´downstream_cells_map¿mountaincar_continuous_beta_mdp’Ù$4156d955-9daf-4429-b152-e8332980fb9eÙ$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8f²upstream_cells_mapÙ)create_continuous_action_mountaincar_beta‘Ù$d2729657-d0bf-4d39-8ec7-f242a1ad48d6Ù$666a4e89-306b-4fb2-bdc4-3dda2c63153f„´precedence_heuristic§cell_idÙ$666a4e89-306b-4fb2-bdc4-3dda2c63153f´downstream_cells_map°SpecialFunctions²upstream_cells_map€Ù$5d35e515-e2d3-443e-becf-eb28c25db346„´precedence_heuristic §cell_idÙ$5d35e515-e2d3-443e-becf-eb28c25db346´downstream_cells_mapÙ#mountaincar_continuing_fcann_params‘Ù$cb70d400-3e9c-441c-b17c-e727e8c928f3²upstream_cells_mapˆ¤Core¤Base·PlutoRunner.create_bond«PlutoRunnerÙ(create_actor_critic_continuing_params_UI‘Ù$5b15d91e-7119-4f85-a54a-7d4f1fdaf097¯Core.applicable¥@bind¨Base.getÙ$4c34640f-efa2-4e1d-8a70-0acd2ce45428„´precedence_heuristic §cell_idÙ$4c34640f-efa2-4e1d-8a70-0acd2ce45428´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$e7566274-5518-4e28-8738-d4b1747d0cfb„´precedence_heuristic §cell_idÙ$e7566274-5518-4e28-8738-d4b1747d0cfb´downstream_cells_map¹form_state_value_function”Ù$4fb83451-b6f8-4e6e-a131-1accc8e10b08Ù$e7e49ff8-32df-48a4-afb2-462859592e92Ù$5b868eba-c1af-49f6-8f93-79b78c319a6fÙ$11b9beea-b0cd-45eb-84c6-151728894df0²upstream_cells_map¨FunctionÙ$6bf5ad39-1400-4e1f-a843-a1934b8aaa48„´precedence_heuristic §cell_idÙ$6bf5ad39-1400-4e1f-a843-a1934b8aaa48´downstream_cells_mapÙ,update_squashed_gaussian_eligibility_vector!‘Ù$05f120be-9695-4824-82fd-142a0df13098²upstream_cells_mapÞ¤zero¦isless©@inbounds£one§nothing¦Vector¡<¯Base.simd_index©eachindex¤Real¥@simd¡/¡^¦NTuple¦Matrix¤last£exp¡:¥first®julia.simdloop¤BaseµBase.simd_outer_range¡-¶Base.simd_inner_length¥atanh¡+¡*Ù$17d07ef4-7c0a-47cc-a701-32c60336571b„´precedence_heuristic §cell_idÙ$17d07ef4-7c0a-47cc-a701-32c60336571b´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$76fd79a2-2bc8-45f8-a243-48415118898a„´precedence_heuristic §cell_idÙ$76fd79a2-2bc8-45f8-a243-48415118898a´downstream_cells_mapÙ'BinarySquashedGaussianEligibilityVector”Ù$f55afa58-962d-4551-8d95-a5b467d61adfÙ$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690cÙ$4e29c621-223e-4859-8e96-db04b967815aÙ$056a8adc-92f4-4b33-90d9-4b3b4026bbbc²upstream_cells_map‹¤zero¥zeros³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41¤ones£one¦Vector¤Real¡N¦NTuple¥Union¤amaxÙ$0b01ba67-3921-4f3f-a7e8-235190bc84eb„´precedence_heuristic §cell_idÙ$0b01ba67-3921-4f3f-a7e8-235190bc84eb´downstream_cells_map®make_beta_dist‘Ù$ad0009af-2cfc-4820-bd4a-698ad391f459²upstream_cells_map…¡-¡^¡/¡*¤betaÙ$9acdbf38-2e10-45ec-85a0-d0db8453a599„´precedence_heuristic §cell_idÙ$9acdbf38-2e10-45ec-85a0-d0db8453a599´downstream_cells_mapºfcann_feature_vector_setup“Ù$f0962801-0dfa-421f-8ffc-e64068e49913Ù$61650a97-b353-4a85-b50b-93fee296ac7bÙ$023f67b8-8f38-470a-9766-ac60a75678aa²upstream_cells_map‹¡:¥Tuple¦Vector«scale_state¤Real²make_sample_vector‘Ù$76d54520-baa3-44bf-b303-4cdcb8b87080¦length¡-¦NTuple¥Union¢==Ù$d4e87ac4-6008-43b2-aa06-e232ec2b2b5b„´precedence_heuristic §cell_idÙ$d4e87ac4-6008-43b2-aa06-e232ec2b2b5b´downstream_cells_map€²upstream_cells_map†CartPoleState§Float32¡x‘Ù$a8349352-3242-46d5-b0d5-1b6eb5d77e90¯reinforce_test5‘Ù$82e0e9a0-9662-429a-87e3-e6bdae02709a£áº‹‘Ù$2e7c737c-c798-4442-a7e1-d74ccfd73119´plot_cartpole_policy‘Ù$602a07dd-8928-4b44-97e5-01c5cbf38351Ù$05f120be-9695-4824-82fd-142a0df13098„´precedence_heuristic §cell_idÙ$05f120be-9695-4824-82fd-142a0df13098´downstream_cells_mapÙNactor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions“Ù$08505e88-9c23-4e95-91e3-d18bf5133dbcÙ$87482ea5-5265-4e02-92c0-1a8bb44ff0f4Ù$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7d²upstream_cells_mapÞµbinary_value_function‘Ù$a540814a-57a1-4b98-9443-59e401425444Ù/setup_binary_squashed_gaussian_policy_arguments‘Ù$4e29c621-223e-4859-8e96-db04b967815a¿make_n_param_dist_policy_params‘Ù$ba41f521-4ee2-42a6-bf18-078bfa4b875e¥zeros³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41¤rand§Integer¦VectorContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295¤Real¨FunctionÙ,update_squashed_gaussian_eligibility_vector!’Ù$6bf5ad39-1400-4e1f-a843-a1934b8aaa48Ù$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690c½update_binary_value_gradient!‘Ù$03a218cb-aa83-4000-85b5-c6f247087053¦NTuple¥Union¾make_squashed_gaussian_sampler‘Ù$7a6f3f79-ea06-4994-8b62-90b2056e4034¦MatrixÙ%actor_critic_with_eligibility_traces!”Ù$266d2234-26c8-43f1-9e75-49440a230ed6Ù$83640f5b-fe13-4ec1-98a0-67a56c189ba1Ù$b71145a4-2614-4f62-bfd2-7d5d1fecec56Ù$4da20fd7-b897-4f26-bf2a-f08d66ddf90fÙ!update_binary_action_preferences!‘Ù$a361f4c9-47ce-42ad-899c-87b611c0d471Ù$b2539398-fdbc-42a2-a8f3-d327358f3643„´precedence_heuristic §cell_idÙ$b2539398-fdbc-42a2-a8f3-d327358f3643´downstream_cells_map€²upstream_cells_mapˆÙ'cartpole_continuing_binary_study_params‘Ù$8e742d32-c074-4981-b35b-b596b64c869b§@md_str¡<Ù,start_cartpole_continuing_binary_param_study‘Ù$37a273b6-b104-46f0-987a-401dc1c97327¡>Ù*cartpole_binary_continuing_parameter_study‘Ù$1b102220-6d78-480d-a77f-0e57bad23dca¦isless¨getindexÙ$c5dd7e99-57e0-4bc7-97d2-2c780b23bcff„´precedence_heuristic §cell_idÙ$c5dd7e99-57e0-4bc7-97d2-2c780b23bcff´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9„´precedence_heuristic §cell_idÙ$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9´downstream_cells_mapÙ!mountaincar_continuing_fcann_test“Ù$10ee7709-0816-48d2-abe0-9be3dd04700fÙ$c0876a48-ea18-494d-8bfc-e2bceb73b417Ù$3a37b53d-9174-4faa-9404-74a40c385b0a²upstream_cells_mapƒÙ*actor_critic_with_eligibility_traces_fcann‘Ù$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54·mountaincar_fcann_setup‘Ù$023f67b8-8f38-470a-9766-ac60a75678aaºmountaincar_continuing_mdp‘Ù$46fea69b-599e-46ab-8455-d2da865d9a8eÙ$042fbafe-2401-4fb7-ac13-4531e0782c79„´precedence_heuristic §cell_idÙ$042fbafe-2401-4fb7-ac13-4531e0782c79´downstream_cells_mapÙ!update_binary_eligibility_vector!”Ù$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290Ù$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbÙ$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2Ù$05bfd818-bf4e-4bda-baa9-5ba647867097²upstream_cells_mapˆ¤Real¦MatrixÙ!update_binary_action_preferences!‘Ù$a361f4c9-47ce-42ad-899c-87b611c0d471³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41©soft_max!‘Ù$33c99850-67cd-4754-94b9-6df97b238e27·BinaryEligibilityVector‘Ù$41dc149d-c6f3-4b0d-a856-06f3aaae3049§Integer¦VectorÙ$d57375a5-b9e0-4742-b5f7-6a7da891604a„´precedence_heuristic §cell_idÙ$d57375a5-b9e0-4742-b5f7-6a7da891604a´downstream_cells_mapÙ-mountaincar_binary_continuing_parameter_study‘Ù$04f42c09-8ab5-4233-b196-51c4aa2dcedb²upstream_cells_mapƒ¼mountaincar_tilecoding_setup‘Ù$7c592385-e8d3-4efe-962c-d39debb64405ºmountaincar_continuing_mdp‘Ù$46fea69b-599e-46ab-8455-d2da865d9a8eÙ#actor_critic_linear_parameter_study“Ù$734573e5-547b-4dcc-89bb-412aa6cc42d6Ù$e96d592d-1e54-486d-8ad9-b857f85476e8Ù$ff4f977e-48df-4c12-845c-c245b4d39d6dÙ$07ad517a-c2ac-4377-99fb-adb13d0f1d0c„´precedence_heuristic §cell_idÙ$07ad517a-c2ac-4377-99fb-adb13d0f1d0c´downstream_cells_map€²upstream_cells_map„Ù#reinforce_monte_carlo_control_fcann‘Ù$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabe¡^¹update_corridor_features!‘Ù$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811Ù$71a5fce8-6d9a-4625-bad1-a951d61bff28„´precedence_heuristic §cell_idÙ$71a5fce8-6d9a-4625-bad1-a951d61bff28´downstream_cells_mapÙ$mountaincar_binary_continuous_params‘Ù$b53dba81-a9e9-41da-8fc2-7736bf25f2dc²upstream_cells_mapˆ¤Core¤Base·PlutoRunner.create_bond«PlutoRunner¯Core.applicable¥@bind¨Base.get½create_actor_critic_params_UI‘Ù$a8b40b8f-051a-4e6f-a079-ece4f32873deÙ$77906355-08f8-4b08-b051-84697199b519„´precedence_heuristic §cell_idÙ$77906355-08f8-4b08-b051-84697199b519´downstream_cells_map´mountaincar_max_vals’Ù$023f67b8-8f38-470a-9766-ac60a75678aaÙ$7c592385-e8d3-4efe-962c-d39debb64405²upstream_cells_map€Ù$5207308e-f636-4d47-b135-036a6e7b8ecd„´precedence_heuristic §cell_idÙ$5207308e-f636-4d47-b135-036a6e7b8ecd´downstream_cells_map€²upstream_cells_map‚Ù&show_mountaincar_continuous_trajectory‘Ù$b5319d8b-0420-4ebf-b603-ea0b93365ac1Ù"mountaincar_continuous_test_train3‘Ù$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7dÙ$16113560-e911-47b4-abc4-641bbd246454„´precedence_heuristic §cell_idÙ$16113560-e911-47b4-abc4-641bbd246454´downstream_cells_map€²upstream_cells_mapƒÙ&mountaincar_continuous_test_train_beta‘Ù$4156d955-9daf-4429-b152-e8332980fb9e¤plot¦LayoutÙ$b7f77935-bcab-4ef1-8e1b-a7d059784ff3„´precedence_heuristic §cell_idÙ$b7f77935-bcab-4ef1-8e1b-a7d059784ff3´downstream_cells_map¶test_mountaincar_state‘Ù$f8215517-b18f-4a03-9421-8edab4ca8089²upstream_cells_map§@md_str¤Core¡:§PlutoUI‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70¨Base.get¥@bind¦Slider¤Base«PlutoRunner·PlutoRunner.create_bond¯Core.applicable¯PlutoUI.combine¨getindexÙ$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5„´precedence_heuristic §cell_idÙ$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5´downstream_cells_map€²upstream_cells_map„¿reinforce_test.policy_and_valuecartpole_mdps‘Ù$024dcd1a-8eaa-4a95-8037-2f578828309c®reinforce_test‘Ù$24fa139c-ad4b-49db-ac8f-23c476ed8608Ù2cartpole_mdps.episodic.continuous.initialize_stateÙ$00bd2835-b006-4244-9877-bc7e031e3ef8„´precedence_heuristic §cell_idÙ$00bd2835-b006-4244-9877-bc7e031e3ef8´downstream_cells_map¶plot_squashed_gaussian’Ù$3e7cecec-eb77-4862-8e3c-b510422e06dbÙ$ff3009eb-23f9-44fe-8e56-85dbc7b463d0²upstream_cells_mapˆµsquashed_gaussian_pdf‘Ù$b16899b7-36bf-4a5e-8e2f-4496b8450687¡-¤plot¨LinRange¡*£one¦Layout¤RealÙ$50ae94c4-70f3-4215-82bd-eb2227c2badf„´precedence_heuristic §cell_idÙ$50ae94c4-70f3-4215-82bd-eb2227c2badf´downstream_cells_map€²upstream_cells_map‰§@md_str¡<Ù(cartpole_continuing_fcann_network_params‘Ù$42d4600a-bf3c-45ac-b7f5-d23917713ff5Ù)cartpole_fcann_continuing_parameter_study‘Ù$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62¡>Ù+start_cartpole_continuing_fcann_param_study‘Ù$2c5d221a-2469-49e1-9249-dfdc2457f2fa¦islessÙ&cartpole_continuing_fcann_study_params‘Ù$5ffc271f-c73f-494a-9727-8d7516af2191¨getindexÙ$cc3ac95e-a398-438a-ba3d-62b6733f6342„´precedence_heuristic §cell_idÙ$cc3ac95e-a398-438a-ba3d-62b6733f6342´downstream_cells_mapÙ update_fcann_action_preferences!‘Ù$0e9de19e-bcd4-40ac-9831-afb6cad38422²upstream_cells_mapˆ¹FCANN.forwardNOGRAD_base!£end°FCANNActivations‘Ù$5c11a92d-7496-4aba-af15-2537eac49dd7§Float32¥FCANN‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84«FCANNParams§Integer¦VectorÙ$c926b6df-c40b-4c4c-8a95-ce9e41feb100„´precedence_heuristic §cell_idÙ$c926b6df-c40b-4c4c-8a95-ce9e41feb100´downstream_cells_map€²upstream_cells_map€Ù$740a3f41-9302-481d-b373-762c0dea8eff„´precedence_heuristic §cell_idÙ$740a3f41-9302-481d-b373-762c0dea8eff´downstream_cells_mapÙ#update_gaussian_eligibility_vector!“Ù$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00Ù$d5020a8d-1dd7-403c-9d1f-665b95543943Ù$20776e09-7d9b-4db8-a060-7bceeec65b47²upstream_cells_mapŒ£exp¡:¡k¥first³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41¦Vector¤Real¿BinaryGaussianEligibilityVector‘Ù$10cdd16e-a337-4421-a7a0-6de4e4b60c0f¦Matrix¡+¦NTuple¤lastÙ$ba642a22-6623-482a-ab4a-81585b83e457„´precedence_heuristic §cell_idÙ$ba642a22-6623-482a-ab4a-81585b83e457´downstream_cells_map‚Ù$##average_continuing_runs_unmemoized·average_continuing_runs”Ù$734573e5-547b-4dcc-89bb-412aa6cc42d6Ù$ff4f977e-48df-4c12-845c-c245b4d39d6dÙ$8bc280db-e57d-4e40-be46-1790f4f7d9e7Ù$11063fff-4d36-46d5-828f-dbed0f46b9cf²upstream_cells_mapŽ¡:¨@memoize¦Random‘Ù$df7f84e8-b42a-4001-9dbf-6bc3ced94207¢|>¦empty!§Integer¤Real¨deepcopy¡/¦foldxt¡+£Map¤get!¬Random.seed!Ù$d17a4bd0-5992-4247-912d-73d51758d2f3„´precedence_heuristic §cell_idÙ$d17a4bd0-5992-4247-912d-73d51758d2f3´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7ef„´precedence_heuristic §cell_idÙ$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7ef´downstream_cells_map€²upstream_cells_map„Ù&cartpole_fcann_continuing_test_episode‘Ù$64b38d1f-ecf9-4843-89a1-4c8953048265´plot_cartpole_policy‘Ù$602a07dd-8928-4b44-97e5-01c5cbf38351Ù-cartpole_fcann_continuing_episode_step_select‘Ù$6acb549a-5d90-4457-a347-d22448ad8071¾cartpole_continuing_fcann_test‘Ù$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27Ù$5ee4ce72-7740-4297-8d84-619e0708e4ac„´precedence_heuristic §cell_idÙ$5ee4ce72-7740-4297-8d84-619e0708e4ac´downstream_cells_mapÙ)cartpole_continuing_fcann_parameter_study²upstream_cells_mapŽ®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6¡:Ù*actor_critic_with_eligibility_traces_fcann‘Ù$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54¢|>¶setup_cartpole_problem§scatter¤plot¡/¦foldxt¡+¼cartpole_fcann_feature_setup‘Ù$61650a97-b353-4a85-b50b-93fee296ac7b£MapÙ3cartpole_fcann_feature_setup.update_feature_vector!¦LayoutÙ$645e93e7-e92e-49c4-9757-8294fabf4e9b„´precedence_heuristic §cell_idÙ$645e93e7-e92e-49c4-9757-8294fabf4e9b´downstream_cells_map€²upstream_cells_map‚¸cartpole_continuing_test‘Ù$3c89209c-9202-4d5d-841c-ea34be369616¼plot_continuing_step_rewards‘Ù$0964133c-3a5b-433b-a8c4-a97813c37583Ù$0c56b341-24eb-4c78-844e-182f44a7221a„´precedence_heuristic §cell_idÙ$0c56b341-24eb-4c78-844e-182f44a7221a´downstream_cells_map€²upstream_cells_map‚¡^«figure_13_1‘Ù$d037ea92-915c-4dc7-97c6-d006d92e088aÙ$d34d22ad-89c2-423e-91dd-bfb895dc6540„´precedence_heuristic §cell_idÙ$d34d22ad-89c2-423e-91dd-bfb895dc6540´downstream_cells_map¾cartpole_fcann_parameter_study²upstream_cells_map„Ù+actor_critic_fcann_episodic_parameter_study‘Ù$f8614042-7c94-4d47-a1b6-4e96676b4e8b®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6·cartpole_vector_update!‘Ù$192b9f82-8d3a-408f-91c2-829cfcd32572¼cartpole_fcann_feature_setup‘Ù$61650a97-b353-4a85-b50b-93fee296ac7bÙ$20776e09-7d9b-4db8-a060-7bceeec65b47„´precedence_heuristic §cell_idÙ$20776e09-7d9b-4db8-a060-7bceeec65b47´downstream_cells_mapÙEactor_critic_with_eligibility_traces_binary_features_gaussian_actions”Ù$55ba8725-0ddf-4196-a41d-3f3c490a8d84Ù$61949faa-8174-4b7b-8fbc-01d5f850b419Ù$b8532822-179b-4cd5-a279-4b71dafb544aÙ$fee14dfe-c5ca-4126-a830-cc9d7eda5433²upstream_cells_mapÞµbinary_value_function‘Ù$a540814a-57a1-4b98-9443-59e401425444¿make_n_param_dist_policy_params‘Ù$ba41f521-4ee2-42a6-bf18-078bfa4b875e¥zeros³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41¤rand§Integerµmake_gaussian_sampler‘Ù$bba13634-ff0e-47f7-a23b-8d56098f4ac6¦VectorContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295Ù#update_gaussian_eligibility_vector!’Ù$5261651e-a51e-4e80-8e23-83a4c10e5259Ù$740a3f41-9302-481d-b373-762c0dea8eff¤Real¨FunctionÙ&setup_binary_gaussian_policy_arguments‘Ù$ba5d6311-daee-4abc-b2fb-fae2184ef3eb½update_binary_value_gradient!‘Ù$03a218cb-aa83-4000-85b5-c6f247087053¦Matrix¦NTuple¥UnionÙ!update_binary_action_preferences!‘Ù$a361f4c9-47ce-42ad-899c-87b611c0d471Ù%actor_critic_with_eligibility_traces!”Ù$266d2234-26c8-43f1-9e75-49440a230ed6Ù$83640f5b-fe13-4ec1-98a0-67a56c189ba1Ù$b71145a4-2614-4f62-bfd2-7d5d1fecec56Ù$4da20fd7-b897-4f26-bf2a-f08d66ddf90fÙ$7856b8a0-565d-4c86-9b3c-4424ff9b86dd„´precedence_heuristic §cell_idÙ$7856b8a0-565d-4c86-9b3c-4424ff9b86dd´downstream_cells_map€²upstream_cells_map€Ù$735b548a-88f5-4a30-ab8f-dfb3d6401b2b„´precedence_heuristic §cell_idÙ$735b548a-88f5-4a30-ab8f-dfb3d6401b2b´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$7cf26604-9c2b-4a77-9674-7d4dac2f99f0„´precedence_heuristic§cell_idÙ$7cf26604-9c2b-4a77-9674-7d4dac2f99f0´downstream_cells_map€²upstream_cells_mapƒ¨joinpath¨@__DIR__§includeÙ$87ee21f3-16ca-4c8c-a0b9-f9e2fd258a91„´precedence_heuristic §cell_idÙ$87ee21f3-16ca-4c8c-a0b9-f9e2fd258a91´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$54f1546d-87ae-49d2-92ed-6fcc9b66e027„´precedence_heuristic §cell_idÙ$54f1546d-87ae-49d2-92ed-6fcc9b66e027´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$63fbf8f4-e4e2-4893-be09-67450e92dbd7„´precedence_heuristic §cell_idÙ$63fbf8f4-e4e2-4893-be09-67450e92dbd7´downstream_cells_map©plot_cart“Ù$fad02876-efba-46a7-9cb7-43820528779fÙ$374af774-3a97-49b5-a3bb-bc3f7f63a3faÙ$1ce4bc6c-7cde-48e9-8ff1-7281697fd121²upstream_cells_mapCartPoleState·HypertextLiteral.Bypass©indicator¸HypertextLiteral.content¤@htl‘Ù$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaeb¥Int64¡-§scatter¤plot·HypertextLiteral.Result°HypertextLiteral‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70¤attr£cos¦Layout£sinÙ$d5020a8d-1dd7-403c-9d1f-665b95543943„´precedence_heuristic §cell_idÙ$d5020a8d-1dd7-403c-9d1f-665b95543943´downstream_cells_mapÙLreinforce_with_baseline_monte_carlo_control_linear_features_gaussian_actions²upstream_cells_mapÞÙ,reinforce_with_baseline_monte_carlo_control!’Ù$4fb83451-b6f8-4e6e-a131-1accc8e10b08Ù$5b868eba-c1af-49f6-8f93-79b78c319a6f¿make_n_param_dist_policy_params‘Ù$ba41f521-4ee2-42a6-bf18-078bfa4b875e¥zeros¤rand§Integerµmake_gaussian_sampler‘Ù$bba13634-ff0e-47f7-a23b-8d56098f4ac6½update_linear_value_gradient!‘Ù$1753b5ed-c00b-4b60-b492-822180778e8c¤copyÙ#update_gaussian_eligibility_vector!’Ù$5261651e-a51e-4e80-8e23-83a4c10e5259Ù$740a3f41-9302-481d-b373-762c0dea8eff¦VectorContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295¤Real¨Function¦MatrixÙ!update_linear_action_preferences!‘Ù$581f7e9b-a5c2-4841-9605-85f9585b0274¦NTuple¥Unionµlinear_value_function‘Ù$0bf3b988-b3fb-49d5-8dde-b25766596363´make_gaussian_paramsÙ$37a8ef7e-e859-4ef0-81e2-76c02a324031„´precedence_heuristic §cell_idÙ$37a8ef7e-e859-4ef0-81e2-76c02a324031´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$98229733-a71e-44ca-a52a-b7229cf8b422„´precedence_heuristic §cell_idÙ$98229733-a71e-44ca-a52a-b7229cf8b422´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$42775fd1-5b27-48e0-abf1-9b22bb775e6d„´precedence_heuristic §cell_idÙ$42775fd1-5b27-48e0-abf1-9b22bb775e6d´downstream_cells_map€²upstream_cells_map‚Ù#corridor_continuing_parameter_study‘Ù$7afb6fb0-248a-4518-b94f-9876f81eca64·continuing_study_params‘Ù$7d94922e-dc9f-4953-b539-24aaa2c85b12Ù$7dbb42a3-aa8c-47e5-b668-18e6325d4038„´precedence_heuristic §cell_idÙ$7dbb42a3-aa8c-47e5-b668-18e6325d4038´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$192b9f82-8d3a-408f-91c2-829cfcd32572„´precedence_heuristic §cell_idÙ$192b9f82-8d3a-408f-91c2-829cfcd32572´downstream_cells_map·cartpole_vector_update!“Ù$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62Ù$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27Ù$d34d22ad-89c2-423e-91dd-bfb895dc6540²upstream_cells_map…¤RealCartPoleState¼cartpole_fcann_feature_setup‘Ù$61650a97-b353-4a85-b50b-93fee296ac7bÙ3cartpole_fcann_feature_setup.update_feature_vector!¦VectorÙ$b5319d8b-0420-4ebf-b603-ea0b93365ac1„´precedence_heuristic §cell_idÙ$b5319d8b-0420-4ebf-b603-ea0b93365ac1´downstream_cells_mapÙ&show_mountaincar_continuous_trajectory”Ù$c87dba8c-9a96-41b3-9dc7-a6c088ec1eafÙ$cd9c9eeb-c90d-4499-9503-7773d5250f47Ù$5207308e-f636-4d47-b135-036a6e7b8ecdÙ$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8f²upstream_cells_map£sum·HypertextLiteral.Bypass¸HypertextLiteral.content¤@htl‘Ù$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaeb§Integer¨Functionºmountaincar_continuous_mdp‘Ù$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2§scatter¤plot·HypertextLiteral.Result°HypertextLiteral‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70¦Layoutªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$4cbdb082-22ba-49e9-a6ed-4380917625ac„´precedence_heuristic §cell_idÙ$4cbdb082-22ba-49e9-a6ed-4380917625ac´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$cc80848a-6834-4272-9152-e17b45448814„´precedence_heuristic §cell_idÙ$cc80848a-6834-4272-9152-e17b45448814´downstream_cells_map«wind_speeds²upstream_cells_map‰¡:·HypertextLiteral.Bypass·HypertextLiteral.Result°HypertextLiteral‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70¯PlutoUI.combine§PlutoUI‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70¸HypertextLiteral.content¤@htl‘Ù$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaeb¦SliderÙ$05bfd818-bf4e-4bda-baa9-5ba647867097„´precedence_heuristic §cell_idÙ$05bfd818-bf4e-4bda-baa9-5ba647867097´downstream_cells_mapÙ4actor_critic_with_eligibility_traces_binary_features›Ù$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728Ù$396e0047-d848-462f-a769-0cc2829abc78Ù$1f041cb3-618c-4380-a1ec-d7bbe4a80f62Ù$72273f27-d0b9-4645-a609-cb65cc9332eeÙ$734573e5-547b-4dcc-89bb-412aa6cc42d6Ù$ff4f977e-48df-4c12-845c-c245b4d39d6dÙ$8b35661b-5075-4d63-bc31-044407f99acfÙ$3c89209c-9202-4d5d-841c-ea34be369616Ù$b02ba928-5b9f-4695-b980-07988c788bb9Ù$dca2f8e2-76af-4679-bf81-3824c15fc76dÙ$6d0925d3-af96-4b94-8e2e-4941cce39e51²upstream_cells_mapµbinary_value_function‘Ù$a540814a-57a1-4b98-9443-59e401425444½setup_binary_policy_arguments‘Ù$96506201-6b66-49e6-8179-06952e2394e1¥zeros³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¦Vector¤Real¨FunctionÙ!update_binary_eligibility_vector!‘Ù$042fbafe-2401-4fb7-ac13-4531e0782c79¦length½update_binary_value_gradient!‘Ù$03a218cb-aa83-4000-85b5-c6f247087053¦MatrixÙ!update_binary_action_preferences!‘Ù$a361f4c9-47ce-42ad-899c-87b611c0d471Ù%actor_critic_with_eligibility_traces!”Ù$266d2234-26c8-43f1-9e75-49440a230ed6Ù$83640f5b-fe13-4ec1-98a0-67a56c189ba1Ù$b71145a4-2614-4f62-bfd2-7d5d1fecec56Ù$4da20fd7-b897-4f26-bf2a-f08d66ddf90fÙ$f0962801-0dfa-421f-8ffc-e64068e49913„´precedence_heuristic §cell_idÙ$f0962801-0dfa-421f-8ffc-e64068e49913´downstream_cells_map¿mountaincar_fcann_feature_setup‘Ù$c251a630-7114-4188-9323-8d8feb5c32e0²upstream_cells_mapºfcann_feature_vector_setup‘Ù$9acdbf38-2e10-45ec-85a0-d0db8453a599Ù$11a55af7-5301-4507-bb26-88e1e11236db„´precedence_heuristic §cell_idÙ$11a55af7-5301-4507-bb26-88e1e11236db´downstream_cells_map€²upstream_cells_map…®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6¸display_cartpole_episode‘Ù$822e4d69-2582-4956-858e-06ecb091e76a¢|>¯reinforce_test3‘Ù$dca2f8e2-76af-4679-bf81-3824c15fc76dªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$ddbca73f-c692-46f2-95f3-a7dd849d33f7„´precedence_heuristic §cell_idÙ$ddbca73f-c692-46f2-95f3-a7dd849d33f7´downstream_cells_map€²upstream_cells_map‚»show_mountaincar_trajectory‘Ù$ba645f6b-143f-4e83-9003-707770ae308d¶mountaincar_test_train‘Ù$6d0925d3-af96-4b94-8e2e-4941cce39e51Ù$b4875f2b-5487-429f-80a3-d1032bbccfc1„´precedence_heuristic §cell_idÙ$b4875f2b-5487-429f-80a3-d1032bbccfc1´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$0cd96c44-cae6-421f-9fae-26141600bef4„´precedence_heuristic §cell_idÙ$0cd96c44-cae6-421f-9fae-26141600bef4´downstream_cells_map€²upstream_cells_map…®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6¸display_cartpole_episode‘Ù$822e4d69-2582-4956-858e-06ecb091e76a¢|>¸cartpole_continuing_test‘Ù$3c89209c-9202-4d5d-841c-ea34be369616ªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$51d6337d-c0bd-40a9-9129-7d88e41e4093„´precedence_heuristic §cell_idÙ$51d6337d-c0bd-40a9-9129-7d88e41e4093´downstream_cells_map€²upstream_cells_map€Ù$5859ca11-90f8-4fd6-88ed-c56efe796fe8„´precedence_heuristic §cell_idÙ$5859ca11-90f8-4fd6-88ed-c56efe796fe8´downstream_cells_map€²upstream_cells_map…®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6¸display_cartpole_episode‘Ù$822e4d69-2582-4956-858e-06ecb091e76a¢|>ªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¯reinforce_test2‘Ù$d3b56fca-5b79-4465-8987-8d0005f854d8Ù$3ea08816-705e-4be7-a175-dbd3f3e4c17d„´precedence_heuristic §cell_idÙ$3ea08816-705e-4be7-a175-dbd3f3e4c17d´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$f3e2db06-9cb7-464a-96b8-938175efd26b„´precedence_heuristic §cell_idÙ$f3e2db06-9cb7-464a-96b8-938175efd26b´downstream_cells_map»setup_fcann_value_arguments‘Ù$e1aec891-d95a-47d1-97d7-d2a4cfb16e64²upstream_cells_mapÞ´fcann_value_function‘Ù$635abb34-2c97-4f04-a74c-22fbec32f408£one³scale_fcann_params!‘Ù$77cf3a74-899f-4ade-99f2-5aaf7a98c02d¹FCANN.makeorthonormalrand¦Vector¦lengthªNamedTuple¤Real©eachindex¨deepcopy¡/¡^¤last¢==¡:¥FCANN‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¥zeros¤Bool§Integer£end¥Int64¶FCANN.form_activations¡+¡*¼update_fcann_value_gradient!‘Ù$5c4a383f-fcf2-4f2b-819f-6d84471dda00Ù$b2082ab0-73a4-45a6-8772-a2e6e22b519a„´precedence_heuristic §cell_idÙ$b2082ab0-73a4-45a6-8772-a2e6e22b519a´downstream_cells_mapƒ³make_beta_n_sampler±make_beta_sampler‘Ù$3e3c5897-809f-46e3-bb58-f115b082443e³beta_action_sampler²upstream_cells_map¤zero£exp£max¥isnan¦isless¤rand§Integer¦Vector£eps¤Real¤Beta£Val¡+¦NTuple¦ntupleÙ$a361f4c9-47ce-42ad-899c-87b611c0d471„´precedence_heuristic §cell_idÙ$a361f4c9-47ce-42ad-899c-87b611c0d471´downstream_cells_mapÙ!update_binary_action_preferences!™Ù$042fbafe-2401-4fb7-ac13-4531e0782c79Ù$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290Ù$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbÙ$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2Ù$05bfd818-bf4e-4bda-baa9-5ba647867097Ù$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00Ù$20776e09-7d9b-4db8-a060-7bceeec65b47Ù$3e3c5897-809f-46e3-bb58-f115b082443eÙ$05f120be-9695-4824-82fd-142a0df13098²upstream_cells_mapÞ¤zero¡:¦isless³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41®julia.simdloop©@inbounds§nothing¦Vector¡<µBase.simd_outer_range©eachindex¤Real¶Base.simd_inner_length¥@simd¤Base¯Base.simd_index¡+¦MatrixÙ$46fea69b-599e-46ab-8455-d2da865d9a8e„´precedence_heuristic §cell_idÙ$46fea69b-599e-46ab-8455-d2da865d9a8e´downstream_cells_mapºmountaincar_continuing_mdp”Ù$d57375a5-b9e0-4742-b5f7-6a7da891604aÙ$b02ba928-5b9f-4695-b980-07988c788bb9Ù$c251a630-7114-4188-9323-8d8feb5c32e0Ù$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9²upstream_cells_mapÙ!create_mountaincar_continuing_mdp‘Ù$00152954-dc98-4120-b94b-2ea4d987832bÙ$bfe7e41d-6318-4bd4-b892-287831876abc„´precedence_heuristic §cell_idÙ$bfe7e41d-6318-4bd4-b892-287831876abc´downstream_cells_map¿update_beta_eligibility_vector!‘Ù$3e3c5897-809f-46e3-bb58-f115b082443e²upstream_cells_mapÞ¤zero¦isless§digamma©@inbounds£one§nothing¦Vector¡<¯Base.simd_index©eachindex¤Real¥@simd¦Matrix¦NTuple¤last£exp¡:¥first®julia.simdloop¤BaseµBase.simd_outer_range¡-£log¶Base.simd_inner_length¡+¡*Ù$c251a630-7114-4188-9323-8d8feb5c32e0„´precedence_heuristic §cell_idÙ$c251a630-7114-4188-9323-8d8feb5c32e0´downstream_cells_mapÙ,mountaincar_fcann_continuing_parameter_study‘Ù$cb70d400-3e9c-441c-b17c-e727e8c928f3²upstream_cells_map…ºmountaincar_continuing_mdp‘Ù$46fea69b-599e-46ab-8455-d2da865d9a8e¤fill¿mountaincar_fcann_feature_setup‘Ù$f0962801-0dfa-421f-8ffc-e64068e49913§IntegerÙ"actor_critic_fcann_parameter_study“Ù$8bc280db-e57d-4e40-be46-1790f4f7d9e7Ù$5aba4f96-e877-457e-8e95-18737348f99fÙ$11063fff-4d36-46d5-828f-dbed0f46b9cfÙ$af144759-fe66-4ad0-b378-e9eb4e859db4„´precedence_heuristic §cell_idÙ$af144759-fe66-4ad0-b378-e9eb4e859db4´downstream_cells_map€²upstream_cells_map„¯reinforce_test4‘Ù$407a0724-4bb6-4c83-ab2d-17a0e19c4072¢ep‘Ù$e1274f57-75cb-4659-a82f-e5870c5367e2§ep_step‘Ù$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547d´plot_cartpole_policy‘Ù$602a07dd-8928-4b44-97e5-01c5cbf38351Ù$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2„´precedence_heuristic §cell_idÙ$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2´downstream_cells_mapºmountaincar_continuous_mdp•Ù$b53dba81-a9e9-41da-8fc2-7736bf25f2dcÙ$b8532822-179b-4cd5-a279-4b71dafb544aÙ$0d93132d-5819-47dc-8cf2-462d480d9c3dÙ$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7dÙ$b5319d8b-0420-4ebf-b603-ea0b93365ac1²upstream_cells_mapÙ$create_continuous_action_mountaincar‘Ù$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbfÙ$fb8904a9-ae64-41cc-93b6-5a25855edad0„´precedence_heuristic §cell_idÙ$fb8904a9-ae64-41cc-93b6-5a25855edad0´downstream_cells_mapºget_corridor_episode_stats”Ù$a019925a-460a-410e-a54b-50a4cfe0e90eÙ$e1493cea-19c4-475d-98a0-86d27fb04af1Ù$573878bb-020d-40f6-9329-3d5f91843010Ù$553b0ceb-f2ca-41ee-99bc-9f53a4487b49²upstream_cells_mapŽ¡:¥first¦isless¢|>¤rand¤Real¦length¡<¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabe¡/¦foldxt¡+£Mapªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$a5b002c9-5e11-462a-9da0-6e060c7963f8„´precedence_heuristic §cell_idÙ$a5b002c9-5e11-462a-9da0-6e060c7963f8´downstream_cells_map£ep2“Ù$9bce6fdb-2cbc-4758-9a8b-794e490c973dÙ$1ce4bc6c-7cde-48e9-8ff1-7281697fd121Ù$bb1ef180-39ac-475f-beea-ef573e71a3bf²upstream_cells_map„CartPoleState®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6¯reinforce_test5‘Ù$82e0e9a0-9662-429a-87e3-e6bdae02709aªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$83640f5b-fe13-4ec1-98a0-67a56c189ba1„´precedence_heuristic §cell_idÙ$83640f5b-fe13-4ec1-98a0-67a56c189ba1´downstream_cells_mapÙ%actor_critic_with_eligibility_traces!–Ù$05bfd818-bf4e-4bda-baa9-5ba647867097Ù$68806899-9972-460a-9f11-daa708a9d610Ù$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54Ù$20776e09-7d9b-4db8-a060-7bceeec65b47Ù$3e3c5897-809f-46e3-bb58-f115b082443eÙ$05f120be-9695-4824-82fd-142a0df13098²upstream_cells_mapÞ¤zero¬zero_params!‘Ù$e6cf9550-2e69-4b82-92cf-5e07a35490aa¼update_traces_with_gradient!’Ù$25be5dcf-be63-46c4-b6de-6cf79fa28fd0Ù$056a8adc-92f4-4b33-90d9-4b3b4026bbbc£one¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¦lengthsample_action‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¦Vector¤Real¨deepcopy¡/¼update_params_with_gradient!“Ù$b0a66a19-ee76-463b-a704-8fcee85444d0Ù$a893a87b-2d07-4db5-9d1a-9da8646216f4Ù$f55afa58-962d-4551-8d95-a5b467d61adf¥error¥zeros©soft_max!‘Ù$33c99850-67cd-4754-94b9-6df97b238e27Ù&form_state_and_policy_function_outputs’Ù$e7e49ff8-32df-48a4-afb2-462859592e92Ù$11b9beea-b0cd-45eb-84c6-151728894df0§Integer¨Function¢<=¥push!¡-¡+¡*Ù$61650a97-b353-4a85-b50b-93fee296ac7b„´precedence_heuristic §cell_idÙ$61650a97-b353-4a85-b50b-93fee296ac7b´downstream_cells_map¼cartpole_fcann_feature_setup—Ù$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62Ù$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27Ù$192b9f82-8d3a-408f-91c2-829cfcd32572Ù$d34d22ad-89c2-423e-91dd-bfb895dc6540Ù$407a0724-4bb6-4c83-ab2d-17a0e19c4072Ù$5ee4ce72-7740-4297-8d84-619e0708e4acÙ$82e0e9a0-9662-429a-87e3-e6bdae02709a²upstream_cells_map‚®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6ºfcann_feature_vector_setup‘Ù$9acdbf38-2e10-45ec-85a0-d0db8453a599Ù$602a07dd-8928-4b44-97e5-01c5cbf38351„´precedence_heuristic §cell_idÙ$602a07dd-8928-4b44-97e5-01c5cbf38351´downstream_cells_map´plot_cartpole_policy•Ù$0574f5a0-72e7-4aa2-80ac-f4ce4f0fe7c2Ù$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7efÙ$fd58402f-da65-44cf-b81a-e21192fd0e63Ù$af144759-fe66-4ad0-b378-e9eb4e859db4Ù$d4e87ac4-6008-43b2-aa06-e232ec2b2b5b²upstream_cells_mapÞCartPoleState¡:¤vcat·HypertextLiteral.Bypass¨LinRange¥zeros¸HypertextLiteral.content¤@htl‘Ù$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaeb¨Function£bar§scatter§Float32¤plot°HypertextLiteral‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70§heatmap·HypertextLiteral.Result¦LayoutÙ$f7433324-acc3-49a5-b5b3-ada0c8f09d52„´precedence_heuristic §cell_idÙ$f7433324-acc3-49a5-b5b3-ada0c8f09d52´downstream_cells_map€²upstream_cells_map‚¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabeªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$0c9986bb-54c0-4b08-9c29-4bfb0b68b54e„´precedence_heuristic §cell_idÙ$0c9986bb-54c0-4b08-9c29-4bfb0b68b54e´downstream_cells_map»collect_state_distributions’Ù$54f559b6-8a62-4a42-894d-c56e41d5ebefÙ$16fcc2d0-9f2f-4226-9dcc-6d86248cab26²upstream_cells_mapÞ ¤zero¡>¦isless©@inbounds£one¤view§nothing¤copy¡<¯Base.simd_index¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabe©eachindex¤Real¥@simd¡/¢==ªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¡:¢|>¥zeros®julia.simdloop¤size¤rand§Integer¤Base¢<=µBase.simd_outer_range¶Base.simd_inner_length¦foldxt¡+¥Array£MapÙ$6d0925d3-af96-4b94-8e2e-4941cce39e51„´precedence_heuristic §cell_idÙ$6d0925d3-af96-4b94-8e2e-4941cce39e51´downstream_cells_map¶mountaincar_test_train’Ù$dc2efc6c-8da8-425b-aa5f-290949109565Ù$ddbca73f-c692-46f2-95f3-a7dd849d33f7²upstream_cells_map…¥Int64Ù4actor_critic_with_eligibility_traces_binary_features‘Ù$05bfd818-bf4e-4bda-baa9-5ba647867097¼mountaincar_tilecoding_setup‘Ù$7c592385-e8d3-4efe-962c-d39debb64405¯MountainCarTask§typemaxÙ$6bb0263e-368e-462a-948c-baf9cfa82512„´precedence_heuristic §cell_idÙ$6bb0263e-368e-462a-948c-baf9cfa82512´downstream_cells_mapµget_corridor_featuresÜÙ$f2f2dd1d-180c-4d36-b515-5079d129f93aÙ$e1493cea-19c4-475d-98a0-86d27fb04af1Ù$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72cÙ$d037ea92-915c-4dc7-97c6-d006d92e088aÙ$f2ed56c9-c2b7-42cb-a083-e12aeaa126efÙ$cbea5840-49d2-4e91-be9c-f5f15666d78aÙ$83ca0577-15d7-4448-b597-c77810b812bfÙ$e5c1aca8-7575-4835-8273-e69ca0a55fe8Ù$7d63b960-3998-4f7b-8cbb-ccd49db9aeacÙ$646bc853-b7fc-49fa-a201-ff98e8f952d4Ù$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728Ù$396e0047-d848-462f-a769-0cc2829abc78Ù$bc8a399b-8864-4473-89d2-e3b0a03d15b5Ù$72273f27-d0b9-4645-a609-cb65cc9332eeÙ$7afb6fb0-248a-4518-b94f-9876f81eca64Ù$8b35661b-5075-4d63-bc31-044407f99acf²upstream_cells_map¡:Ù$72273f27-d0b9-4645-a609-cb65cc9332ee„´precedence_heuristic §cell_idÙ$72273f27-d0b9-4645-a609-cb65cc9332ee´downstream_cells_map€²upstream_cells_map„¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabeµget_corridor_features‘Ù$6bb0263e-368e-462a-948c-baf9cfa82512¡^Ù4actor_critic_with_eligibility_traces_binary_features‘Ù$05bfd818-bf4e-4bda-baa9-5ba647867097Ù$87482ea5-5265-4e02-92c0-1a8bb44ff0f4„´precedence_heuristic §cell_idÙ$87482ea5-5265-4e02-92c0-1a8bb44ff0f4´downstream_cells_mapÙ@actor_critic_binary_continuing_squashed_gaussian_parameter_study²upstream_cells_mapÞ¦Random‘Ù$df7f84e8-b42a-4001-9dbf-6bc3ced94207ContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295¤copy¦Vector¤Real§scatter¡/¦MatrixÙNactor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions’Ù$05f120be-9695-4824-82fd-142a0df13098Ù$717e4c69-59d5-4929-923f-dd35a97fb160¡:®AbstractVector¢|>¿make_n_param_dist_policy_params‘Ù$ba41f521-4ee2-42a6-bf18-078bfa4b875e¥zeros¤rand§Integer¨Function¦UInt64¤plot¦foldxt¡+£Map¦Layout¬Random.seed!Ù$3bafd7df-9bc0-4d13-874d-739590cf3ad9„´precedence_heuristic §cell_idÙ$3bafd7df-9bc0-4d13-874d-739590cf3ad9´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$f27f2bcd-05b6-44fe-bf9e-a3e51556db7c„´precedence_heuristic §cell_idÙ$f27f2bcd-05b6-44fe-bf9e-a3e51556db7c´downstream_cells_map²cartpole_functions“Ù$5d434c83-c9ca-499f-8695-c7733031c2deÙ$4c4e643b-d4b9-44f0-8d30-dc521bcc55acÙ$de3cba34-9842-44d1-9b79-47126c0a0751²upstream_cells_map¹create_cartpole_functions‘Ù$352d2952-cb83-47d3-9078-2b2ef9927443Ù$41dc149d-c6f3-4b0d-a856-06f3aaae3049„´precedence_heuristic §cell_idÙ$41dc149d-c6f3-4b0d-a856-06f3aaae3049´downstream_cells_map·BinaryEligibilityVector”Ù$b0a66a19-ee76-463b-a704-8fcee85444d0Ù$042fbafe-2401-4fb7-ac13-4531e0782c79Ù$96506201-6b66-49e6-8179-06952e2394e1Ù$25be5dcf-be63-46c4-b6de-6cf79fa28fd0²upstream_cells_mapƒ¥Int64³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41¦VectorÙ$38e5d800-4d43-40d2-87ea-f7d4b4283dab„´precedence_heuristic §cell_idÙ$38e5d800-4d43-40d2-87ea-f7d4b4283dab´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2„´precedence_heuristic §cell_idÙ$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2´downstream_cells_mapÙ%one_step_actor_critic_binary_features’Ù$7d63b960-3998-4f7b-8cbb-ccd49db9aeacÙ$646bc853-b7fc-49fa-a201-ff98e8f952d4²upstream_cells_mapµbinary_value_function‘Ù$a540814a-57a1-4b98-9443-59e401425444½setup_binary_policy_arguments‘Ù$96506201-6b66-49e6-8179-06952e2394e1¥zeros³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¦Vector¤Real¶one_step_actor_critic!‘Ù$4d4ae57b-afc3-44f9-b6fc-892f59f82921¨FunctionÙ!update_binary_eligibility_vector!‘Ù$042fbafe-2401-4fb7-ac13-4531e0782c79¦length½update_binary_value_gradient!‘Ù$03a218cb-aa83-4000-85b5-c6f247087053¦MatrixÙ!update_binary_action_preferences!‘Ù$a361f4c9-47ce-42ad-899c-87b611c0d471Ù$73b90260-d57a-449a-8db6-47f91e6a4e4f„´precedence_heuristic §cell_idÙ$73b90260-d57a-449a-8db6-47f91e6a4e4f´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$5aba4f96-e877-457e-8e95-18737348f99f„´precedence_heuristic §cell_idÙ$5aba4f96-e877-457e-8e95-18737348f99f´downstream_cells_mapÙ"actor_critic_fcann_parameter_study’Ù$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62Ù$c251a630-7114-4188-9323-8d8feb5c32e0²upstream_cells_mapŒ«@NamedTuple¡:§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¦Vector¥Int64¤Real¨Function¡-¤Base¡^¡+Ù$fed4dc4c-0d1c-4ee3-9d0e-8ef2a7db7486„´precedence_heuristic §cell_idÙ$fed4dc4c-0d1c-4ee3-9d0e-8ef2a7db7486´downstream_cells_mapÙ$mountaincar_continuing_binary_params‘Ù$04f42c09-8ab5-4233-b196-51c4aa2dcedb²upstream_cells_mapˆ¤Core¤Base·PlutoRunner.create_bond«PlutoRunnerÙ(create_actor_critic_continuing_params_UI‘Ù$5b15d91e-7119-4f85-a54a-7d4f1fdaf097¯Core.applicable¥@bind¨Base.getÙ$27487ad0-4779-42ce-8def-e660ef04bee0„´precedence_heuristic §cell_idÙ$27487ad0-4779-42ce-8def-e660ef04bee0´downstream_cells_map€²upstream_cells_map„¯reinforce_test4‘Ù$407a0724-4bb6-4c83-ab2d-17a0e19c4072®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6Ù reinforce_test4.policy_and_valueÙ6cartpole_setup.mdps.episodic.discrete.initialize_stateÙ$0d93132d-5819-47dc-8cf2-462d480d9c3d„´precedence_heuristic §cell_idÙ$0d93132d-5819-47dc-8cf2-462d480d9c3d´downstream_cells_map€²upstream_cells_mapŠºmountaincar_continuous_mdp‘Ù$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2§@md_str¡<¡>¦isless¼mountaincar_tilecoding_setup‘Ù$7c592385-e8d3-4efe-962c-d39debb64405Ù%mountaincar_binary_continuous_params2‘Ù$0d45ae72-572f-4d17-83cf-9814f2854131Ù>actor_critic_binary_episodic_squashed_gaussian_parameter_study“Ù$08505e88-9c23-4e95-91e3-d18bf5133dbcÙ$bd6a7c16-6c25-4fc2-8e1b-4dab693ce19fÙ$c5a2879c-e89b-47f7-bbd6-48200d7e89e3Ù8run_mountaincar_binary_episodic_countinuous_param_study2‘Ù$e524f8cc-ab69-4f8b-a59f-28156696a104¨getindexÙ$9978d537-49ff-4014-a971-b42704c50a6b„´precedence_heuristic §cell_idÙ$9978d537-49ff-4014-a971-b42704c50a6b´downstream_cells_map»fcann_cartpole_study_params²upstream_cells_mapˆ¤Core¤Base·PlutoRunner.create_bond«PlutoRunner¯Core.applicable¥@bind¨Base.getÙ#create_actor_critic_fcann_params_UI‘Ù$5eebf3da-bfe7-46eb-81a3-f87f334ee270Ù$f8215517-b18f-4a03-9421-8edab4ca8089„´precedence_heuristic §cell_idÙ$f8215517-b18f-4a03-9421-8edab4ca8089´downstream_cells_map€²upstream_cells_mapƒ´show_squashed_policy‘Ù$ff3009eb-23f9-44fe-8e56-85dbc7b463d0¶test_mountaincar_state‘Ù$b7f77935-bcab-4ef1-8e1b-a7d059784ff3Ù"mountaincar_continuous_test_train3‘Ù$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7dÙ$1ac9296f-047b-4051-ba5c-0c23d5f9cde9„´precedence_heuristic §cell_idÙ$1ac9296f-047b-4051-ba5c-0c23d5f9cde9´downstream_cells_map·corridor_continuing_mdp’Ù$7afb6fb0-248a-4518-b94f-9876f81eca64Ù$8b35661b-5075-4d63-bc31-044407f99acf²upstream_cells_map¼make_corridor_continuing_mdp‘Ù$f0104778-81a6-417b-8501-f916e5e7f3afÙ$c87dba8c-9a96-41b3-9dc7-a6c088ec1eaf„´precedence_heuristic §cell_idÙ$c87dba8c-9a96-41b3-9dc7-a6c088ec1eaf´downstream_cells_map€²upstream_cells_map‚Ù!mountaincar_continuous_test_train‘Ù$b8532822-179b-4cd5-a279-4b71dafb544aÙ&show_mountaincar_continuous_trajectory‘Ù$b5319d8b-0420-4ebf-b603-ea0b93365ac1Ù$5cc4d12d-b537-47e2-8109-4e7a234fdf25„´precedence_heuristic §cell_idÙ$5cc4d12d-b537-47e2-8109-4e7a234fdf25´downstream_cells_map±make_corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabe²upstream_cells_mapŠ£max¹StateMDPTransitionSampler‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¡-¦isless¡+¡*¦iseven¢==§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$5334064b-5a16-4135-afa0-86a48291725b„´precedence_heuristic §cell_idÙ$5334064b-5a16-4135-afa0-86a48291725b´downstream_cells_map€²upstream_cells_map‚½corridor_train.value_function®corridor_train‘Ù$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72cÙ$9c342958-1971-48ec-b919-5dfdcbc915a4„´precedence_heuristic §cell_idÙ$9c342958-1971-48ec-b919-5dfdcbc915a4´downstream_cells_map§bgcolor‘Ù$e5faaa1b-88cb-43e2-8d04-8972b58b4bda²upstream_cells_mapŠ§@md_str¤Base±ColorStringPicker·PlutoRunner.create_bond«PlutoRunner¤Core¯Core.applicable¨Base.get¥@bind¨getindexÙ$966ef17c-23be-49dc-bc37-4cb52b34c049„´precedence_heuristic §cell_idÙ$966ef17c-23be-49dc-bc37-4cb52b34c049´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$e7e49ff8-32df-48a4-afb2-462859592e92„´precedence_heuristic §cell_idÙ$e7e49ff8-32df-48a4-afb2-462859592e92´downstream_cells_mapÙ&form_state_and_policy_function_outputs•Ù$4d4ae57b-afc3-44f9-b6fc-892f59f82921Ù$266d2234-26c8-43f1-9e75-49440a230ed6Ù$83640f5b-fe13-4ec1-98a0-67a56c189ba1Ù$b71145a4-2614-4f62-bfd2-7d5d1fecec56Ù$4da20fd7-b897-4f26-bf2a-f08d66ddf90f²upstream_cells_map‡¦Vectorsample_action‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¨deepcopyºform_state_policy_function‘Ù$37ec6802-d4c2-4470-ad69-439d5a732f77¹form_state_value_function‘Ù$e7566274-5518-4e28-8738-d4b1747d0cfb¨Function¤copyÙ$78c83673-2117-4542-b4d8-1c243e8f610b„´precedence_heuristic §cell_idÙ$78c83673-2117-4542-b4d8-1c243e8f610b´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8f„´precedence_heuristic §cell_idÙ$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8f´downstream_cells_map€²upstream_cells_mapƒÙ&mountaincar_continuous_test_train_beta‘Ù$4156d955-9daf-4429-b152-e8332980fb9eÙ&show_mountaincar_continuous_trajectory‘Ù$b5319d8b-0420-4ebf-b603-ea0b93365ac1¿mountaincar_continuous_beta_mdp‘Ù$8e096fae-9941-49d8-ae87-c68b02f68da5Ù$396e0047-d848-462f-a769-0cc2829abc78„´precedence_heuristic §cell_idÙ$396e0047-d848-462f-a769-0cc2829abc78´downstream_cells_map€²upstream_cells_map†¥Int64¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabeµget_corridor_features‘Ù$6bb0263e-368e-462a-948c-baf9cfa82512¡^Ù4actor_critic_with_eligibility_traces_binary_features‘Ù$05bfd818-bf4e-4bda-baa9-5ba647867097§typemaxÙ$ff4f977e-48df-4c12-845c-c245b4d39d6d„´precedence_heuristic §cell_idÙ$ff4f977e-48df-4c12-845c-c245b4d39d6d´downstream_cells_mapÙ#actor_critic_linear_parameter_study“Ù$7afb6fb0-248a-4518-b94f-9876f81eca64Ù$1b102220-6d78-480d-a77f-0e57bad23dcaÙ$d57375a5-b9e0-4742-b5f7-6a7da891604a²upstream_cells_map¡:®AbstractVector¥zeros¤rand§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¦UInt64¤Real¨Function¦lengthÙ4actor_critic_with_eligibility_traces_binary_features‘Ù$05bfd818-bf4e-4bda-baa9-5ba647867097¦Matrix·average_continuing_runs‘Ù$ba642a22-6623-482a-ab4a-81585b83e457©DataFrameÙ4actor_critic_with_eligibility_traces_linear_features‘Ù$68806899-9972-460a-9f11-daa708a9d610Ù$aa450da4-fe84-4eea-b6c4-9820b7982437„´precedence_heuristic §cell_idÙ$aa450da4-fe84-4eea-b6c4-9820b7982437´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$bb1ef180-39ac-475f-beea-ef573e71a3bf„´precedence_heuristic §cell_idÙ$bb1ef180-39ac-475f-beea-ef573e71a3bf´downstream_cells_map€²upstream_cells_mapƒ¸display_cartpole_episode‘Ù$822e4d69-2582-4956-858e-06ecb091e76a¢|>£ep2‘Ù$a5b002c9-5e11-462a-9da0-6e060c7963f8Ù$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27„´precedence_heuristic §cell_idÙ$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27´downstream_cells_map¾cartpole_continuing_fcann_test”Ù$04b5929a-2058-49c9-963a-96c752a1d67dÙ$64b38d1f-ecf9-4843-89a1-4c8953048265Ù$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7efÙ$fd58402f-da65-44cf-b81a-e21192fd0e63²upstream_cells_map„Ù*actor_critic_with_eligibility_traces_fcann‘Ù$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54·cartpole_vector_update!‘Ù$192b9f82-8d3a-408f-91c2-829cfcd32572·cartpole_continuing_mdp‘Ù$4c4e643b-d4b9-44f0-8d30-dc521bcc55ac¼cartpole_fcann_feature_setup‘Ù$61650a97-b353-4a85-b50b-93fee296ac7bÙ$5b868eba-c1af-49f6-8f93-79b78c319a6f„´precedence_heuristic §cell_idÙ$5b868eba-c1af-49f6-8f93-79b78c319a6f´downstream_cells_mapÙ,reinforce_with_baseline_monte_carlo_control!–Ù$0ac7ea44-14f6-4e80-80f9-d6df8059bb38Ù$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbÙ$d1ed25e6-60c6-411f-a541-99986e5da2c5Ù$697b2310-9d96-4f7f-be62-c3bd6bf736f3Ù$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00Ù$d5020a8d-1dd7-403c-9d1f-665b95543943²upstream_cells_mapÞ¤zero£oneContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295¤copy¦VectorÙ%form_state_continuous_policy_function‘Ù$f545c800-0bf3-491f-9d7d-42341cfdb573¤Real©eachindex«runepisode!‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¨deepcopy¡/¡^¼update_params_with_gradient!“Ù$b0a66a19-ee76-463b-a704-8fcee85444d0Ù$a893a87b-2d07-4db5-9d1a-9da8646216f4Ù$f55afa58-962d-4551-8d95-a5b467d61adfªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¡:¥zeros§Integer¨Function¥Int64¡-¡+¡*¹form_state_value_function‘Ù$e7566274-5518-4e28-8738-d4b1747d0cfbÙ$68469a40-7976-48b7-b7a1-eaa4c5f33a18„´precedence_heuristic §cell_idÙ$68469a40-7976-48b7-b7a1-eaa4c5f33a18´downstream_cells_mapÙ"plot_mountaincar_continuous_values“Ù$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580aÙ$b695ef21-a1ac-4d1f-a0e1-71cd81cede18Ù$a0ca7a5e-0089-4a45-9278-c0f27cd096a0²upstream_cells_map·HypertextLiteral.Bypass¨LinRange¥zeros¸HypertextLiteral.content¤@htl‘Ù$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaeb¨Function©enumerate§Float32¤plot°HypertextLiteral‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70§heatmap·HypertextLiteral.Result¦LayoutÙ$2a586e46-66e4-461a-85c8-5817e4d1aa43„´precedence_heuristic §cell_idÙ$2a586e46-66e4-461a-85c8-5817e4d1aa43´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$a206c759-3f6e-4003-8cba-5f6ce6742646„´precedence_heuristic §cell_idÙ$a206c759-3f6e-4003-8cba-5f6ce6742646´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$fc3dcd26-c5cf-4141-bf6c-eaed5fc9bb1d„´precedence_heuristic §cell_idÙ$fc3dcd26-c5cf-4141-bf6c-eaed5fc9bb1d´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$3cfd63ad-b1a2-4b99-ae97-2ff10351e4f5„´precedence_heuristic §cell_idÙ$3cfd63ad-b1a2-4b99-ae97-2ff10351e4f5´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$31db0f58-28e4-454f-9394-25565687266f„´precedence_heuristic §cell_idÙ$31db0f58-28e4-454f-9394-25565687266f´downstream_cells_map€²upstream_cells_map†¥randn¸display_cartpole_episode‘Ù$822e4d69-2582-4956-858e-06ecb091e76a§Float32¢|>cartpole_mdps‘Ù$024dcd1a-8eaa-4a95-8037-2f578828309cªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$822e4d69-2582-4956-858e-06ecb091e76a„´precedence_heuristic §cell_idÙ$822e4d69-2582-4956-858e-06ecb091e76a´downstream_cells_map¸display_cartpole_episode™Ù$0cd96c44-cae6-421f-9fae-26141600bef4Ù$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fceÙ$31db0f58-28e4-454f-9394-25565687266fÙ$dddc4a2f-34b2-41dc-85b3-55aba4880fa6Ù$5859ca11-90f8-4fd6-88ed-c56efe796fe8Ù$11a55af7-5301-4507-bb26-88e1e11236dbÙ$07ba9fe4-aaa7-4123-9865-cbfa79d0d44aÙ$daf35bfe-8f9c-4f55-971d-4d443be8f8bfÙ$bb1ef180-39ac-475f-beea-ef573e71a3bf²upstream_cells_mapˆ¨getfieldCartPoleState©enumerate§scatter¤plot¤attr¦Layout¦VectorÙ$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580a„´precedence_heuristic §cell_idÙ$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580a´downstream_cells_map€²upstream_cells_map‚Ù"plot_mountaincar_continuous_values‘Ù$68469a40-7976-48b7-b7a1-eaa4c5f33a18Ù!mountaincar_continuous_test_train‘Ù$b8532822-179b-4cd5-a279-4b71dafb544aÙ$05b0fcad-628b-48d2-aa24-f6f562dbb660„´precedence_heuristic §cell_idÙ$05b0fcad-628b-48d2-aa24-f6f562dbb660´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$d2729657-d0bf-4d39-8ec7-f242a1ad48d6„´precedence_heuristic §cell_idÙ$d2729657-d0bf-4d39-8ec7-f242a1ad48d6´downstream_cells_mapÙ)create_continuous_action_mountaincar_beta‘Ù$8e096fae-9941-49d8-ae87-c68b02f68da5²upstream_cells_map…´MountainCarTask.step¡-¡*ContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295¯MountainCarTaskÙ$5c11a92d-7496-4aba-af15-2537eac49dd7„´precedence_heuristic §cell_idÙ$5c11a92d-7496-4aba-af15-2537eac49dd7´downstream_cells_map°FCANNActivations“Ù$cc3ac95e-a398-438a-ba3d-62b6733f6342Ù$5c4a383f-fcf2-4f2b-819f-6d84471dda00Ù$635abb34-2c97-4f04-a74c-22fbec32f408²upstream_cells_map‚¤Real¦VectorÙ$1753b5ed-c00b-4b60-b492-822180778e8c„´precedence_heuristic §cell_idÙ$1753b5ed-c00b-4b60-b492-822180778e8c´downstream_cells_map½update_linear_value_gradient!”Ù$d1ed25e6-60c6-411f-a541-99986e5da2c5Ù$57e5e12a-b722-4ea3-ab3b-e5711029e640Ù$68806899-9972-460a-9f11-daa708a9d610Ù$d5020a8d-1dd7-403c-9d1f-665b95543943²upstream_cells_map‚¤Real¦VectorÙ$f7ede764-5ad8-426b-a805-cc21b622d977„´precedence_heuristic §cell_idÙ$f7ede764-5ad8-426b-a805-cc21b622d977´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$36d514fa-b27a-4c6b-8399-9d108377b9b5„´precedence_heuristic §cell_idÙ$36d514fa-b27a-4c6b-8399-9d108377b9b5´downstream_cells_map¬study_params‘Ù$c52c4cec-0ea8-4af3-831a-d284f0e086ee²upstream_cells_mapˆ¤Core¤Base·PlutoRunner.create_bond«PlutoRunner¯Core.applicable¥@bind¨Base.get½create_actor_critic_params_UI‘Ù$a8b40b8f-051a-4e6f-a079-ece4f32873deÙ$6b1acb57-159a-4b7f-99fe-5f996522243b„´precedence_heuristic §cell_idÙ$6b1acb57-159a-4b7f-99fe-5f996522243b´downstream_cells_map€²upstream_cells_map€Ù$45f0a385-6465-4acc-8637-1b007a0fe215„´precedence_heuristic §cell_idÙ$45f0a385-6465-4acc-8637-1b007a0fe215´downstream_cells_mapÙ update_fcann_eligibility_vector!‘Ù$0e9de19e-bcd4-40ac-9831-afb6cad38422²upstream_cells_map‹¡:°CrossEntropyLoss‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¥FCANN‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84©@inbounds«FCANNParams§Integer¦Vector©eachindex§Float32¡*´FCANN.nnCostFunctionÙ$c52c4cec-0ea8-4af3-831a-d284f0e086ee„´precedence_heuristic §cell_idÙ$c52c4cec-0ea8-4af3-831a-d284f0e086ee´downstream_cells_map€²upstream_cells_map…¡:¡^¡+¬study_params‘Ù$36d514fa-b27a-4c6b-8399-9d108377b9b5¸corridor_parameter_study‘Ù$bc8a399b-8864-4473-89d2-e3b0a03d15b5Ù$f8614042-7c94-4d47-a1b6-4e96676b4e8b„´precedence_heuristic §cell_idÙ$f8614042-7c94-4d47-a1b6-4e96676b4e8b´downstream_cells_mapÙ+actor_critic_fcann_episodic_parameter_study‘Ù$d34d22ad-89c2-423e-91dd-bfb895dc6540²upstream_cells_mapÞ¡!¦Random‘Ù$df7f84e8-b42a-4001-9dbf-6bc3ced94207¦Filter¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¦Vector¤Real§scatter§isempty¤mean§missing¡:®AbstractVectorÙ*actor_critic_with_eligibility_traces_fcann‘Ù$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54¢|>©ismissing¤rand§Integer¨Function¦UInt64¨tcollect¥Int64¤plot£Map¦Layout¬Random.seed!Ù$76eb6743-cac0-4174-9ba3-a0691c200b54„´precedence_heuristic §cell_idÙ$76eb6743-cac0-4174-9ba3-a0691c200b54´downstream_cells_map¸make_n_param_dist_params“Ù$ba5d6311-daee-4abc-b2fb-fae2184ef3ebÙ$ed93259c-7b8b-46d7-97fb-f194e0e04b3aÙ$4e29c621-223e-4859-8e96-db04b967815a²upstream_cells_map…¥zeros¦NTuple¡*§Integer¤RealÙ$94517664-6988-44dc-a297-e9d5873ee540„´precedence_heuristic §cell_idÙ$94517664-6988-44dc-a297-e9d5873ee540´downstream_cells_map½squashed_gaussian_plot_params‘Ù$3e7cecec-eb77-4862-8e3c-b510422e06db²upstream_cells_map§@md_str¤Core¡:§PlutoUI‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70¨Base.get¥@bind¦Slider¤Base«PlutoRunner·PlutoRunner.create_bond¯Core.applicable¯PlutoUI.combine¨getindexÙ$d037ea92-915c-4dc7-97c6-d006d92e088a„´precedence_heuristic §cell_idÙ$d037ea92-915c-4dc7-97c6-d006d92e088a´downstream_cells_map«figure_13_1‘Ù$0c56b341-24eb-4c78-844e-182f44a7221a²upstream_cells_mapÞ¦foldxt¦Layout¡:¡*¤sqrt¦Random‘Ù$df7f84e8-b42a-4001-9dbf-6bc3ced94207¢|>¥Int64¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabeµget_corridor_features‘Ù$6bb0263e-368e-462a-948c-baf9cfa82512§scatter¡-¡/¤plot¡+¤log2Ù-reinforce_monte_carlo_control_binary_features‘Ù$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290¤fill£Map¥round¬Random.seed!Ù$24fa139c-ad4b-49db-ac8f-23c476ed8608„´precedence_heuristic §cell_idÙ$24fa139c-ad4b-49db-ac8f-23c476ed8608´downstream_cells_map®reinforce_test’Ù$dddc4a2f-34b2-41dc-85b3-55aba4880fa6Ù$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5²upstream_cells_mapƒ®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6¡^ÙLreinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions‘Ù$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00Ù$2025ff38-f2ec-4224-b771-ff72ffe1af28„´precedence_heuristic §cell_idÙ$2025ff38-f2ec-4224-b771-ff72ffe1af28´downstream_cells_map´mountaincar_min_vals’Ù$023f67b8-8f38-470a-9766-ac60a75678aaÙ$7c592385-e8d3-4efe-962c-d39debb64405²upstream_cells_map€Ù$cb70d400-3e9c-441c-b17c-e727e8c928f3„´precedence_heuristic §cell_idÙ$cb70d400-3e9c-441c-b17c-e727e8c928f3´downstream_cells_map€²upstream_cells_mapˆ§@md_str¡<Ù.start_mountaincar_continuing_fcann_param_study‘Ù$f487f2dd-ad09-48ac-ae34-bf50cfa6ac7d¡>Ù#mountaincar_continuing_fcann_params‘Ù$5d35e515-e2d3-443e-becf-eb28c25db346Ù,mountaincar_fcann_continuing_parameter_study‘Ù$c251a630-7114-4188-9323-8d8feb5c32e0¦isless¨getindexÙ$e034b9cb-f4ee-46f4-bea6-72c93c75d966„´precedence_heuristic§cell_idÙ$e034b9cb-f4ee-46f4-bea6-72c93c75d966´downstream_cells_mapªDataFrames²upstream_cells_map€Ù$e6cf9550-2e69-4b82-92cf-5e07a35490aa„´precedence_heuristic §cell_idÙ$e6cf9550-2e69-4b82-92cf-5e07a35490aa´downstream_cells_map¬zero_params!”Ù$266d2234-26c8-43f1-9e75-49440a230ed6Ù$83640f5b-fe13-4ec1-98a0-67a56c189ba1Ù$b71145a4-2614-4f62-bfd2-7d5d1fecec56Ù$4da20fd7-b897-4f26-bf2a-f08d66ddf90f²upstream_cells_map†¤zero©eachindex¡:¥Array«FCANNParams¤RealÙ$717e4c69-59d5-4929-923f-dd35a97fb160„´precedence_heuristic §cell_idÙ$717e4c69-59d5-4929-923f-dd35a97fb160´downstream_cells_mapÙNactor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions“Ù$08505e88-9c23-4e95-91e3-d18bf5133dbcÙ$87482ea5-5265-4e02-92c0-1a8bb44ff0f4Ù$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7d²upstream_cells_map‡¤Real¨Function¦NTuple¥Union£one§IntegerContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295Ù$1386ffdb-940d-4f1b-a872-4e38647b5335„´precedence_heuristic §cell_idÙ$1386ffdb-940d-4f1b-a872-4e38647b5335´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$a893a87b-2d07-4db5-9d1a-9da8646216f4„´precedence_heuristic §cell_idÙ$a893a87b-2d07-4db5-9d1a-9da8646216f4´downstream_cells_map¼update_params_with_gradient!—Ù$4fb83451-b6f8-4e6e-a131-1accc8e10b08Ù$4d4ae57b-afc3-44f9-b6fc-892f59f82921Ù$266d2234-26c8-43f1-9e75-49440a230ed6Ù$83640f5b-fe13-4ec1-98a0-67a56c189ba1Ù$5b868eba-c1af-49f6-8f93-79b78c319a6fÙ$b71145a4-2614-4f62-bfd2-7d5d1fecec56Ù$4da20fd7-b897-4f26-bf2a-f08d66ddf90f²upstream_cells_mapÞ¤zero¡:¦isless³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41®julia.simdloop©@inbounds§nothing¦Vector¡<µBase.simd_outer_range¤Real¤Base¶Base.simd_inner_length¥@simd¯Base.simd_index¡+Ù$2cbc972b-c685-4c1c-8a8d-9d58b197ad90„´precedence_heuristic §cell_idÙ$2cbc972b-c685-4c1c-8a8d-9d58b197ad90´downstream_cells_map»update_binary_value_params!²upstream_cells_map…¤Real®BinaryFeatures‘Ù$da2d3186-a778-41cc-9b49-759bf1e9b8fa¡+©@inbounds¦VectorÙ$37ec6802-d4c2-4470-ad69-439d5a732f77„´precedence_heuristic §cell_idÙ$37ec6802-d4c2-4470-ad69-439d5a732f77´downstream_cells_mapºform_state_policy_function’Ù$4fb83451-b6f8-4e6e-a131-1accc8e10b08Ù$e7e49ff8-32df-48a4-afb2-462859592e92²upstream_cells_map‚©soft_max!‘Ù$33c99850-67cd-4754-94b9-6df97b238e27¨FunctionÙ$98222fcd-b456-477c-90dd-844df36877e5„´precedence_heuristic §cell_idÙ$98222fcd-b456-477c-90dd-844df36877e5´downstream_cells_map€²upstream_cells_map‚Ù mountaincar_continuing_tile_test‘Ù$b02ba928-5b9f-4695-b980-07988c788bb9¼plot_continuing_step_rewards‘Ù$0964133c-3a5b-433b-a8c4-a97813c37583Ù$f7f58fd2-facc-4b87-9172-5e911677c8f4„´precedence_heuristic §cell_idÙ$f7f58fd2-facc-4b87-9172-5e911677c8f4´downstream_cells_map€²upstream_cells_map€Ù$58403c8e-0ee4-4466-ba25-ee0c86fb0b47„´precedence_heuristic §cell_idÙ$58403c8e-0ee4-4466-ba25-ee0c86fb0b47´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$e1aec891-d95a-47d1-97d7-d2a4cfb16e64„´precedence_heuristic §cell_idÙ$e1aec891-d95a-47d1-97d7-d2a4cfb16e64´downstream_cells_mapÙ&setup_fcann_policy_and_value_arguments“Ù$697b2310-9d96-4f7f-be62-c3bd6bf736f3Ù$57bbdb10-bed8-459d-8f67-9ea637cf12baÙ$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54²upstream_cells_mapˆ¤Real¥Int64¼setup_fcann_policy_arguments‘Ù$0e9de19e-bcd4-40ac-9831-afb6cad38422»setup_fcann_value_arguments‘Ù$f3e2db06-9cb7-464a-96b8-938175efd26b¤Bool«FCANNParams§Integer¦VectorÙ$3d065608-eef2-4caa-b17d-ec60714e3d58„´precedence_heuristic §cell_idÙ$3d065608-eef2-4caa-b17d-ec60714e3d58´downstream_cells_mapÙ1actor_critic_binary_episodic_beta_parameter_study²upstream_cells_map‹«@NamedTuple¡:§IntegerContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295¤Real¥Int64¨Function¤Base¡-¡^¡+Ù$b87ff1a9-abff-40f7-a1d8-f751a1c8b060„´precedence_heuristic §cell_idÙ$b87ff1a9-abff-40f7-a1d8-f751a1c8b060´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$e89bdc84-dbb5-4c73-a39c-6392e5f79704„´precedence_heuristic §cell_idÙ$e89bdc84-dbb5-4c73-a39c-6392e5f79704´downstream_cells_map€²upstream_cells_map‚Ù mountaincar_continuing_tile_test‘Ù$b02ba928-5b9f-4695-b980-07988c788bb9·plot_mountaincar_values‘Ù$f9facbba-39d4-483e-9066-275603156db0Ù$d3b56fca-5b79-4465-8987-8d0005f854d8„´precedence_heuristic §cell_idÙ$d3b56fca-5b79-4465-8987-8d0005f854d8´downstream_cells_map¯reinforce_test2‘Ù$5859ca11-90f8-4fd6-88ed-c56efe796fe8²upstream_cells_mapƒ®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6Ù;reinforce_with_baseline_monte_carlo_control_binary_features‘Ù$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb¡^Ù$d21617aa-6f38-4a90-8586-4b32022497ad„´precedence_heuristic §cell_idÙ$d21617aa-6f38-4a90-8586-4b32022497ad´downstream_cells_map€²upstream_cells_map®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6Ù$0574f5a0-72e7-4aa2-80ac-f4ce4f0fe7c2„´precedence_heuristic §cell_idÙ$0574f5a0-72e7-4aa2-80ac-f4ce4f0fe7c2´downstream_cells_map€²upstream_cells_map„´sref_cartpole_binary‘Ù$19dfabda-7049-4050-8662-0385529c0c5aCartPoleState¸cartpole_continuing_test‘Ù$3c89209c-9202-4d5d-841c-ea34be369616´plot_cartpole_policy‘Ù$602a07dd-8928-4b44-97e5-01c5cbf38351Ù$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7d„´precedence_heuristic §cell_idÙ$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7d´downstream_cells_mapÙ"mountaincar_continuous_test_train3“Ù$a0ca7a5e-0089-4a45-9278-c0f27cd096a0Ù$5207308e-f636-4d47-b135-036a6e7b8ecdÙ$f8215517-b18f-4a03-9421-8edab4ca8089²upstream_cells_map…ºmountaincar_continuous_mdp‘Ù$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2¥Int64ÙNactor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions’Ù$05f120be-9695-4824-82fd-142a0df13098Ù$717e4c69-59d5-4929-923f-dd35a97fb160¼mountaincar_tilecoding_setup‘Ù$7c592385-e8d3-4efe-962c-d39debb64405§typemaxÙ$d82e7ab8-c372-4462-afb5-1617560cdb56„´precedence_heuristic §cell_idÙ$d82e7ab8-c372-4462-afb5-1617560cdb56´downstream_cells_map€²upstream_cells_map‚Ù&mountaincar_continuous_test_train_beta‘Ù$4156d955-9daf-4429-b152-e8332980fb9e·plot_mountaincar_values‘Ù$f9facbba-39d4-483e-9066-275603156db0Ù$3c89209c-9202-4d5d-841c-ea34be369616„´precedence_heuristic §cell_idÙ$3c89209c-9202-4d5d-841c-ea34be369616´downstream_cells_map¸cartpole_continuing_test“Ù$645e93e7-e92e-49c4-9757-8294fabf4e9bÙ$0cd96c44-cae6-421f-9fae-26141600bef4Ù$0574f5a0-72e7-4aa2-80ac-f4ce4f0fe7c2²upstream_cells_map„¹cartpole_tilecoding_setup‘Ù$de3cba34-9842-44d1-9b79-47126c0a0751Ù-cartpole_tilecoding_setup.get_active_featuresÙ4actor_critic_with_eligibility_traces_binary_features‘Ù$05bfd818-bf4e-4bda-baa9-5ba647867097·cartpole_continuing_mdp‘Ù$4c4e643b-d4b9-44f0-8d30-dc521bcc55acÙ$635abb34-2c97-4f04-a74c-22fbec32f408„´precedence_heuristic §cell_idÙ$635abb34-2c97-4f04-a74c-22fbec32f408´downstream_cells_map´fcann_value_function‘Ù$f3e2db06-9cb7-464a-96b8-938175efd26b²upstream_cells_map‰¹FCANN.forwardNOGRAD_base!°FCANNActivations‘Ù$5c11a92d-7496-4aba-af15-2537eac49dd7§Float32¥FCANN‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¥first¤last«FCANNParams§Integer¦VectorÙ$0bf3b988-b3fb-49d5-8dde-b25766596363„´precedence_heuristic §cell_idÙ$0bf3b988-b3fb-49d5-8dde-b25766596363´downstream_cells_mapµlinear_value_function”Ù$d1ed25e6-60c6-411f-a541-99986e5da2c5Ù$57e5e12a-b722-4ea3-ab3b-e5711029e640Ù$68806899-9972-460a-9f11-daa708a9d610Ù$d5020a8d-1dd7-403c-9d1f-665b95543943²upstream_cells_mapƒ¤Real£dot¦VectorÙ$d8222abf-139c-4220-8e92-cc987ec6900c„´precedence_heuristic §cell_idÙ$d8222abf-139c-4220-8e92-cc987ec6900c´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$68e6f17e-8c87-40f0-a673-1115ecd1b71d„´precedence_heuristic §cell_idÙ$68e6f17e-8c87-40f0-a673-1115ecd1b71d´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$cf1859d6-f889-4923-8c87-2d7c039f26c3„´precedence_heuristic §cell_idÙ$cf1859d6-f889-4923-8c87-2d7c039f26c3´downstream_cells_map€²upstream_cells_map„¥randn§Float32cartpole_mdps‘Ù$024dcd1a-8eaa-4a95-8037-2f578828309cªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$5500fd8e-64cb-4af7-808d-230440746319„´precedence_heuristic §cell_idÙ$5500fd8e-64cb-4af7-808d-230440746319´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$76d54520-baa3-44bf-b303-4cdcb8b87080„´precedence_heuristic §cell_idÙ$76d54520-baa3-44bf-b303-4cdcb8b87080´downstream_cells_map²make_sample_vector‘Ù$9acdbf38-2e10-45ec-85a0-d0db8453a599²upstream_cells_mapƒ¥zeros¦NTuple¤RealÙ$27441783-d3c6-40be-9c36-4941613e6ae9„´precedence_heuristic §cell_idÙ$27441783-d3c6-40be-9c36-4941613e6ae9´downstream_cells_map€²upstream_cells_map‰¦length¥Int64¨LinRange¡/¢|>¯reinforce_test5‘Ù$82e0e9a0-9662-429a-87e3-e6bdae02709a¤plot¦cumsum¥roundÙ$fac138d9-3c5d-44b0-a87c-b13872f19450„´precedence_heuristic§cell_idÙ$fac138d9-3c5d-44b0-a87c-b13872f19450´downstream_cells_map§Memoize²upstream_cells_map€Ù$82e0e9a0-9662-429a-87e3-e6bdae02709a„´precedence_heuristic §cell_idÙ$82e0e9a0-9662-429a-87e3-e6bdae02709a´downstream_cells_map¯reinforce_test5•Ù$27441783-d3c6-40be-9c36-4941613e6ae9Ù$daf35bfe-8f9c-4f55-971d-4d443be8f8bfÙ$a5b002c9-5e11-462a-9da0-6e060c7963f8Ù$d4e87ac4-6008-43b2-aa06-e232ec2b2b5bÙ$700dcbc4-c94c-4287-8cf0-0b2c7a320a3a²upstream_cells_map„®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6Ù*actor_critic_with_eligibility_traces_fcann‘Ù$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54¼cartpole_fcann_feature_setup‘Ù$61650a97-b353-4a85-b50b-93fee296ac7bÙ3cartpole_fcann_feature_setup.update_feature_vector!Ù$d3c1379f-acd6-4e15-be7e-a5dbe46a4f62„´precedence_heuristic §cell_idÙ$d3c1379f-acd6-4e15-be7e-a5dbe46a4f62´downstream_cells_mapÙ(start_mountaincar_continuing_param_study‘Ù$04f42c09-8ab5-4233-b196-51c4aa2dcedb²upstream_cells_mapˆ¤Core¤Base·PlutoRunner.create_bond«PlutoRunnerCounterButton¯Core.applicable¥@bind¨Base.getÙ$fad02876-efba-46a7-9cb7-43820528779f„´precedence_heuristic §cell_idÙ$fad02876-efba-46a7-9cb7-43820528779f´downstream_cells_map€²upstream_cells_mapƒÙ&cartpole_fcann_continuing_test_episode‘Ù$64b38d1f-ecf9-4843-89a1-4c8953048265©plot_cart‘Ù$63fbf8f4-e4e2-4893-be09-67450e92dbd7Ù-cartpole_fcann_continuing_episode_step_select‘Ù$6acb549a-5d90-4457-a347-d22448ad8071Ù$1ce4bc6c-7cde-48e9-8ff1-7281697fd121„´precedence_heuristic §cell_idÙ$1ce4bc6c-7cde-48e9-8ff1-7281697fd121´downstream_cells_map€²upstream_cells_mapƒ¨ep2_step‘Ù$9bce6fdb-2cbc-4758-9a8b-794e490c973d£ep2‘Ù$a5b002c9-5e11-462a-9da0-6e060c7963f8©plot_cart‘Ù$63fbf8f4-e4e2-4893-be09-67450e92dbd7Ù$024dcd1a-8eaa-4a95-8037-2f578828309c„´precedence_heuristic §cell_idÙ$024dcd1a-8eaa-4a95-8037-2f578828309c´downstream_cells_mapcartpole_mdps“Ù$cf1859d6-f889-4923-8c87-2d7c039f26c3Ù$31db0f58-28e4-454f-9394-25565687266fÙ$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5²upstream_cells_map´create_cartpole_mdps‘Ù$3c316495-bb6c-41e2-a38f-ba867a319fbbÙ$e1274f57-75cb-4659-a82f-e5870c5367e2„´precedence_heuristic §cell_idÙ$e1274f57-75cb-4659-a82f-e5870c5367e2´downstream_cells_map¢ep“Ù$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547dÙ$374af774-3a97-49b5-a3bb-bc3f7f63a3faÙ$af144759-fe66-4ad0-b378-e9eb4e859db4²upstream_cells_mapƒ¯reinforce_test4‘Ù$407a0724-4bb6-4c83-ab2d-17a0e19c4072®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6ªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$fdd3f4fd-4706-4d6b-b150-6ee6b4b370cb„´precedence_heuristic §cell_idÙ$fdd3f4fd-4706-4d6b-b150-6ee6b4b370cb´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$b02ba928-5b9f-4695-b980-07988c788bb9„´precedence_heuristic §cell_idÙ$b02ba928-5b9f-4695-b980-07988c788bb9´downstream_cells_mapÙ mountaincar_continuing_tile_test”Ù$98222fcd-b456-477c-90dd-844df36877e5Ù$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6Ù$e89bdc84-dbb5-4c73-a39c-6392e5f79704Ù$da3cb392-78f2-48b2-b0dc-5f016664798c²upstream_cells_mapƒÙ4actor_critic_with_eligibility_traces_binary_features‘Ù$05bfd818-bf4e-4bda-baa9-5ba647867097¼mountaincar_tilecoding_setup‘Ù$7c592385-e8d3-4efe-962c-d39debb64405ºmountaincar_continuing_mdp‘Ù$46fea69b-599e-46ab-8455-d2da865d9a8eÙ$f946c886-6246-4f98-a96f-f06984691ad8„´precedence_heuristic §cell_idÙ$f946c886-6246-4f98-a96f-f06984691ad8´downstream_cells_map‚¾ApproximationUtils.runepisode!½ApproximationUtils.runepisode²upstream_cells_mapÞ$§@assert¥Tuple¡>¦isless¡!ContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295²Base.CoreLogging.!¦length¡<¦Vector§Returns¤RealÙ'Base.CoreLogging.Base.fixup_stdlib_path¥@info±Base.invokelatest¦NTuple¢==³Base.AssertionError½Base.CoreLogging.invokelatest´Base.CoreLogging.===ªBase.throw²ApproximationUtils‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84º#___this_pluto_module_name¤rand¨Function§typemax¤Base¢<=¥Int64¥push!´Base.CoreLogging.isa¡-µbad_continuous_action‘Ù$b966b248-fb4d-457d-90f6-114370846242¡+¥Union³Base.CoreLogging.>=Ù$3c316495-bb6c-41e2-a38f-ba867a319fbb„´precedence_heuristic §cell_idÙ$3c316495-bb6c-41e2-a38f-ba867a319fbb´downstream_cells_map´create_cartpole_mdps’Ù$024dcd1a-8eaa-4a95-8037-2f578828309cÙ$fddef10c-7695-4596-9e16-987fd45a57e6²upstream_cells_mapÞ¤zero©TabularRL‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¡>¦isless£oneContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295¤Real¡<§Returns§Float32¡/¢Ï€¥Inf32¥clamp¯CartPoleVehicle£abs²TabularRL.StateMDPCartPoleState¤rand¨Function¹cartpole_runge_kutta_step¾ContinuousMDPTransitionSampler‘Ù$c8b47eac-2d45-419a-bec6-2ae0cdc59393¡-¹StateMDPTransitionSampler‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$6c5e9bb2-4c38-4613-9652-dec99e97b512„´precedence_heuristic §cell_idÙ$6c5e9bb2-4c38-4613-9652-dec99e97b512´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$b0a66a19-ee76-463b-a704-8fcee85444d0„´precedence_heuristic §cell_idÙ$b0a66a19-ee76-463b-a704-8fcee85444d0´downstream_cells_map¼update_params_with_gradient!—Ù$4fb83451-b6f8-4e6e-a131-1accc8e10b08Ù$4d4ae57b-afc3-44f9-b6fc-892f59f82921Ù$266d2234-26c8-43f1-9e75-49440a230ed6Ù$83640f5b-fe13-4ec1-98a0-67a56c189ba1Ù$5b868eba-c1af-49f6-8f93-79b78c319a6fÙ$b71145a4-2614-4f62-bfd2-7d5d1fecec56Ù$4da20fd7-b897-4f26-bf2a-f08d66ddf90f²upstream_cells_mapÞ¤zero¦isless©@inbounds«FCANNParams·BinaryEligibilityVector‘Ù$41dc149d-c6f3-4b0d-a856-06f3aaae3049§nothing¤Real¡<¯Base.simd_index©eachindex¥@simd¦Matrix§Float32¡:§Nothing¥first³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41®julia.simdloop¤BaseµBase.simd_outer_range¡-¶Base.simd_inner_length¡+¡*¥ArrayÙ$13ebc12f-ff6f-4266-88d3-28d6df5fcf59„´precedence_heuristic §cell_idÙ$13ebc12f-ff6f-4266-88d3-28d6df5fcf59´downstream_cells_mapÙ5actor_critic_binary_episodic_gaussian_parameter_study‘Ù$b53dba81-a9e9-41da-8fc2-7736bf25f2dc²upstream_cells_map‹«@NamedTuple¡:§IntegerContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295¤Real¥Int64¨Function¤Base¡-¡^¡+Ù$7a6fb1f0-fc3c-4c29-a6d9-769d32ca98a9„´precedence_heuristic §cell_idÙ$7a6fb1f0-fc3c-4c29-a6d9-769d32ca98a9´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$f2f2dd1d-180c-4d36-b515-5079d129f93a„´precedence_heuristic §cell_idÙ$f2f2dd1d-180c-4d36-b515-5079d129f93a´downstream_cells_map€²upstream_cells_mapŠ¦length¥Int64¡:¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabeµget_corridor_features‘Ù$6bb0263e-368e-462a-948c-baf9cfa82512¤plot¡/¢|>§typemax¨sarsa_Î»Ù$553b0ceb-f2ca-41ee-99bc-9f53a4487b49„´precedence_heuristic §cell_idÙ$553b0ceb-f2ca-41ee-99bc-9f53a4487b49´downstream_cells_map€²upstream_cells_map‚°best_mc_corridor‘Ù$a12b92d1-e045-4f92-b8cd-eee5d56fa67dºget_corridor_episode_stats’Ù$fb8904a9-ae64-41cc-93b6-5a25855edad0Ù$cecc2a35-3850-4f66-84e8-e29da4f3d4b0Ù$f9facbba-39d4-483e-9066-275603156db0„´precedence_heuristic §cell_idÙ$f9facbba-39d4-483e-9066-275603156db0´downstream_cells_map·plot_mountaincar_values“Ù$e89bdc84-dbb5-4c73-a39c-6392e5f79704Ù$c0876a48-ea18-494d-8bfc-e2bceb73b417Ù$d82e7ab8-c372-4462-afb5-1617560cdb56²upstream_cells_mapŒ·HypertextLiteral.Bypass¨LinRange¥zeros¸HypertextLiteral.content¤@htl‘Ù$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaeb©enumerate§Float32¤plot°HypertextLiteral‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70§heatmap·HypertextLiteral.Result¦LayoutÙ$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0„´precedence_heuristic §cell_idÙ$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0´downstream_cells_map€²upstream_cells_map†»one_step_actor_critic_fcann‘Ù$57bbdb10-bed8-459d-8f67-9ea637cf12ba¥Int64¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabe¡^¹update_corridor_features!‘Ù$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811§typemaxÙ$d41f1dd1-45fe-4456-9a01-ed47fd6704a7„´precedence_heuristic §cell_idÙ$d41f1dd1-45fe-4456-9a01-ed47fd6704a7´downstream_cells_map¿update_beta_eligibility_vector!‘Ù$3e3c5897-809f-46e3-bb58-f115b082443e²upstream_cells_mapŒ£exp¡:¡k¥first³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41¦Vector¤Real»BinaryBetaEligibilityVector‘Ù$54fff14b-cf53-47b0-9cfa-8b9ee33df54e¦Matrix¡+¦NTuple¤lastÙ$ba5d6311-daee-4abc-b2fb-fae2184ef3eb„´precedence_heuristic §cell_idÙ$ba5d6311-daee-4abc-b2fb-fae2184ef3eb´downstream_cells_mapÙ&setup_binary_gaussian_policy_arguments’Ù$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00Ù$20776e09-7d9b-4db8-a060-7bceeec65b47²upstream_cells_map‹³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41¤rand§IntegerContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295¤Real¨Function¿BinaryGaussianEligibilityVector‘Ù$10cdd16e-a337-4421-a7a0-6de4e4b60c0f½update_binary_feature_vector!‘Ù$8eab55a5-41b7-4f5e-a02f-4c19388bc9ea¸make_n_param_dist_params‘Ù$76eb6743-cac0-4174-9ba3-a0691c200b54¥Union¦NTupleÙ$8e742d32-c074-4981-b35b-b596b64c869b„´precedence_heuristic §cell_idÙ$8e742d32-c074-4981-b35b-b596b64c869b´downstream_cells_mapÙ'cartpole_continuing_binary_study_params‘Ù$b2539398-fdbc-42a2-a8f3-d327358f3643²upstream_cells_mapˆ¤Core¤Base·PlutoRunner.create_bond«PlutoRunnerÙ(create_actor_critic_continuing_params_UI‘Ù$5b15d91e-7119-4f85-a54a-7d4f1fdaf097¯Core.applicable¥@bind¨Base.getÙ$03a218cb-aa83-4000-85b5-c6f247087053„´precedence_heuristic §cell_idÙ$03a218cb-aa83-4000-85b5-c6f247087053´downstream_cells_map½update_binary_value_gradient!—Ù$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbÙ$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2Ù$05bfd818-bf4e-4bda-baa9-5ba647867097Ù$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00Ù$20776e09-7d9b-4db8-a060-7bceeec65b47Ù$3e3c5897-809f-46e3-bb58-f115b082443eÙ$05f120be-9695-4824-82fd-142a0df13098²upstream_cells_mapƒ¤Real³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41¦VectorÙ$1ec1acf1-f833-4478-9b3c-88029340a629„´precedence_heuristic §cell_idÙ$1ec1acf1-f833-4478-9b3c-88029340a629´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$de3cba34-9842-44d1-9b79-47126c0a0751„´precedence_heuristic §cell_idÙ$de3cba34-9842-44d1-9b79-47126c0a0751´downstream_cells_map¹cartpole_tilecoding_setup’Ù$1b102220-6d78-480d-a77f-0e57bad23dcaÙ$3c89209c-9202-4d5d-841c-ea34be369616²upstream_cells_mapƒ±tile_coding_setup¡/²cartpole_functions‘Ù$f27f2bcd-05b6-44fe-bf9e-a3e51556db7cÙ$04f42c09-8ab5-4233-b196-51c4aa2dcedb„´precedence_heuristic §cell_idÙ$04f42c09-8ab5-4233-b196-51c4aa2dcedb´downstream_cells_map€²upstream_cells_mapˆ§@md_str¡<Ù$mountaincar_continuing_binary_params‘Ù$fed4dc4c-0d1c-4ee3-9d0e-8ef2a7db7486¡>¦islessÙ-mountaincar_binary_continuing_parameter_study‘Ù$d57375a5-b9e0-4742-b5f7-6a7da891604aÙ(start_mountaincar_continuing_param_study‘Ù$d3c1379f-acd6-4e15-be7e-a5dbe46a4f62¨getindexÙ$54ff46a2-489a-4dd2-bc30-df70c780cc42„´precedence_heuristic §cell_idÙ$54ff46a2-489a-4dd2-bc30-df70c780cc42´downstream_cells_map€²upstream_cells_map€Ù$7126aefd-b847-497a-9545-514e9b9afa71„´precedence_heuristic §cell_idÙ$7126aefd-b847-497a-9545-514e9b9afa71´downstream_cells_map€²upstream_cells_map€Ù$48dcd2d0-a940-41da-a097-90c780f2ec4d„´precedence_heuristic §cell_idÙ$48dcd2d0-a940-41da-a097-90c780f2ec4d´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$e1493cea-19c4-475d-98a0-86d27fb04af1„´precedence_heuristic §cell_idÙ$e1493cea-19c4-475d-98a0-86d27fb04af1´downstream_cells_map€²upstream_cells_map‡¨sarsa_Î»¥Int64¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabeµget_corridor_features‘Ù$6bb0263e-368e-462a-948c-baf9cfa82512¢|>§typemaxºget_corridor_episode_stats’Ù$fb8904a9-ae64-41cc-93b6-5a25855edad0Ù$cecc2a35-3850-4f66-84e8-e29da4f3d4b0Ù$511a847f-234c-465e-8f4a-688e79d9b975„´precedence_heuristic §cell_idÙ$511a847f-234c-465e-8f4a-688e79d9b975´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$697b2310-9d96-4f7f-be62-c3bd6bf736f3„´precedence_heuristic §cell_idÙ$697b2310-9d96-4f7f-be62-c3bd6bf736f3´downstream_cells_mapÙ1reinforce_with_baseline_monte_carlo_control_fcann‘Ù$aa69e4ea-91e0-496a-a7be-529e67f4dbec²upstream_cells_mapÙ,reinforce_with_baseline_monte_carlo_control!’Ù$4fb83451-b6f8-4e6e-a131-1accc8e10b08Ù$5b868eba-c1af-49f6-8f93-79b78c319a6f¥FCANN‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84«FCANNParams§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84»FCANN.initializeparams_saxe¥Int64¦Vector¤Real¨FunctionÙ&setup_fcann_policy_and_value_arguments‘Ù$e1aec891-d95a-47d1-97d7-d2a4cfb16e64¦length¤fillÙ$056a8adc-92f4-4b33-90d9-4b3b4026bbbc„´precedence_heuristic §cell_idÙ$056a8adc-92f4-4b33-90d9-4b3b4026bbbc´downstream_cells_map¼update_traces_with_gradient!”Ù$266d2234-26c8-43f1-9e75-49440a230ed6Ù$83640f5b-fe13-4ec1-98a0-67a56c189ba1Ù$b71145a4-2614-4f62-bfd2-7d5d1fecec56Ù$4da20fd7-b897-4f26-bf2a-f08d66ddf90f²upstream_cells_mapÞ*¤zeroÙ'BinarySquashedGaussianEligibilityVector‘Ù$76fd79a2-2bc8-45f8-a243-48415118898a¥isnan¦isless§digamma©@inbounds£oneÙ'Base.CoreLogging.Base.fixup_stdlib_path²Base.CoreLogging.!§nothing¡<¯Base.simd_index¦Vector¤Real¥isinf¥@simd¡/¡^¦NTuple¦Matrix¥@info±Base.invokelatest½Base.CoreLogging.invokelatest´Base.CoreLogging.===¡:³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41®julia.simdloopº#___this_pluto_module_name¤size¤BaseµBase.simd_outer_range¡-£log¥atanh»BinaryBetaEligibilityVector‘Ù$54fff14b-cf53-47b0-9cfa-8b9ee33df54e´Base.CoreLogging.isa¡+¡*³Base.CoreLogging.>=¶Base.simd_inner_length¿BinaryGaussianEligibilityVector‘Ù$10cdd16e-a337-4421-a7a0-6de4e4b60c0f¢Î¸Ù$bc8a399b-8864-4473-89d2-e3b0a03d15b5„´precedence_heuristic §cell_idÙ$bc8a399b-8864-4473-89d2-e3b0a03d15b5´downstream_cells_map¸corridor_parameter_study‘Ù$c52c4cec-0ea8-4af3-831a-d284f0e086ee²upstream_cells_mapƒ¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabeµget_corridor_features‘Ù$6bb0263e-368e-462a-948c-baf9cfa82512Ù,actor_critic_binary_episodic_parameter_study’Ù$1f041cb3-618c-4380-a1ec-d7bbe4a80f62Ù$d9d11d69-bc16-400a-8f46-f9a8ecb8516aÙ$bba13634-ff0e-47f7-a23b-8d56098f4ac6„´precedence_heuristic §cell_idÙ$bba13634-ff0e-47f7-a23b-8d56098f4ac6´downstream_cells_mapƒ·gaussian_action_sampler·make_gaussian_n_samplerµmake_gaussian_sampler“Ù$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00Ù$d5020a8d-1dd7-403c-9d1f-665b95543943Ù$20776e09-7d9b-4db8-a060-7bceeec65b47²upstream_cells_mapŽ¨isapprox¤zero£exp¥isnan¤rand§Integer¦Vector¤Real¥isinf£Val¡+¦NTuple¦ntuple¦NormalÙ$407a0724-4bb6-4c83-ab2d-17a0e19c4072„´precedence_heuristic §cell_idÙ$407a0724-4bb6-4c83-ab2d-17a0e19c4072´downstream_cells_map¯reinforce_test4•Ù$27487ad0-4779-42ce-8def-e660ef04bee0Ù$9d264543-33ab-498a-90f5-5f913c252484Ù$07ba9fe4-aaa7-4123-9865-cbfa79d0d44aÙ$af144759-fe66-4ad0-b378-e9eb4e859db4Ù$e1274f57-75cb-4659-a82f-e5870c5367e2²upstream_cells_map†¥Int64®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6Ù*actor_critic_with_eligibility_traces_fcann‘Ù$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54¼cartpole_fcann_feature_setup‘Ù$61650a97-b353-4a85-b50b-93fee296ac7bÙ3cartpole_fcann_feature_setup.update_feature_vector!§typemaxÙ$77cf3a74-899f-4ade-99f2-5aaf7a98c02d„´precedence_heuristic §cell_idÙ$77cf3a74-899f-4ade-99f2-5aaf7a98c02d´downstream_cells_map³scale_fcann_params!‘Ù$f3e2db06-9cb7-464a-96b8-938175efd26b²upstream_cells_map‡¤Real¡:©eachindex¡/©@inbounds«FCANNParams¦VectorÙ$28ce6e60-59cf-408a-8081-b978507b3c72„´precedence_heuristic §cell_idÙ$28ce6e60-59cf-408a-8081-b978507b3c72´downstream_cells_mapÙ$cartpole_fcann_continuing_test_state‘Ù$fd58402f-da65-44cf-b81a-e21192fd0e63²upstream_cells_mapÞ§@md_str¤Core¡:§deg2rad¨LinRange§PlutoUI‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70¢|>¨Base.get¥@bind¦Slider¤Base«PlutoRunner¡-·PlutoRunner.create_bond§confirm¯Core.applicable¯PlutoUI.combine¨getindexÙ$7ccadf01-fbba-4dfd-a5ad-770dab9946f9„´precedence_heuristic §cell_idÙ$7ccadf01-fbba-4dfd-a5ad-770dab9946f9´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$b72e030f-7d52-481f-b4f7-2b16b227e547„´precedence_heuristic §cell_idÙ$b72e030f-7d52-481f-b4f7-2b16b227e547´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$4c5cb75e-79b5-4502-b1eb-6246e002feaf„´precedence_heuristic §cell_idÙ$4c5cb75e-79b5-4502-b1eb-6246e002feaf´downstream_cells_map¹mountaincar_binary_params‘Ù$8eb42403-1234-4e59-993e-057cc3a6d5c9²upstream_cells_mapˆ¤Core¤Base·PlutoRunner.create_bond«PlutoRunner¯Core.applicable¥@bind¨Base.get½create_actor_critic_params_UI‘Ù$a8b40b8f-051a-4e6f-a079-ece4f32873deÙ$48b342f2-e48f-457a-9bd3-b3504a79f3a6„´precedence_heuristic §cell_idÙ$48b342f2-e48f-457a-9bd3-b3504a79f3a6´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$5d50a5d0-8fe2-4c6e-b76c-d5614e4fd884„´precedence_heuristic §cell_idÙ$5d50a5d0-8fe2-4c6e-b76c-d5614e4fd884´downstream_cells_map³show_or_lookup_plot²upstream_cells_mapŠ¤DictªNamedTuple§@md_str¨Function®AbstractString¥Tuple¦haskey¢==§Integer¨getindexÙ$ba645f6b-143f-4e83-9003-707770ae308d„´precedence_heuristic §cell_idÙ$ba645f6b-143f-4e83-9003-707770ae308d´downstream_cells_map»show_mountaincar_trajectory“Ù$da3cb392-78f2-48b2-b0dc-5f016664798cÙ$3a37b53d-9174-4faa-9404-74a40c385b0aÙ$ddbca73f-c692-46f2-95f3-a7dd849d33f7²upstream_cells_map£sum·HypertextLiteral.Bypass¸HypertextLiteral.content¤@htl‘Ù$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaeb§Integer¨Function§scatter¤plot·HypertextLiteral.Result°HypertextLiteral‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70¦Layoutªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¯MountainCarTaskÙ$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811„´precedence_heuristic §cell_idÙ$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811´downstream_cells_map¹update_corridor_features!—Ù$5720e942-d3f8-4329-83a8-8bcedf078b6aÙ$cacaaca6-6e01-464f-a2ee-cbf62737a426Ù$07ad517a-c2ac-4377-99fb-adb13d0f1d0cÙ$aa69e4ea-91e0-496a-a7be-529e67f4dbecÙ$a12b92d1-e045-4f92-b8cd-eee5d56fa67dÙ$9db9ff71-bee9-4bea-a45b-748f8517fed1Ù$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0²upstream_cells_mapƒ¤Real£one¦VectorÙ$8f1b2db4-ed35-44fc-a3d5-e06deae16d48„´precedence_heuristic §cell_idÙ$8f1b2db4-ed35-44fc-a3d5-e06deae16d48´downstream_cells_map€²upstream_cells_map€Ù$57bbdb10-bed8-459d-8f67-9ea637cf12ba„´precedence_heuristic §cell_idÙ$57bbdb10-bed8-459d-8f67-9ea637cf12ba´downstream_cells_map»one_step_actor_critic_fcann‘Ù$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0²upstream_cells_map¥FCANN‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84«FCANNParams§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84»FCANN.initializeparams_saxe¥Int64¦Vector¤Real¶one_step_actor_critic!‘Ù$4d4ae57b-afc3-44f9-b6fc-892f59f82921¨FunctionÙ&setup_fcann_policy_and_value_arguments‘Ù$e1aec891-d95a-47d1-97d7-d2a4cfb16e64¦length¤fillÙ$ca360680-afc9-4dd9-9351-493643f91575„´precedence_heuristic §cell_idÙ$ca360680-afc9-4dd9-9351-493643f91575´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$d95f75b5-21d8-4862-baa7-50b58d9725b8„´precedence_heuristic §cell_idÙ$d95f75b5-21d8-4862-baa7-50b58d9725b8´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$65be0e58-24be-4932-92a9-9e4825b14144„´precedence_heuristic §cell_idÙ$65be0e58-24be-4932-92a9-9e4825b14144´downstream_cells_mapÙ@actor_critic_binary_continuing_squashed_gaussian_parameter_study²upstream_cells_map…¤Real¨Function£one§IntegerContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295Ù$60c21e9c-e42d-4f0b-a910-3b318440fbc8„´precedence_heuristic §cell_idÙ$60c21e9c-e42d-4f0b-a910-3b318440fbc8´downstream_cells_map´gaussian_plot_params‘Ù$09dd1440-5d09-421f-addc-b1ede43ff517²upstream_cells_map§@md_str¤Core¡:§PlutoUI‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70¨Base.get¥@bind¦Slider¤Base«PlutoRunner·PlutoRunner.create_bond¯Core.applicable¯PlutoUI.combine¨getindexÙ$da2d3186-a778-41cc-9b49-759bf1e9b8fa„´precedence_heuristic §cell_idÙ$da2d3186-a778-41cc-9b49-759bf1e9b8fa´downstream_cells_map®BinaryFeatures’Ù$65d2add6-fd6f-456c-92ed-3cd9d1862ef6Ù$2cbc972b-c685-4c1c-8a8d-9d58b197ad90²upstream_cells_mapŠ¢C2¢C1¢C3¡T®AbstractVector¡N¤Base¦NTuple¥Union§IntegerÙ$b695ef21-a1ac-4d1f-a0e1-71cd81cede18„´precedence_heuristic §cell_idÙ$b695ef21-a1ac-4d1f-a0e1-71cd81cede18´downstream_cells_map€²upstream_cells_map‚Ù"plot_mountaincar_continuous_values‘Ù$68469a40-7976-48b7-b7a1-eaa4c5f33a18Ù"mountaincar_continuous_test_train2‘Ù$fee14dfe-c5ca-4126-a830-cc9d7eda5433Ù$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00„´precedence_heuristic §cell_idÙ$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00´downstream_cells_mapÙLreinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions’Ù$24fa139c-ad4b-49db-ac8f-23c476ed8608Ù$8aa16866-bfda-48df-9cf1-cf3d2e203ccb²upstream_cells_mapÞÙ,reinforce_with_baseline_monte_carlo_control!’Ù$4fb83451-b6f8-4e6e-a131-1accc8e10b08Ù$5b868eba-c1af-49f6-8f93-79b78c319a6fµbinary_value_function‘Ù$a540814a-57a1-4b98-9443-59e401425444¿make_n_param_dist_policy_params‘Ù$ba41f521-4ee2-42a6-bf18-078bfa4b875e¥zeros³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41¤rand§Integerµmake_gaussian_sampler‘Ù$bba13634-ff0e-47f7-a23b-8d56098f4ac6¦VectorContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295Ù#update_gaussian_eligibility_vector!’Ù$5261651e-a51e-4e80-8e23-83a4c10e5259Ù$740a3f41-9302-481d-b373-762c0dea8eff¤Real¨FunctionÙ&setup_binary_gaussian_policy_arguments‘Ù$ba5d6311-daee-4abc-b2fb-fae2184ef3eb½update_binary_value_gradient!‘Ù$03a218cb-aa83-4000-85b5-c6f247087053¦Matrix¦NTuple¥UnionÙ!update_binary_action_preferences!‘Ù$a361f4c9-47ce-42ad-899c-87b611c0d471Ù$dcb306ae-a1b1-43d6-ba6e-e38668838689„´precedence_heuristic §cell_idÙ$dcb306ae-a1b1-43d6-ba6e-e38668838689´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$54f559b6-8a62-4a42-894d-c56e41d5ebef„´precedence_heuristic §cell_idÙ$54f559b6-8a62-4a42-894d-c56e41d5ebef´downstream_cells_mapµcorridor_state_counts‘Ù$62e677ac-2070-4f6b-9df2-90849d89fa9f²upstream_cells_map»collect_state_distributions‘Ù$0c9986bb-54c0-4b08-9c29-4bfb0b68b54eÙ$f545c800-0bf3-491f-9d7d-42341cfdb573„´precedence_heuristic §cell_idÙ$f545c800-0bf3-491f-9d7d-42341cfdb573´downstream_cells_mapÙ%form_state_continuous_policy_function’Ù$5b868eba-c1af-49f6-8f93-79b78c319a6fÙ$11b9beea-b0cd-45eb-84c6-151728894df0²upstream_cells_map¨FunctionÙ$8b35661b-5075-4d63-bc31-044407f99acf„´precedence_heuristic §cell_idÙ$8b35661b-5075-4d63-bc31-044407f99acf´downstream_cells_map€²upstream_cells_mapƒµget_corridor_features‘Ù$6bb0263e-368e-462a-948c-baf9cfa82512Ù4actor_critic_with_eligibility_traces_binary_features‘Ù$05bfd818-bf4e-4bda-baa9-5ba647867097·corridor_continuing_mdp‘Ù$1ac9296f-047b-4051-ba5c-0c23d5f9cde9Ù$09dd1440-5d09-421f-addc-b1ede43ff517„´precedence_heuristic §cell_idÙ$09dd1440-5d09-421f-addc-b1ede43ff517´downstream_cells_map€²upstream_cells_map‡¦Normal£pdf§scatter¤plot¨LinRange¦Layout´gaussian_plot_params‘Ù$60c21e9c-e42d-4f0b-a910-3b318440fbc8Ù$a0ca7a5e-0089-4a45-9278-c0f27cd096a0„´precedence_heuristic §cell_idÙ$a0ca7a5e-0089-4a45-9278-c0f27cd096a0´downstream_cells_map€²upstream_cells_map‚Ù"plot_mountaincar_continuous_values‘Ù$68469a40-7976-48b7-b7a1-eaa4c5f33a18Ù"mountaincar_continuous_test_train3‘Ù$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7dÙ$64b38d1f-ecf9-4843-89a1-4c8953048265„´precedence_heuristic §cell_idÙ$64b38d1f-ecf9-4843-89a1-4c8953048265´downstream_cells_mapÙ&cartpole_fcann_continuing_test_episode”Ù$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fceÙ$6acb549a-5d90-4457-a347-d22448ad8071Ù$fad02876-efba-46a7-9cb7-43820528779fÙ$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7ef²upstream_cells_mapƒ®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6ªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¾cartpole_continuing_fcann_test‘Ù$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84„´precedence_heuristic§cell_idÙ$d963ff6d-f1b6-4799-aa0e-1ae100310d84´downstream_cells_mapÞž§spzeros¨droptol!´requestCostFunctionsÙ#monte_carlo_off_policy_prediction_qµdouble_expected_sarsa·AbstractAveragingMethod²policy_iteration_v§TailRec®sample_rollout®autoTuneParamsªq_learningªgetBackend©fullTrain«sparse_vcat»monte_carlo_control_Ïµ_softreadBinParams³make_greedy_policy!±policy_iteration!ªsmartTuneRÙ$monte_carlo_control_exploring_starts¦sparse½initialize_state_action_valueªmultiTrain¾StateMRPTransitionDistributionÙ!monte_carlo_off_policy_prediction¬rook_actions·td0_policy_prediction_v·td0_policy_prediction_q¥FCANNœÙ$cc3ac95e-a398-438a-ba3d-62b6733f6342Ù$45f0a385-6465-4acc-8637-1b007a0fe215Ù$0e9de19e-bcd4-40ac-9831-afb6cad38422Ù$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091Ù$5c4a383f-fcf2-4f2b-819f-6d84471dda00Ù$635abb34-2c97-4f04-a74c-22fbec32f408Ù$f3e2db06-9cb7-464a-96b8-938175efd26bÙ$697b2310-9d96-4f7f-be62-c3bd6bf736f3Ù$57bbdb10-bed8-459d-8f67-9ea637cf12baÙ$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54Ù$8bc280db-e57d-4e40-be46-1790f4f7d9e7Ù$11063fff-4d36-46d5-828f-dbed0f46b9cf¨@tailrec©simulate!»TabularMRPTransitionSampler±multiTrainAutoReg¹get_cuda_toolkit_versions¹StateMRPTransitionSampler©autoTuneRªwriteArray´AbstractSparseMatrix¹@using_nvidialib_settings³bellman_state_valueªdropoutReg¯smartEvalLayers´distribution_rollout±calcfeatureimpact¯SparseMatrixCSC°ADAMAXTrainNNCPU´AbstractSparseVector«AbstractMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295³policy_evaluation_q½monte_carlo_policy_predictionsample_action•Ù$4fb83451-b6f8-4e6e-a131-1accc8e10b08Ù$e7e49ff8-32df-48a4-afb2-462859592e92Ù$4d4ae57b-afc3-44f9-b6fc-892f59f82921Ù$266d2234-26c8-43f1-9e75-49440a230ed6Ù$83640f5b-fe13-4ec1-98a0-67a56c189ba1«traintrials«runepisode!’Ù$4fb83451-b6f8-4e6e-a131-1accc8e10b08Ù$5b868eba-c1af-49f6-8f93-79b78c319a6f±double_q_learningªmaxNormReg¿monte_carlo_policy_prediction_v¯NVIDIALibrariesµAbstractAfterstateMDP¬sparse_hvcatªAbstractMP²policy_evaluation!²ApproximationUtils‘Ù$f946c886-6246-4f98-a96f-f06984691ad8«sparse_hcat±value_iteration_q°initializeParams¨issparse·monte_carlo_tree_searchµ_Previous_Controller_¨archEval¶monte_carlo_prediction³AbstractSparseArray³benchmarkCPUThreads¨nonzeros²generalized_sarsa!³policy_evaluation_vªevalLayers«ftranspose!¨StateMRP§nzrange¹AbstractTabularTransition©TabularRL‘Ù$3c316495-bb6c-41e2-a38f-ba867a319fbbµConstantStepAveraging®archEvalSample©tuneAlpha¹make_stochastic_gridworld«writeParams¸bellman_afterstate_value¨StateMDPÜ Ù$5cc4d12d-b537-47e2-8109-4e7a234fdf25Ù$0ac7ea44-14f6-4e80-80f9-d6df8059bb38Ù$96506201-6b66-49e6-8179-06952e2394e1Ù$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290Ù$8e39bd15-862e-4941-88f9-2794b861a523Ù$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091Ù$4fb83451-b6f8-4e6e-a131-1accc8e10b08Ù$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbÙ$d1ed25e6-60c6-411f-a541-99986e5da2c5Ù$697b2310-9d96-4f7f-be62-c3bd6bf736f3Ù$4d4ae57b-afc3-44f9-b6fc-892f59f82921Ù$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2Ù$57e5e12a-b722-4ea3-ab3b-e5711029e640Ù$57bbdb10-bed8-459d-8f67-9ea637cf12baÙ$266d2234-26c8-43f1-9e75-49440a230ed6Ù$05bfd818-bf4e-4bda-baa9-5ba647867097Ù$68806899-9972-460a-9f11-daa708a9d610Ù$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54Ù$1f041cb3-618c-4380-a1ec-d7bbe4a80f62Ù$11ea640c-3981-404d-87c6-4d3d0708a2b8Ù$f8614042-7c94-4d47-a1b6-4e96676b4e8bÙ$83640f5b-fe13-4ec1-98a0-67a56c189ba1Ù$f0104778-81a6-417b-8501-f916e5e7f3afÙ$734573e5-547b-4dcc-89bb-412aa6cc42d6Ù$e96d592d-1e54-486d-8ad9-b857f85476e8Ù$ff4f977e-48df-4c12-845c-c245b4d39d6dÙ$8bc280db-e57d-4e40-be46-1790f4f7d9e7Ù$5aba4f96-e877-457e-8e95-18737348f99fÙ$11063fff-4d36-46d5-828f-dbed0f46b9cfÙ$4c4e643b-d4b9-44f0-8d30-dc521bcc55acÙ$00152954-dc98-4120-b94b-2ea4d987832bÙ$d9d11d69-bc16-400a-8f46-f9a8ecb8516a²make_random_policy§spdiagm©dropzeros¬preptraining«backendList§sprandn¨LossType®mrp_evaluation·AbstractStateTransition®expected_sarsa©evalMulti¯SampleAveraging¾monte_carlo_off_policy_control¥tuneRªsetBackendªTabularMRP£uctªmakelookupªapply_uct!¥L2Reg¯GridworldAction®td0_prediction©sparsevec¾TabularDeterministicTransition°value_iteration!©blockdiag´TabularAfterstateMDPµtd0_policy_prediction®GridworldState¶make_Ïµ_greedy_policy!‘Ù$5981f52b-d829-4c7d-b47b-33310f7d64a2±value_iteration_vªTabularMDP²AbstractTransition‘Ù$c8b47eac-2d45-419a-bec6-2ae0cdc59393¯mrp_evaluation!«AbstractMRP¦sprand´find_terminal_states¥sarsa»TabularMDPTransitionSamplerswitch_deviceªdropzeros!½TabularTransitionDistribution«OutputIndex‘Ù$5c4a383f-fcf2-4f2b-819f-6d84471dda00³monte_carlo_control¾StateMDPTransitionDistribution±policy_evaluation¬readBinInput¦fkeep!¶bellman_policy_update!¬SparseVector¯value_iteration©testTrainªrunepisodeÜÙ$f7433324-acc3-49a5-b5b3-ada0c8f09d52Ù$fb8904a9-ae64-41cc-93b6-5a25855edad0Ù$cecc2a35-3850-4f66-84e8-e29da4f3d4b0Ù$0c9986bb-54c0-4b08-9c29-4bfb0b68b54eÙ$4fb83451-b6f8-4e6e-a131-1accc8e10b08Ù$0cd96c44-cae6-421f-9fae-26141600bef4Ù$64b38d1f-ecf9-4843-89a1-4c8953048265Ù$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6Ù$5b868eba-c1af-49f6-8f93-79b78c319a6fÙ$cf1859d6-f889-4923-8c87-2d7c039f26c3Ù$31db0f58-28e4-454f-9394-25565687266fÙ$dddc4a2f-34b2-41dc-85b3-55aba4880fa6Ù$5859ca11-90f8-4fd6-88ed-c56efe796fe8Ù$11a55af7-5301-4507-bb26-88e1e11236dbÙ$07ba9fe4-aaa7-4123-9865-cbfa79d0d44aÙ$e1274f57-75cb-4659-a82f-e5870c5367e2Ù$daf35bfe-8f9c-4f55-971d-4d443be8f8bfÙ$a5b002c9-5e11-462a-9da0-6e060c7963f8Ù$ba645f6b-143f-4e83-9003-707770ae308dÙ$b5319d8b-0420-4ebf-b603-ea0b93365ac1¼make_deterministic_gridworld°CrossEntropyLoss‘Ù$45f0a385-6465-4acc-8637-1b007a0fe215¦findnz¬SparseArrays¬checkNumGrad»TabularStochasticTransition»initialize_afterstate_value£nnzºbellman_state_action_value¹StateMDPTransitionSampler•Ù$5cc4d12d-b537-47e2-8109-4e7a234fdf25Ù$f0104778-81a6-417b-8501-f916e5e7f3afÙ$4c4e643b-d4b9-44f0-8d30-dc521bcc55acÙ$00152954-dc98-4120-b94b-2ea4d987832bÙ$3c316495-bb6c-41e2-a38f-ba867a319fbb¯benchmarkDevice¶find_available_actions¶initialize_state_value¿monte_carlo_policy_prediction_q°policy_iteration§permute§rowvals²upstream_cells_map½Base.CoreLogging.invokelatest´Base.CoreLogging.===¨@raw_strº#___this_pluto_module_name§rethrow²Base.CoreLogging.!Ù'Base.CoreLogging.Base.fixup_stdlib_path¤Base´Base.CoreLogging.isa±Base.invokelatest³Base.CoreLogging.>=¨@__DIR__»PlutoDevMacros.@frompackageÙ$b16899b7-36bf-4a5e-8e2f-4496b8450687„´precedence_heuristic §cell_idÙ$b16899b7-36bf-4a5e-8e2f-4496b8450687´downstream_cells_mapµsquashed_gaussian_pdf‘Ù$00bd2835-b006-4244-9877-bc7e031e3ef8²upstream_cells_map£exp¤sqrtAbstractArray¤Real¡-¥atanh¡/¡^¢Ï€¥Union¡*£abs£invÙ$10cdd16e-a337-4421-a7a0-6de4e4b60c0f„´precedence_heuristic §cell_idÙ$10cdd16e-a337-4421-a7a0-6de4e4b60c0f´downstream_cells_map¿BinaryGaussianEligibilityVector”Ù$f55afa58-962d-4551-8d95-a5b467d61adfÙ$740a3f41-9302-481d-b373-762c0dea8effÙ$ba5d6311-daee-4abc-b2fb-fae2184ef3ebÙ$056a8adc-92f4-4b33-90d9-4b3b4026bbbc²upstream_cells_mapŠ¤Real¤zero¡N¥zeros³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41¥Union¦NTuple¤ones£one¦VectorÙ$a8b40b8f-051a-4e6f-a079-ece4f32873de„´precedence_heuristic §cell_idÙ$a8b40b8f-051a-4e6f-a079-ece4f32873de´downstream_cells_map½create_actor_critic_params_UI”Ù$36d514fa-b27a-4c6b-8399-9d108377b9b5Ù$4c5cb75e-79b5-4502-b1eb-6246e002feafÙ$71a5fce8-6d9a-4625-bad1-a951d61bff28Ù$0d45ae72-572f-4d17-83cf-9814f2854131²upstream_cells_mapŽ§@md_strBase.getindex¡:§PlutoUI‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70¢|>¸HypertextLiteral.content¦Slider¤@htl‘Ù$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaeb¤Base«NumberField·HypertextLiteral.Result°HypertextLiteral‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70¯PlutoUI.combine§confirmÙ$5eebf3da-bfe7-46eb-81a3-f87f334ee270„´precedence_heuristic §cell_idÙ$5eebf3da-bfe7-46eb-81a3-f87f334ee270´downstream_cells_mapÙ#create_actor_critic_fcann_params_UI‘Ù$9978d537-49ff-4014-a971-b42704c50a6b²upstream_cells_map‰§@md_str¡:§PlutoUI‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70«NumberField¢|>¯PlutoUI.combine§confirm¦Slider¨getindexÙ$9bce6fdb-2cbc-4758-9a8b-794e490c973d„´precedence_heuristic §cell_idÙ$9bce6fdb-2cbc-4758-9a8b-794e490c973d´downstream_cells_map¨ep2_step‘Ù$1ce4bc6c-7cde-48e9-8ff1-7281697fd121²upstream_cells_map‹¤Core¡:¨Base.get¥@bind£ep2‘Ù$a5b002c9-5e11-462a-9da0-6e060c7963f8¦Slider¦length¤Base«PlutoRunner·PlutoRunner.create_bond¯Core.applicableÙ$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbf„´precedence_heuristic §cell_idÙ$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbf´downstream_cells_mapÙ$create_continuous_action_mountaincar’Ù$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2Ù$349631b2-4686-49a9-9f3a-1e4ad588b568²upstream_cells_map‰£abs¡<´MountainCarTask.step¤sign¡>¦isless¡*ContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295¯MountainCarTaskÙ$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6„´precedence_heuristic §cell_idÙ$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6´downstream_cells_mapÙ#mountaincar_continuing_test_episode²upstream_cells_mapƒÙ mountaincar_continuing_tile_test‘Ù$b02ba928-5b9f-4695-b980-07988c788bb9ªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¯MountainCarTaskÙ$7afb6fb0-248a-4518-b94f-9876f81eca64„´precedence_heuristic §cell_idÙ$7afb6fb0-248a-4518-b94f-9876f81eca64´downstream_cells_mapÙ#corridor_continuing_parameter_study‘Ù$42775fd1-5b27-48e0-abf1-9b22bb775e6d²upstream_cells_mapƒµget_corridor_features‘Ù$6bb0263e-368e-462a-948c-baf9cfa82512·corridor_continuing_mdp‘Ù$1ac9296f-047b-4051-ba5c-0c23d5f9cde9Ù#actor_critic_linear_parameter_study“Ù$734573e5-547b-4dcc-89bb-412aa6cc42d6Ù$e96d592d-1e54-486d-8ad9-b857f85476e8Ù$ff4f977e-48df-4c12-845c-c245b4d39d6dÙ$37a273b6-b104-46f0-987a-401dc1c97327„´precedence_heuristic §cell_idÙ$37a273b6-b104-46f0-987a-401dc1c97327´downstream_cells_mapÙ,start_cartpole_continuing_binary_param_study‘Ù$b2539398-fdbc-42a2-a8f3-d327358f3643²upstream_cells_mapˆ¤Core¤Base·PlutoRunner.create_bond«PlutoRunnerCounterButton¯Core.applicable¥@bind¨Base.getÙ$7a6f3f79-ea06-4994-8b62-90b2056e4034„´precedence_heuristic §cell_idÙ$7a6f3f79-ea06-4994-8b62-90b2056e4034´downstream_cells_mapƒÙ squashed_gaussian_action_samplerÙ make_squashed_gaussian_n_sampler¾make_squashed_gaussian_sampler‘Ù$05f120be-9695-4824-82fd-142a0df13098²upstream_cells_mapÞ¨isapprox¤zero£exp¥isnan¤rand§Integer¦Vector¤Real¥isinf¤sign£Val¡+¦NTuple¦ntuple¤tanh¡*¦NormalÙ$f2ed56c9-c2b7-42cb-a083-e12aeaa126ef„´precedence_heuristic §cell_idÙ$f2ed56c9-c2b7-42cb-a083-e12aeaa126ef´downstream_cells_map€²upstream_cells_map„¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabeµget_corridor_features‘Ù$6bb0263e-368e-462a-948c-baf9cfa82512¡^Ù-reinforce_monte_carlo_control_binary_features‘Ù$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290Ù$cbea5840-49d2-4e91-be9c-f5f15666d78a„´precedence_heuristic §cell_idÙ$cbea5840-49d2-4e91-be9c-f5f15666d78a´downstream_cells_map€²upstream_cells_map„¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabeµget_corridor_features‘Ù$6bb0263e-368e-462a-948c-baf9cfa82512Ù;reinforce_with_baseline_monte_carlo_control_binary_features‘Ù$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb¡^Ù$1f041cb3-618c-4380-a1ec-d7bbe4a80f62„´precedence_heuristic §cell_idÙ$1f041cb3-618c-4380-a1ec-d7bbe4a80f62´downstream_cells_mapÙ,actor_critic_binary_episodic_parameter_study’Ù$bc8a399b-8864-4473-89d2-e3b0a03d15b5Ù$8eb42403-1234-4e59-993e-057cc3a6d5c9²upstream_cells_mapÞ¦Random‘Ù$df7f84e8-b42a-4001-9dbf-6bc3ced94207¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¦length¤copy¦Vector¤Real§scatter¡/¦MatrixÙ4actor_critic_with_eligibility_traces_binary_features‘Ù$05bfd818-bf4e-4bda-baa9-5ba647867097§isempty¤mean¡:®AbstractVector¢|>£Inf¥zeros¤rand§Integer¨Function¦UInt64¡-¤plot¦foldxt¡+£Map¦Layout¬Random.seed!Ù$96506201-6b66-49e6-8179-06952e2394e1„´precedence_heuristic §cell_idÙ$96506201-6b66-49e6-8179-06952e2394e1´downstream_cells_map½setup_binary_policy_arguments”Ù$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290Ù$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbÙ$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2Ù$05bfd818-bf4e-4bda-baa9-5ba647867097²upstream_cells_mapŠ¤copy¤Real¨Function¦length½update_binary_feature_vector!‘Ù$8eab55a5-41b7-4f5e-a02f-4c19388bc9ea¥zeros³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41·BinaryEligibilityVector‘Ù$41dc149d-c6f3-4b0d-a856-06f3aaae3049§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$76b03e72-da04-4530-8534-6d6468268cbd„´precedence_heuristic §cell_idÙ$76b03e72-da04-4530-8534-6d6468268cbd´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$fd89433e-643c-474b-b3c4-a997678421a6„´precedence_heuristic §cell_idÙ$fd89433e-643c-474b-b3c4-a997678421a6´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$87feff3e-e510-4916-91a9-db3a2cd12225„´precedence_heuristic §cell_idÙ$87feff3e-e510-4916-91a9-db3a2cd12225´downstream_cells_mapÙ&fcann_continuing_cartpole_study_params²upstream_cells_mapÞ§@md_str¤Core¡:§PlutoUI‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70¢|>¨Base.get¥@bind¦Slider¤Base«PlutoRunner·PlutoRunner.create_bond«NumberField§confirm¯Core.applicable¯PlutoUI.combine¨getindexÙ$5261651e-a51e-4e80-8e23-83a4c10e5259„´precedence_heuristic §cell_idÙ$5261651e-a51e-4e80-8e23-83a4c10e5259´downstream_cells_mapÙ#update_gaussian_eligibility_vector!“Ù$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00Ù$d5020a8d-1dd7-403c-9d1f-665b95543943Ù$20776e09-7d9b-4db8-a060-7bceeec65b47²upstream_cells_mapÞ¤zero¦isless©@inbounds£one§nothing¦Vector¡<¯Base.simd_index©eachindex¤Real¥@simd¡^¦Matrix¦NTuple¤last£exp¡:¥first®julia.simdloop¤BaseµBase.simd_outer_range¡-¶Base.simd_inner_length¡+¡*Ù$dddc4a2f-34b2-41dc-85b3-55aba4880fa6„´precedence_heuristic §cell_idÙ$dddc4a2f-34b2-41dc-85b3-55aba4880fa6´downstream_cells_map€²upstream_cells_map…®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6¸display_cartpole_episode‘Ù$822e4d69-2582-4956-858e-06ecb091e76a¢|>®reinforce_test‘Ù$24fa139c-ad4b-49db-ac8f-23c476ed8608ªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$54fff14b-cf53-47b0-9cfa-8b9ee33df54e„´precedence_heuristic §cell_idÙ$54fff14b-cf53-47b0-9cfa-8b9ee33df54e´downstream_cells_map»BinaryBetaEligibilityVector”Ù$f55afa58-962d-4551-8d95-a5b467d61adfÙ$d41f1dd1-45fe-4456-9a01-ed47fd6704a7Ù$ed93259c-7b8b-46d7-97fb-f194e0e04b3aÙ$056a8adc-92f4-4b33-90d9-4b3b4026bbbc²upstream_cells_mapˆ¤Real¡N³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41¥Union¦NTuple¤ones£one¦VectorÙ$023f67b8-8f38-470a-9766-ac60a75678aa„´precedence_heuristic §cell_idÙ$023f67b8-8f38-470a-9766-ac60a75678aa´downstream_cells_map·mountaincar_fcann_setup‘Ù$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9²upstream_cells_mapƒºfcann_feature_vector_setup‘Ù$9acdbf38-2e10-45ec-85a0-d0db8453a599´mountaincar_min_vals‘Ù$2025ff38-f2ec-4224-b771-ff72ffe1af28´mountaincar_max_vals‘Ù$77906355-08f8-4b08-b051-84697199b519Ù$1558cec1-c4fd-4bc0-85ed-ae22c6067d41„´precedence_heuristic §cell_idÙ$1558cec1-c4fd-4bc0-85ed-ae22c6067d41´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$da8d0bca-105b-4d0b-a73d-ee5c9059aeaf„´precedence_heuristic §cell_idÙ$da8d0bca-105b-4d0b-a73d-ee5c9059aeaf´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$3e7cecec-eb77-4862-8e3c-b510422e06db„´precedence_heuristic §cell_idÙ$3e7cecec-eb77-4862-8e3c-b510422e06db´downstream_cells_map€²upstream_cells_map‚½squashed_gaussian_plot_params‘Ù$94517664-6988-44dc-a297-e9d5873ee540¶plot_squashed_gaussian‘Ù$00bd2835-b006-4244-9877-bc7e031e3ef8Ù$0284f0d7-b8a9-4ae6-add0-ac1078571d9b„´precedence_heuristic §cell_idÙ$0284f0d7-b8a9-4ae6-add0-ac1078571d9b´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$b94fc99c-f439-4df2-8da3-c01718a136c4„´precedence_heuristic §cell_idÙ$b94fc99c-f439-4df2-8da3-c01718a136c4´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$b8532822-179b-4cd5-a279-4b71dafb544a„´precedence_heuristic §cell_idÙ$b8532822-179b-4cd5-a279-4b71dafb544a´downstream_cells_mapÙ!mountaincar_continuous_test_train’Ù$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580aÙ$c87dba8c-9a96-41b3-9dc7-a6c088ec1eaf²upstream_cells_map…ºmountaincar_continuous_mdp‘Ù$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2¥Int64¼mountaincar_tilecoding_setup‘Ù$7c592385-e8d3-4efe-962c-d39debb64405ÙEactor_critic_with_eligibility_traces_binary_features_gaussian_actions‘Ù$20776e09-7d9b-4db8-a060-7bceeec65b47§typemaxÙ$07ba9fe4-aaa7-4123-9865-cbfa79d0d44a„´precedence_heuristic §cell_idÙ$07ba9fe4-aaa7-4123-9865-cbfa79d0d44a´downstream_cells_map€²upstream_cells_map…¯reinforce_test4‘Ù$407a0724-4bb6-4c83-ab2d-17a0e19c4072®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6¸display_cartpole_episode‘Ù$822e4d69-2582-4956-858e-06ecb091e76a¢|>ªrunepisode‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$f487f2dd-ad09-48ac-ae34-bf50cfa6ac7d„´precedence_heuristic §cell_idÙ$f487f2dd-ad09-48ac-ae34-bf50cfa6ac7d´downstream_cells_mapÙ.start_mountaincar_continuing_fcann_param_study‘Ù$cb70d400-3e9c-441c-b17c-e727e8c928f3²upstream_cells_mapˆ¤Core¤Base·PlutoRunner.create_bond«PlutoRunnerCounterButton¯Core.applicable¥@bind¨Base.getÙ$5c4a383f-fcf2-4f2b-819f-6d84471dda00„´precedence_heuristic §cell_idÙ$5c4a383f-fcf2-4f2b-819f-6d84471dda00´downstream_cells_map¼update_fcann_value_gradient!‘Ù$f3e2db06-9cb7-464a-96b8-938175efd26b²upstream_cells_map¡:®AbstractVector¥FCANN‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¤Bool«FCANNParams©@inbounds«OutputIndex‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84§Integer¦Vector¥Int64°FCANNActivations‘Ù$5c11a92d-7496-4aba-af15-2537eac49dd7©eachindex§Float32¡*´FCANN.nnCostFunctionÙ$135f205a-f87e-4691-8e87-d317d6312c84„´precedence_heuristic §cell_idÙ$135f205a-f87e-4691-8e87-d317d6312c84´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$4a39f9a7-72d4-44ad-895a-742cd1291f92„´precedence_heuristic §cell_idÙ$4a39f9a7-72d4-44ad-895a-742cd1291f92´downstream_cells_map«dist_plot_p‘Ù$9cf3dc5f-8a25-479f-93db-06e34f0d37a0²upstream_cells_map‹¤Core¡:¢|>¨Base.get¥@bind¦Slider¤Base«PlutoRunner·PlutoRunner.create_bond§confirm¯Core.applicableÙ$ee72af8d-3cb8-4314-82df-580f068e1252„´precedence_heuristic §cell_idÙ$ee72af8d-3cb8-4314-82df-580f068e1252´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$e524f8cc-ab69-4f8b-a59f-28156696a104„´precedence_heuristic §cell_idÙ$e524f8cc-ab69-4f8b-a59f-28156696a104´downstream_cells_mapÙ8run_mountaincar_binary_episodic_countinuous_param_study2‘Ù$0d93132d-5819-47dc-8cf2-462d480d9c3d²upstream_cells_mapˆ¤Core¤Base·PlutoRunner.create_bond«PlutoRunnerCounterButton¯Core.applicable¥@bind¨Base.getÙ$1894ae1a-bb68-4de0-a4d2-ac5d02c49f09„´precedence_heuristic §cell_idÙ$1894ae1a-bb68-4de0-a4d2-ac5d02c49f09´downstream_cells_map€²upstream_cells_map€Ù$f3bc47b5-03fc-4bd9-a890-26f9608a730b„´precedence_heuristic §cell_idÙ$f3bc47b5-03fc-4bd9-a890-26f9608a730b´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$4915b1ed-ad53-4ece-9b00-bc136d47d8dc„´precedence_heuristic §cell_idÙ$4915b1ed-ad53-4ece-9b00-bc136d47d8dc´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$f924eb30-d1cc-4941-8fb5-ff70ad425ab9„´precedence_heuristic §cell_idÙ$f924eb30-d1cc-4941-8fb5-ff70ad425ab9´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$d83dc659-dce7-41dd-a8e7-2933ab39d15c„´precedence_heuristic §cell_idÙ$d83dc659-dce7-41dd-a8e7-2933ab39d15c´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fce„´precedence_heuristic §cell_idÙ$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fce´downstream_cells_map€²upstream_cells_map‚¸display_cartpole_episode‘Ù$822e4d69-2582-4956-858e-06ecb091e76aÙ&cartpole_fcann_continuing_test_episode‘Ù$64b38d1f-ecf9-4843-89a1-4c8953048265Ù$83ca0577-15d7-4448-b597-c77810b812bf„´precedence_heuristic §cell_idÙ$83ca0577-15d7-4448-b597-c77810b812bf´downstream_cells_map°figure_13_2_test‘Ù$a7dcc8cd-04ec-48f2-a387-116330eaffb2²upstream_cells_mapÞ¤sqrt¦Random‘Ù$df7f84e8-b42a-4001-9dbf-6bc3ced94207¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabe§scatter¡/¤fill¥round¡:¢|>¥Int64µget_corridor_features‘Ù$6bb0263e-368e-462a-948c-baf9cfa82512¡-¤log2¦foldxt¤plot¡+¡*Ù-reinforce_monte_carlo_control_binary_features‘Ù$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290£MapÙ;reinforce_with_baseline_monte_carlo_control_binary_features‘Ù$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb¦Layout¬Random.seed!Ù$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb„´precedence_heuristic §cell_idÙ$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb´downstream_cells_mapÙ;reinforce_with_baseline_monte_carlo_control_binary_features•Ù$cbea5840-49d2-4e91-be9c-f5f15666d78aÙ$83ca0577-15d7-4448-b597-c77810b812bfÙ$e5c1aca8-7575-4835-8273-e69ca0a55fe8Ù$d3b56fca-5b79-4465-8987-8d0005f854d8Ù$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03²upstream_cells_mapÙ,reinforce_with_baseline_monte_carlo_control!’Ù$4fb83451-b6f8-4e6e-a131-1accc8e10b08Ù$5b868eba-c1af-49f6-8f93-79b78c319a6fµbinary_value_function‘Ù$a540814a-57a1-4b98-9443-59e401425444½setup_binary_policy_arguments‘Ù$96506201-6b66-49e6-8179-06952e2394e1¥zeros³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¦Vector¤Real¨FunctionÙ!update_binary_eligibility_vector!‘Ù$042fbafe-2401-4fb7-ac13-4531e0782c79¦length½update_binary_value_gradient!‘Ù$03a218cb-aa83-4000-85b5-c6f247087053¦MatrixÙ!update_binary_action_preferences!‘Ù$a361f4c9-47ce-42ad-899c-87b611c0d471Ù$a7dcc8cd-04ec-48f2-a387-116330eaffb2„´precedence_heuristic §cell_idÙ$a7dcc8cd-04ec-48f2-a387-116330eaffb2´downstream_cells_map€²upstream_cells_map„¡:°figure_13_2_test‘Ù$83ca0577-15d7-4448-b597-c77810b812bf¤vcat¡^Ù$0ab70fc3-6188-42eb-aba2-d808f319be9f„´precedence_heuristic §cell_idÙ$0ab70fc3-6188-42eb-aba2-d808f319be9f´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$047656d1-2921-40f2-b75b-ce4a87098007„´precedence_heuristic §cell_idÙ$047656d1-2921-40f2-b75b-ce4a87098007´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$5d434c83-c9ca-499f-8695-c7733031c2de„´precedence_heuristic §cell_idÙ$5d434c83-c9ca-499f-8695-c7733031c2de´downstream_cells_map¸cartpole_continuing_step‘Ù$4c4e643b-d4b9-44f0-8d30-dc521bcc55ac²upstream_cells_map‡CartPoleState·cartpole_functions.stepÙ#cartpole_functions.initialize_stateºcartpole_functions.failure¡+²cartpole_functions‘Ù$f27f2bcd-05b6-44fe-bf9e-a3e51556db7c§IntegerÙ$3a37b53d-9174-4faa-9404-74a40c385b0a„´precedence_heuristic §cell_idÙ$3a37b53d-9174-4faa-9404-74a40c385b0a´downstream_cells_map€²upstream_cells_map‚»show_mountaincar_trajectory‘Ù$ba645f6b-143f-4e83-9003-707770ae308dÙ!mountaincar_continuing_fcann_test‘Ù$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9Ù$820752af-8966-4ee8-82f7-a40934522de5„´precedence_heuristic §cell_idÙ$820752af-8966-4ee8-82f7-a40934522de5´downstream_cells_map€²upstream_cells_map€Ù$6acb549a-5d90-4457-a347-d22448ad8071„´precedence_heuristic §cell_idÙ$6acb549a-5d90-4457-a347-d22448ad8071´downstream_cells_mapÙ-cartpole_fcann_continuing_episode_step_select’Ù$fad02876-efba-46a7-9cb7-43820528779fÙ$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7ef²upstream_cells_map‹¤Core¡:¨Base.get¥@bind¦Slider¦length¤Base«PlutoRunner·PlutoRunner.create_bondÙ&cartpole_fcann_continuing_test_episode‘Ù$64b38d1f-ecf9-4843-89a1-4c8953048265¯Core.applicableÙ$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62„´precedence_heuristic §cell_idÙ$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62´downstream_cells_mapÙ)cartpole_fcann_continuing_parameter_study‘Ù$50ae94c4-70f3-4215-82bd-eb2227c2badf²upstream_cells_map†·cartpole_vector_update!‘Ù$192b9f82-8d3a-408f-91c2-829cfcd32572·cartpole_continuing_mdp‘Ù$4c4e643b-d4b9-44f0-8d30-dc521bcc55ac¼cartpole_fcann_feature_setup‘Ù$61650a97-b353-4a85-b50b-93fee296ac7b¤fillÙ"actor_critic_fcann_parameter_study“Ù$8bc280db-e57d-4e40-be46-1790f4f7d9e7Ù$5aba4f96-e877-457e-8e95-18737348f99fÙ$11063fff-4d36-46d5-828f-dbed0f46b9cf§IntegerÙ$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728„´precedence_heuristic §cell_idÙ$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728´downstream_cells_map€²upstream_cells_map†¥Int64¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabeµget_corridor_features‘Ù$6bb0263e-368e-462a-948c-baf9cfa82512¡^Ù4actor_critic_with_eligibility_traces_binary_features‘Ù$05bfd818-bf4e-4bda-baa9-5ba647867097§typemaxÙ$ae0f5a96-7a4b-47f9-be1e-e803a238a071„´precedence_heuristic §cell_idÙ$ae0f5a96-7a4b-47f9-be1e-e803a238a071´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$41d62de1-2c92-41ee-9430-b9ca3007afd9„´precedence_heuristic §cell_idÙ$41d62de1-2c92-41ee-9430-b9ca3007afd9´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$8eb42403-1234-4e59-993e-057cc3a6d5c9„´precedence_heuristic §cell_idÙ$8eb42403-1234-4e59-993e-057cc3a6d5c9´downstream_cells_map€²upstream_cells_mapŠ§@md_str¡<Ù,actor_critic_binary_episodic_parameter_study’Ù$1f041cb3-618c-4380-a1ec-d7bbe4a80f62Ù$d9d11d69-bc16-400a-8f46-f9a8ecb8516a¡>¹mountaincar_binary_params‘Ù$4c5cb75e-79b5-4502-b1eb-6246e002feaf¦isless¼mountaincar_tilecoding_setup‘Ù$7c592385-e8d3-4efe-962c-d39debb64405Ù+run_mountaincar_binary_episodic_param_study‘Ù$192cc1cf-9ea1-492d-baa7-f2e197abecd4¯MountainCarTask¨getindexÙ$bbc8864a-1545-433f-bc7c-0ddf6e907138„´precedence_heuristic §cell_idÙ$bbc8864a-1545-433f-bc7c-0ddf6e907138´downstream_cells_map¾plot_mountaincar_policy_values‘Ù$dc2efc6c-8da8-425b-aa5f-290949109565²upstream_cells_mapŽ¡:·HypertextLiteral.Bypass¨LinRange¥zeros¸HypertextLiteral.content¤@htl‘Ù$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaeb¨Function©enumerate§Float32¤plot°HypertextLiteral‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70§heatmap·HypertextLiteral.Result¦LayoutÙ$a12b92d1-e045-4f92-b8cd-eee5d56fa67d„´precedence_heuristic §cell_idÙ$a12b92d1-e045-4f92-b8cd-eee5d56fa67d´downstream_cells_map°best_mc_corridor’Ù$44b32cc0-36a8-41fd-89bc-ce894536926cÙ$553b0ceb-f2ca-41ee-99bc-9f53a4487b49²upstream_cells_map„¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabeÙ;reinforce_with_baseline_monte_carlo_control_linear_features‘Ù$d1ed25e6-60c6-411f-a541-99986e5da2c5¡^¹update_corridor_features!‘Ù$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811Ù$ce33f710-fd9d-4dfa-acda-40204e54d518„´precedence_heuristic §cell_idÙ$ce33f710-fd9d-4dfa-acda-40204e54d518´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$339b4d2b-2237-46a3-9867-ecc3332856c1„´precedence_heuristic §cell_idÙ$339b4d2b-2237-46a3-9867-ecc3332856c1´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$a8349352-3242-46d5-b0d5-1b6eb5d77e90„´precedence_heuristic §cell_idÙ$a8349352-3242-46d5-b0d5-1b6eb5d77e90´downstream_cells_map¡x‘Ù$d4e87ac4-6008-43b2-aa06-e232ec2b2b5b²upstream_cells_map‰¤Core¤Base¡:·PlutoRunner.create_bond«PlutoRunner¯Core.applicable¥@bind¨Base.get¦SliderÙ$7d63b960-3998-4f7b-8cbb-ccd49db9aeac„´precedence_heuristic §cell_idÙ$7d63b960-3998-4f7b-8cbb-ccd49db9aeac´downstream_cells_map€²upstream_cells_map†¥Int64Ù%one_step_actor_critic_binary_features‘Ù$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabeµget_corridor_features‘Ù$6bb0263e-368e-462a-948c-baf9cfa82512¡^§typemaxÙ$65d2add6-fd6f-456c-92ed-3cd9d1862ef6„´precedence_heuristic §cell_idÙ$65d2add6-fd6f-456c-92ed-3cd9d1862ef6´downstream_cells_map¼update_binary_policy_params!²upstream_cells_mapŠ¤Real©eachindex¡-®BinaryFeatures‘Ù$da2d3186-a778-41cc-9b49-759bf1e9b8fa¦Matrix¡+©@inbounds¡*§Integer¦VectorÙ$f55afa58-962d-4551-8d95-a5b467d61adf„´precedence_heuristic §cell_idÙ$f55afa58-962d-4551-8d95-a5b467d61adf´downstream_cells_map¼update_params_with_gradient!—Ù$4fb83451-b6f8-4e6e-a131-1accc8e10b08Ù$4d4ae57b-afc3-44f9-b6fc-892f59f82921Ù$266d2234-26c8-43f1-9e75-49440a230ed6Ù$83640f5b-fe13-4ec1-98a0-67a56c189ba1Ù$5b868eba-c1af-49f6-8f93-79b78c319a6fÙ$b71145a4-2614-4f62-bfd2-7d5d1fecec56Ù$4da20fd7-b897-4f26-bf2a-f08d66ddf90f²upstream_cells_mapÞ¤zeroÙ'BinarySquashedGaussianEligibilityVector‘Ù$76fd79a2-2bc8-45f8-a243-48415118898a¦isless§digamma©@inbounds£one§nothing¦Vector¡<¯Base.simd_index¤Real¥@simd¡/¡^¦NTuple¦Matrix¡:³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41®julia.simdloop¤size¤BaseµBase.simd_outer_range¡-£log¥atanh»BinaryBetaEligibilityVector‘Ù$54fff14b-cf53-47b0-9cfa-8b9ee33df54e¶Base.simd_inner_length¡+¡*¿BinaryGaussianEligibilityVector‘Ù$10cdd16e-a337-4421-a7a0-6de4e4b60c0fÙ$d9d11d69-bc16-400a-8f46-f9a8ecb8516a„´precedence_heuristic §cell_idÙ$d9d11d69-bc16-400a-8f46-f9a8ecb8516a´downstream_cells_mapÙ,actor_critic_binary_episodic_parameter_study’Ù$bc8a399b-8864-4473-89d2-e3b0a03d15b5Ù$8eb42403-1234-4e59-993e-057cc3a6d5c9²upstream_cells_map‹«@NamedTuple¡:§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¤Real¥Int64¨Function¤Base¡-¡^¡+Ù$ed93259c-7b8b-46d7-97fb-f194e0e04b3a„´precedence_heuristic §cell_idÙ$ed93259c-7b8b-46d7-97fb-f194e0e04b3a´downstream_cells_mapÙ"setup_binary_beta_policy_arguments‘Ù$3e3c5897-809f-46e3-bb58-f115b082443e²upstream_cells_map‹³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41¤rand§IntegerContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295¤Real¨Function»BinaryBetaEligibilityVector‘Ù$54fff14b-cf53-47b0-9cfa-8b9ee33df54e½update_binary_feature_vector!‘Ù$8eab55a5-41b7-4f5e-a02f-4c19388bc9ea¸make_n_param_dist_params‘Ù$76eb6743-cac0-4174-9ba3-a0691c200b54¥Union¦NTupleÙ$d1ed25e6-60c6-411f-a541-99986e5da2c5„´precedence_heuristic §cell_idÙ$d1ed25e6-60c6-411f-a541-99986e5da2c5´downstream_cells_mapÙ;reinforce_with_baseline_monte_carlo_control_linear_features’Ù$cacaaca6-6e01-464f-a2ee-cbf62737a426Ù$a12b92d1-e045-4f92-b8cd-eee5d56fa67d²upstream_cells_mapŽÙ,reinforce_with_baseline_monte_carlo_control!’Ù$4fb83451-b6f8-4e6e-a131-1accc8e10b08Ù$5b868eba-c1af-49f6-8f93-79b78c319a6f¥zeros§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84½update_linear_value_gradient!‘Ù$1753b5ed-c00b-4b60-b492-822180778e8c¤copy¦VectorÙ!update_linear_eligibility_vector!‘Ù$4634267b-5dea-4164-8bb2-1eb2fd4d7954¤Real¨FunctionÙ!update_linear_action_preferences!‘Ù$581f7e9b-a5c2-4841-9605-85f9585b0274¦Matrix¦lengthµlinear_value_function‘Ù$0bf3b988-b3fb-49d5-8dde-b25766596363Ù$b966b248-fb4d-457d-90f6-114370846242„´precedence_heuristic §cell_idÙ$b966b248-fb4d-457d-90f6-114370846242´downstream_cells_mapµbad_continuous_action“Ù$f946c886-6246-4f98-a96f-f06984691ad8Ù$b71145a4-2614-4f62-bfd2-7d5d1fecec56Ù$4da20fd7-b897-4f26-bf2a-f08d66ddf90f²upstream_cells_map„¥isnan£any¦NTuple¤RealÙ$4156d955-9daf-4429-b152-e8332980fb9e„´precedence_heuristic §cell_idÙ$4156d955-9daf-4429-b152-e8332980fb9e´downstream_cells_mapÙ&mountaincar_continuous_test_train_beta“Ù$d82e7ab8-c372-4462-afb5-1617560cdb56Ù$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8fÙ$16113560-e911-47b4-abc4-641bbd246454²upstream_cells_map…ÙAactor_critic_with_eligibility_traces_binary_features_beta_actions‘Ù$3e3c5897-809f-46e3-bb58-f115b082443e¥Int64¼mountaincar_tilecoding_setup‘Ù$7c592385-e8d3-4efe-962c-d39debb64405¿mountaincar_continuous_beta_mdp‘Ù$8e096fae-9941-49d8-ae87-c68b02f68da5§typemaxÙ$b09e1e48-494e-4967-826a-6e70199acad4„´precedence_heuristic §cell_idÙ$b09e1e48-494e-4967-826a-6e70199acad4´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$734573e5-547b-4dcc-89bb-412aa6cc42d6„´precedence_heuristic §cell_idÙ$734573e5-547b-4dcc-89bb-412aa6cc42d6´downstream_cells_mapÙ#actor_critic_linear_parameter_study“Ù$7afb6fb0-248a-4518-b94f-9876f81eca64Ù$1b102220-6d78-480d-a77f-0e57bad23dcaÙ$d57375a5-b9e0-4742-b5f7-6a7da891604a²upstream_cells_mapÞ®AbstractVector¥zeros¤rand§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¦UInt64¤Real¨Function¦length§scatter¤plotÙ4actor_critic_with_eligibility_traces_binary_features‘Ù$05bfd818-bf4e-4bda-baa9-5ba647867097¦Matrix·average_continuing_runs‘Ù$ba642a22-6623-482a-ab4a-81585b83e457¦LayoutÙ4actor_critic_with_eligibility_traces_linear_features‘Ù$68806899-9972-460a-9f11-daa708a9d610Ù$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54„´precedence_heuristic §cell_idÙ$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54´downstream_cells_mapÙ*actor_critic_with_eligibility_traces_fcann˜Ù$f8614042-7c94-4d47-a1b6-4e96676b4e8bÙ$8bc280db-e57d-4e40-be46-1790f4f7d9e7Ù$11063fff-4d36-46d5-828f-dbed0f46b9cfÙ$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27Ù$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9Ù$407a0724-4bb6-4c83-ab2d-17a0e19c4072Ù$5ee4ce72-7740-4297-8d84-619e0708e4acÙ$82e0e9a0-9662-429a-87e3-e6bdae02709a²upstream_cells_map¥FCANN‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84«FCANNParams§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84»FCANN.initializeparams_saxe¥Int64¦Vector¤Real¨FunctionÙ&setup_fcann_policy_and_value_arguments‘Ù$e1aec891-d95a-47d1-97d7-d2a4cfb16e64¦length¤fillÙ%actor_critic_with_eligibility_traces!”Ù$266d2234-26c8-43f1-9e75-49440a230ed6Ù$83640f5b-fe13-4ec1-98a0-67a56c189ba1Ù$b71145a4-2614-4f62-bfd2-7d5d1fecec56Ù$4da20fd7-b897-4f26-bf2a-f08d66ddf90fÙ$692c1043-4eaf-491e-b8fe-368618867f99„´precedence_heuristic §cell_idÙ$692c1043-4eaf-491e-b8fe-368618867f99´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$2c5d221a-2469-49e1-9249-dfdc2457f2fa„´precedence_heuristic §cell_idÙ$2c5d221a-2469-49e1-9249-dfdc2457f2fa´downstream_cells_mapÙ+start_cartpole_continuing_fcann_param_study‘Ù$50ae94c4-70f3-4215-82bd-eb2227c2badf²upstream_cells_mapˆ¤Core¤Base·PlutoRunner.create_bond«PlutoRunnerCounterButton¯Core.applicable¥@bind¨Base.getÙ$7c592385-e8d3-4efe-962c-d39debb64405„´precedence_heuristic §cell_idÙ$7c592385-e8d3-4efe-962c-d39debb64405´downstream_cells_map¼mountaincar_tilecoding_setupšÙ$d57375a5-b9e0-4742-b5f7-6a7da891604aÙ$b02ba928-5b9f-4695-b980-07988c788bb9Ù$8eb42403-1234-4e59-993e-057cc3a6d5c9Ù$6d0925d3-af96-4b94-8e2e-4941cce39e51Ù$b53dba81-a9e9-41da-8fc2-7736bf25f2dcÙ$b8532822-179b-4cd5-a279-4b71dafb544aÙ$fee14dfe-c5ca-4126-a830-cc9d7eda5433Ù$0d93132d-5819-47dc-8cf2-462d480d9c3dÙ$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7dÙ$4156d955-9daf-4429-b152-e8332980fb9e²upstream_cells_mapƒ±tile_coding_setup´mountaincar_min_vals‘Ù$2025ff38-f2ec-4224-b771-ff72ffe1af28´mountaincar_max_vals‘Ù$77906355-08f8-4b08-b051-84697199b519Ù$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaeb„´precedence_heuristic §cell_idÙ$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaeb´downstream_cells_map¤@htlšÙ$16fcc2d0-9f2f-4226-9dcc-6d86248cab26Ù$cc80848a-6834-4272-9152-e17b45448814Ù$a8b40b8f-051a-4e6f-a079-ece4f32873deÙ$602a07dd-8928-4b44-97e5-01c5cbf38351Ù$63fbf8f4-e4e2-4893-be09-67450e92dbd7Ù$f9facbba-39d4-483e-9066-275603156db0Ù$bbc8864a-1545-433f-bc7c-0ddf6e907138Ù$68469a40-7976-48b7-b7a1-eaa4c5f33a18Ù$ba645f6b-143f-4e83-9003-707770ae308dÙ$b5319d8b-0420-4ebf-b603-ea0b93365ac1²upstream_cells_map€Ù$8eab55a5-41b7-4f5e-a02f-4c19388bc9ea„´precedence_heuristic §cell_idÙ$8eab55a5-41b7-4f5e-a02f-4c19388bc9ea´downstream_cells_map½update_binary_feature_vector!”Ù$96506201-6b66-49e6-8179-06952e2394e1Ù$ba5d6311-daee-4abc-b2fb-fae2184ef3ebÙ$ed93259c-7b8b-46d7-97fb-f194e0e04b3aÙ$4e29c621-223e-4859-8e96-db04b967815a²upstream_cells_map‰¦length¡<¥push!©enumerate¡>¦isless¡+³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41¨FunctionÙ$0ac7ea44-14f6-4e80-80f9-d6df8059bb38„´precedence_heuristic §cell_idÙ$0ac7ea44-14f6-4e80-80f9-d6df8059bb38´downstream_cells_map¾reinforce_monte_carlo_control!“Ù$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290Ù$8e39bd15-862e-4941-88f9-2794b861a523Ù$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091²upstream_cells_mapŠ¤Real§nothing¨FunctionÙ,reinforce_with_baseline_monte_carlo_control!’Ù$4fb83451-b6f8-4e6e-a131-1accc8e10b08Ù$5b868eba-c1af-49f6-8f93-79b78c319a6f§Returns¤zero¡/£one§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$5ffc271f-c73f-494a-9727-8d7516af2191„´precedence_heuristic §cell_idÙ$5ffc271f-c73f-494a-9727-8d7516af2191´downstream_cells_mapÙ&cartpole_continuing_fcann_study_params‘Ù$50ae94c4-70f3-4215-82bd-eb2227c2badf²upstream_cells_mapˆ¤Core¤Base·PlutoRunner.create_bond«PlutoRunnerÙ(create_actor_critic_continuing_params_UI‘Ù$5b15d91e-7119-4f85-a54a-7d4f1fdaf097¯Core.applicable¥@bind¨Base.getÙ$c5a2879c-e89b-47f7-bbd6-48200d7e89e3„´precedence_heuristic §cell_idÙ$c5a2879c-e89b-47f7-bbd6-48200d7e89e3´downstream_cells_mapÙ>actor_critic_binary_episodic_squashed_gaussian_parameter_study‘Ù$0d93132d-5819-47dc-8cf2-462d480d9c3d²upstream_cells_map…¤Real¨Function£one§IntegerContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295Ù$537270ba-122b-4f2b-880b-31d086766295„´precedence_heuristic §cell_idÙ$537270ba-122b-4f2b-880b-31d086766295´downstream_cells_mapContinuousMDPÜÙ$f946c886-6246-4f98-a96f-f06984691ad8Ù$5b868eba-c1af-49f6-8f93-79b78c319a6fÙ$ba5d6311-daee-4abc-b2fb-fae2184ef3ebÙ$ed93259c-7b8b-46d7-97fb-f194e0e04b3aÙ$4e29c621-223e-4859-8e96-db04b967815aÙ$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00Ù$d5020a8d-1dd7-403c-9d1f-665b95543943Ù$b71145a4-2614-4f62-bfd2-7d5d1fecec56Ù$4da20fd7-b897-4f26-bf2a-f08d66ddf90fÙ$20776e09-7d9b-4db8-a060-7bceeec65b47Ù$3e3c5897-809f-46e3-bb58-f115b082443eÙ$05f120be-9695-4824-82fd-142a0df13098Ù$717e4c69-59d5-4929-923f-dd35a97fb160Ù$55ba8725-0ddf-4196-a41d-3f3c490a8d84Ù$61949faa-8174-4b7b-8fbc-01d5f850b419Ù$dd8e8cd2-7b41-46c4-8530-adefb7aea684Ù$08505e88-9c23-4e95-91e3-d18bf5133dbcÙ$87482ea5-5265-4e02-92c0-1a8bb44ff0f4Ù$13ebc12f-ff6f-4266-88d3-28d6df5fcf59Ù$3d065608-eef2-4caa-b17d-ec60714e3d58Ù$bd6a7c16-6c25-4fc2-8e1b-4dab693ce19fÙ$c5a2879c-e89b-47f7-bbd6-48200d7e89e3Ù$65be0e58-24be-4932-92a9-9e4825b14144Ù$3c316495-bb6c-41e2-a38f-ba867a319fbbÙ$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbfÙ$d2729657-d0bf-4d39-8ec7-f242a1ad48d6²upstream_cells_mapˆ¨Function£new¾ContinuousMDPTransitionSampler‘Ù$c8b47eac-2d45-419a-bec6-2ae0cdc59393§Returns¼AbstractContinuousTransition‘Ù$c8b47eac-2d45-419a-bec6-2ae0cdc59393«AbstractMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¡F¤RealÙ$dc2efc6c-8da8-425b-aa5f-290949109565„´precedence_heuristic §cell_idÙ$dc2efc6c-8da8-425b-aa5f-290949109565´downstream_cells_map€²upstream_cells_map‚¾plot_mountaincar_policy_values‘Ù$bbc8864a-1545-433f-bc7c-0ddf6e907138¶mountaincar_test_train‘Ù$6d0925d3-af96-4b94-8e2e-4941cce39e51Ù$a019925a-460a-410e-a54b-50a4cfe0e90e„´precedence_heuristic §cell_idÙ$a019925a-460a-410e-a54b-50a4cfe0e90e´downstream_cells_map€²upstream_cells_map†¡-§scatter¤plot¨LinRange¦Layoutºget_corridor_episode_stats’Ù$fb8904a9-ae64-41cc-93b6-5a25855edad0Ù$cecc2a35-3850-4f66-84e8-e29da4f3d4b0Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41„´precedence_heuristic §cell_idÙ$f92bb265-4b19-4f0e-a698-d7547bb6dd41´downstream_cells_map³BinaryFeatureVectorÜÙ$b0a66a19-ee76-463b-a704-8fcee85444d0Ù$8eab55a5-41b7-4f5e-a02f-4c19388bc9eaÙ$a361f4c9-47ce-42ad-899c-87b611c0d471Ù$41dc149d-c6f3-4b0d-a856-06f3aaae3049Ù$042fbafe-2401-4fb7-ac13-4531e0782c79Ù$96506201-6b66-49e6-8179-06952e2394e1Ù$03a218cb-aa83-4000-85b5-c6f247087053Ù$a893a87b-2d07-4db5-9d1a-9da8646216f4Ù$a540814a-57a1-4b98-9443-59e401425444Ù$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbÙ$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2Ù$25be5dcf-be63-46c4-b6de-6cf79fa28fd0Ù$05bfd818-bf4e-4bda-baa9-5ba647867097Ù$10cdd16e-a337-4421-a7a0-6de4e4b60c0fÙ$54fff14b-cf53-47b0-9cfa-8b9ee33df54eÙ$76fd79a2-2bc8-45f8-a243-48415118898aÙ$f55afa58-962d-4551-8d95-a5b467d61adfÙ$740a3f41-9302-481d-b373-762c0dea8effÙ$d41f1dd1-45fe-4456-9a01-ed47fd6704a7Ù$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690cÙ$ba5d6311-daee-4abc-b2fb-fae2184ef3ebÙ$ed93259c-7b8b-46d7-97fb-f194e0e04b3aÙ$4e29c621-223e-4859-8e96-db04b967815aÙ$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00Ù$056a8adc-92f4-4b33-90d9-4b3b4026bbbcÙ$20776e09-7d9b-4db8-a060-7bceeec65b47Ù$3e3c5897-809f-46e3-bb58-f115b082443eÙ$05f120be-9695-4824-82fd-142a0df13098²upstream_cells_map„¥Int64£new§Integer¦VectorÙ$ac9c8845-284d-4c21-b05d-d930f86598a3„´precedence_heuristic §cell_idÙ$ac9c8845-284d-4c21-b05d-d930f86598a3´downstream_cells_mapÙ7run_mountaincar_binary_episodic_countinuous_param_study‘Ù$b53dba81-a9e9-41da-8fc2-7736bf25f2dc²upstream_cells_mapˆ¤Core¤Base·PlutoRunner.create_bond«PlutoRunnerCounterButton¯Core.applicable¥@bind¨Base.getÙ$192cc1cf-9ea1-492d-baa7-f2e197abecd4„´precedence_heuristic §cell_idÙ$192cc1cf-9ea1-492d-baa7-f2e197abecd4´downstream_cells_mapÙ+run_mountaincar_binary_episodic_param_study‘Ù$8eb42403-1234-4e59-993e-057cc3a6d5c9²upstream_cells_mapˆ¤Core¤Base·PlutoRunner.create_bond«PlutoRunnerCounterButton¯Core.applicable¥@bind¨Base.getÙ$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547d„´precedence_heuristic §cell_idÙ$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547d´downstream_cells_map§ep_step’Ù$374af774-3a97-49b5-a3bb-bc3f7f63a3faÙ$af144759-fe66-4ad0-b378-e9eb4e859db4²upstream_cells_map‹¤Core¡:¨Base.get¥@bind¦Slider¦length¢ep‘Ù$e1274f57-75cb-4659-a82f-e5870c5367e2¤Base«PlutoRunner·PlutoRunner.create_bond¯Core.applicableÙ$c8b47eac-2d45-419a-bec6-2ae0cdc59393„´precedence_heuristic §cell_idÙ$c8b47eac-2d45-419a-bec6-2ae0cdc59393´downstream_cells_map‚¾ContinuousMDPTransitionSampler“Ù$c8b47eac-2d45-419a-bec6-2ae0cdc59393Ù$537270ba-122b-4f2b-880b-31d086766295Ù$3c316495-bb6c-41e2-a38f-ba867a319fbb¼AbstractContinuousTransition’Ù$c8b47eac-2d45-419a-bec6-2ae0cdc59393Ù$537270ba-122b-4f2b-880b-31d086766295²upstream_cells_mapÞ²AbstractTransition‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84ºMain.Base.inferencebarrier§@assert¼AbstractContinuousTransition‘Ù$c8b47eac-2d45-419a-bec6-2ae0cdc59393¦typeof¬promote_type£Any¨Function¤Real¤Main£new¾ContinuousMDPTransitionSampler‘Ù$c8b47eac-2d45-419a-bec6-2ae0cdc59393¥throw®AssertionError¢!=¢==Ù$36a6e43f-6bcf-4c27-bfbb-047760e77ada„´precedence_heuristic §cell_idÙ$36a6e43f-6bcf-4c27-bfbb-047760e77ada´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$436c52d2-280b-4ca4-9360-d6587b8254c7„´precedence_heuristic §cell_idÙ$436c52d2-280b-4ca4-9360-d6587b8254c7´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$e96d592d-1e54-486d-8ad9-b857f85476e8„´precedence_heuristic §cell_idÙ$e96d592d-1e54-486d-8ad9-b857f85476e8´downstream_cells_mapÙ#actor_critic_linear_parameter_study“Ù$7afb6fb0-248a-4518-b94f-9876f81eca64Ù$1b102220-6d78-480d-a77f-0e57bad23dcaÙ$d57375a5-b9e0-4742-b5f7-6a7da891604a²upstream_cells_map‹«@NamedTuple¡:§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¤Real¥Int64¨Function¤Base¡-¡^¡+Ù$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6c„´precedence_heuristic §cell_idÙ$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6c´downstream_cells_map€²upstream_cells_mapƒ¡:ºcorridor_parameter_studies’Ù$e5c1aca8-7575-4835-8273-e69ca0a55fe8Ù$646bc853-b7fc-49fa-a201-ff98e8f952d4¡^Ù$4da20fd7-b897-4f26-bf2a-f08d66ddf90f„´precedence_heuristic §cell_idÙ$4da20fd7-b897-4f26-bf2a-f08d66ddf90f´downstream_cells_mapÙ%actor_critic_with_eligibility_traces!–Ù$05bfd818-bf4e-4bda-baa9-5ba647867097Ù$68806899-9972-460a-9f11-daa708a9d610Ù$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54Ù$20776e09-7d9b-4db8-a060-7bceeec65b47Ù$3e3c5897-809f-46e3-bb58-f115b082443eÙ$05f120be-9695-4824-82fd-142a0df13098²upstream_cells_mapÞ"¤zero§typemin¬zero_params!‘Ù$e6cf9550-2e69-4b82-92cf-5e07a35490aa¼update_traces_with_gradient!’Ù$25be5dcf-be63-46c4-b6de-6cf79fa28fd0Ù$056a8adc-92f4-4b33-90d9-4b3b4026bbbc£oneContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295²Base.CoreLogging.!¦Vector¤RealÙ'Base.CoreLogging.Base.fixup_stdlib_pathepisode_steps¨deepcopy¡/¥@info±Base.invokelatest¼update_params_with_gradient!“Ù$b0a66a19-ee76-463b-a704-8fcee85444d0Ù$a893a87b-2d07-4db5-9d1a-9da8646216f4Ù$f55afa58-962d-4551-8d95-a5b467d61adf½Base.CoreLogging.invokelatest´Base.CoreLogging.===¥errorÙ&form_state_and_policy_function_outputs’Ù$e7e49ff8-32df-48a4-afb2-462859592e92Ù$11b9beea-b0cd-45eb-84c6-151728894df0º#___this_pluto_module_name§Integer¨Function¤Base¢<=¢Î³¥push!´Base.CoreLogging.isa¡-µbad_continuous_action‘Ù$b966b248-fb4d-457d-90f6-114370846242¡+¡*³Base.CoreLogging.>=¯episode_rewardsÙ$11ea640c-3981-404d-87c6-4d3d0708a2b8„´precedence_heuristic §cell_idÙ$11ea640c-3981-404d-87c6-4d3d0708a2b8´downstream_cells_mapÙ,actor_critic_linear_episodic_parameter_study²upstream_cells_mapÞ£sum¦Random‘Ù$df7f84e8-b42a-4001-9dbf-6bc3ced94207¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¦length¤Real§scatter¡/§isemptyÙ4actor_critic_with_eligibility_traces_linear_features‘Ù$68806899-9972-460a-9f11-daa708a9d610¡:®AbstractVector¢|>£Inf¤rand§Integer¨Function¦UInt64¡-¤plot¦foldxt¡+£Map¦Layout¬Random.seed!Ù$281360af-46bf-4c73-bf11-3cb1153ad3e2„´precedence_heuristic §cell_idÙ$281360af-46bf-4c73-bf11-3cb1153ad3e2´downstream_cells_map€²upstream_cells_map€Ù$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690c„´precedence_heuristic §cell_idÙ$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690c´downstream_cells_mapÙ,update_squashed_gaussian_eligibility_vector!‘Ù$05f120be-9695-4824-82fd-142a0df13098²upstream_cells_mapŒ£exp¡:Ù'BinarySquashedGaussianEligibilityVector‘Ù$76fd79a2-2bc8-45f8-a243-48415118898a¡k¥first³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41¦Vector¤Real¦Matrix¡+¦NTuple¤lastÙ$da3cb392-78f2-48b2-b0dc-5f016664798c„´precedence_heuristic §cell_idÙ$da3cb392-78f2-48b2-b0dc-5f016664798c´downstream_cells_map€²upstream_cells_map‚Ù mountaincar_continuing_tile_test‘Ù$b02ba928-5b9f-4695-b980-07988c788bb9»show_mountaincar_trajectory‘Ù$ba645f6b-143f-4e83-9003-707770ae308dÙ$dca2f8e2-76af-4679-bf81-3824c15fc76d„´precedence_heuristic §cell_idÙ$dca2f8e2-76af-4679-bf81-3824c15fc76d´downstream_cells_map¯reinforce_test3‘Ù$11a55af7-5301-4507-bb26-88e1e11236db²upstream_cells_map…¥Int64®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6¡^Ù4actor_critic_with_eligibility_traces_binary_features‘Ù$05bfd818-bf4e-4bda-baa9-5ba647867097§typemaxÙ$8019bec9-1228-407b-9199-2fe29f26a981„´precedence_heuristic §cell_idÙ$8019bec9-1228-407b-9199-2fe29f26a981´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$fd964539-2baf-4ff1-b286-5a0bb1b222c4„´precedence_heuristic §cell_idÙ$fd964539-2baf-4ff1-b286-5a0bb1b222c4´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$5720e942-d3f8-4329-83a8-8bcedf078b6a„´precedence_heuristic §cell_idÙ$5720e942-d3f8-4329-83a8-8bcedf078b6a´downstream_cells_map€²upstream_cells_map„¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabeÙ-reinforce_monte_carlo_control_linear_features‘Ù$8e39bd15-862e-4941-88f9-2794b861a523¡^¹update_corridor_features!‘Ù$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811Ù$62e677ac-2070-4f6b-9df2-90849d89fa9f„´precedence_heuristic §cell_idÙ$62e677ac-2070-4f6b-9df2-90849d89fa9f´downstream_cells_map¿corridor_terminal_probabilities²upstream_cells_mapƒµcorridor_state_counts‘Ù$54f559b6-8a62-4a42-894d-c56e41d5ebef£sum¡-Ù$11b9beea-b0cd-45eb-84c6-151728894df0„´precedence_heuristic §cell_idÙ$11b9beea-b0cd-45eb-84c6-151728894df0´downstream_cells_mapÙ&form_state_and_policy_function_outputs•Ù$4d4ae57b-afc3-44f9-b6fc-892f59f82921Ù$266d2234-26c8-43f1-9e75-49440a230ed6Ù$83640f5b-fe13-4ec1-98a0-67a56c189ba1Ù$b71145a4-2614-4f62-bfd2-7d5d1fecec56Ù$4da20fd7-b897-4f26-bf2a-f08d66ddf90f²upstream_cells_map‡¦VectorÙ%form_state_continuous_policy_function‘Ù$f545c800-0bf3-491f-9d7d-42341cfdb573¤Real¨deepcopy¹form_state_value_function‘Ù$e7566274-5518-4e28-8738-d4b1747d0cfb¨Function¤copyÙ$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290„´precedence_heuristic §cell_idÙ$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290´downstream_cells_mapÙ-reinforce_monte_carlo_control_binary_features”Ù$d037ea92-915c-4dc7-97c6-d006d92e088aÙ$f2ed56c9-c2b7-42cb-a083-e12aeaa126efÙ$83ca0577-15d7-4448-b597-c77810b812bfÙ$e5c1aca8-7575-4835-8273-e69ca0a55fe8²upstream_cells_map‹½setup_binary_policy_arguments‘Ù$96506201-6b66-49e6-8179-06952e2394e1¥zeros§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¤Real¨Function¦lengthÙ!update_binary_eligibility_vector!‘Ù$042fbafe-2401-4fb7-ac13-4531e0782c79¦MatrixÙ!update_binary_action_preferences!‘Ù$a361f4c9-47ce-42ad-899c-87b611c0d471¾reinforce_monte_carlo_control!‘Ù$0ac7ea44-14f6-4e80-80f9-d6df8059bb38Ù$55ba8725-0ddf-4196-a41d-3f3c490a8d84„´precedence_heuristic §cell_idÙ$55ba8725-0ddf-4196-a41d-3f3c490a8d84´downstream_cells_mapÙ5actor_critic_binary_episodic_gaussian_parameter_study‘Ù$b53dba81-a9e9-41da-8fc2-7736bf25f2dc²upstream_cells_mapÞ¦Random‘Ù$df7f84e8-b42a-4001-9dbf-6bc3ced94207ContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295¤copy¦Vector¤Real§scatter¡/¦Matrix§isempty¤mean¡:®AbstractVector¢|>£Inf¿make_n_param_dist_policy_params‘Ù$ba41f521-4ee2-42a6-bf18-078bfa4b875e¥zeros¤rand§Integer¨Function¦UInt64¡-¤plot¦foldxt¡+£MapÙEactor_critic_with_eligibility_traces_binary_features_gaussian_actions‘Ù$20776e09-7d9b-4db8-a060-7bceeec65b47¦Layout¬Random.seed!Ù$a540814a-57a1-4b98-9443-59e401425444„´precedence_heuristic §cell_idÙ$a540814a-57a1-4b98-9443-59e401425444´downstream_cells_mapµbinary_value_function—Ù$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbÙ$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2Ù$05bfd818-bf4e-4bda-baa9-5ba647867097Ù$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00Ù$20776e09-7d9b-4db8-a060-7bceeec65b47Ù$3e3c5897-809f-46e3-bb58-f115b082443eÙ$05f120be-9695-4824-82fd-142a0df13098²upstream_cells_mapÞ¤zero¡:¦isless³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41®julia.simdloop©@inbounds§nothing¦Vector¡<µBase.simd_outer_range¤Real¤Base¶Base.simd_inner_length¥@simd¯Base.simd_index¡+Ù$1b102220-6d78-480d-a77f-0e57bad23dca„´precedence_heuristic §cell_idÙ$1b102220-6d78-480d-a77f-0e57bad23dca´downstream_cells_mapÙ*cartpole_binary_continuing_parameter_study‘Ù$b2539398-fdbc-42a2-a8f3-d327358f3643²upstream_cells_map„¹cartpole_tilecoding_setup‘Ù$de3cba34-9842-44d1-9b79-47126c0a0751Ù-cartpole_tilecoding_setup.get_active_features·cartpole_continuing_mdp‘Ù$4c4e643b-d4b9-44f0-8d30-dc521bcc55acÙ#actor_critic_linear_parameter_study“Ù$734573e5-547b-4dcc-89bb-412aa6cc42d6Ù$e96d592d-1e54-486d-8ad9-b857f85476e8Ù$ff4f977e-48df-4c12-845c-c245b4d39d6dÙ$4d4ae57b-afc3-44f9-b6fc-892f59f82921„´precedence_heuristic §cell_idÙ$4d4ae57b-afc3-44f9-b6fc-892f59f82921´downstream_cells_map¶one_step_actor_critic!“Ù$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2Ù$57e5e12a-b722-4ea3-ab3b-e5711029e640Ù$57bbdb10-bed8-459d-8f67-9ea637cf12ba²upstream_cells_mapÞ¤zero¥zerosÙ&form_state_and_policy_function_outputs’Ù$e7e49ff8-32df-48a4-afb2-462859592e92Ù$11b9beea-b0cd-45eb-84c6-151728894df0©soft_max!‘Ù$33c99850-67cd-4754-94b9-6df97b238e27£one§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¦Vector¢<=sample_action‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¥Int64¡-¥push!¤Real¡/¨Function¡+¡*¦length¼update_params_with_gradient!“Ù$b0a66a19-ee76-463b-a704-8fcee85444d0Ù$a893a87b-2d07-4db5-9d1a-9da8646216f4Ù$f55afa58-962d-4551-8d95-a5b467d61adfÙ$61949faa-8174-4b7b-8fbc-01d5f850b419„´precedence_heuristic §cell_idÙ$61949faa-8174-4b7b-8fbc-01d5f850b419´downstream_cells_mapÙ7actor_critic_binary_continuing_gaussian_parameter_study²upstream_cells_mapÞ¦Random‘Ù$df7f84e8-b42a-4001-9dbf-6bc3ced94207ContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295¤copy¦Vector¤Real§scatter¡/¦Matrix¡:®AbstractVector¢|>¿make_n_param_dist_policy_params‘Ù$ba41f521-4ee2-42a6-bf18-078bfa4b875e¥zeros¤rand§Integer¨Function¦UInt64¤plot¦foldxt¡+£MapÙEactor_critic_with_eligibility_traces_binary_features_gaussian_actions‘Ù$20776e09-7d9b-4db8-a060-7bceeec65b47¦Layout¬Random.seed!Ù$5b15f5c9-80bf-47f0-898a-f8dead5b927c„´precedence_heuristic §cell_idÙ$5b15f5c9-80bf-47f0-898a-f8dead5b927c´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$266d2234-26c8-43f1-9e75-49440a230ed6„´precedence_heuristic §cell_idÙ$266d2234-26c8-43f1-9e75-49440a230ed6´downstream_cells_mapÙ%actor_critic_with_eligibility_traces!–Ù$05bfd818-bf4e-4bda-baa9-5ba647867097Ù$68806899-9972-460a-9f11-daa708a9d610Ù$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54Ù$20776e09-7d9b-4db8-a060-7bceeec65b47Ù$3e3c5897-809f-46e3-bb58-f115b082443eÙ$05f120be-9695-4824-82fd-142a0df13098²upstream_cells_mapÞ¤zero¬zero_params!‘Ù$e6cf9550-2e69-4b82-92cf-5e07a35490aa¼update_traces_with_gradient!’Ù$25be5dcf-be63-46c4-b6de-6cf79fa28fd0Ù$056a8adc-92f4-4b33-90d9-4b3b4026bbbc£one¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¦lengthsample_action‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¦Vector¤Real¨deepcopy¡/¼update_params_with_gradient!“Ù$b0a66a19-ee76-463b-a704-8fcee85444d0Ù$a893a87b-2d07-4db5-9d1a-9da8646216f4Ù$f55afa58-962d-4551-8d95-a5b467d61adf¥zeros©soft_max!‘Ù$33c99850-67cd-4754-94b9-6df97b238e27Ù&form_state_and_policy_function_outputs’Ù$e7e49ff8-32df-48a4-afb2-462859592e92Ù$11b9beea-b0cd-45eb-84c6-151728894df0§Integer¨Function¢<=¥Int64¥push!¡-¡+¡*Ù$aa69e4ea-91e0-496a-a7be-529e67f4dbec„´precedence_heuristic §cell_idÙ$aa69e4ea-91e0-496a-a7be-529e67f4dbec´downstream_cells_map€²upstream_cells_map„Ù1reinforce_with_baseline_monte_carlo_control_fcann‘Ù$697b2310-9d96-4f7f-be62-c3bd6bf736f3¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabe¡^¹update_corridor_features!‘Ù$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811Ù$10ee7709-0816-48d2-abe0-9be3dd04700f„´precedence_heuristic §cell_idÙ$10ee7709-0816-48d2-abe0-9be3dd04700f´downstream_cells_map€²upstream_cells_map‚¼plot_continuing_step_rewards‘Ù$0964133c-3a5b-433b-a8c4-a97813c37583Ù!mountaincar_continuing_fcann_test‘Ù$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9Ù$7d94922e-dc9f-4953-b539-24aaa2c85b12„´precedence_heuristic §cell_idÙ$7d94922e-dc9f-4953-b539-24aaa2c85b12´downstream_cells_map·continuing_study_params‘Ù$42775fd1-5b27-48e0-abf1-9b22bb775e6d²upstream_cells_mapˆ¤Core¤Base·PlutoRunner.create_bond«PlutoRunnerÙ(create_actor_critic_continuing_params_UI‘Ù$5b15d91e-7119-4f85-a54a-7d4f1fdaf097¯Core.applicable¥@bind¨Base.getÙ$df7f84e8-b42a-4001-9dbf-6bc3ced94207„´precedence_heuristic§cell_idÙ$df7f84e8-b42a-4001-9dbf-6bc3ced94207´downstream_cells_map‰ªStatistics¬StaticArrays©StatsBase«TransducersLinearAlgebraDistributions®PlutoDevMacros¦RandomžÙ$d037ea92-915c-4dc7-97c6-d006d92e088aÙ$83ca0577-15d7-4448-b597-c77810b812bfÙ$e5c1aca8-7575-4835-8273-e69ca0a55fe8Ù$1f041cb3-618c-4380-a1ec-d7bbe4a80f62Ù$11ea640c-3981-404d-87c6-4d3d0708a2b8Ù$f8614042-7c94-4d47-a1b6-4e96676b4e8bÙ$ba642a22-6623-482a-ab4a-81585b83e457Ù$8bc280db-e57d-4e40-be46-1790f4f7d9e7Ù$11063fff-4d36-46d5-828f-dbed0f46b9cfÙ$55ba8725-0ddf-4196-a41d-3f3c490a8d84Ù$61949faa-8174-4b7b-8fbc-01d5f850b419Ù$dd8e8cd2-7b41-46c4-8530-adefb7aea684Ù$08505e88-9c23-4e95-91e3-d18bf5133dbcÙ$87482ea5-5265-4e02-92c0-1a8bb44ff0f4§Threads²upstream_cells_map€Ù$352d2952-cb83-47d3-9078-2b2ef9927443„´precedence_heuristic §cell_idÙ$352d2952-cb83-47d3-9078-2b2ef9927443´downstream_cells_map¹create_cartpole_functions‘Ù$f27f2bcd-05b6-44fe-bf9e-a3e51556db7c²upstream_cells_mapCartPoleState¤zero§deg2rad¡>¦isless¤rand¹cartpole_runge_kutta_step¤Real¨Function¡<¡-§Float32¥clamp¯CartPoleVehicle£absÙ$0964133c-3a5b-433b-a8c4-a97813c37583„´precedence_heuristic §cell_idÙ$0964133c-3a5b-433b-a8c4-a97813c37583´downstream_cells_map¼plot_continuing_step_rewards”Ù$645e93e7-e92e-49c4-9757-8294fabf4e9bÙ$04b5929a-2058-49c9-963a-96c752a1d67dÙ$98222fcd-b456-477c-90dd-844df36877e5Ù$10ee7709-0816-48d2-abe0-9be3dd04700f²upstream_cells_mapŒ¡:¨LinRange¦cumsum¦Vector¤Real¥Int64¦length§scatter¤plot¡/¦Layout¥roundÙ$349631b2-4686-49a9-9f3a-1e4ad588b568„´precedence_heuristic §cell_idÙ$349631b2-4686-49a9-9f3a-1e4ad588b568´downstream_cells_map»mountaincar_continuous_mdp2’Ù$fee14dfe-c5ca-4126-a830-cc9d7eda5433Ù$cd9c9eeb-c90d-4499-9503-7773d5250f47²upstream_cells_mapÙ$create_continuous_action_mountaincar‘Ù$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbfÙ$8544eddb-2095-4a3c-82e0-920123a88e6d„´precedence_heuristic §cell_idÙ$8544eddb-2095-4a3c-82e0-920123a88e6d´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$31f7e903-30b6-4193-9174-88093e004de4„´precedence_heuristic §cell_idÙ$31f7e903-30b6-4193-9174-88093e004de4´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$fee14dfe-c5ca-4126-a830-cc9d7eda5433„´precedence_heuristic §cell_idÙ$fee14dfe-c5ca-4126-a830-cc9d7eda5433´downstream_cells_mapÙ"mountaincar_continuous_test_train2’Ù$cd9c9eeb-c90d-4499-9503-7773d5250f47Ù$b695ef21-a1ac-4d1f-a0e1-71cd81cede18²upstream_cells_map…¥Int64»mountaincar_continuous_mdp2‘Ù$349631b2-4686-49a9-9f3a-1e4ad588b568¼mountaincar_tilecoding_setup‘Ù$7c592385-e8d3-4efe-962c-d39debb64405ÙEactor_critic_with_eligibility_traces_binary_features_gaussian_actions‘Ù$20776e09-7d9b-4db8-a060-7bceeec65b47§typemaxÙ$b53dba81-a9e9-41da-8fc2-7736bf25f2dc„´precedence_heuristic §cell_idÙ$b53dba81-a9e9-41da-8fc2-7736bf25f2dc´downstream_cells_map€²upstream_cells_mapŠºmountaincar_continuous_mdp‘Ù$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2§@md_strÙ$mountaincar_binary_continuous_params‘Ù$71a5fce8-6d9a-4625-bad1-a951d61bff28¡<¡>¦islessÙ7run_mountaincar_binary_episodic_countinuous_param_study‘Ù$ac9c8845-284d-4c21-b05d-d930f86598a3¼mountaincar_tilecoding_setup‘Ù$7c592385-e8d3-4efe-962c-d39debb64405Ù5actor_critic_binary_episodic_gaussian_parameter_study’Ù$55ba8725-0ddf-4196-a41d-3f3c490a8d84Ù$13ebc12f-ff6f-4266-88d3-28d6df5fcf59¨getindexÙ$beb01fb8-c77d-4b5c-a66d-3812415e04a3„´precedence_heuristic §cell_idÙ$beb01fb8-c77d-4b5c-a66d-3812415e04a3´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$8bc280db-e57d-4e40-be46-1790f4f7d9e7„´precedence_heuristic §cell_idÙ$8bc280db-e57d-4e40-be46-1790f4f7d9e7´downstream_cells_mapÙ"actor_critic_fcann_parameter_study’Ù$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62Ù$c251a630-7114-4188-9323-8d8feb5c32e0²upstream_cells_mapÞ®AbstractVectorÙ*actor_critic_with_eligibility_traces_fcann‘Ù$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54¥FCANN‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¦Random‘Ù$df7f84e8-b42a-4001-9dbf-6bc3ced94207¤rand§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84»FCANN.initializeparams_saxe¥Int64¦Vector¤Real¨Function§scatter¤plot¦UInt64¦length·average_continuing_runs‘Ù$ba642a22-6623-482a-ab4a-81585b83e457¦Layout¬Random.seed!Ù$89901156-b874-416b-89c1-6dc434a4eb17„´precedence_heuristic §cell_idÙ$89901156-b874-416b-89c1-6dc434a4eb17´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$ff76ef94-fdf5-41f3-a31a-21c4629efabe„´precedence_heuristic §cell_idÙ$ff76ef94-fdf5-41f3-a31a-21c4629efabe´downstream_cells_map¬corridor_mdpÜÙ$f7433324-acc3-49a5-b5b3-ada0c8f09d52Ù$fb8904a9-ae64-41cc-93b6-5a25855edad0Ù$cecc2a35-3850-4f66-84e8-e29da4f3d4b0Ù$f2f2dd1d-180c-4d36-b515-5079d129f93aÙ$e1493cea-19c4-475d-98a0-86d27fb04af1Ù$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72cÙ$0c9986bb-54c0-4b08-9c29-4bfb0b68b54eÙ$d037ea92-915c-4dc7-97c6-d006d92e088aÙ$f2ed56c9-c2b7-42cb-a083-e12aeaa126efÙ$cbea5840-49d2-4e91-be9c-f5f15666d78aÙ$5720e942-d3f8-4329-83a8-8bcedf078b6aÙ$cacaaca6-6e01-464f-a2ee-cbf62737a426Ù$07ad517a-c2ac-4377-99fb-adb13d0f1d0cÙ$aa69e4ea-91e0-496a-a7be-529e67f4dbecÙ$83ca0577-15d7-4448-b597-c77810b812bfÙ$a12b92d1-e045-4f92-b8cd-eee5d56fa67dÙ$e5c1aca8-7575-4835-8273-e69ca0a55fe8Ù$7d63b960-3998-4f7b-8cbb-ccd49db9aeacÙ$9db9ff71-bee9-4bea-a45b-748f8517fed1Ù$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0Ù$646bc853-b7fc-49fa-a201-ff98e8f952d4Ù$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728Ù$396e0047-d848-462f-a769-0cc2829abc78Ù$bc8a399b-8864-4473-89d2-e3b0a03d15b5Ù$72273f27-d0b9-4645-a609-cb65cc9332ee²upstream_cells_map±make_corridor_mdp‘Ù$5cc4d12d-b537-47e2-8109-4e7a234fdf25Ù$581f7e9b-a5c2-4841-9605-85f9585b0274„´precedence_heuristic §cell_idÙ$581f7e9b-a5c2-4841-9605-85f9585b0274´downstream_cells_mapÙ!update_linear_action_preferences!–Ù$4634267b-5dea-4164-8bb2-1eb2fd4d7954Ù$8e39bd15-862e-4941-88f9-2794b861a523Ù$d1ed25e6-60c6-411f-a541-99986e5da2c5Ù$57e5e12a-b722-4ea3-ab3b-e5711029e640Ù$68806899-9972-460a-9f11-daa708a9d610Ù$d5020a8d-1dd7-403c-9d1f-665b95543943²upstream_cells_map‡¤zero¤BLASªBLAS.gemv!¦Matrix£oneAbstractFloat¦VectorÙ$8aa16866-bfda-48df-9cf1-cf3d2e203ccb„´precedence_heuristic §cell_idÙ$8aa16866-bfda-48df-9cf1-cf3d2e203ccb´downstream_cells_mapÙ8cartpole_tilecoding_reinforce_continuous_parameter_study²upstream_cells_mapŽ®cartpole_setup‘Ù$26880577-d267-4950-8725-7afe0d0402b6¡:¢|>¶setup_cartpole_problem§scatter¤plot¡/ÙLreinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions‘Ù$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00¡+¦foldxt§isempty£Map¤mean¦LayoutÙ$04b5929a-2058-49c9-963a-96c752a1d67d„´precedence_heuristic §cell_idÙ$04b5929a-2058-49c9-963a-96c752a1d67d´downstream_cells_map€²upstream_cells_map‚¼plot_continuing_step_rewards‘Ù$0964133c-3a5b-433b-a8c4-a97813c37583¾cartpole_continuing_fcann_test‘Ù$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27Ù$f0104778-81a6-417b-8501-f916e5e7f3af„´precedence_heuristic §cell_idÙ$f0104778-81a6-417b-8501-f916e5e7f3af´downstream_cells_map¼make_corridor_continuing_mdp‘Ù$1ac9296f-047b-4051-ba5c-0c23d5f9cde9²upstream_cells_map‹¦ifelse§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84§Returns¡-¹StateMDPTransitionSampler‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84§Float32¡+¡*¦iseven¢==Ù$3e3c5897-809f-46e3-bb58-f115b082443e„´precedence_heuristic §cell_idÙ$3e3c5897-809f-46e3-bb58-f115b082443e´downstream_cells_mapÙAactor_critic_with_eligibility_traces_binary_features_beta_actions’Ù$dd8e8cd2-7b41-46c4-8530-adefb7aea684Ù$4156d955-9daf-4429-b152-e8332980fb9e²upstream_cells_mapÞ¿update_beta_eligibility_vector!’Ù$bfe7e41d-6318-4bd4-b892-287831876abcÙ$d41f1dd1-45fe-4456-9a01-ed47fd6704a7µbinary_value_function‘Ù$a540814a-57a1-4b98-9443-59e401425444±make_beta_sampler‘Ù$b2082ab0-73a4-45a6-8772-a2e6e22b519a¿make_n_param_dist_policy_params‘Ù$ba41f521-4ee2-42a6-bf18-078bfa4b875e¥zeros³BinaryFeatureVector‘Ù$f92bb265-4b19-4f0e-a698-d7547bb6dd41¤rand§Integer¦VectorContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295¤Real¨Function½update_binary_value_gradient!‘Ù$03a218cb-aa83-4000-85b5-c6f247087053Ù"setup_binary_beta_policy_arguments‘Ù$ed93259c-7b8b-46d7-97fb-f194e0e04b3a¦NTuple¥Union¦MatrixÙ!update_binary_action_preferences!‘Ù$a361f4c9-47ce-42ad-899c-87b611c0d471Ù%actor_critic_with_eligibility_traces!”Ù$266d2234-26c8-43f1-9e75-49440a230ed6Ù$83640f5b-fe13-4ec1-98a0-67a56c189ba1Ù$b71145a4-2614-4f62-bfd2-7d5d1fecec56Ù$4da20fd7-b897-4f26-bf2a-f08d66ddf90fÙ$a9db3f85-ff56-4bbc-be87-47b893ef3b7b„´precedence_heuristic §cell_idÙ$a9db3f85-ff56-4bbc-be87-47b893ef3b7b´downstream_cells_map»mountaincar_continuing_step‘Ù$00152954-dc98-4120-b94b-2ea4d987832b²upstream_cells_map…´MountainCarTask.stepÙ MountainCarTask.initialize_state¢==§Integer¯MountainCarTaskÙ$08505e88-9c23-4e95-91e3-d18bf5133dbc„´precedence_heuristic §cell_idÙ$08505e88-9c23-4e95-91e3-d18bf5133dbc´downstream_cells_mapÙ>actor_critic_binary_episodic_squashed_gaussian_parameter_study‘Ù$0d93132d-5819-47dc-8cf2-462d480d9c3d²upstream_cells_mapÞ¦Random‘Ù$df7f84e8-b42a-4001-9dbf-6bc3ced94207ContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295¤copy¦Vector¤Real§scatter¡/¦Matrix§isempty¤meanÙNactor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions’Ù$05f120be-9695-4824-82fd-142a0df13098Ù$717e4c69-59d5-4929-923f-dd35a97fb160¡:®AbstractVector¢|>£Inf¿make_n_param_dist_policy_params‘Ù$ba41f521-4ee2-42a6-bf18-078bfa4b875e¥zeros¤rand§Integer¨Function¦UInt64¡-¤plot¦foldxt¡+£Map¦Layout¬Random.seed!Ù$ad0009af-2cfc-4820-bd4a-698ad391f459„´precedence_heuristic §cell_idÙ$ad0009af-2cfc-4820-bd4a-698ad391f459´downstream_cells_map€²upstream_cells_map…«beta_params‘Ù$7bf209c8-ef0a-46d1-937e-b1a6e45dc62e§scatter¤plot¨LinRange®make_beta_dist‘Ù$0b01ba67-3921-4f3f-a7e8-235190bc84ebÙ$16fcc2d0-9f2f-4226-9dcc-6d86248cab26„´precedence_heuristic §cell_idÙ$16fcc2d0-9f2f-4226-9dcc-6d86248cab26´downstream_cells_map¸plot_state_distributions‘Ù$9cf3dc5f-8a25-479f-93db-06e34f0d37a0²upstream_cells_mapÞ¡:£sum¤vcat·HypertextLiteral.Bypass¸HypertextLiteral.content¤size§adjoint¤@htl‘Ù$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaeb£bar¡-¤plot¡/°HypertextLiteral‘Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70¤attr§heatmap·HypertextLiteral.Result¦Layout»collect_state_distributions‘Ù$0c9986bb-54c0-4b08-9c29-4bfb0b68b54e¤conjÙ$11063fff-4d36-46d5-828f-dbed0f46b9cf„´precedence_heuristic §cell_idÙ$11063fff-4d36-46d5-828f-dbed0f46b9cf´downstream_cells_mapÙ"actor_critic_fcann_parameter_study’Ù$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62Ù$c251a630-7114-4188-9323-8d8feb5c32e0²upstream_cells_mapÞ¡:®AbstractVectorÙ*actor_critic_with_eligibility_traces_fcann‘Ù$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54¥FCANN‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84¦Random‘Ù$df7f84e8-b42a-4001-9dbf-6bc3ced94207¤rand§Integer¨StateMDP‘Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84»FCANN.initializeparams_saxe¥Int64¦Vector¤Real¨Function¦UInt64¦length·average_continuing_runs‘Ù$ba642a22-6623-482a-ab4a-81585b83e457©DataFrame¬Random.seed!Ù$8fcdca63-01a0-4d4b-933c-06a7621d980a„´precedence_heuristic §cell_idÙ$8fcdca63-01a0-4d4b-933c-06a7621d980a´downstream_cells_map€²upstream_cells_map€Ù$33c99850-67cd-4754-94b9-6df97b238e27„´precedence_heuristic §cell_idÙ$33c99850-67cd-4754-94b9-6df97b238e27´downstream_cells_map©soft_max!–Ù$4634267b-5dea-4164-8bb2-1eb2fd4d7954Ù$042fbafe-2401-4fb7-ac13-4531e0782c79Ù$37ec6802-d4c2-4470-ad69-439d5a732f77Ù$4d4ae57b-afc3-44f9-b6fc-892f59f82921Ù$266d2234-26c8-43f1-9e75-49440a230ed6Ù$83640f5b-fe13-4ec1-98a0-67a56c189ba1²upstream_cells_mapÞ¤zero¦isless©@inbounds£one§nothing¦length¡<¯Base.simd_index©eachindex¤Real¥@simd¡/¢==§extrema£exp®AbstractVector®julia.simdloop¤BaseµBase.simd_outer_range¡-¶Base.simd_inner_length¡+Ù$786a5385-b648-4fc3-8e19-bf6582828136„´precedence_heuristic §cell_idÙ$786a5385-b648-4fc3-8e19-bf6582828136´downstream_cells_map€²upstream_cells_map‚§@md_str¨getindexÙ$573878bb-020d-40f6-9329-3d5f91843010„´precedence_heuristic §cell_idÙ$573878bb-020d-40f6-9329-3d5f91843010´downstream_cells_map€²upstream_cells_map‚®corridor_train‘Ù$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72cºget_corridor_episode_stats’Ù$fb8904a9-ae64-41cc-93b6-5a25855edad0Ù$cecc2a35-3850-4f66-84e8-e29da4f3d4b0Ù$2e7c737c-c798-4442-a7e1-d74ccfd73119„´precedence_heuristic §cell_idÙ$2e7c737c-c798-4442-a7e1-d74ccfd73119´downstream_cells_map£áº‹‘Ù$d4e87ac4-6008-43b2-aa06-e232ec2b2b5b²upstream_cells_map‰¤Core¤Base¡:·PlutoRunner.create_bond«PlutoRunner¯Core.applicable¥@bind¨Base.get¦SliderÙ$9d264543-33ab-498a-90f5-5f913c252484„´precedence_heuristic §cell_idÙ$9d264543-33ab-498a-90f5-5f913c252484´downstream_cells_map€²upstream_cells_map†¯reinforce_test4‘Ù$407a0724-4bb6-4c83-ab2d-17a0e19c4072¦length¡:¤plot¡/£endÙ$9cf3dc5f-8a25-479f-93db-06e34f0d37a0„´precedence_heuristic §cell_idÙ$9cf3dc5f-8a25-479f-93db-06e34f0d37a0´downstream_cells_map€²upstream_cells_map‚¸plot_state_distributions‘Ù$16fcc2d0-9f2f-4226-9dcc-6d86248cab26«dist_plot_p‘Ù$4a39f9a7-72d4-44ad-895a-742cd1291f92Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70„´precedence_heuristic§cell_idÙ$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70´downstream_cells_map‡¯ProgressLogging§PlutoUIœÙ$cc80848a-6834-4272-9152-e17b45448814Ù$a8b40b8f-051a-4e6f-a079-ece4f32873deÙ$5b15d91e-7119-4f85-a54a-7d4f1fdaf097Ù$19dfabda-7049-4050-8662-0385529c0c5aÙ$42d4600a-bf3c-45ac-b7f5-d23917713ff5Ù$28ce6e60-59cf-408a-8081-b978507b3c72Ù$60c21e9c-e42d-4f0b-a910-3b318440fbc8Ù$7bf209c8-ef0a-46d1-937e-b1a6e45dc62eÙ$94517664-6988-44dc-a297-e9d5873ee540Ù$5eebf3da-bfe7-46eb-81a3-f87f334ee270Ù$87feff3e-e510-4916-91a9-db3a2cd12225Ù$b7f77935-bcab-4ef1-8e1b-a7d059784ff3°HypertextLiteralšÙ$16fcc2d0-9f2f-4226-9dcc-6d86248cab26Ù$cc80848a-6834-4272-9152-e17b45448814Ù$a8b40b8f-051a-4e6f-a079-ece4f32873deÙ$602a07dd-8928-4b44-97e5-01c5cbf38351Ù$63fbf8f4-e4e2-4893-be09-67450e92dbd7Ù$f9facbba-39d4-483e-9066-275603156db0Ù$bbc8864a-1545-433f-bc7c-0ddf6e907138Ù$68469a40-7976-48b7-b7a1-eaa4c5f33a18Ù$ba645f6b-143f-4e83-9003-707770ae308dÙ$b5319d8b-0420-4ebf-b603-ea0b93365ac1¬PlutoProfile®BenchmarkTools«PlutoPlotly¬LaTeXStrings²upstream_cells_map¯TableOfContentsÙ$bd6a7c16-6c25-4fc2-8e1b-4dab693ce19f„´precedence_heuristic §cell_idÙ$bd6a7c16-6c25-4fc2-8e1b-4dab693ce19f´downstream_cells_mapÙ>actor_critic_binary_episodic_squashed_gaussian_parameter_study‘Ù$0d93132d-5819-47dc-8cf2-462d480d9c3d²upstream_cells_map‹«@NamedTuple¡:§IntegerContinuousMDP‘Ù$537270ba-122b-4f2b-880b-31d086766295¤Real¥Int64¨Function¤Base¡-¡^¡+Ù$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72c„´precedence_heuristic §cell_idÙ$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72c´downstream_cells_map®corridor_train“Ù$5334064b-5a16-4135-afa0-86a48291725bÙ$5981f52b-d829-4c7d-b47b-33310f7d64a2Ù$573878bb-020d-40f6-9329-3d5f91843010²upstream_cells_map…¥Int64¬corridor_mdp‘Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabeµget_corridor_features‘Ù$6bb0263e-368e-462a-948c-baf9cfa82512§typemax¨sarsa_Î»´cell_execution_orderÜ§Ù$fac138d9-3c5d-44b0-a87c-b13872f19450Ù$e034b9cb-f4ee-46f4-bea6-72c93c75d966Ù$666a4e89-306b-4fb2-bdc4-3dda2c63153fÙ$df7f84e8-b42a-4001-9dbf-6bc3ced94207Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70Ù$7cf26604-9c2b-4a77-9674-7d4dac2f99f0Ù$36a6e43f-6bcf-4c27-bfbb-047760e77adaÙ$31f7e903-30b6-4193-9174-88093e004de4Ù$48dcd2d0-a940-41da-a097-90c780f2ec4dÙ$d95f75b5-21d8-4862-baa7-50b58d9725b8Ù$fc3dcd26-c5cf-4141-bf6c-eaed5fc9bb1dÙ$dcb306ae-a1b1-43d6-ba6e-e38668838689Ù$33c99850-67cd-4754-94b9-6df97b238e27Ù$7a6fb1f0-fc3c-4c29-a6d9-769d32ca98a9Ù$5cc4d12d-b537-47e2-8109-4e7a234fdf25Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabeÙ$6bb0263e-368e-462a-948c-baf9cfa82512Ù$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811Ù$f2f2dd1d-180c-4d36-b515-5079d129f93aÙ$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72cÙ$5334064b-5a16-4135-afa0-86a48291725bÙ$5981f52b-d829-4c7d-b47b-33310f7d64a2Ù$8019bec9-1228-407b-9199-2fe29f26a981Ù$38e5d800-4d43-40d2-87ea-f7d4b4283dabÙ$b94fc99c-f439-4df2-8da3-c01718a136c4Ù$9c342958-1971-48ec-b919-5dfdcbc915a4Ù$e5faaa1b-88cb-43e2-8d04-8972b58b4bdaÙ$406638af-1e08-44d2-9ee4-97aa9294a94bÙ$aa450da4-fe84-4eea-b6c4-9820b7982437Ù$fdd3f4fd-4706-4d6b-b150-6ee6b4b370cbÙ$41d62de1-2c92-41ee-9430-b9ca3007afd9Ù$135f205a-f87e-4691-8e87-d317d6312c84Ù$ca360680-afc9-4dd9-9351-493643f91575Ù$4a39f9a7-72d4-44ad-895a-742cd1291f92Ù$98229733-a71e-44ca-a52a-b7229cf8b422Ù$37a8ef7e-e859-4ef0-81e2-76c02a324031Ù$339b4d2b-2237-46a3-9867-ecc3332856c1Ù$05b0fcad-628b-48d2-aa24-f6f562dbb660Ù$17d07ef4-7c0a-47cc-a701-32c60336571bÙ$76b03e72-da04-4530-8534-6d6468268cbdÙ$2a586e46-66e4-461a-85c8-5817e4d1aa43Ù$90d3b96b-ad2b-405c-951b-f48ec7ccf24aÙ$f924eb30-d1cc-4941-8fb5-ff70ad425ab9Ù$189798b3-ec6b-48b9-918c-ee0f65935ab3Ù$70096b14-beab-4f71-9886-6355c749bb8aÙ$1558cec1-c4fd-4bc0-85ed-ae22c6067d41Ù$e3a2fb12-37ce-4c23-ad93-5fc89991aabbÙ$58403c8e-0ee4-4466-ba25-ee0c86fb0b47Ù$73b90260-d57a-449a-8db6-47f91e6a4e4fÙ$ee72af8d-3cb8-4314-82df-580f068e1252Ù$89901156-b874-416b-89c1-6dc434a4eb17Ù$5c11a92d-7496-4aba-af15-2537eac49dd7Ù$581f7e9b-a5c2-4841-9605-85f9585b0274Ù$da2d3186-a778-41cc-9b49-759bf1e9b8faÙ$f92bb265-4b19-4f0e-a698-d7547bb6dd41Ù$8eab55a5-41b7-4f5e-a02f-4c19388bc9eaÙ$a361f4c9-47ce-42ad-899c-87b611c0d471Ù$cc3ac95e-a398-438a-ba3d-62b6733f6342Ù$4634267b-5dea-4164-8bb2-1eb2fd4d7954Ù$45f0a385-6465-4acc-8637-1b007a0fe215Ù$41dc149d-c6f3-4b0d-a856-06f3aaae3049Ù$b0a66a19-ee76-463b-a704-8fcee85444d0Ù$042fbafe-2401-4fb7-ac13-4531e0782c79Ù$65d2add6-fd6f-456c-92ed-3cd9d1862ef6Ù$96506201-6b66-49e6-8179-06952e2394e1Ù$0e9de19e-bcd4-40ac-9831-afb6cad38422Ù$a206c759-3f6e-4003-8cba-5f6ce6742646Ù$3bafd7df-9bc0-4d13-874d-739590cf3ad9Ù$cc45091e-b889-4d5a-9eef-84d80f792046Ù$d83dc659-dce7-41dd-a8e7-2933ab39d15cÙ$1753b5ed-c00b-4b60-b492-822180778e8cÙ$03a218cb-aa83-4000-85b5-c6f247087053Ù$a893a87b-2d07-4db5-9d1a-9da8646216f4Ù$5c4a383f-fcf2-4f2b-819f-6d84471dda00Ù$2cbc972b-c685-4c1c-8a8d-9d58b197ad90Ù$77cf3a74-899f-4ade-99f2-5aaf7a98c02dÙ$0bf3b988-b3fb-49d5-8dde-b25766596363Ù$a540814a-57a1-4b98-9443-59e401425444Ù$635abb34-2c97-4f04-a74c-22fbec32f408Ù$37ec6802-d4c2-4470-ad69-439d5a732f77Ù$e7566274-5518-4e28-8738-d4b1747d0cfbÙ$f3e2db06-9cb7-464a-96b8-938175efd26bÙ$e1aec891-d95a-47d1-97d7-d2a4cfb16e64Ù$8544eddb-2095-4a3c-82e0-920123a88e6dÙ$48b342f2-e48f-457a-9bd3-b3504a79f3a6Ù$fd89433e-643c-474b-b3c4-a997678421a6Ù$1ec1acf1-f833-4478-9b3c-88029340a629Ù$b72e030f-7d52-481f-b4f7-2b16b227e547Ù$047656d1-2921-40f2-b75b-ce4a87098007Ù$738ada7f-edc7-4ed3-a15e-e92113468738Ù$ce33f710-fd9d-4dfa-acda-40204e54d518Ù$f4b6f10b-4cd0-4be6-98ec-4d4ffb696392Ù$e7e49ff8-32df-48a4-afb2-462859592e92Ù$1386ffdb-940d-4f1b-a872-4e38647b5335Ù$e2b09af1-0f22-4f7f-b806-54fa522adb20Ù$4cbdb082-22ba-49e9-a6ed-4380917625acÙ$e6cf9550-2e69-4b82-92cf-5e07a35490aaÙ$25be5dcf-be63-46c4-b6de-6cf79fa28fd0Ù$4fea7232-f286-4a8b-93f8-a0702818ab31Ù$d8222abf-139c-4220-8e92-cc987ec6900cÙ$511a847f-234c-465e-8f4a-688e79d9b975Ù$0284f0d7-b8a9-4ae6-add0-ac1078571d9bÙ$b4875f2b-5487-429f-80a3-d1032bbccfc1Ù$4915b1ed-ad53-4ece-9b00-bc136d47d8dcÙ$5b15f5c9-80bf-47f0-898a-f8dead5b927cÙ$f3bc47b5-03fc-4bd9-a890-26f9608a730bÙ$436c52d2-280b-4ca4-9360-d6587b8254c7Ù$f0104778-81a6-417b-8501-f916e5e7f3afÙ$1ac9296f-047b-4051-ba5c-0c23d5f9cde9Ù$ba642a22-6623-482a-ab4a-81585b83e457Ù$e96d592d-1e54-486d-8ad9-b857f85476e8Ù$5aba4f96-e877-457e-8e95-18737348f99fÙ$5b15d91e-7119-4f85-a54a-7d4f1fdaf097Ù$7d94922e-dc9f-4953-b539-24aaa2c85b12Ù$da8d0bca-105b-4d0b-a73d-ee5c9059aeafÙ$d17a4bd0-5992-4247-912d-73d51758d2f3Ù$352d2952-cb83-47d3-9078-2b2ef9927443Ù$f27f2bcd-05b6-44fe-bf9e-a3e51556db7cÙ$b87ff1a9-abff-40f7-a1d8-f751a1c8b060Ù$5d434c83-c9ca-499f-8695-c7733031c2deÙ$4c4e643b-d4b9-44f0-8d30-dc521bcc55acÙ$7dbb42a3-aa8c-47e5-b668-18e6325d4038Ù$de3cba34-9842-44d1-9b79-47126c0a0751Ù$37a273b6-b104-46f0-987a-401dc1c97327Ù$8e742d32-c074-4981-b35b-b596b64c869bÙ$64900586-ef92-48e4-839e-ff952a46671bÙ$19dfabda-7049-4050-8662-0385529c0c5aÙ$966ef17c-23be-49dc-bc37-4cb52b34c049Ù$2c5d221a-2469-49e1-9249-dfdc2457f2faÙ$5ffc271f-c73f-494a-9727-8d7516af2191Ù$42d4600a-bf3c-45ac-b7f5-d23917713ff5Ù$820752af-8966-4ee8-82f7-a40934522de5Ù$0964133c-3a5b-433b-a8c4-a97813c37583Ù$28ce6e60-59cf-408a-8081-b978507b3c72Ù$5500fd8e-64cb-4af7-808d-230440746319Ù$a9db3f85-ff56-4bbc-be87-47b893ef3b7bÙ$00152954-dc98-4120-b94b-2ea4d987832bÙ$46fea69b-599e-46ab-8455-d2da865d9a8eÙ$d3c1379f-acd6-4e15-be7e-a5dbe46a4f62Ù$fed4dc4c-0d1c-4ee3-9d0e-8ef2a7db7486Ù$c926b6df-c40b-4c4c-8a95-ce9e41feb100Ù$f487f2dd-ad09-48ac-ae34-bf50cfa6ac7dÙ$5d35e515-e2d3-443e-becf-eb28c25db346Ù$735b548a-88f5-4a30-ab8f-dfb3d6401b2bÙ$60c21e9c-e42d-4f0b-a910-3b318440fbc8Ù$09dd1440-5d09-421f-addc-b1ede43ff517Ù$7ccadf01-fbba-4dfd-a5ad-770dab9946f9Ù$beb01fb8-c77d-4b5c-a66d-3812415e04a3Ù$68e6f17e-8c87-40f0-a673-1115ecd1b71dÙ$692c1043-4eaf-491e-b8fe-368618867f99Ù$3cfd63ad-b1a2-4b99-ae97-2ff10351e4f5Ù$fd964539-2baf-4ff1-b286-5a0bb1b222c4Ù$0b01ba67-3921-4f3f-a7e8-235190bc84ebÙ$7bf209c8-ef0a-46d1-937e-b1a6e45dc62eÙ$ad0009af-2cfc-4820-bd4a-698ad391f459Ù$b09e1e48-494e-4967-826a-6e70199acad4Ù$5864a5a3-a5a5-43c2-9cb4-7d13b2d20bedÙ$94517664-6988-44dc-a297-e9d5873ee540Ù$b16899b7-36bf-4a5e-8e2f-4496b8450687Ù$00bd2835-b006-4244-9877-bc7e031e3ef8Ù$3e7cecec-eb77-4862-8e3c-b510422e06dbÙ$78c83673-2117-4542-b4d8-1c243e8f610bÙ$ae0f5a96-7a4b-47f9-be1e-e803a238a071Ù$c8b47eac-2d45-419a-bec6-2ae0cdc59393Ù$537270ba-122b-4f2b-880b-31d086766295Ù$10cdd16e-a337-4421-a7a0-6de4e4b60c0fÙ$54fff14b-cf53-47b0-9cfa-8b9ee33df54eÙ$76fd79a2-2bc8-45f8-a243-48415118898aÙ$87ee21f3-16ca-4c8c-a0b9-f9e2fd258a91Ù$b966b248-fb4d-457d-90f6-114370846242Ù$f946c886-6246-4f98-a96f-f06984691ad8Ù$f7433324-acc3-49a5-b5b3-ada0c8f09d52Ù$fb8904a9-ae64-41cc-93b6-5a25855edad0Ù$cecc2a35-3850-4f66-84e8-e29da4f3d4b0Ù$a019925a-460a-410e-a54b-50a4cfe0e90eÙ$e1493cea-19c4-475d-98a0-86d27fb04af1Ù$573878bb-020d-40f6-9329-3d5f91843010Ù$0c9986bb-54c0-4b08-9c29-4bfb0b68b54eÙ$54f559b6-8a62-4a42-894d-c56e41d5ebefÙ$62e677ac-2070-4f6b-9df2-90849d89fa9fÙ$bba13634-ff0e-47f7-a23b-8d56098f4ac6Ù$b2082ab0-73a4-45a6-8772-a2e6e22b519aÙ$7a6f3f79-ea06-4994-8b62-90b2056e4034Ù$5261651e-a51e-4e80-8e23-83a4c10e5259Ù$bfe7e41d-6318-4bd4-b892-287831876abcÙ$6bf5ad39-1400-4e1f-a843-a1934b8aaa48Ù$f55afa58-962d-4551-8d95-a5b467d61adfÙ$4fb83451-b6f8-4e6e-a131-1accc8e10b08Ù$740a3f41-9302-481d-b373-762c0dea8effÙ$d41f1dd1-45fe-4456-9a01-ed47fd6704a7Ù$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690cÙ$f545c800-0bf3-491f-9d7d-42341cfdb573Ù$5b868eba-c1af-49f6-8f93-79b78c319a6fÙ$0ac7ea44-14f6-4e80-80f9-d6df8059bb38Ù$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290Ù$d037ea92-915c-4dc7-97c6-d006d92e088aÙ$0c56b341-24eb-4c78-844e-182f44a7221aÙ$f2ed56c9-c2b7-42cb-a083-e12aeaa126efÙ$8e39bd15-862e-4941-88f9-2794b861a523Ù$5720e942-d3f8-4329-83a8-8bcedf078b6aÙ$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091Ù$07ad517a-c2ac-4377-99fb-adb13d0f1d0cÙ$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbÙ$cbea5840-49d2-4e91-be9c-f5f15666d78aÙ$83ca0577-15d7-4448-b597-c77810b812bfÙ$a7dcc8cd-04ec-48f2-a387-116330eaffb2Ù$e5c1aca8-7575-4835-8273-e69ca0a55fe8Ù$d1ed25e6-60c6-411f-a541-99986e5da2c5Ù$cacaaca6-6e01-464f-a2ee-cbf62737a426Ù$a12b92d1-e045-4f92-b8cd-eee5d56fa67dÙ$44b32cc0-36a8-41fd-89bc-ce894536926cÙ$553b0ceb-f2ca-41ee-99bc-9f53a4487b49Ù$697b2310-9d96-4f7f-be62-c3bd6bf736f3Ù$aa69e4ea-91e0-496a-a7be-529e67f4dbecÙ$76eb6743-cac0-4174-9ba3-a0691c200b54Ù$ba41f521-4ee2-42a6-bf18-078bfa4b875eÙ$ba5d6311-daee-4abc-b2fb-fae2184ef3ebÙ$ed93259c-7b8b-46d7-97fb-f194e0e04b3aÙ$4e29c621-223e-4859-8e96-db04b967815aÙ$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00Ù$d5020a8d-1dd7-403c-9d1f-665b95543943Ù$2be8a812-4f21-4fe8-a2de-50497db0345aÙ$056a8adc-92f4-4b33-90d9-4b3b4026bbbcÙ$11b9beea-b0cd-45eb-84c6-151728894df0Ù$4d4ae57b-afc3-44f9-b6fc-892f59f82921Ù$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2Ù$7d63b960-3998-4f7b-8cbb-ccd49db9aeacÙ$646bc853-b7fc-49fa-a201-ff98e8f952d4Ù$94354552-9920-4b90-98d9-f75286d1f53eÙ$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6cÙ$57e5e12a-b722-4ea3-ab3b-e5711029e640Ù$9db9ff71-bee9-4bea-a45b-748f8517fed1Ù$57bbdb10-bed8-459d-8f67-9ea637cf12baÙ$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0Ù$266d2234-26c8-43f1-9e75-49440a230ed6Ù$83640f5b-fe13-4ec1-98a0-67a56c189ba1Ù$b71145a4-2614-4f62-bfd2-7d5d1fecec56Ù$4da20fd7-b897-4f26-bf2a-f08d66ddf90fÙ$05bfd818-bf4e-4bda-baa9-5ba647867097Ù$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728Ù$396e0047-d848-462f-a769-0cc2829abc78Ù$1f041cb3-618c-4380-a1ec-d7bbe4a80f62Ù$72273f27-d0b9-4645-a609-cb65cc9332eeÙ$8b35661b-5075-4d63-bc31-044407f99acfÙ$3c89209c-9202-4d5d-841c-ea34be369616Ù$645e93e7-e92e-49c4-9757-8294fabf4e9bÙ$68806899-9972-460a-9f11-daa708a9d610Ù$11ea640c-3981-404d-87c6-4d3d0708a2b8Ù$734573e5-547b-4dcc-89bb-412aa6cc42d6Ù$ff4f977e-48df-4c12-845c-c245b4d39d6dÙ$7afb6fb0-248a-4518-b94f-9876f81eca64Ù$42775fd1-5b27-48e0-abf1-9b22bb775e6dÙ$1b102220-6d78-480d-a77f-0e57bad23dcaÙ$b2539398-fdbc-42a2-a8f3-d327358f3643Ù$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54Ù$f8614042-7c94-4d47-a1b6-4e96676b4e8bÙ$8bc280db-e57d-4e40-be46-1790f4f7d9e7Ù$11063fff-4d36-46d5-828f-dbed0f46b9cfÙ$20776e09-7d9b-4db8-a060-7bceeec65b47Ù$3e3c5897-809f-46e3-bb58-f115b082443eÙ$05f120be-9695-4824-82fd-142a0df13098Ù$717e4c69-59d5-4929-923f-dd35a97fb160Ù$55ba8725-0ddf-4196-a41d-3f3c490a8d84Ù$61949faa-8174-4b7b-8fbc-01d5f850b419Ù$dd8e8cd2-7b41-46c4-8530-adefb7aea684Ù$08505e88-9c23-4e95-91e3-d18bf5133dbcÙ$87482ea5-5265-4e02-92c0-1a8bb44ff0f4Ù$13ebc12f-ff6f-4266-88d3-28d6df5fcf59Ù$3d065608-eef2-4caa-b17d-ec60714e3d58Ù$bd6a7c16-6c25-4fc2-8e1b-4dab693ce19fÙ$c5a2879c-e89b-47f7-bbd6-48200d7e89e3Ù$65be0e58-24be-4932-92a9-9e4825b14144Ù$3c695d54-c30f-4f04-bd40-f5da53be2a95Ù$3c316495-bb6c-41e2-a38f-ba867a319fbbÙ$024dcd1a-8eaa-4a95-8037-2f578828309cÙ$822e4d69-2582-4956-858e-06ecb091e76aÙ$cf1859d6-f889-4923-8c87-2d7c039f26c3Ù$31db0f58-28e4-454f-9394-25565687266fÙ$fddef10c-7695-4596-9e16-987fd45a57e6Ù$26880577-d267-4950-8725-7afe0d0402b6Ù$0cd96c44-cae6-421f-9fae-26141600bef4Ù$24fa139c-ad4b-49db-ac8f-23c476ed8608Ù$dddc4a2f-34b2-41dc-85b3-55aba4880fa6Ù$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5Ù$d3b56fca-5b79-4465-8987-8d0005f854d8Ù$5859ca11-90f8-4fd6-88ed-c56efe796fe8Ù$281360af-46bf-4c73-bf11-3cb1153ad3e2Ù$8f1b2db4-ed35-44fc-a3d5-e06deae16d48Ù$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03Ù$8aa16866-bfda-48df-9cf1-cf3d2e203ccbÙ$dca2f8e2-76af-4679-bf81-3824c15fc76dÙ$11a55af7-5301-4507-bb26-88e1e11236dbÙ$7856b8a0-565d-4c86-9b3c-4424ff9b86ddÙ$8fcdca63-01a0-4d4b-933c-06a7621d980aÙ$76d54520-baa3-44bf-b303-4cdcb8b87080Ù$9acdbf38-2e10-45ec-85a0-d0db8453a599Ù$f0962801-0dfa-421f-8ffc-e64068e49913Ù$c251a630-7114-4188-9323-8d8feb5c32e0Ù$cb70d400-3e9c-441c-b17c-e727e8c928f3Ù$61650a97-b353-4a85-b50b-93fee296ac7bÙ$192b9f82-8d3a-408f-91c2-829cfcd32572Ù$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62Ù$50ae94c4-70f3-4215-82bd-eb2227c2badfÙ$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27Ù$04b5929a-2058-49c9-963a-96c752a1d67dÙ$64b38d1f-ecf9-4843-89a1-4c8953048265Ù$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fceÙ$6acb549a-5d90-4457-a347-d22448ad8071Ù$d34d22ad-89c2-423e-91dd-bfb895dc6540Ù$5eebf3da-bfe7-46eb-81a3-f87f334ee270Ù$9978d537-49ff-4014-a971-b42704c50a6bÙ$54ff46a2-489a-4dd2-bc30-df70c780cc42Ù$407a0724-4bb6-4c83-ab2d-17a0e19c4072Ù$27487ad0-4779-42ce-8def-e660ef04bee0Ù$9d264543-33ab-498a-90f5-5f913c252484Ù$07ba9fe4-aaa7-4123-9865-cbfa79d0d44aÙ$e1274f57-75cb-4659-a82f-e5870c5367e2Ù$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547dÙ$5ee4ce72-7740-4297-8d84-619e0708e4acÙ$87feff3e-e510-4916-91a9-db3a2cd12225Ù$6b1acb57-159a-4b7f-99fe-5f996522243bÙ$82e0e9a0-9662-429a-87e3-e6bdae02709aÙ$27441783-d3c6-40be-9c36-4941613e6ae9Ù$daf35bfe-8f9c-4f55-971d-4d443be8f8bfÙ$51d6337d-c0bd-40a9-9129-7d88e41e4093Ù$a5b002c9-5e11-462a-9da0-6e060c7963f8Ù$9bce6fdb-2cbc-4758-9a8b-794e490c973dÙ$bb1ef180-39ac-475f-beea-ef573e71a3bfÙ$a8349352-3242-46d5-b0d5-1b6eb5d77e90Ù$2e7c737c-c798-4442-a7e1-d74ccfd73119Ù$f7f58fd2-facc-4b87-9172-5e911677c8f4Ù$d21617aa-6f38-4a90-8586-4b32022497adÙ$700dcbc4-c94c-4287-8cf0-0b2c7a320a3aÙ$4f96be72-ef3e-4e08-ac4c-be4271dcd14cÙ$54f1546d-87ae-49d2-92ed-6fcc9b66e027Ù$c5dd7e99-57e0-4bc7-97d2-2c780b23bcffÙ$2025ff38-f2ec-4224-b771-ff72ffe1af28Ù$77906355-08f8-4b08-b051-84697199b519Ù$023f67b8-8f38-470a-9766-ac60a75678aaÙ$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9Ù$10ee7709-0816-48d2-abe0-9be3dd04700fÙ$7c592385-e8d3-4efe-962c-d39debb64405Ù$d57375a5-b9e0-4742-b5f7-6a7da891604aÙ$04f42c09-8ab5-4233-b196-51c4aa2dcedbÙ$b02ba928-5b9f-4695-b980-07988c788bb9Ù$98222fcd-b456-477c-90dd-844df36877e5Ù$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6Ù$d9d11d69-bc16-400a-8f46-f9a8ecb8516aÙ$bc8a399b-8864-4473-89d2-e3b0a03d15b5Ù$192cc1cf-9ea1-492d-baa7-f2e197abecd4Ù$6d0925d3-af96-4b94-8e2e-4941cce39e51Ù$786a5385-b648-4fc3-8e19-bf6582828136Ù$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbfÙ$38acd032-1d18-4760-9111-67c9cdd2e892Ù$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2Ù$349631b2-4686-49a9-9f3a-1e4ad588b568Ù$ac9c8845-284d-4c21-b05d-d930f86598a3Ù$b8532822-179b-4cd5-a279-4b71dafb544aÙ$fee14dfe-c5ca-4126-a830-cc9d7eda5433Ù$e524f8cc-ab69-4f8b-a59f-28156696a104Ù$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7dÙ$ff3009eb-23f9-44fe-8e56-85dbc7b463d0Ù$b7f77935-bcab-4ef1-8e1b-a7d059784ff3Ù$6c5e9bb2-4c38-4613-9652-dec99e97b512Ù$f8215517-b18f-4a03-9421-8edab4ca8089Ù$d2729657-d0bf-4d39-8ec7-f242a1ad48d6Ù$8e096fae-9941-49d8-ae87-c68b02f68da5Ù$44f14d4f-7414-4c6f-883a-042ca261a403Ù$6c5f51bb-a6be-447e-b73d-4f9c2885e809Ù$4156d955-9daf-4429-b152-e8332980fb9eÙ$16113560-e911-47b4-abc4-641bbd246454Ù$a7891c63-18d6-4c1f-ba67-adf7c547d334Ù$7126aefd-b847-497a-9545-514e9b9afa71Ù$1894ae1a-bb68-4de0-a4d2-ac5d02c49f09Ù$4c34640f-efa2-4e1d-8a70-0acd2ce45428Ù$f7ede764-5ad8-426b-a805-cc21b622d977Ù$3ea08816-705e-4be7-a175-dbd3f3e4c17dÙ$5d50a5d0-8fe2-4c6e-b76c-d5614e4fd884Ù$0ab70fc3-6188-42eb-aba2-d808f319be9fÙ$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaebÙ$16fcc2d0-9f2f-4226-9dcc-6d86248cab26Ù$9cf3dc5f-8a25-479f-93db-06e34f0d37a0Ù$cc80848a-6834-4272-9152-e17b45448814Ù$a8b40b8f-051a-4e6f-a079-ece4f32873deÙ$36d514fa-b27a-4c6b-8399-9d108377b9b5Ù$c52c4cec-0ea8-4af3-831a-d284f0e086eeÙ$4c5cb75e-79b5-4502-b1eb-6246e002feafÙ$8eb42403-1234-4e59-993e-057cc3a6d5c9Ù$71a5fce8-6d9a-4625-bad1-a951d61bff28Ù$b53dba81-a9e9-41da-8fc2-7736bf25f2dcÙ$0d45ae72-572f-4d17-83cf-9814f2854131Ù$0d93132d-5819-47dc-8cf2-462d480d9c3dÙ$602a07dd-8928-4b44-97e5-01c5cbf38351Ù$0574f5a0-72e7-4aa2-80ac-f4ce4f0fe7c2Ù$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7efÙ$fd58402f-da65-44cf-b81a-e21192fd0e63Ù$af144759-fe66-4ad0-b378-e9eb4e859db4Ù$d4e87ac4-6008-43b2-aa06-e232ec2b2b5bÙ$63fbf8f4-e4e2-4893-be09-67450e92dbd7Ù$fad02876-efba-46a7-9cb7-43820528779fÙ$374af774-3a97-49b5-a3bb-bc3f7f63a3faÙ$1ce4bc6c-7cde-48e9-8ff1-7281697fd121Ù$f9facbba-39d4-483e-9066-275603156db0Ù$e89bdc84-dbb5-4c73-a39c-6392e5f79704Ù$c0876a48-ea18-494d-8bfc-e2bceb73b417Ù$d82e7ab8-c372-4462-afb5-1617560cdb56Ù$bbc8864a-1545-433f-bc7c-0ddf6e907138Ù$dc2efc6c-8da8-425b-aa5f-290949109565Ù$68469a40-7976-48b7-b7a1-eaa4c5f33a18Ù$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580aÙ$b695ef21-a1ac-4d1f-a0e1-71cd81cede18Ù$a0ca7a5e-0089-4a45-9278-c0f27cd096a0Ù$ba645f6b-143f-4e83-9003-707770ae308dÙ$da3cb392-78f2-48b2-b0dc-5f016664798cÙ$3a37b53d-9174-4faa-9404-74a40c385b0aÙ$ddbca73f-c692-46f2-95f3-a7dd849d33f7Ù$b5319d8b-0420-4ebf-b603-ea0b93365ac1Ù$c87dba8c-9a96-41b3-9dc7-a6c088ec1eafÙ$cd9c9eeb-c90d-4499-9503-7773d5250f47Ù$5207308e-f636-4d47-b135-036a6e7b8ecdÙ$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8fÙ$f59a5dcd-9f4a-4336-a391-e64af35ef799´last_hot_reload_timeË©shortpathÙ%Chapter_13_Policy_Gradient_Methods.jl®process_status¥ready¤pathÙ°/home/runner/work/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Reinforcement-Learning-Sutton-Barto-Exercise-Solutions/Chapter-13/Chapter_13_Policy_Gradient_Methods.jlpluto_version§v0.20.8®last_save_timeËAÚ”ô yªcell_orderÜ§Ù$36a6e43f-6bcf-4c27-bfbb-047760e77adaÙ$31f7e903-30b6-4193-9174-88093e004de4Ù$48dcd2d0-a940-41da-a097-90c780f2ec4dÙ$d95f75b5-21d8-4862-baa7-50b58d9725b8Ù$fc3dcd26-c5cf-4141-bf6c-eaed5fc9bb1dÙ$dcb306ae-a1b1-43d6-ba6e-e38668838689Ù$33c99850-67cd-4754-94b9-6df97b238e27Ù$7a6fb1f0-fc3c-4c29-a6d9-769d32ca98a9Ù$5cc4d12d-b537-47e2-8109-4e7a234fdf25Ù$ff76ef94-fdf5-41f3-a31a-21c4629efabeÙ$f7433324-acc3-49a5-b5b3-ada0c8f09d52Ù$fb8904a9-ae64-41cc-93b6-5a25855edad0Ù$cecc2a35-3850-4f66-84e8-e29da4f3d4b0Ù$6bb0263e-368e-462a-948c-baf9cfa82512Ù$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811Ù$a019925a-460a-410e-a54b-50a4cfe0e90eÙ$f2f2dd1d-180c-4d36-b515-5079d129f93aÙ$e1493cea-19c4-475d-98a0-86d27fb04af1Ù$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72cÙ$5334064b-5a16-4135-afa0-86a48291725bÙ$5981f52b-d829-4c7d-b47b-33310f7d64a2Ù$573878bb-020d-40f6-9329-3d5f91843010Ù$8019bec9-1228-407b-9199-2fe29f26a981Ù$38e5d800-4d43-40d2-87ea-f7d4b4283dabÙ$b94fc99c-f439-4df2-8da3-c01718a136c4Ù$9c342958-1971-48ec-b919-5dfdcbc915a4Ù$e5faaa1b-88cb-43e2-8d04-8972b58b4bdaÙ$406638af-1e08-44d2-9ee4-97aa9294a94bÙ$aa450da4-fe84-4eea-b6c4-9820b7982437Ù$fdd3f4fd-4706-4d6b-b150-6ee6b4b370cbÙ$0c9986bb-54c0-4b08-9c29-4bfb0b68b54eÙ$54f559b6-8a62-4a42-894d-c56e41d5ebefÙ$41d62de1-2c92-41ee-9430-b9ca3007afd9Ù$62e677ac-2070-4f6b-9df2-90849d89fa9fÙ$135f205a-f87e-4691-8e87-d317d6312c84Ù$ca360680-afc9-4dd9-9351-493643f91575Ù$4a39f9a7-72d4-44ad-895a-742cd1291f92Ù$9cf3dc5f-8a25-479f-93db-06e34f0d37a0Ù$16fcc2d0-9f2f-4226-9dcc-6d86248cab26Ù$98229733-a71e-44ca-a52a-b7229cf8b422Ù$37a8ef7e-e859-4ef0-81e2-76c02a324031Ù$339b4d2b-2237-46a3-9867-ecc3332856c1Ù$05b0fcad-628b-48d2-aa24-f6f562dbb660Ù$17d07ef4-7c0a-47cc-a701-32c60336571bÙ$76b03e72-da04-4530-8534-6d6468268cbdÙ$2a586e46-66e4-461a-85c8-5817e4d1aa43Ù$90d3b96b-ad2b-405c-951b-f48ec7ccf24aÙ$f924eb30-d1cc-4941-8fb5-ff70ad425ab9Ù$189798b3-ec6b-48b9-918c-ee0f65935ab3Ù$70096b14-beab-4f71-9886-6355c749bb8aÙ$1558cec1-c4fd-4bc0-85ed-ae22c6067d41Ù$e3a2fb12-37ce-4c23-ad93-5fc89991aabbÙ$58403c8e-0ee4-4466-ba25-ee0c86fb0b47Ù$73b90260-d57a-449a-8db6-47f91e6a4e4fÙ$ee72af8d-3cb8-4314-82df-580f068e1252Ù$89901156-b874-416b-89c1-6dc434a4eb17Ù$5c11a92d-7496-4aba-af15-2537eac49dd7Ù$b0a66a19-ee76-463b-a704-8fcee85444d0Ù$581f7e9b-a5c2-4841-9605-85f9585b0274Ù$da2d3186-a778-41cc-9b49-759bf1e9b8faÙ$f92bb265-4b19-4f0e-a698-d7547bb6dd41Ù$8eab55a5-41b7-4f5e-a02f-4c19388bc9eaÙ$a361f4c9-47ce-42ad-899c-87b611c0d471Ù$cc3ac95e-a398-438a-ba3d-62b6733f6342Ù$4634267b-5dea-4164-8bb2-1eb2fd4d7954Ù$45f0a385-6465-4acc-8637-1b007a0fe215Ù$41dc149d-c6f3-4b0d-a856-06f3aaae3049Ù$042fbafe-2401-4fb7-ac13-4531e0782c79Ù$65d2add6-fd6f-456c-92ed-3cd9d1862ef6Ù$0ac7ea44-14f6-4e80-80f9-d6df8059bb38Ù$96506201-6b66-49e6-8179-06952e2394e1Ù$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290Ù$8e39bd15-862e-4941-88f9-2794b861a523Ù$0e9de19e-bcd4-40ac-9831-afb6cad38422Ù$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091Ù$d037ea92-915c-4dc7-97c6-d006d92e088aÙ$a206c759-3f6e-4003-8cba-5f6ce6742646Ù$0c56b341-24eb-4c78-844e-182f44a7221aÙ$3bafd7df-9bc0-4d13-874d-739590cf3ad9Ù$cc45091e-b889-4d5a-9eef-84d80f792046Ù$d83dc659-dce7-41dd-a8e7-2933ab39d15cÙ$1753b5ed-c00b-4b60-b492-822180778e8cÙ$03a218cb-aa83-4000-85b5-c6f247087053Ù$a893a87b-2d07-4db5-9d1a-9da8646216f4Ù$5c4a383f-fcf2-4f2b-819f-6d84471dda00Ù$2cbc972b-c685-4c1c-8a8d-9d58b197ad90Ù$77cf3a74-899f-4ade-99f2-5aaf7a98c02dÙ$0bf3b988-b3fb-49d5-8dde-b25766596363Ù$a540814a-57a1-4b98-9443-59e401425444Ù$635abb34-2c97-4f04-a74c-22fbec32f408Ù$37ec6802-d4c2-4470-ad69-439d5a732f77Ù$e7566274-5518-4e28-8738-d4b1747d0cfbÙ$4fb83451-b6f8-4e6e-a131-1accc8e10b08Ù$a7c9ae69-f4b8-471c-ab97-90642f3c2bdbÙ$d1ed25e6-60c6-411f-a541-99986e5da2c5Ù$f3e2db06-9cb7-464a-96b8-938175efd26bÙ$e1aec891-d95a-47d1-97d7-d2a4cfb16e64Ù$697b2310-9d96-4f7f-be62-c3bd6bf736f3Ù$8544eddb-2095-4a3c-82e0-920123a88e6dÙ$48b342f2-e48f-457a-9bd3-b3504a79f3a6Ù$f2ed56c9-c2b7-42cb-a083-e12aeaa126efÙ$cbea5840-49d2-4e91-be9c-f5f15666d78aÙ$fd89433e-643c-474b-b3c4-a997678421a6Ù$5720e942-d3f8-4329-83a8-8bcedf078b6aÙ$cacaaca6-6e01-464f-a2ee-cbf62737a426Ù$1ec1acf1-f833-4478-9b3c-88029340a629Ù$07ad517a-c2ac-4377-99fb-adb13d0f1d0cÙ$aa69e4ea-91e0-496a-a7be-529e67f4dbecÙ$83ca0577-15d7-4448-b597-c77810b812bfÙ$b72e030f-7d52-481f-b4f7-2b16b227e547Ù$a7dcc8cd-04ec-48f2-a387-116330eaffb2Ù$047656d1-2921-40f2-b75b-ce4a87098007Ù$94354552-9920-4b90-98d9-f75286d1f53eÙ$a12b92d1-e045-4f92-b8cd-eee5d56fa67dÙ$44b32cc0-36a8-41fd-89bc-ce894536926cÙ$553b0ceb-f2ca-41ee-99bc-9f53a4487b49Ù$738ada7f-edc7-4ed3-a15e-e92113468738Ù$e5c1aca8-7575-4835-8273-e69ca0a55fe8Ù$ce33f710-fd9d-4dfa-acda-40204e54d518Ù$f4b6f10b-4cd0-4be6-98ec-4d4ffb696392Ù$e7e49ff8-32df-48a4-afb2-462859592e92Ù$4d4ae57b-afc3-44f9-b6fc-892f59f82921Ù$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2Ù$57e5e12a-b722-4ea3-ab3b-e5711029e640Ù$57bbdb10-bed8-459d-8f67-9ea637cf12baÙ$1386ffdb-940d-4f1b-a872-4e38647b5335Ù$7d63b960-3998-4f7b-8cbb-ccd49db9aeacÙ$9db9ff71-bee9-4bea-a45b-748f8517fed1Ù$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0Ù$646bc853-b7fc-49fa-a201-ff98e8f952d4Ù$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6cÙ$e2b09af1-0f22-4f7f-b806-54fa522adb20Ù$4cbdb082-22ba-49e9-a6ed-4380917625acÙ$e6cf9550-2e69-4b82-92cf-5e07a35490aaÙ$25be5dcf-be63-46c4-b6de-6cf79fa28fd0Ù$266d2234-26c8-43f1-9e75-49440a230ed6Ù$05bfd818-bf4e-4bda-baa9-5ba647867097Ù$68806899-9972-460a-9f11-daa708a9d610Ù$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54Ù$4fea7232-f286-4a8b-93f8-a0702818ab31Ù$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728Ù$396e0047-d848-462f-a769-0cc2829abc78Ù$1f041cb3-618c-4380-a1ec-d7bbe4a80f62Ù$11ea640c-3981-404d-87c6-4d3d0708a2b8Ù$f8614042-7c94-4d47-a1b6-4e96676b4e8bÙ$bc8a399b-8864-4473-89d2-e3b0a03d15b5Ù$cc80848a-6834-4272-9152-e17b45448814Ù$a8b40b8f-051a-4e6f-a079-ece4f32873deÙ$36d514fa-b27a-4c6b-8399-9d108377b9b5Ù$c52c4cec-0ea8-4af3-831a-d284f0e086eeÙ$d8222abf-139c-4220-8e92-cc987ec6900cÙ$511a847f-234c-465e-8f4a-688e79d9b975Ù$0284f0d7-b8a9-4ae6-add0-ac1078571d9bÙ$b4875f2b-5487-429f-80a3-d1032bbccfc1Ù$4915b1ed-ad53-4ece-9b00-bc136d47d8dcÙ$5b15f5c9-80bf-47f0-898a-f8dead5b927cÙ$83640f5b-fe13-4ec1-98a0-67a56c189ba1Ù$f3bc47b5-03fc-4bd9-a890-26f9608a730bÙ$72273f27-d0b9-4645-a609-cb65cc9332eeÙ$436c52d2-280b-4ca4-9360-d6587b8254c7Ù$f0104778-81a6-417b-8501-f916e5e7f3afÙ$1ac9296f-047b-4051-ba5c-0c23d5f9cde9Ù$fac138d9-3c5d-44b0-a87c-b13872f19450Ù$ba642a22-6623-482a-ab4a-81585b83e457Ù$734573e5-547b-4dcc-89bb-412aa6cc42d6Ù$e96d592d-1e54-486d-8ad9-b857f85476e8Ù$ff4f977e-48df-4c12-845c-c245b4d39d6dÙ$8bc280db-e57d-4e40-be46-1790f4f7d9e7Ù$5aba4f96-e877-457e-8e95-18737348f99fÙ$11063fff-4d36-46d5-828f-dbed0f46b9cfÙ$7afb6fb0-248a-4518-b94f-9876f81eca64Ù$5b15d91e-7119-4f85-a54a-7d4f1fdaf097Ù$7d94922e-dc9f-4953-b539-24aaa2c85b12Ù$42775fd1-5b27-48e0-abf1-9b22bb775e6dÙ$da8d0bca-105b-4d0b-a73d-ee5c9059aeafÙ$8b35661b-5075-4d63-bc31-044407f99acfÙ$d17a4bd0-5992-4247-912d-73d51758d2f3Ù$352d2952-cb83-47d3-9078-2b2ef9927443Ù$f27f2bcd-05b6-44fe-bf9e-a3e51556db7cÙ$b87ff1a9-abff-40f7-a1d8-f751a1c8b060Ù$5d434c83-c9ca-499f-8695-c7733031c2deÙ$4c4e643b-d4b9-44f0-8d30-dc521bcc55acÙ$602a07dd-8928-4b44-97e5-01c5cbf38351Ù$7dbb42a3-aa8c-47e5-b668-18e6325d4038Ù$de3cba34-9842-44d1-9b79-47126c0a0751Ù$1b102220-6d78-480d-a77f-0e57bad23dcaÙ$37a273b6-b104-46f0-987a-401dc1c97327Ù$8e742d32-c074-4981-b35b-b596b64c869bÙ$b2539398-fdbc-42a2-a8f3-d327358f3643Ù$e034b9cb-f4ee-46f4-bea6-72c93c75d966Ù$64900586-ef92-48e4-839e-ff952a46671bÙ$3c89209c-9202-4d5d-841c-ea34be369616Ù$645e93e7-e92e-49c4-9757-8294fabf4e9bÙ$0cd96c44-cae6-421f-9fae-26141600bef4Ù$19dfabda-7049-4050-8662-0385529c0c5aÙ$0574f5a0-72e7-4aa2-80ac-f4ce4f0fe7c2Ù$966ef17c-23be-49dc-bc37-4cb52b34c049Ù$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62Ù$2c5d221a-2469-49e1-9249-dfdc2457f2faÙ$5ffc271f-c73f-494a-9727-8d7516af2191Ù$42d4600a-bf3c-45ac-b7f5-d23917713ff5Ù$50ae94c4-70f3-4215-82bd-eb2227c2badfÙ$820752af-8966-4ee8-82f7-a40934522de5Ù$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27Ù$0964133c-3a5b-433b-a8c4-a97813c37583Ù$04b5929a-2058-49c9-963a-96c752a1d67dÙ$64b38d1f-ecf9-4843-89a1-4c8953048265Ù$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fceÙ$6acb549a-5d90-4457-a347-d22448ad8071Ù$fad02876-efba-46a7-9cb7-43820528779fÙ$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7efÙ$28ce6e60-59cf-408a-8081-b978507b3c72Ù$fd58402f-da65-44cf-b81a-e21192fd0e63Ù$5500fd8e-64cb-4af7-808d-230440746319Ù$a9db3f85-ff56-4bbc-be87-47b893ef3b7bÙ$00152954-dc98-4120-b94b-2ea4d987832bÙ$46fea69b-599e-46ab-8455-d2da865d9a8eÙ$d57375a5-b9e0-4742-b5f7-6a7da891604aÙ$d3c1379f-acd6-4e15-be7e-a5dbe46a4f62Ù$fed4dc4c-0d1c-4ee3-9d0e-8ef2a7db7486Ù$04f42c09-8ab5-4233-b196-51c4aa2dcedbÙ$b02ba928-5b9f-4695-b980-07988c788bb9Ù$98222fcd-b456-477c-90dd-844df36877e5Ù$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6Ù$e89bdc84-dbb5-4c73-a39c-6392e5f79704Ù$da3cb392-78f2-48b2-b0dc-5f016664798cÙ$f0962801-0dfa-421f-8ffc-e64068e49913Ù$c251a630-7114-4188-9323-8d8feb5c32e0Ù$c926b6df-c40b-4c4c-8a95-ce9e41feb100Ù$f487f2dd-ad09-48ac-ae34-bf50cfa6ac7dÙ$5d35e515-e2d3-443e-becf-eb28c25db346Ù$cb70d400-3e9c-441c-b17c-e727e8c928f3Ù$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9Ù$10ee7709-0816-48d2-abe0-9be3dd04700fÙ$c0876a48-ea18-494d-8bfc-e2bceb73b417Ù$3a37b53d-9174-4faa-9404-74a40c385b0aÙ$735b548a-88f5-4a30-ab8f-dfb3d6401b2bÙ$60c21e9c-e42d-4f0b-a910-3b318440fbc8Ù$09dd1440-5d09-421f-addc-b1ede43ff517Ù$7ccadf01-fbba-4dfd-a5ad-770dab9946f9Ù$beb01fb8-c77d-4b5c-a66d-3812415e04a3Ù$68e6f17e-8c87-40f0-a673-1115ecd1b71dÙ$692c1043-4eaf-491e-b8fe-368618867f99Ù$3cfd63ad-b1a2-4b99-ae97-2ff10351e4f5Ù$fd964539-2baf-4ff1-b286-5a0bb1b222c4Ù$666a4e89-306b-4fb2-bdc4-3dda2c63153fÙ$0b01ba67-3921-4f3f-a7e8-235190bc84ebÙ$7bf209c8-ef0a-46d1-937e-b1a6e45dc62eÙ$ad0009af-2cfc-4820-bd4a-698ad391f459Ù$b09e1e48-494e-4967-826a-6e70199acad4Ù$5864a5a3-a5a5-43c2-9cb4-7d13b2d20bedÙ$94517664-6988-44dc-a297-e9d5873ee540Ù$b16899b7-36bf-4a5e-8e2f-4496b8450687Ù$00bd2835-b006-4244-9877-bc7e031e3ef8Ù$3e7cecec-eb77-4862-8e3c-b510422e06dbÙ$78c83673-2117-4542-b4d8-1c243e8f610bÙ$ae0f5a96-7a4b-47f9-be1e-e803a238a071Ù$c8b47eac-2d45-419a-bec6-2ae0cdc59393Ù$537270ba-122b-4f2b-880b-31d086766295Ù$10cdd16e-a337-4421-a7a0-6de4e4b60c0fÙ$54fff14b-cf53-47b0-9cfa-8b9ee33df54eÙ$76fd79a2-2bc8-45f8-a243-48415118898aÙ$87ee21f3-16ca-4c8c-a0b9-f9e2fd258a91Ù$b966b248-fb4d-457d-90f6-114370846242Ù$f946c886-6246-4f98-a96f-f06984691ad8Ù$bba13634-ff0e-47f7-a23b-8d56098f4ac6Ù$b2082ab0-73a4-45a6-8772-a2e6e22b519aÙ$7a6f3f79-ea06-4994-8b62-90b2056e4034Ù$5261651e-a51e-4e80-8e23-83a4c10e5259Ù$bfe7e41d-6318-4bd4-b892-287831876abcÙ$6bf5ad39-1400-4e1f-a843-a1934b8aaa48Ù$f55afa58-962d-4551-8d95-a5b467d61adfÙ$740a3f41-9302-481d-b373-762c0dea8effÙ$d41f1dd1-45fe-4456-9a01-ed47fd6704a7Ù$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690cÙ$f545c800-0bf3-491f-9d7d-42341cfdb573Ù$5b868eba-c1af-49f6-8f93-79b78c319a6fÙ$76eb6743-cac0-4174-9ba3-a0691c200b54Ù$ba41f521-4ee2-42a6-bf18-078bfa4b875eÙ$ba5d6311-daee-4abc-b2fb-fae2184ef3ebÙ$ed93259c-7b8b-46d7-97fb-f194e0e04b3aÙ$4e29c621-223e-4859-8e96-db04b967815aÙ$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00Ù$d5020a8d-1dd7-403c-9d1f-665b95543943Ù$2be8a812-4f21-4fe8-a2de-50497db0345aÙ$056a8adc-92f4-4b33-90d9-4b3b4026bbbcÙ$11b9beea-b0cd-45eb-84c6-151728894df0Ù$b71145a4-2614-4f62-bfd2-7d5d1fecec56Ù$4da20fd7-b897-4f26-bf2a-f08d66ddf90fÙ$20776e09-7d9b-4db8-a060-7bceeec65b47Ù$3e3c5897-809f-46e3-bb58-f115b082443eÙ$05f120be-9695-4824-82fd-142a0df13098Ù$717e4c69-59d5-4929-923f-dd35a97fb160Ù$55ba8725-0ddf-4196-a41d-3f3c490a8d84Ù$61949faa-8174-4b7b-8fbc-01d5f850b419Ù$dd8e8cd2-7b41-46c4-8530-adefb7aea684Ù$08505e88-9c23-4e95-91e3-d18bf5133dbcÙ$87482ea5-5265-4e02-92c0-1a8bb44ff0f4Ù$13ebc12f-ff6f-4266-88d3-28d6df5fcf59Ù$3d065608-eef2-4caa-b17d-ec60714e3d58Ù$bd6a7c16-6c25-4fc2-8e1b-4dab693ce19fÙ$c5a2879c-e89b-47f7-bbd6-48200d7e89e3Ù$65be0e58-24be-4932-92a9-9e4825b14144Ù$3c695d54-c30f-4f04-bd40-f5da53be2a95Ù$3c316495-bb6c-41e2-a38f-ba867a319fbbÙ$024dcd1a-8eaa-4a95-8037-2f578828309cÙ$822e4d69-2582-4956-858e-06ecb091e76aÙ$cf1859d6-f889-4923-8c87-2d7c039f26c3Ù$31db0f58-28e4-454f-9394-25565687266fÙ$fddef10c-7695-4596-9e16-987fd45a57e6Ù$26880577-d267-4950-8725-7afe0d0402b6Ù$24fa139c-ad4b-49db-ac8f-23c476ed8608Ù$dddc4a2f-34b2-41dc-85b3-55aba4880fa6Ù$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5Ù$d3b56fca-5b79-4465-8987-8d0005f854d8Ù$5859ca11-90f8-4fd6-88ed-c56efe796fe8Ù$281360af-46bf-4c73-bf11-3cb1153ad3e2Ù$8f1b2db4-ed35-44fc-a3d5-e06deae16d48Ù$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03Ù$8aa16866-bfda-48df-9cf1-cf3d2e203ccbÙ$dca2f8e2-76af-4679-bf81-3824c15fc76dÙ$11a55af7-5301-4507-bb26-88e1e11236dbÙ$7856b8a0-565d-4c86-9b3c-4424ff9b86ddÙ$8fcdca63-01a0-4d4b-933c-06a7621d980aÙ$76d54520-baa3-44bf-b303-4cdcb8b87080Ù$9acdbf38-2e10-45ec-85a0-d0db8453a599Ù$61650a97-b353-4a85-b50b-93fee296ac7bÙ$192b9f82-8d3a-408f-91c2-829cfcd32572Ù$d34d22ad-89c2-423e-91dd-bfb895dc6540Ù$5eebf3da-bfe7-46eb-81a3-f87f334ee270Ù$9978d537-49ff-4014-a971-b42704c50a6bÙ$54ff46a2-489a-4dd2-bc30-df70c780cc42Ù$407a0724-4bb6-4c83-ab2d-17a0e19c4072Ù$27487ad0-4779-42ce-8def-e660ef04bee0Ù$9d264543-33ab-498a-90f5-5f913c252484Ù$07ba9fe4-aaa7-4123-9865-cbfa79d0d44aÙ$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547dÙ$374af774-3a97-49b5-a3bb-bc3f7f63a3faÙ$af144759-fe66-4ad0-b378-e9eb4e859db4Ù$e1274f57-75cb-4659-a82f-e5870c5367e2Ù$63fbf8f4-e4e2-4893-be09-67450e92dbd7Ù$5ee4ce72-7740-4297-8d84-619e0708e4acÙ$87feff3e-e510-4916-91a9-db3a2cd12225Ù$6b1acb57-159a-4b7f-99fe-5f996522243bÙ$82e0e9a0-9662-429a-87e3-e6bdae02709aÙ$27441783-d3c6-40be-9c36-4941613e6ae9Ù$daf35bfe-8f9c-4f55-971d-4d443be8f8bfÙ$51d6337d-c0bd-40a9-9129-7d88e41e4093Ù$a5b002c9-5e11-462a-9da0-6e060c7963f8Ù$9bce6fdb-2cbc-4758-9a8b-794e490c973dÙ$1ce4bc6c-7cde-48e9-8ff1-7281697fd121Ù$bb1ef180-39ac-475f-beea-ef573e71a3bfÙ$a8349352-3242-46d5-b0d5-1b6eb5d77e90Ù$2e7c737c-c798-4442-a7e1-d74ccfd73119Ù$d4e87ac4-6008-43b2-aa06-e232ec2b2b5bÙ$f7f58fd2-facc-4b87-9172-5e911677c8f4Ù$d21617aa-6f38-4a90-8586-4b32022497adÙ$700dcbc4-c94c-4287-8cf0-0b2c7a320a3aÙ$4f96be72-ef3e-4e08-ac4c-be4271dcd14cÙ$54f1546d-87ae-49d2-92ed-6fcc9b66e027Ù$c5dd7e99-57e0-4bc7-97d2-2c780b23bcffÙ$2025ff38-f2ec-4224-b771-ff72ffe1af28Ù$77906355-08f8-4b08-b051-84697199b519Ù$023f67b8-8f38-470a-9766-ac60a75678aaÙ$7c592385-e8d3-4efe-962c-d39debb64405Ù$d9d11d69-bc16-400a-8f46-f9a8ecb8516aÙ$192cc1cf-9ea1-492d-baa7-f2e197abecd4Ù$4c5cb75e-79b5-4502-b1eb-6246e002feafÙ$8eb42403-1234-4e59-993e-057cc3a6d5c9Ù$6d0925d3-af96-4b94-8e2e-4941cce39e51Ù$dc2efc6c-8da8-425b-aa5f-290949109565Ù$ddbca73f-c692-46f2-95f3-a7dd849d33f7Ù$786a5385-b648-4fc3-8e19-bf6582828136Ù$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbfÙ$38acd032-1d18-4760-9111-67c9cdd2e892Ù$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2Ù$349631b2-4686-49a9-9f3a-1e4ad588b568Ù$ac9c8845-284d-4c21-b05d-d930f86598a3Ù$71a5fce8-6d9a-4625-bad1-a951d61bff28Ù$b53dba81-a9e9-41da-8fc2-7736bf25f2dcÙ$b8532822-179b-4cd5-a279-4b71dafb544aÙ$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580aÙ$c87dba8c-9a96-41b3-9dc7-a6c088ec1eafÙ$fee14dfe-c5ca-4126-a830-cc9d7eda5433Ù$cd9c9eeb-c90d-4499-9503-7773d5250f47Ù$b695ef21-a1ac-4d1f-a0e1-71cd81cede18Ù$e524f8cc-ab69-4f8b-a59f-28156696a104Ù$0d45ae72-572f-4d17-83cf-9814f2854131Ù$0d93132d-5819-47dc-8cf2-462d480d9c3dÙ$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7dÙ$a0ca7a5e-0089-4a45-9278-c0f27cd096a0Ù$5207308e-f636-4d47-b135-036a6e7b8ecdÙ$ff3009eb-23f9-44fe-8e56-85dbc7b463d0Ù$b7f77935-bcab-4ef1-8e1b-a7d059784ff3Ù$6c5e9bb2-4c38-4613-9652-dec99e97b512Ù$f8215517-b18f-4a03-9421-8edab4ca8089Ù$d2729657-d0bf-4d39-8ec7-f242a1ad48d6Ù$8e096fae-9941-49d8-ae87-c68b02f68da5Ù$44f14d4f-7414-4c6f-883a-042ca261a403Ù$6c5f51bb-a6be-447e-b73d-4f9c2885e809Ù$4156d955-9daf-4429-b152-e8332980fb9eÙ$d82e7ab8-c372-4462-afb5-1617560cdb56Ù$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8fÙ$16113560-e911-47b4-abc4-641bbd246454Ù$a7891c63-18d6-4c1f-ba67-adf7c547d334Ù$7126aefd-b847-497a-9545-514e9b9afa71Ù$1894ae1a-bb68-4de0-a4d2-ac5d02c49f09Ù$f9facbba-39d4-483e-9066-275603156db0Ù$bbc8864a-1545-433f-bc7c-0ddf6e907138Ù$68469a40-7976-48b7-b7a1-eaa4c5f33a18Ù$ba645f6b-143f-4e83-9003-707770ae308dÙ$b5319d8b-0420-4ebf-b603-ea0b93365ac1Ù$4c34640f-efa2-4e1d-8a70-0acd2ce45428Ù$f7ede764-5ad8-426b-a805-cc21b622d977Ù$3ea08816-705e-4be7-a175-dbd3f3e4c17dÙ$5d50a5d0-8fe2-4c6e-b76c-d5614e4fd884Ù$0ab70fc3-6188-42eb-aba2-d808f319be9fÙ$df7f84e8-b42a-4001-9dbf-6bc3ced94207Ù$d963ff6d-f1b6-4799-aa0e-1ae100310d84Ù$7cf26604-9c2b-4a77-9674-7d4dac2f99f0Ù$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70Ù$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaebÙ$f59a5dcd-9f4a-4336-a391-e64af35ef799±published_objects€¥nbpkgŠ¯install_time_nsÏà¨¬instantiatedÃ²installed_versionsÞªStatistics¦stdlib«Transducers¦0.4.84©StatsBase§0.33.21§Memoize¥0.4.4Distributions¨0.25.117§PlutoUI¦0.7.61¯ProgressLogging¥0.1.4¦Random¦stdlib®BenchmarkTools¥1.3.2¬StaticArrays¦1.5.26LinearAlgebra¦stdlib®PlutoDevMacros¥0.9.0ªDataFrames¥1.7.0¬PlutoProfile¥0.4.0°HypertextLiteral¥0.9.5°SpecialFunctions¥2.5.0¬LaTeXStrings¥1.3.1«PlutoPlotly¥0.3.9°terminal_outputsÞªStatisticsÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39m«TransducersÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39m§MemoizeÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39m¯ProgressLoggingÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39mDistributionsÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39m§PlutoUIÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39m¦RandomÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39m®BenchmarkToolsÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39mLinearAlgebraÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39m¬PlutoProfileÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39m¬LaTeXStringsÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39m©StatsBaseÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39m²ApproximationUtilsÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39m¬StaticArraysÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39m¤BaseÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39mªnbpkg_syncÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39m®PlutoDevMacrosÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39m°SpecialFunctionsÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39m°HypertextLiteralÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39mªDataFramesÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39m«PlutoPlotlyÚó [0m[1mResolving...[22m [90m===[39m [32m[1m Installed[22m[39m PDMats â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.11.32 Installed Crayons â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v4.1.1 [32m[1m[22m[39m Installed HypergeometricFunctions â”€ v0.3.27 [32m[1m[22m[39m Installed Accessors â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.1.41 [32m[1m[22m[39m Installed PlotlyBase â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.8.20 [32m[1m[22m[39m Installed SpecialFunctions â”€â”€â”€â”€â”€â”€â”€â”€ v2.5.0 [32m[1m[22m[39m Installed PrettyTables â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v2.4.0 [32m[1m[22m[39m Installed MIMEs â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.0.0 [32m[1m[22m[39m Installed JuliaInterpreter â”€â”€â”€â”€â”€â”€â”€â”€ v0.9.41 [32m[1m[22m[39m Installed InvertedIndices â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.3.1 [32m[1m[22m[39m Installed PlutoUI â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.7.61 [32m[1m[22m[39m Installed StaticArrays â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.5.26 [32m[1m[22m[39m Installed DataFrames â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v1.7.0 [32m[1m[22m[39m Installed Memoize â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.4.4 [32m[1m[22m[39m Installed Distributions â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.25.117 [32m[1m[22m[39m Installed StringManipulation â”€â”€â”€â”€â”€â”€ v0.4.1 [32m[1m[22m[39m Installed MacroTools â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€ v0.5.15 [32m[1m[22m[39m No Changes to `/tmp/jl_E3c5X0/Project.toml` [32m[1m[22m[39m [32m[1m No Changes[22m[39m to `/tmp/jl_E3c5X0/Manifest.toml` [0m[1mInstantiating...[22m [90m===[39m [0m[1mPrecompiling...[22m [90m===[39m [32m[1m Activating[22m[39m project at `/tmp/jl_E3c5X0` [92m[1mPrecompiling[22m[39m project... 47 dependencies successfully precompiled in 90 seconds. 108 already precompiled. [33m1[39m dependency had output during precompilation:[33m â”Œ [39mPlutoPlotly[33m â”‚ [39mâ”Œ Warning: You are trying to show a PlutoPlot outside of Pluto, this is not the intended behavior and you should use either PlotlyBase or PlotlyJS directly.[33m â”‚ [39mâ”‚ NOTE: If you receive this warning during pre-compilation or sysimage creation, you can ignore this warning.[33m â”‚ [39mâ”” @ PlutoPlotly ~/.julia/packages/PlutoPlotly/5DpMg/notebooks/wrapper.jl:43[33m â”” [39m§enabledÃ·restart_recommended_msgÀ´restart_required_msgÀbusy_packages¶waiting_for_permissionÂÙ,waiting_for_permission_but_probably_disabledÂ«cell_inputsÞ§Ù$4f96be72-ef3e-4e08-ac4c-be4271dcd14c„§cell_idÙ$4f96be72-ef3e-4e08-ac4c-be4271dcd14c¤code ¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$19dfabda-7049-4050-8662-0385529c0c5a„§cell_idÙ$19dfabda-7049-4050-8662-0385529c0c5a¤codeÙî@bind sref_cartpole_binary PlutoUI.combine() do Child md""" x position: $(Child(:x, Slider(-50f0:50f0, default = 0f0, show_value=true))) x velocity: $(Child(:xÌ‡, Slider(-50f0:50f0, default = 0f0, show_value=true))) """ end |> confirm¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$b71145a4-2614-4f62-bfd2-7d5d1fecec56„§cell_idÙ$b71145a4-2614-4f62-bfd2-7d5d1fecec56¤codeÚ š#version of reinforce for general function approximation function actor_critic_with_eligibility_traces!(policy_params::P1, âˆ‡lnÏ€, value_params::P2, âˆ‡vÌ‚, mdp::ContinuousMDP{T, S, A, PTF, F1, F2, F3}, Î»_Î¸::T, Î»_w::T, update_action_distribution!::Function, action_dist_params::Vector{T}, action_sampler::Function, update_eligibility_vector!::Function, x, update_feature_vector!::Function, value_function::Function, update_value_gradient!::Function, max_episodes::Integer, max_steps::Integer; Î±_w::T = one(T)/10, Î±_Î¸::T = one(T)/10, Î³::T = one(T), z_Î¸::P1 = deepcopy(policy_params), z_w::P2 = deepcopy(value_params), save_step_rewards = false) where {P1, P2, T<:Real, S, A, PTF, F1, F2, F3} step_rewards = Vector{T}() episode_steps = Vector{Int64}() episode_rewards = Vector{T}() #initialize variables ep = 1 step = 1 rtot = zero(T) c = one(T) zero_params!(z_Î¸) zero_params!(z_w) s = mdp.initialize_state() update_feature_vector!(x, s) while (ep <= max_episodes) && (step <= max_steps) update_value_gradient!(âˆ‡vÌ‚, x, value_params) vÌ‚ = value_function(x, value_params) update_action_distribution!(action_dist_params, x, policy_params) a = action_sampler(action_dist_params) if bad_continuous_action(a) @info "terminating after $step steps and episode $ep due to invalid continuous action $a taken in state $s with action distribution parameters $action_dist_params" push!(episode_steps, max_steps) push!(episode_rewards, typemin(T)) break end update_eligibility_vector!(âˆ‡lnÏ€, action_dist_params, x, a, policy_params) (r, sâ€²) = mdp.ptf(s, a) rtot += r save_step_rewards && push!(step_rewards, r) step += 1 if mdp.isterm(sâ€²) push!(episode_steps, step) push!(episode_rewards, rtot) vÌ‚â€² = zero(T) rtot = zero(T) zero_params!(z_Î¸) zero_params!(z_w) ep += 1 c = one(T) s = mdp.initialize_state() update_feature_vector!(x, s) else update_feature_vector!(x, sâ€²) vÌ‚â€² = value_function(x, value_params) s = sâ€² c *= Î³ end Î´ = r + Î³*vÌ‚â€² - vÌ‚ update_traces_with_gradient!(Î³*Î»_w, z_w, âˆ‡vÌ‚) update_traces_with_gradient!(Î³*Î»_Î¸, z_Î¸, c, âˆ‡lnÏ€) update_params_with_gradient!(value_params, Î±_w*Î´, z_w) update_params_with_gradient!(policy_params, Î±_Î¸*c*Î´, z_Î¸) end function_outputs = form_state_and_policy_function_outputs(update_feature_vector!, update_action_distribution!, action_dist_params, action_sampler, value_function, x, policy_params, value_params) return (;step_rewards = step_rewards, episode_steps = episode_steps, episode_rewards = episode_rewards, policy_parameters = policy_params, value_parameters = value_params, function_outputs...) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$c0876a48-ea18-494d-8bfc-e2bceb73b417„§cell_idÙ$c0876a48-ea18-494d-8bfc-e2bceb73b417¤codeÙ‡plot_mountaincar_values(mountaincar_continuing_fcann_test.estimate_state_value, mountaincar_continuing_fcann_test.policy_sample_action)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091„§cell_idÙ$1d36ae81-d3da-45c0-bbcf-0b6e0e80b091¤codeÚfunction reinforce_monte_carlo_control_fcann(mdp::StateMDP{T, S, A, P, F1, F2, F3}, input_length::Integer, hidden_layers::Vector{Int64}, update_feature_vector!::Function,max_episodes::Integer; params::FCANNParams = FCANN.initializeparams_saxe(input_length, hidden_layers, length(mdp.actions)), reslayers = 0, l2 = 0f0, dropout = 0f0, use_Î¼P = true, activation_list = fill(true, length(hidden_layers)), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} setup = setup_fcann_policy_arguments(params, input_length, hidden_layers, reslayers, l2, dropout, use_Î¼P, activation_list) reinforce_monte_carlo_control!(params, setup.eligibility_vector, mdp, setup.update_action_preferences!, setup.update_eligibility_vector!, setup.feature_vector, update_feature_vector!, max_episodes; kwargs...) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$f4b6f10b-4cd0-4be6-98ec-4d4ffb696392„§cell_idÙ$f4b6f10b-4cd0-4be6-98ec-4d4ffb696392¤codeÙ4md""" ### *One-step Actor-Critic Implementation* """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$9db9ff71-bee9-4bea-a45b-748f8517fed1„§cell_idÙ$9db9ff71-bee9-4bea-a45b-748f8517fed1¤codeÙºone_step_actor_critic_linear_features(corridor_mdp, update_corridor_features!, 1, typemax(Int64), 100_000, Î±_Î¸ = 2f0^-8, Î±_w = 2f0^-8, policy_params = [0f0 3.7f0]).policy_and_value(1)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$4634267b-5dea-4164-8bb2-1eb2fd4d7954„§cell_idÙ$4634267b-5dea-4164-8bb2-1eb2fd4d7954¤codeÚ–function update_linear_eligibility_vector!(âˆ‡lnÏ€::Matrix{T}, action_preferences::Vector{T}, x::Vector{T}, i_a::Integer, params::Matrix{T}) where T<:AbstractFloat update_linear_action_preferences!(action_preferences, x, params) soft_max!(action_preferences) BLAS.gemm!('N', 'T', -one(T), x, action_preferences, zero(T), âˆ‡lnÏ€) @inbounds @simd for i in eachindex(x) âˆ‡lnÏ€[i, i_a] += x[i] end end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$6c5f51bb-a6be-447e-b73d-4f9c2885e809„§cell_idÙ$6c5f51bb-a6be-447e-b73d-4f9c2885e809¤codeÙóactor_critic_binary_episodic_beta_parameter_study(mountaincar_continuous_mdp, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, mountaincar_binary_continuous_params2, 5, 3, 10000; max_steps = 100_000)¨metadataƒ©show_logsÃ¨disabledÃ®skip_as_scriptÂ«code_foldedÂÙ$cc45091e-b889-4d5a-9eef-84d80f792046„§cell_idÙ$cc45091e-b889-4d5a-9eef-84d80f792046¤codeÚmd""" ## 13.4 REINFORCE with Baseline The policy gradient theorem (13.5) can be generalized to include a comparison of the action value to an arbitrary *baseline* b(s): $\nabla J(\boldsymbol{\theta}) \propto \sum_s \mu(s)\sum_a\left( q_\pi(s,a)-b(s) \right ) \nabla\pi(a|s,\boldsymbol{\theta}) \tag{13.10}$ The baseline can be any function, even a random variable, as long as it does not vary with $a$; the euation remains valid because the subtracted quantity is zero: $\sum_ab(s)\nabla\pi(a|s,\boldsymbol{\theta})=b(s)\nabla\sum_a\pi(a|s,\boldsymbol{\theta})=b(s)\nabla1=0$ The policy gradient theorem with baseline (13.10) can be used to derive an update rule using similar steps as in the previous section. The update rule that we end up with is a new version of REINFORCE that includes a general baseline: $\boldsymbol{\theta}_{t+1} \doteq \boldsymbol{\theta}_t+\alpha(G_t-b(S_t))\frac{\nabla\pi(A_t|S_t,\boldsymbol{\theta}_t)}{\pi(A_t|S_t,\boldsymbol{\theta}_t)} \tag{13.11}$ Since the baseline could be uniformly zero, this is a strict generalization of REINFORCE. To have an effective baseline that depends on state we can use a state value estimate that is also updated with gradient steps: $\hat v(S_t, \mathbf{w})$. Using such an estimate we can revise the previous REINFORCE algorithm. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$5b15d91e-7119-4f85-a54a-7d4f1fdaf097„§cell_idÙ$5b15d91e-7119-4f85-a54a-7d4f1fdaf097¤codeÚ¯function create_actor_critic_continuing_params_UI(;Î»_Î¸ = 0.5f0, Î»_w = 0.5f0, log2Î±_Î¸ = -10, log2Î±_w = -10, Î±_rÌ„ = 0.005f0) PlutoUI.combine() do Child md""" $$\lambda_\theta$$: $(Child(:Î»_Î¸, Slider(0.00f0:0.001f0:.999f0, default = Î»_Î¸, show_value=true))) $$\lambda_\mathbf{w}$$: $(Child(:Î»_w, Slider(0.00f0:0.001f0:.999f0, default = Î»_w, show_value=true))) $$\alpha_{\overline{r}}$$: $(Child(:Î±_rÌ„, NumberField(0.00f0:0.001f0:1f0, default = Î±_rÌ„))) $$\log_2 \alpha_\theta$$ min: $(Child(:Î±_Î¸_min, NumberField(-100:0, default = log2Î±_Î¸))) $$\log_2 \alpha_{\mathbf{w}}$$ min: $(Child(:Î±_w_min, NumberField(-100:0, default = log2Î±_w))) """ end |> confirm end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$ba41f521-4ee2-42a6-bf18-078bfa4b875e„§cell_idÙ$ba41f521-4ee2-42a6-bf18-078bfa4b875e¤codeÚbegin make_n_param_dist_policy_params(n::Integer, num_features::Integer, ::T) where T<:Real = zeros(T, num_features, n) make_n_param_dist_policy_params(n::Integer, num_features::Integer, ::NTuple{N, T}) where {N, T<:Real} = zeros(T, num_features, n*N) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03„§cell_idÙ$d41f0dc4-15ac-4f8f-acb5-a7ccd8d48f03¤codeÚYfunction cartpole_tilecoding_reinforce_parameter_study(Î±1_list, Î±2_list, max_episodes; num_trials = 100, kwargs...) setup = setup_cartpole_problem(;kwargs...) traces = [begin steps = [begin 1:num_trials |> Map() do i solution = reinforce_with_baseline_monte_carlo_control_binary_features(cartpole_setup.mdps.episodic.discrete, cartpole_setup.get_active_features, cartpole_setup.num_features, max_episodes; Î±_Î¸ = Î±1, Î±_w = Î±2) steps = solution.episode_steps isempty(steps) && return max_steps mean(steps) end |> foldxt(+) |> x -> x / num_trials end for Î±1 in Î±1_list] scatter(x = Î±1_list, y = steps, name = "Î±_w = $Î±2") end for Î±2 in Î±2_list] plot(traces, Layout(xaxis_title = "Policy Learning Rate Î±_Î¸", yaxis_title = "Average Episode Duration Over First $max_episodes Episodes", xaxis_type = "log")) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$3c695d54-c30f-4f04-bd40-f5da53be2a95„§cell_idÙ$3c695d54-c30f-4f04-bd40-f5da53be2a95¤codeÙ/md""" ### *Cart Pole Continuous Action MDP* """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$0d45ae72-572f-4d17-83cf-9814f2854131„§cell_idÙ$0d45ae72-572f-4d17-83cf-9814f2854131¤codeÙg@bind mountaincar_binary_continuous_params2 create_actor_critic_params_UI(Î»_Î¸ = 0.05f0, Î»_w = 0.8f0)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$cd9c9eeb-c90d-4499-9503-7773d5250f47„§cell_idÙ$cd9c9eeb-c90d-4499-9503-7773d5250f47¤codeÙ‰show_mountaincar_continuous_trajectory(mountaincar_continuous_test_train2.policy_sample_action, 1_000; mdp = mountaincar_continuous_mdp2)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$fd58402f-da65-44cf-b81a-e21192fd0e63„§cell_idÙ$fd58402f-da65-44cf-b81a-e21192fd0e63¤codeÙ…plot_cartpole_policy(cartpole_continuing_fcann_test.policy_and_value; s_ref = CartPoleState(cartpole_fcann_continuing_test_state...))¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$8e39bd15-862e-4941-88f9-2794b861a523„§cell_idÙ$8e39bd15-862e-4941-88f9-2794b861a523¤codeÚreinforce_monte_carlo_control_linear_features(mdp::StateMDP{T, S, A, P, F1, F2, F3}, update_feature_vector!::Function, num_features::Integer, max_episodes::Integer; params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), x::Vector{T} = zeros(T, num_features), âˆ‡lnÏ€::Matrix{T} = copy(params), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = reinforce_monte_carlo_control!(params, âˆ‡lnÏ€, mdp, update_linear_action_preferences!, update_linear_eligibility_vector!, x, update_feature_vector!, max_episodes; kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$64900586-ef92-48e4-839e-ff952a46671b„§cell_idÙ$64900586-ef92-48e4-839e-ff952a46671b¤codeÚ•test_study = actor_critic_linear_parameter_study(cartpole_continuing_mdp, s -> cartpole_tilecoding_setup.get_active_features((s.x, s.Î¸, s.xÌ‡, s.Î¸Ì‡)), cartpole_tilecoding_setup.num_features, LinRange(.5f0, .95f0, 10), LinRange(0.5f0, .95f0, 10), [0.005f0, 0.01f0, 0.05f0], 2f0 .^ (-5:-1), 2f0 .^ (-10:-5), 100, 10_000; nruns = 40, seed = 45, binary_features = true) |> df -> sort(df, :output; rev=true)¨metadataƒ©show_logsÃ¨disabledÃ®skip_as_scriptÂ«code_foldedÂÙ$fddef10c-7695-4596-9e16-987fd45a57e6„§cell_idÙ$fddef10c-7695-4596-9e16-987fd45a57e6¤codeÚfunction setup_cartpole_continuous_problem(;h = 4f-2, f = 300f0, x_max = 50f0, Î¸_max = deg2rad(70f0), xÌ‡_max = 50f0, Î¸Ì‡_max = 10f0, num_tiles = (8, 8, 8, 8), num_tilings = 8, kwargs...) tile_size = Tuple(1f0 / n for n in num_tiles) min_vals = (-x_max, -Î¸_max, -xÌ‡_max, -Î¸Ì‡_max) max_vals = (x_max, Î¸_max, xÌ‡_max, Î¸Ì‡_max) setup = tile_coding_setup(min_vals, max_vals, tile_size, num_tilings, (1, 3, 5, 7)) init_Î¸() = rand([-0.05f0, 0.05f0]) mdps = create_cartpole_mdps(h = h, fmax = f, x_max = x_max, Î¸_max = Î¸_max, xÌ‡_max = xÌ‡_max, Î¸Ì‡_max = Î¸Ì‡_max, init_Î¸ = init_Î¸, kwargs...) (mdps = mdps, get_active_features = s -> setup.get_active_features((s.x, s.Î¸, s.xÌ‡, s.Î¸Ì‡)), num_features = setup.num_features, min_vals = min_vals, max_vals = max_vals) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$e2b09af1-0f22-4f7f-b806-54fa522adb20„§cell_idÙ$e2b09af1-0f22-4f7f-b806-54fa522adb20¤codeÚ#note that due to the feature construction in this problem, the bootstrapping estimate is worthless so we'd expect this to do poorly at this task. Imagine that the value function starts out perfectly accurate for the initial policy with bad parameterization which we know finishes episodes in 90 to 100 steps. Then the Î´ on each step is just -1. Given the policy initialization, this is very poor progress since the improvement is only towards something still with worse performance than the completely random policy¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$2be8a812-4f21-4fe8-a2de-50497db0345a„§cell_idÙ$2be8a812-4f21-4fe8-a2de-50497db0345a¤codeÙHmd""" ### *Actor-Critic Implementation for Continuous Action Spaces* """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$68806899-9972-460a-9f11-daa708a9d610„§cell_idÙ$68806899-9972-460a-9f11-daa708a9d610¤codeÚÕactor_critic_with_eligibility_traces_linear_features(mdp::StateMDP{T, S, A, P, F1, F2, F3}, Î»_Î¸::T, Î»_w::T, update_feature_vector!::Function, num_features::Integer, args...; policy_params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), value_params::Vector{T} = zeros(T, num_features), x = zeros(T, num_features), action_preferences = zeros(T, length(mdp.actions)), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = actor_critic_with_eligibility_traces!(policy_params, copy(policy_params), value_params, copy(value_params), mdp, Î»_Î¸, Î»_w, update_linear_action_preferences!, update_linear_eligibility_vector!, x, update_feature_vector!, linear_value_function, update_linear_value_gradient!, args...; kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$189798b3-ec6b-48b9-918c-ee0f65935ab3„§cell_idÙ$189798b3-ec6b-48b9-918c-ee0f65935ab3¤codeÚÈmd""" > ### *Exercise 13.3* > In Section 13.1 we considered policy parameterizations using the soft-max in action preferences (13.2) with linear action preferences (13.3). For this parameterization, prove that the eligibility vector is > $\begin{flalign} > \nabla \ln \pi(a|s, \boldsymbol{\theta}) = \mathbf{x}(s, a) - \sum_b \pi(b|s, \boldsymbol{\theta}) \mathbf{x}(s, b) \tag{13.9} > \end{flalign}$ > using the definitions and elementary calculus. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$00152954-dc98-4120-b94b-2ea4d987832b„§cell_idÙ$00152954-dc98-4120-b94b-2ea4d987832b¤codeÙÄfunction create_mountaincar_continuing_mdp() ptf = StateMDPTransitionSampler(mountaincar_continuing_step, (0f0, 0f0)) StateMDP(MountainCarTask.actions, ptf, MountainCarTask.initialize_state) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$42d4600a-bf3c-45ac-b7f5-d23917713ff5„§cell_idÙ$42d4600a-bf3c-45ac-b7f5-d23917713ff5¤codeÙÑ@bind cartpole_continuing_fcann_network_params PlutoUI.combine() do Child md""" Layer Size: $(Child(NumberField(1:128, default = 4))) Num Layers: $(Child(NumberField(1:10, default = 2))) """ end |> confirm¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$4e29c621-223e-4859-8e96-db04b967815a„§cell_idÙ$4e29c621-223e-4859-8e96-db04b967815a¤codeÚ´function setup_binary_squashed_gaussian_policy_arguments(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, amax::A, get_active_features::Function, num_features::Integer) where {T<:Real, S, N, A<:Union{T, NTuple{N, T}}, P, F1, F2, F3} x = BinaryFeatureVector() update_feature_vector!(x::BinaryFeatureVector, s) = update_binary_feature_vector!(x, s, get_active_features) sample_action = rand(A) action_dist_params = make_n_param_dist_params(2, sample_action) âˆ‡lnÏ€ = BinarySquashedGaussianEligibilityVector(sample_action, amax) return (feature_vector = x, update_feature_vector! = update_feature_vector!, action_distribution_parameters = action_dist_params, eligibility_vector = âˆ‡lnÏ€) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$5981f52b-d829-4c7d-b47b-33310f7d64a2„§cell_idÙ$5981f52b-d829-4c7d-b47b-33310f7d64a2¤codeÙRmake_Ïµ_greedy_policy!(corridor_train.value_function(1).action_values; Ïµ = 0.0f0)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$0e9de19e-bcd4-40ac-9831-afb6cad38422„§cell_idÙ$0e9de19e-bcd4-40ac-9831-afb6cad38422¤codeÚÊfunction setup_fcann_policy_arguments(params::FCANNParams{T}, input_length::Integer, hidden_layers::Vector{Int64}, reslayers::Integer, l2::T, dropout::T, use_Î¼P::Bool, activation_list) where {T<:Real} x = zeros(T, input_length) activations = FCANN.form_activations(params[1]) tanh_grad_z = deepcopy(activations) deltas = deepcopy(activations) scales = fill(one(T), length(params[1])) if use_Î¼P for i in eachindex(hidden_layers) iâ€² = i + 1 scales[iâ€²] /= size(params[1][iâ€²], 2) end end âˆ‡lnÏ€ = deepcopy(params) update_eligibility_vector!(âˆ‡lnÏ€::FCANNParams, action_preferences::Vector{T}, x, i_a, params::FCANNParams) = update_fcann_eligibility_vector!(âˆ‡lnÏ€, action_preferences, x, i_a, params, hidden_layers, l2, tanh_grad_z, activations, deltas, dropout, reslayers, activation_list, scales) update_action_preferences!(action_preferences::Vector{T}, x::Vector{T}, params::FCANNParams) = update_fcann_action_preferences!(action_preferences, x, params, activations, reslayers) return (feature_vector = x, params = params, eligibility_vector = âˆ‡lnÏ€, update_eligibility_vector! = update_eligibility_vector!, update_action_preferences! = update_action_preferences!, scales = scales) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$ff3009eb-23f9-44fe-8e56-85dbc7b463d0„§cell_idÙ$ff3009eb-23f9-44fe-8e56-85dbc7b463d0¤codeÙufunction show_squashed_policy(Ï€::Function, s) pdist=Ï€(s) plot_squashed_gaussian(pdist[1], exp(pdist[2]), 1f0) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$4fb83451-b6f8-4e6e-a131-1accc8e10b08„§cell_idÙ$4fb83451-b6f8-4e6e-a131-1accc8e10b08¤codeÚ B#version of reinforce for general function approximation function reinforce_with_baseline_monte_carlo_control!(policy_params, âˆ‡lnÏ€, value_params, âˆ‡vÌ‚, mdp::StateMDP{T, S, A, PTF, F1, F2, F3}, update_action_preferences!::Function, update_eligibility_vector!::Function, x, update_feature_vector!::Function, value_function::Function, update_value_gradient!::Function, max_episodes::Integer; Î±_w::T = one(T)/10, Î±_Î¸::T = one(T)/10, Î³::T = one(T), action_preferences = zeros(T, length(mdp.actions)), epkwargs...) where {T<:Real, S, A, PTF, F1, F2, F3} rewards = zeros(T, max_episodes) steps = zeros(Int64, max_episodes) Ï€! = form_state_policy_function(update_feature_vector!, update_action_preferences!) Ï€(s) = Ï€!(x, action_preferences, s, policy_params) Ï€_sample(s) = sample_action(Ï€(s)) v! = form_state_value_function(update_feature_vector!, value_function) estimate_state_value(s) = v!(x, s, value_params) state_history, action_history, reward_history, _, _ = runepisode(mdp; Ï€ = Ï€_sample, max_steps = 0) #initialize variables to update episodes for i in eachindex(rewards) # @info "On episode $i of $max_episodes" state_history, action_history, reward_history, sterm, nsteps = runepisode!((state_history, action_history, reward_history), mdp; Ï€ = Ï€_sample, epkwargs...) g = zero(T) rtotal = zero(T) #iterate through episode beginning at the end for i in nsteps:-1:1 g = (Î³ * g) + reward_history[i] update_feature_vector!(x, state_history[i]) vÌ‚ = value_function(x, value_params) Î´ = g - vÌ‚ update_value_gradient!(âˆ‡vÌ‚, x, value_params) c = Î±_w*Î´ update_params_with_gradient!(value_params, c, âˆ‡vÌ‚) update_eligibility_vector!(âˆ‡lnÏ€, action_preferences, x, action_history[i], policy_params) c = Î±_Î¸ * Î³^(i-1) * Î´ update_params_with_gradient!(policy_params, c, âˆ‡lnÏ€) rtotal += reward_history[i] end rewards[i] = rtotal steps[i] = nsteps end Ï€2(s; feature_vector = deepcopy(x), action_preferences = copy(action_preferences)) = Ï€!(feature_vector, action_preferences, s, policy_params) Ï€_sample2(s; kwargs...) = sample_action(Ï€2(s; kwargs...)) function policy_and_value(s::S) Ï€!(x, action_preferences, s, policy_params) vÌ‚ = value_function(x, value_params) return (action_probabilities = action_preferences, state_value_estimate = vÌ‚) end return (episode_rewards = rewards, episode_steps = steps, policy_function = Ï€2, policy_sample_action = Ï€_sample2, policy_parameters = policy_params, estimate_state_value = estimate_state_value, value_parameters = value_params, policy_and_value = policy_and_value) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$406638af-1e08-44d2-9ee4-97aa9294a94b„§cell_idÙ$406638af-1e08-44d2-9ee4-97aa9294a94b¤codeÙ-md""" ## 13.2 The Policy Gradient Theorem """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$57e5e12a-b722-4ea3-ab3b-e5711029e640„§cell_idÙ$57e5e12a-b722-4ea3-ab3b-e5711029e640¤codeÚÉone_step_actor_critic_linear_features(mdp::StateMDP{T, S, A, P, F1, F2, F3}, update_feature_vector!::Function, num_features::Integer, max_episodes::Integer, max_steps::Integer; policy_params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), value_params::Vector{T} = zeros(T, num_features), x = zeros(T, num_features), action_preferences = zeros(T, length(mdp.actions)), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = one_step_actor_critic!(policy_params, copy(policy_params), value_params, copy(value_params), mdp, update_linear_action_preferences!, update_linear_eligibility_vector!, x, update_feature_vector!, linear_value_function, update_linear_value_gradient!, max_episodes, max_steps; kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$374af774-3a97-49b5-a3bb-bc3f7f63a3fa„§cell_idÙ$374af774-3a97-49b5-a3bb-bc3f7f63a3fa¤codeÙ)plot_cart(ep[1][ep_step], ep[2][ep_step])¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$7bf209c8-ef0a-46d1-937e-b1a6e45dc62e„§cell_idÙ$7bf209c8-ef0a-46d1-937e-b1a6e45dc62e¤codeÙ¨@bind beta_params PlutoUI.combine() do Child md""" Î±: $(Child(Slider(0.01:0.1:100; show_value=true))) Î²: $(Child(Slider(0.01:0.1:100, show_value=true))) """ end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$dd8e8cd2-7b41-46c4-8530-adefb7aea684„§cell_idÙ$dd8e8cd2-7b41-46c4-8530-adefb7aea684¤codeÚgfunction actor_critic_binary_episodic_beta_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, Î»_Î¸::T, Î»_w::T, Î±_Î¸_list::AbstractVector{T}, Î±_w_list::AbstractVector{T}, max_episodes::Integer; nruns::Integer = 100, max_steps::Integer = 10_000, seed = rand(UInt64), init_policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), init_value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} Random.seed!(seed) function average_runs(Î±_Î¸, Î±_w) 1:nruns |> Map(_ -> actor_critic_with_eligibility_traces_binary_features_beta_actions(mdp, Î»_Î¸, Î»_w, get_active_features, num_features, max_episodes, max_steps; Î±_Î¸ = Î±_Î¸, Î±_w = Î±_w, policy_params = copy(init_policy_params), value_params = copy(init_value_params), kwargs...) |> x -> isempty(x.episode_rewards) ? -T(Inf) : mean(x.episode_rewards)) |> foldxt(+) |> x -> x / nruns end traces = [begin scatter(x = Î±_Î¸_list, y = average_runs.(Î±_Î¸_list, Î±_w), name = "Î±_w = $Î±_w") end for Î±_w in Î±_w_list] plot(traces, Layout(xaxis_title = "Î±_Î¸", yaxis_title = "Average Reward Per Episode in the First
$max_episodes Episodes Averaged Over $nruns Runs", xaxis_type = "log", title = "Binary Feature Encoding with $num_features Features, Î»_Î¸ = $Î»_Î¸, Î»_w = $Î»_w")) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$4fea7232-f286-4a8b-93f8-a0702818ab31„§cell_idÙ$4fea7232-f286-4a8b-93f8-a0702818ab31¤codeÙ8md""" #### Test Actor-Critic with Eligibility Traces """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$26880577-d267-4950-8725-7afe0d0402b6„§cell_idÙ$26880577-d267-4950-8725-7afe0d0402b6¤codeÙ:const cartpole_setup = setup_cartpole_continuous_problem()¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$a7891c63-18d6-4c1f-ba67-adf7c547d334„§cell_idÙ$a7891c63-18d6-4c1f-ba67-adf7c547d334¤codeÙ@bind fcann_mountaincar_study_params create_actor_critic_fcann_params_UI(;Î»_Î¸ = 0.5f0, Î»_w = 0.5f0, h = 16, log2Î±_Î¸ = -10, log2Î±_w = -11)¨metadataƒ©show_logsÃ¨disabledÃ®skip_as_scriptÂ«code_foldedÂÙ$44f14d4f-7414-4c6f-883a-042ca261a403„§cell_idÙ$44f14d4f-7414-4c6f-883a-042ca261a403¤codeÙK@bind mountaincar_binary_continuous_params2 create_actor_critic_params_UI()¨metadataƒ©show_logsÃ¨disabledÃ®skip_as_scriptÂ«code_foldedÃÙ$94354552-9920-4b90-98d9-f75286d1f53e„§cell_idÙ$94354552-9920-4b90-98d9-f75286d1f53e¤codeÙ`corridor_parameter_studies(1.5f0 .^(-24:-20), 1.25f0 .^ (-27:-20), 2f0 .^(-3:-1); nruns = 1_000)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$e5faaa1b-88cb-43e2-8d04-8972b58b4bda„§cell_idÙ$e5faaa1b-88cb-43e2-8d04-8972b58b4bda¤codeÚÉbegin v1(p) = -2*(1+p)/((1-p)*p) v2(p) = -(p+2)/((1-p)*p) v3(p) = -3/(1-p) plist = 0.:0.001:1. traces = [scatter(x = plist, y = f.(1 .- plist), name = n) for (f, n) in zip([v1, v2, v3], ["V(S1)", "V(S2)", "V(S3)"])] plot(traces, Layout(font_color = "LightGray", plot_bgcolor = bgcolor, paper_bgcolor = "rgb(40, 40, 40)", yaxis_range = [-100, 0], xaxis_title = "probability of right action", yaxis_title = "State Value", width = 900, height = 600)) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$70096b14-beab-4f71-9886-6355c749bb8a„§cell_idÙ$70096b14-beab-4f71-9886-6355c749bb8a¤codeÚÔmd""" We previously derived an expression for the gradient of the policy itself in the case of linear action preferences: $\begin{flalign} h_a &= \boldsymbol{\theta}^\top \mathbf{x}(s, a) \\ \pi_a &= \frac{e^{h_a}}{\sum_k e^{h_k}} \\ \nabla(\pi_a)_i &= \pi_a \left ( \mathbf{x}(s, a)_i - \sum_k \pi_k \mathbf{x}(s, k)_i \right) \end{flalign}$ Applying the chain rule to the natural logarithm produces: $\nabla \left ( \ln f(\theta) \right) = \frac{\nabla f(\theta)}{f(\theta)} \implies \nabla \left ( \ln f(\theta) \right )_i = \frac{\nabla \left ( f(\theta) \right )_i}{f(\theta)}$ Applying this to the above expression yields: $\begin{flalign} \nabla \left ( \ln \pi_a \right )_i &= \frac{\nabla \left ( \pi_a \right )_i}{\pi_a} \\ &= \frac{\pi_a \left ( \mathbf{x}(s, a)_i - \sum_k \pi_k \mathbf{x}(s, k)_i \right)}{\pi_a} \\ &= \mathbf{x}(s, a)_i - \sum_k \pi_k \mathbf{x}(s, k)_i \end{flalign}$ which is the per component version of the desired vector expression. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$90d3b96b-ad2b-405c-951b-f48ec7ccf24a„§cell_idÙ$90d3b96b-ad2b-405c-951b-f48ec7ccf24a¤codeÙÕmd""" The final expected value expression (13.5) can be sampled on a step by step basis during an episode since we would have access to both the step count and some unbiased sample of the state-action value. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$700dcbc4-c94c-4287-8cf0-0b2c7a320a3a„§cell_idÙ$700dcbc4-c94c-4287-8cf0-0b2c7a320a3a¤codeÙ1reinforce_test5.policy_and_value(CartPoleState())¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$f59a5dcd-9f4a-4336-a391-e64af35ef799„§cell_idÙ$f59a5dcd-9f4a-4336-a391-e64af35ef799¤codeÙÉhtml""" """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÃ«code_foldedÂÙ$5864a5a3-a5a5-43c2-9cb4-7d13b2d20bed„§cell_idÙ$5864a5a3-a5a5-43c2-9cb4-7d13b2d20bed¤codeÚHmd""" Normal Distribution: $f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{(x - \mu)^2}{2 \sigma^2}}$ Consider a new random variable $Y \sim \tanh(X)$ where $X \sim N(0, 1)$. Using the change of variables theorem from probability theory we can compute the density function of $Y$: $f_Y(y) = f_X (g^{-1}(y)) \cdot \left \vert \frac{d}{dy} g^{-1}(y) \right \vert$ where $g(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ so $f_Y(y) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{\left (\tanh^{-1}(y) - \mu \right )^2}{2 \sigma^2}} \left \vert \frac{1}{1 - y^2} \right \vert$ """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$e3a2fb12-37ce-4c23-ad93-5fc89991aabb„§cell_idÙ$e3a2fb12-37ce-4c23-ad93-5fc89991aabb¤codeÙNmd""" ### Eligibility Vector for General Soft-Max and State Feature Vector """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$e5c1aca8-7575-4835-8273-e69ca0a55fe8„§cell_idÙ$e5c1aca8-7575-4835-8273-e69ca0a55fe8¤codeÚáfunction corridor_parameter_studies(Î±_list, Î±_Î¸_list, Î±_w_list; nruns = 100, num_episodes = 100, max_steps = 1_000) Random.seed!(45) function average_runs(Î±) 1:nruns |> Map(_ -> reinforce_monte_carlo_control_binary_features(corridor_mdp, get_corridor_features, 1, num_episodes, params = [0f0 3.7f0], Î± = Î±, max_steps = max_steps).episode_rewards |> sum) |> foldxt(+) |> x -> x / nruns / num_episodes end function average_runs(Î±_Î¸, Î±_w) 1:nruns |> Map(_ -> reinforce_with_baseline_monte_carlo_control_binary_features(corridor_mdp, get_corridor_features, 1, num_episodes, policy_params = [0f0 3.7f0], Î±_Î¸ = Î±_Î¸, Î±_w = Î±_w, max_steps = max_steps).episode_rewards |> sum) |> foldxt(+) |> x -> x / nruns / num_episodes end trace1 = scatter(x = Î±_list, y = average_runs.(Î±_list), name = "REINFORCE") with_baseline_traces = [begin scatter(x = Î±_Î¸_list, y = average_runs.(Î±_Î¸_list, Î±_w), name = "REINFORCE with Baseline: Î±_w = 2^$(round(Int64, log2(Î±_w)))") end for Î±_w in Î±_w_list] plot([trace1; with_baseline_traces], Layout(xaxis_title = "Policy Parameters Learning Rate", yaxis_title = "Average Reward Per Episode
Over First $num_episodes Episodes", xaxis_type = "log", yaxis_range = [-70, -10])) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$44b32cc0-36a8-41fd-89bc-ce894536926c„§cell_idÙ$44b32cc0-36a8-41fd-89bc-ce894536926c¤codeÙ$best_mc_corridor.policy_and_value(1)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$646bc853-b7fc-49fa-a201-ff98e8f952d4„§cell_idÙ$646bc853-b7fc-49fa-a201-ff98e8f952d4¤codeÚQfunction corridor_parameter_studies(Î±_Î¸_list, Î±_w_list; nruns = 100, max_episodes = 100, max_steps = 1_000_000) # Random.seed!(45) function average_runs(Î±_Î¸, Î±_w) 1:nruns |> Map(_ -> one_step_actor_critic_binary_features(corridor_mdp, get_corridor_features, 1, max_episodes, max_steps, policy_params = [0f0 3.7f0], Î±_Î¸ = Î±_Î¸, Î±_w = Î±_w) |> x -> isempty(x.episode_rewards) ? -Inf32 : (sum(x.episode_rewards) / length(x.episode_rewards))) |> foldxt(+) |> x -> x / nruns end traces = [begin scatter(x = Î±_Î¸_list, y = average_runs.(Î±_Î¸_list, Î±_w), name = "Î±_w = 2^$(round(Int64, log2(Î±_w)))") end for Î±_w in Î±_w_list] plot(traces, Layout(xaxis_title = "Policy Parameters Learning Rate", yaxis_title = "Average Reward Per Episode In First
$max_episodes Episodes Averaged over $nruns Runs", xaxis_type = "log")) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$25be5dcf-be63-46c4-b6de-6cf79fa28fd0„§cell_idÙ$25be5dcf-be63-46c4-b6de-6cf79fa28fd0¤codeÚbegin function update_traces_with_gradient!(c1::T, z_Î¸::Matrix{T}, c2::T, âˆ‡Î¸::BinaryEligibilityVector{T, B}) where {T<:Real, B<:BinaryFeatureVector} z_Î¸ .*= c1 @inbounds for i in eachindex(âˆ‡Î¸.Ï€_dist) @simd for j in 1:âˆ‡Î¸.binary_features.num_features k = âˆ‡Î¸.binary_features.active_features[j] z_Î¸[k, i] -= c2*âˆ‡Î¸.Ï€_dist[i] end end @inbounds @simd for i in 1:âˆ‡Î¸.binary_features.num_features j = âˆ‡Î¸.binary_features.active_features[i] z_Î¸[j, âˆ‡Î¸.i_a] += c2 end return z_Î¸ end function update_traces_with_gradient!(c1::T, z_w::Vector{T}, âˆ‡w::BinaryFeatureVector) where {T<:Real} z_w .*= c1 @inbounds @simd for i in 1:âˆ‡w.num_features j = âˆ‡w.active_features[i] z_w[j] += one(T) end return z_w end function update_traces_with_gradient!(c1::T, z_Î¸::Array{T, N}, âˆ‡Î¸::Array{T, N}) where {T<:Real, N} z_Î¸ .= c1 .* z_Î¸ .+ âˆ‡Î¸ end function update_traces_with_gradient!(c1::T, z_Î¸::Array{T, N}, c2::T, âˆ‡Î¸::Array{T, N}) where {T<:Real, N} z_Î¸ .= c1 .* z_Î¸ .+ c2 .* âˆ‡Î¸ end function update_traces_with_gradient!(c1::Float32, z_Î¸::FCANNParams, âˆ‡Î¸::FCANNParams) for i in eachindex(first(z_Î¸)) for j in 1:2 update_traces_with_gradient!(c1, z_Î¸[j][i], âˆ‡Î¸[j][i]) end end end function update_traces_with_gradient!(c1::Float32, z_Î¸::FCANNParams, c2::Float32, âˆ‡Î¸::FCANNParams) for i in eachindex(first(z_Î¸)) for j in 1:2 update_traces_with_gradient!(c1, z_Î¸[j][i], c2, âˆ‡Î¸[j][i]) end end end end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$38acd032-1d18-4760-9111-67c9cdd2e892„§cell_idÙ$38acd032-1d18-4760-9111-67c9cdd2e892¤codeÙq#without limiting the force in this way, the learned policy just applies so much force to go up the hill directly¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$cecc2a35-3850-4f66-84e8-e29da4f3d4b0„§cell_idÙ$cecc2a35-3850-4f66-84e8-e29da4f3d4b0¤codeÙÉfunction get_corridor_episode_stats(Ï€::Function; ntrials=10_000, kwargs...) 1:ntrials |> Map(_ -> runepisode(corridor_mdp; Ï€ = Ï€, kwargs...) |> first |> length) |> foldxt(+) |> a -> a / ntrials end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$4c4e643b-d4b9-44f0-8d30-dc521bcc55ac„§cell_idÙ$4c4e643b-d4b9-44f0-8d30-dc521bcc55ac¤codeÙÎconst cartpole_continuing_mdp = StateMDP(cartpole_functions.discrete_actions, StateMDPTransitionSampler(cartpole_continuing_step, cartpole_functions.initialize_state()), cartpole_functions.initialize_state)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$738ada7f-edc7-4ed3-a15e-e92113468738„§cell_idÙ$738ada7f-edc7-4ed3-a15e-e92113468738¤codeÙt#note that the random policy i.e. p = 0.5 has an expected episode length of 12 which is very close to ideal already.¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$cacaaca6-6e01-464f-a2ee-cbf62737a426„§cell_idÙ$cacaaca6-6e01-464f-a2ee-cbf62737a426¤codeÙµreinforce_with_baseline_monte_carlo_control_linear_features(corridor_mdp, update_corridor_features!, 1, 1_000; Î±_Î¸ = 2f0^-12, Î±_w = 2f0^-6, max_steps = 1_000).policy_function(1)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$daf35bfe-8f9c-4f55-971d-4d443be8f8bf„§cell_idÙ$daf35bfe-8f9c-4f55-971d-4d443be8f8bf¤codeÙ£display_cartpole_episode((runepisode(cartpole_setup.mdps.episodic.discrete; Ï€ = reinforce_test5.policy_sample_action, max_steps = 1_000) |> x -> (x[1], x[2]))...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$8e096fae-9941-49d8-ae87-c68b02f68da5„§cell_idÙ$8e096fae-9941-49d8-ae87-c68b02f68da5¤codeÙSconst mountaincar_continuous_beta_mdp = create_continuous_action_mountaincar_beta()¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$666a4e89-306b-4fb2-bdc4-3dda2c63153f„§cell_idÙ$666a4e89-306b-4fb2-bdc4-3dda2c63153f¤code¶using SpecialFunctions¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$5d35e515-e2d3-443e-becf-eb28c25db346„§cell_idÙ$5d35e515-e2d3-443e-becf-eb28c25db346¤codeÙs@bind mountaincar_continuing_fcann_params create_actor_critic_continuing_params_UI(; Î»_Î¸ = 0.85f0, Î»_w = 0.95f0)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$4c34640f-efa2-4e1d-8a70-0acd2ce45428„§cell_idÙ$4c34640f-efa2-4e1d-8a70-0acd2ce45428¤codeÚ ªmd""" # Bonus Problems: Comparing Techniques Consider the case of applying the techniques in this chapter to problems where we choose feature vectors and parameters to effectively compute the tabular case. That is we enumerate every state and state/action pair. Our parameters for each function will store a single value for each case. Let's consider the gradients for both the state-value estimate and the policy. We will use two sets of parameters: $\mathbf{w}$ and $\mathbf{\theta}$. $\mathbf{w}_s$ is the parameter for state s and $\mathbf{\theta}_{s, a}$ is the parameter for state/action pair $(s, a)$. Using this notation $\mathbf{w}$ is a vector and $\theta$ is a matrix. Starting with the state-value function: $\begin{align} \hat v(s, \mathbf{w}) &= \mathbf{w}_s \\ \nabla v(s, \mathbf{w}) &= \nabla \mathbf{w}_s \\ &= \mathbf{e}_s \end{align}$ where $\mathbf{e}_s$ is the one-hot vector for index s and length equal to the number of states. Now moving on to the policy, we will use a soft-max function to convert action preferences into probabilities. $\begin{align} \pi(a|s, \theta) &= \frac{\exp{\theta_{s, a}}}{\sum_{i = 1}^{n_A}{\exp{\theta_{s, i}}}} \\ \nabla \pi(a|s, \theta) &= \nabla \frac{\exp{\theta_{s, a}}}{\sum_{i = 1}^{n_A}{\exp{\theta_{s, i}}}} \\ \end{align}$ But we already calculated the gradient of the soft-max function of a vector $\mathbf{x}$. $\nabla\sigma(\mathbf{x})_{i, j} = \sigma(\mathbf{x})_i \left ( \delta_{i, j} - \sigma(\mathbf{x})_j \right )$ Comparing to what we desire, $\mathbf{x} = \mathbf{\theta}_s$ which is the parameter vector for the state s and $\sigma = \pi$. So we can immediately write down the components of this gradient: $\begin{align} \nabla \pi(a|\theta_s)_i &= \pi(a|\theta_s) \left (\delta_{a, i} - \pi(i|\theta_s) \right ) \\ \frac{\nabla \pi(a|\theta_s)_i}{\pi(a|\theta_s)} = \nabla \ln \pi(a|\theta_s)_i &= \left (\delta_{a, i} - \pi(i|\theta_s) \right ) \\ \end{align}$ $\begin{equation} \nabla \ln{\pi(a|\theta_s)}_i = \begin{cases} -\pi(i|\theta_s) & i \neq a \\ 1 - \pi(i|\theta_s) & i = a \end{cases} \end{equation}$ This is a gradient vector which corresponds to the components of $\theta_s$ which is the parameter vector for each action at that state. We have a new vector update for each unique state/action pair observed, but once those two are fixed the number of components that need to be calculated is just a vector with a length equal to the number of actions. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$e7566274-5518-4e28-8738-d4b1747d0cfb„§cell_idÙ$e7566274-5518-4e28-8738-d4b1747d0cfb¤codeÙÉfunction form_state_value_function(update_feature_vector!::Function, value_function::Function) function v!(x, s, value_params) update_feature_vector!(x, s) value_function(x, value_params) end end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$6bf5ad39-1400-4e1f-a843-a1934b8aaa48„§cell_idÙ$6bf5ad39-1400-4e1f-a843-a1934b8aaa48¤codeÚöbegin function update_squashed_gaussian_eligibility_vector!(âˆ‡lnÏ€::Matrix{T}, action_dist_params::Vector{T}, x::Vector{T}, action::T, policy_params::Matrix{T}, amax::T) where T<:Real c1 = atanh(action/amax) - first(action_dist_params) Ïƒ = exp(last(action_dist_params)) c2 = Ïƒ^-2 c3 = c2*c1 c4 = c3*c1 - one(T) @inbounds @simd for i in eachindex(x) âˆ‡lnÏ€[i, 1] = x[i]*c3 end @inbounds @simd for i in eachindex(x) âˆ‡lnÏ€[i, 2] = x[i]*c4 end end function update_squashed_gaussian_eligibility_vector!(âˆ‡lnÏ€::Matrix{T}, action_dist_params::Vector{T}, x::Vector{T}, action::NTuple{N, T}, policy_params::Matrix{T}, amax::NTuple{N, T}) where {N, T <: Real} for k = 1:N c1 = atanh(action/amax[k]) - action_dist_params[k] Ïƒ = exp(action_dist_params[k+N]) c2 = Ïƒ^-2 c3 = c2*c1 c4 = c3*c1 - one(T) @inbounds @simd for i in eachindex(x) âˆ‡lnÏ€[i, k] = x[i]*c3 end @inbounds @simd for i in eachindex(x) âˆ‡lnÏ€[i, k+N] = x[i]*c4 end end end end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$17d07ef4-7c0a-47cc-a701-32c60336571b„§cell_idÙ$17d07ef4-7c0a-47cc-a701-32c60336571b¤codeÚ€md""" Noticing this pattern, the kth term will be of the form $\gamma^k \sum_{x \in \mathcal{S}} \Pr(s \rightarrow x, k, \pi)f(x)$ and the total expression will just be a sum of all of these terms to infinity or the maximum length of an episode under the policy. Looking more closely at the probability term, we can equate it to some other probabilities regarding episode length. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$76fd79a2-2bc8-45f8-a243-48415118898a„§cell_idÙ$76fd79a2-2bc8-45f8-a243-48415118898a¤codeÚ+begin mutable struct BinarySquashedGaussianEligibilityVector{T<:Real, A<:Union{T, NTuple{N, T} where N}, P<:Union{T, Vector{T}}, B <: BinaryFeatureVector} binary_features::B a::A Î¼::P Ïƒ::P amax::A end BinarySquashedGaussianEligibilityVector(a::T, amax::T) where T<:Real = BinarySquashedGaussianEligibilityVector(BinaryFeatureVector(), a, zero(T), one(T), amax) BinarySquashedGaussianEligibilityVector(a::NTuple{N, T}) where {T<:Real, N} = BinarySquashedGaussianEligibilityVector(BinaryFeatureVector(), a, zeros(T, N), ones(T, N), amax) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$0b01ba67-3921-4f3f-a7e8-235190bc84eb„§cell_idÙ$0b01ba67-3921-4f3f-a7e8-235190bc84eb¤codeÙRfunction make_beta_dist(Î±, Î²) f(x) = x^(Î±-1) * (1-x)^(Î²-1) / beta(Î±, Î²) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$9acdbf38-2e10-45ec-85a0-d0db8453a599„§cell_idÙ$9acdbf38-2e10-45ec-85a0-d0db8453a599¤codeÚ…#this version of tile coding setup just produces a function that returns the active indices as a generator rather than actually update the feature vector function fcann_feature_vector_setup(min_value::S, max_value::S) where {T<:Real, N, S <: Union{T, NTuple{N, T}}} #states must be tuples with k elements or some number value k = S == T ? 1 : N s_range = if k == 1 max_value - min_value else Tuple(max_value[i] - min_value[i] for i in 1:k) end sample_vector = make_sample_vector(min_value) function update_feature_vector!(x::Vector{T}, s::Real) x[1] = scale_state(s, min_value, s_range) return x end function update_feature_vector!(x::Vector{T}, s::NTuple{N, T}) for i in 1:N x[i] = scale_state(s[i], min_value[i], s_range[i]) end return x end (feature_vector = sample_vector, num_features = length(sample_vector), update_feature_vector! = update_feature_vector!) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$d4e87ac4-6008-43b2-aa06-e232ec2b2b5b„§cell_idÙ$d4e87ac4-6008-43b2-aa06-e232ec2b2b5b¤codeÙqplot_cartpole_policy(reinforce_test5.policy_and_value; s_ref = CartPoleState(Float32(x), 0f0, Float32(xÌ‡), 0f0))¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$05f120be-9695-4824-82fd-142a0df13098„§cell_idÙ$05f120be-9695-4824-82fd-142a0df13098¤codeÚÊfunction actor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, amax::A, Î»_Î¸::T, Î»_w::T, get_active_features::Function, num_features::Integer, args...; policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, N, A <: Union{T, NTuple{N, T}}, P, F1, F2, F3} setup = setup_binary_squashed_gaussian_policy_arguments(mdp, amax, get_active_features, num_features) actor_critic_with_eligibility_traces!(policy_params, setup.eligibility_vector, value_params, BinaryFeatureVector(), mdp, Î»_Î¸, Î»_w, update_binary_action_preferences!, setup.action_distribution_parameters, make_squashed_gaussian_sampler(rand(A), amax), update_squashed_gaussian_eligibility_vector!, setup.feature_vector, setup.update_feature_vector!, binary_value_function, update_binary_value_gradient!, args...; kwargs...) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$b2539398-fdbc-42a2-a8f3-d327358f3643„§cell_idÙ$b2539398-fdbc-42a2-a8f3-d327358f3643¤codeÙÖif start_cartpole_continuing_binary_param_study > 0 cartpole_binary_continuing_parameter_study(cartpole_continuing_binary_study_params, 5, 3, 10_000; seed = 45) else md""" Waiting to run parameter study """ end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$c5dd7e99-57e0-4bc7-97d2-2c780b23bcff„§cell_idÙ$c5dd7e99-57e0-4bc7-97d2-2c780b23bcff¤codeÚqmd""" #### Discrete Action Space As an initial test, consider the discrete action space originally used for the mountain car problem where there are three actions (-1, 0, 1) corresponding to full throttle reverse, idle, and full throttle forward. We can apply the same tile coding solution technique from before but with a policy gradient method instead of Sarsa. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9„§cell_idÙ$d5ab6d24-dd4e-4410-a50e-fe3584b21cf9¤codeÚ9const mountaincar_continuing_fcann_test = actor_critic_with_eligibility_traces_fcann(mountaincar_continuing_mdp, 0.85f0, 0.95f0, mountaincar_fcann_setup.num_features, [32, 32, 32], mountaincar_fcann_setup.update_feature_vector!, 1_000_000, Î±_Î¸ = 0.002f0, Î±_w = 0.002f0, Î±_rÌ„ = 0.01f0; save_step_rewards=true)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$042fbafe-2401-4fb7-ac13-4531e0782c79„§cell_idÙ$042fbafe-2401-4fb7-ac13-4531e0782c79¤codeÚ¶function update_binary_eligibility_vector!(âˆ‡lnÏ€::BinaryEligibilityVector{T, B}, action_preferences::Vector{T}, binary_features::B, i_a::Integer, params::Matrix{T}) where {T<:Real, B<:BinaryFeatureVector} update_binary_action_preferences!(action_preferences, binary_features, params) soft_max!(action_preferences) âˆ‡lnÏ€.binary_features = binary_features âˆ‡lnÏ€.i_a = i_a âˆ‡lnÏ€.Ï€_dist .= action_preferences return âˆ‡lnÏ€ end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$d57375a5-b9e0-4742-b5f7-6a7da891604a„§cell_idÙ$d57375a5-b9e0-4742-b5f7-6a7da891604a¤codeÚ mountaincar_binary_continuing_parameter_study(args...; kwargs...) = actor_critic_linear_parameter_study(mountaincar_continuing_mdp, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, args...; binary_features=true, kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$07ad517a-c2ac-4377-99fb-adb13d0f1d0c„§cell_idÙ$07ad517a-c2ac-4377-99fb-adb13d0f1d0c¤codeÙ“reinforce_monte_carlo_control_fcann(corridor_mdp, 1, [10, 10], update_corridor_features!, 100; Î± = 2f0^-14, max_steps = 10_000).policy_function(1)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$71a5fce8-6d9a-4625-bad1-a951d61bff28„§cell_idÙ$71a5fce8-6d9a-4625-bad1-a951d61bff28¤codeÙf@bind mountaincar_binary_continuous_params create_actor_critic_params_UI(Î»_Î¸ = 0.05f0, Î»_w = 0.8f0)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$77906355-08f8-4b08-b051-84697199b519„§cell_idÙ$77906355-08f8-4b08-b051-84697199b519¤codeÙ,const mountaincar_max_vals = (0.5f0, 0.07f0)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$5207308e-f636-4d47-b135-036a6e7b8ecd„§cell_idÙ$5207308e-f636-4d47-b135-036a6e7b8ecd¤codeÙfshow_mountaincar_continuous_trajectory(mountaincar_continuous_test_train3.policy_sample_action, 1_000)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$16113560-e911-47b4-abc4-641bbd246454„§cell_idÙ$16113560-e911-47b4-abc4-641bbd246454¤codeÙ_plot(mountaincar_continuous_test_train_beta.episode_rewards, Layout(yaxis_range = [-10000, 0]))¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$b7f77935-bcab-4ef1-8e1b-a7d059784ff3„§cell_idÙ$b7f77935-bcab-4ef1-8e1b-a7d059784ff3¤codeÚ@bind test_mountaincar_state PlutoUI.combine() do Child md""" #### Evaluation State for Policy Function x position: $(Child(Slider(-1.2f0:0.1f0:0.5f0, default = 0f0, show_value=true))) velocity: $(Child(Slider(-0.07f0:0.01f0:0.07f0, default = 0f0, show_value=true))) """ end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5„§cell_idÙ$f9ac1bf0-55ee-4c71-bdaa-a00f9d779bf5¤codeÙUreinforce_test.policy_and_value(cartpole_mdps.episodic.continuous.initialize_state())¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$00bd2835-b006-4244-9877-bc7e031e3ef8„§cell_idÙ$00bd2835-b006-4244-9877-bc7e031e3ef8¤codeÙâfunction plot_squashed_gaussian(Î¼::T, Ïƒ::T, xmax::T; npoints = 1000) where T<:Real x = LinRange(-one(T)*xmax, one(T)*xmax, npoints) y = squashed_gaussian_pdf.(x, Î¼, Ïƒ, xmax) plot(x, y, Layout(xaxis_range = [-2, 2])) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$50ae94c4-70f3-4215-82bd-eb2227c2badf„§cell_idÙ$50ae94c4-70f3-4215-82bd-eb2227c2badf¤codeÚif start_cartpole_continuing_fcann_param_study > 0 cartpole_fcann_continuing_parameter_study(cartpole_continuing_fcann_network_params..., cartpole_continuing_fcann_study_params, 4, 3, 100_000; seed = 45) else md""" Waiting to run parameter study """ end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$cc3ac95e-a398-438a-ba3d-62b6733f6342„§cell_idÙ$cc3ac95e-a398-438a-ba3d-62b6733f6342¤codeÚfunction update_fcann_action_preferences!(action_preferences::Vector{T}, x::Vector{T}, params::FCANNParams, activations::FCANNActivations{T}, reslayers::Integer) where T<:Float32 FCANN.forwardNOGRAD_base!(activations, params..., x, reslayers) action_preferences .= activations[end] end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$c926b6df-c40b-4c4c-8a95-ce9e41feb100„§cell_idÙ$c926b6df-c40b-4c4c-8a95-ce9e41feb100¤codeÚMactor_critic_fcann_parameter_study(mountaincar_continuing_mdp, mountaincar_fcann_feature_setup.update_feature_vector!, mountaincar_fcann_feature_setup.num_features, [4, 4], 0.0f0:0.05f0:0.95f0, 0.0f0:0.05f0:0.95f0, [0.01f0, 0.005f0], 2f0 .^ (-20:-1), 2f0 .^ (-20:-1), 1_000, 1_000_000; seed = 45) |> df -> sort(df, :output; rev=true)¨metadataƒ©show_logsÃ¨disabledÃ®skip_as_scriptÂ«code_foldedÂÙ$740a3f41-9302-481d-b373-762c0dea8eff„§cell_idÙ$740a3f41-9302-481d-b373-762c0dea8eff¤codeÚVbegin function update_gaussian_eligibility_vector!(âˆ‡lnÏ€::BinaryGaussianEligibilityVector{T, T, T, B}, dist_params::Vector{T}, x::B, action::T, policy_params::Matrix{T}) where {T<:Real, B<:BinaryFeatureVector} âˆ‡lnÏ€.binary_features = x âˆ‡lnÏ€.a = action âˆ‡lnÏ€.Î¼ = first(dist_params) âˆ‡lnÏ€.Ïƒ = exp(last(dist_params)) # isapprox(âˆ‡lnÏ€.Ïƒ, 0f0) && @info "with distribution params $dist_params having 0 result for Ïƒ of $âˆ‡lnÏ€.Ïƒ" # isinf(âˆ‡lnÏ€.Ïƒ) && @info "with distribution params $dist_params having inf result for Ïƒ of $âˆ‡lnÏ€.Ïƒ" # isnan(âˆ‡lnÏ€.Ïƒ) && @info "with distribution params $dist_params having nan result for Ïƒ of $âˆ‡lnÏ€.Ïƒ" return âˆ‡lnÏ€ end function update_gaussian_eligibility_vector!(âˆ‡lnÏ€::BinaryGaussianEligibilityVector{T, NTuple{N, T}, Vector{T}, B}, dist_params::Vector{T}, x::B, action::NTuple{N, T}, policy_params::Matrix{T}) where {T<:Real, N, B<:BinaryFeatureVector} âˆ‡lnÏ€.binary_features = x âˆ‡lnÏ€.a = action for i in 1:N âˆ‡lnÏ€.Î¼[k] = dist_params[k] âˆ‡lnÏ€.Ïƒ[k] = exp(dist_params[k+N]) end return âˆ‡lnÏ€ end end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$ba642a22-6623-482a-ab4a-81585b83e457„§cell_idÙ$ba642a22-6623-482a-ab4a-81585b83e457¤codeÚ-@memoize Dict function average_continuing_runs(nruns::Integer, seed::Integer, Î±_Î¸::T, Î±_w::T, Î±_rÌ„::T, policy_params, algo, args...; kwargs...) where T<:Real # @info "Running trials for continuing actor critic with binary encoding: $nruns $seed $Î±_Î¸ $Î±_w $Î±_rÌ„ $mdp $Î»_Î¸ $Î»_w $get_active_features $num_features" Random.seed!(seed) 1:nruns |> Map() do _ x =algo(args...; Î±_Î¸ = Î±_Î¸, Î±_w = Î±_w, Î±_rÌ„ = Î±_rÌ„, policy_params = deepcopy(policy_params), kwargs...) x.total_reward / x.total_steps end |> foldxt(+) |> a -> a / nruns end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$d17a4bd0-5992-4247-912d-73d51758d2f3„§cell_idÙ$d17a4bd0-5992-4247-912d-73d51758d2f3¤codeÙ+md""" ### *Continuing Cartpole Example* """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7ef„§cell_idÙ$db6ed0ea-c26b-4ea1-b4a1-7641f0f9c7ef¤codeÙ§plot_cartpole_policy(cartpole_continuing_fcann_test.policy_and_value; s_ref = cartpole_fcann_continuing_test_episode[1][cartpole_fcann_continuing_episode_step_select])¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$5ee4ce72-7740-4297-8d84-619e0708e4ac„§cell_idÙ$5ee4ce72-7740-4297-8d84-619e0708e4ac¤codeÚâfunction cartpole_continuing_fcann_parameter_study(Î±1_list, Î±2_list, Î±_rÌ„, Î»_Î¸, Î»_w, hidden_layers, max_steps; num_trials = 100, kwargs...) setup = setup_cartpole_problem(;kwargs...) traces = [begin steps = [begin 1:num_trials |> Map() do i solution = actor_critic_with_eligibility_traces_fcann(cartpole_setup.mdps.continuing.discrete, Î»_Î¸, Î»_w, cartpole_fcann_feature_setup.num_features, hidden_layers, (x, s) -> cartpole_fcann_feature_setup.update_feature_vector!(x, (s.x, s.Î¸, s.xÌ‡, s.Î¸Ì‡)), max_steps; Î±_Î¸ = Î±1, Î±_w = Î±2, Î±_rÌ„ = Î±_rÌ„) solution.total_reward / max_steps end |> foldxt(+) |> x -> x / num_trials end for Î±1 in Î±1_list] scatter(x = Î±1_list, y = steps, name = "Î±_w = $Î±2") end for Î±2 in Î±2_list] plot(traces, Layout(xaxis_title = "Policy Learning Rate Î±_Î¸", yaxis_title = "Average Failure Rate Over First $max_steps Steps", xaxis_type = "log", title = "Hiden Layers = $hidden_layers, Î»_Î¸ = $Î»_Î¸, Î»_w = $Î»_w")) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$645e93e7-e92e-49c4-9757-8294fabf4e9b„§cell_idÙ$645e93e7-e92e-49c4-9757-8294fabf4e9b¤codeÙCplot_continuing_step_rewards(cartpole_continuing_test.step_rewards)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$0c56b341-24eb-4c78-844e-182f44a7221a„§cell_idÙ$0c56b341-24eb-4c78-844e-182f44a7221a¤codeÚ#in the source code used to generate this for the book found here: http://incompleteideas.net/book/code/figure_13_1.py - graphs look as they do because of poor parameter initialization since the random policy is fairly close to ideal already figure_13_1(2f0 .^ [-12, -13, -14])¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$d34d22ad-89c2-423e-91dd-bfb895dc6540„§cell_idÙ$d34d22ad-89c2-423e-91dd-bfb895dc6540¤codeÙßcartpole_fcann_parameter_study(args...; kwargs...) = actor_critic_fcann_episodic_parameter_study(cartpole_setup.mdps.episodic.discrete, cartpole_vector_update!, cartpole_fcann_feature_setup.num_features, args...; kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$20776e09-7d9b-4db8-a060-7bceeec65b47„§cell_idÙ$20776e09-7d9b-4db8-a060-7bceeec65b47¤codeÚ‘function actor_critic_with_eligibility_traces_binary_features_gaussian_actions(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, Î»_Î¸::T, Î»_w::T, get_active_features::Function, num_features::Integer, args...; policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, N, A <: Union{T, NTuple{N, T}}, P, F1, F2, F3} setup = setup_binary_gaussian_policy_arguments(mdp, get_active_features, num_features) actor_critic_with_eligibility_traces!(policy_params, setup.eligibility_vector, value_params, BinaryFeatureVector(), mdp, Î»_Î¸, Î»_w, update_binary_action_preferences!, setup.action_distribution_parameters, make_gaussian_sampler(rand(A)), update_gaussian_eligibility_vector!, setup.feature_vector, setup.update_feature_vector!, binary_value_function, update_binary_value_gradient!, args...; kwargs...) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$7856b8a0-565d-4c86-9b3c-4424ff9b86dd„§cell_idÙ$7856b8a0-565d-4c86-9b3c-4424ff9b86dd¤codeÙb#add policy gradient example on cartpole without continuous actions and parameter studies for both¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$735b548a-88f5-4a30-ab8f-dfb3d6401b2b„§cell_idÙ$735b548a-88f5-4a30-ab8f-dfb3d6401b2b¤codeÚAmd""" ## 13.7 Policy Parameterization for Continuous Actions With a parameterized policy we are to learn statistics of the distribution that selects actions. As a foundation consider the normal distribution: $p(x) \doteq \frac{1}{\sigma \sqrt{2\pi}} \exp \left ( - \frac{(x-\mu)^2}{2\sigma^2} \right ) \tag{13.18}$ """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$7cf26604-9c2b-4a77-9674-7d4dac2f99f0„§cell_idÙ$7cf26604-9c2b-4a77-9674-7d4dac2f99f0¤codeÚ—begin include(joinpath(@__DIR__, "..", "Chapter-09", "Chapter_09_On-policy_Prediction_with_Approximation.jl")) include(joinpath(@__DIR__, "..", "Chapter-10", "Chapter_10_On_policy_Control_with_Approximation.jl")) include(joinpath(@__DIR__, "..", "Chapter-11", "Chapter_11_Off_policy_Methods_with_Approximation.jl")) include(joinpath(@__DIR__, "..", "Chapter-12", "Chapter_12_Eligibility_Traces.jl")) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$87ee21f3-16ca-4c8c-a0b9-f9e2fd258a91„§cell_idÙ$87ee21f3-16ca-4c8c-a0b9-f9e2fd258a91¤codeÙEmd""" ### *REINFORCE Implementation for Continuous Action Spaces* """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$54f1546d-87ae-49d2-92ed-6fcc9b66e027„§cell_idÙ$54f1546d-87ae-49d2-92ed-6fcc9b66e027¤codeÙ md""" ### *Mountain Car MDP* """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$63fbf8f4-e4e2-4893-be09-67450e92dbd7„§cell_idÙ$63fbf8f4-e4e2-4893-be09-67450e92dbd7¤codeÚ,function plot_cart(s::CartPoleState, a::Int64; xmin = -50, xmax = 50, Î¸Ì‡_min = -10, Î¸Ì‡_max = 10) s.x s.Î¸ t1 = scatter(x = [0, sin(s.Î¸)], y = [0, cos(s.Î¸)], mode = "lines", color = "black") t2 = scatter(x = [sin(s.Î¸)], y = [cos(s.Î¸)], mode = "markers", color = "black") p1 = plot([t1, t2], Layout(yaxis_range = [-.1, 1.2], xaxis_range = [-1.2, 1.2], xaxis_scaleanchor = "y", width = 250, height = 230, showlegend = false, title = "Pole Angle")) p2 = plot(scatter(x = [s.x], y = [0]), Layout(xaxis_range = [xmin, xmax], width = 250, height = 230, title = "X Location")) p3 = plot(indicator(mode = "gauge+number+delta", value = s.Î¸Ì‡, title_text = "Angular Speed
in Radians per Second", delta_reference = 0, gauge_axis_range = [-10, 10]), Layout(width = 250, height = 230)) p4 = plot(indicator(mode = "gauge+number+delta", value = s.xÌ‡, title_text = "Horizontal Speed
in Meters per Second", delta_reference = 0, gauge_axis_range = [-50, 50]), Layout(width = 250, height = 230)) p5 = plot(indicator(mode = "gaugue+number", gauge=attr(shape="bullet"), value = a - 2, title_text = "Action", delta_reference = 0, gauge_axis_range = [-1, 1]), Layout(width = 250, height = 230)) @htl("""

$p1 $p2 $p3 $p4 $p5

""") end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$d5020a8d-1dd7-403c-9d1f-665b95543943„§cell_idÙ$d5020a8d-1dd7-403c-9d1f-665b95543943¤codeÚVreinforce_with_baseline_monte_carlo_control_linear_features_gaussian_actions(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, update_feature_vector!::Function, num_features::Integer, max_episodes::Integer; policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), value_params::Vector{T} = zeros(T, num_features), x = zeros(T, num_features), action_dist_params::Vector{T} = make_gaussian_params(rand(A)), kwargs...) where {T<:Real, S, N, A<:Union{T, NTuple{N, T}}, P, F1, F2, F3} = reinforce_with_baseline_monte_carlo_control!(policy_params, copy(policy_params), value_params, copy(value_params), mdp, update_linear_action_preferences!, action_dist_params, make_gaussian_sampler(rand(A)), update_gaussian_eligibility_vector!, x, update_feature_vector!, linear_value_function, update_linear_value_gradient!, max_episodes; kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$37a8ef7e-e859-4ef0-81e2-76c02a324031„§cell_idÙ$37a8ef7e-e859-4ef0-81e2-76c02a324031¤codeÚmd""" ### Policy Gradient Theorem Proof In all cases below when a sum over states is taken, it is assumed to be over the set of non-terminal states: $\sum_s \implies \sum_{s \in \mathcal{S}}$ Note that for the case of the value function this is identical to the sum over $\mathcal{S}^+$ because the state-action values are always zero for terminal states. $\begin{flalign} \nabla v_\pi(s) &= \nabla \left [ \sum_a \pi(a \vert s) q_\pi(s, a) \right ] \text{, } \forall s \in \mathcal{S} \tag{definitiong of value functions and expected value} \\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \nabla q_\pi(s, a) \right ] \tag{product rule} \\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \nabla \sum_{s^\prime, r} p(s^\prime, r \vert s, a)(r + \gamma v_\pi(s^\prime) \right ] \tag{relationship between action and state values} \\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \gamma \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) \right ] \tag{gradient independence}\\ \end{flalign}$ Note that the final term in the sum is the original expression evaluated at $s^\prime$ instead of $s$, so we have derived a recurssive expression which can be applied repeatedly: $\begin{flalign} \nabla v_\pi(s) &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \gamma \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \left [ \nabla \pi(a^\prime \vert s^\prime) q_\pi(s^\prime, a^\prime) + \pi(a^\prime \vert s^\prime) \gamma \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) \nabla v_\pi(s^{\prime \prime}) \right ] \right ] \tag{recur once}\\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) \right ] + \gamma \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \left [ \nabla \pi(a^\prime \vert s^\prime) q_\pi(s^\prime, a^\prime) \right ] \right ] + \\ &\hspace{50px} \gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) \nabla v_\pi(s^{\prime \prime}) \right ] \tag{grouping terms}\\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) \right ] + \gamma \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \left [ \nabla \pi(a^\prime \vert s^\prime) q_\pi(s^\prime, a^\prime) \right ] \right ] + \\ &\hspace{50px} \gamma^2 \sum_a \left [ \pi(a \vert s)\sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) \sum_{a^{\prime \prime}} [ \nabla \pi(a^{\prime \prime} \vert s^{\prime \prime}) q_\pi(s^{\prime \prime}, a^{\prime \prime})\right ] + \cdots \tag{extend recursion}\\ \end{flalign}$ """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$98229733-a71e-44ca-a52a-b7229cf8b422„§cell_idÙ$98229733-a71e-44ca-a52a-b7229cf8b422¤codeÚ1md""" The probability transition function is normalized over all possible transition states $\sum_{s^\prime \in \mathcal{S}^+} p(s^\prime \vert s, a) = 1$. If we only take the sum of $\mathcal{S}$ then we instead get the probability that after a single transition we have NOT reached a terminal state. Let's say we also have a policy function $\pi(a \vert s)$ which is normalized over actions: $\sum_a \pi(a \vert s) = 1$. Now if we combine the two, we can arrive at a new distribution over transition states: $p(s^\prime \vert s, \pi) = \sum_a \pi(a \vert s) p(s^\prime \vert s, a)$ which is the probability of transitioning from $s$ to $s^\prime$ under the policy. We can see that this distribution is normalized over the transition states as well as long as we include the terminal state: $\sum_{s^\prime \in \mathcal{S}^+} p(s^\prime \vert s, \pi) = \sum_{s^\prime \in \mathcal{S}^+, a} \pi(a \vert s) p(s^\prime \vert s, a) = \sum_a \pi(a \vert s) \sum_{s^\prime \in \mathcal{S}^+} p(s^\prime \vert s, a) = 1 \times 1 = 1$. If instead we take the sum over $\mathcal{S}$ we simply get the probability of NOT terminating in one step. What if we consider two steps into the future though? Now we have $\sum_{s^\prime}\sum_{a^\prime}\pi(a^\prime \vert s^\prime)p(s^{\prime \prime} \vert s^\prime, a^\prime)\sum_a \pi(a \vert s) p(s^\prime \vert s, a) = \sum_{s^\prime}p(s^{\prime \prime} \vert s^\prime, \pi) p(s^\prime \vert s, \pi)$. It would appear as though we can just put the two probabilities together and consider a new distribution over $s^{\prime \prime}$ which is $p(s^{\prime \prime} \vert s, \pi, 2)$ where instead of one step this now occurs over two steps, but how is this distribution normalized? In the case of the one step, transition, we saw that its sum over all transition states is 1 as expected. If we sum both transition states over only $\mathcal{S}$ rather than $\mathcal{S}^+$ what is the result? We already know that $\sum_{s^{\prime \prime} \in \mathcal{S}^+} p(s^{\prime \prime} \vert s^\prime , \pi) = \Pr \{ S_1 \neq S_T \ \vert S_0 = s^\prime, \pi \}$ that is the probability that after transitioning out of $s^\prime$ under the policy $\pi$ we have not reached a terminal state. $\sum_{s^{\prime \prime} \in \mathcal{S}} \sum_{s^\prime \in \mathcal{S}} p(s^{\prime \prime} \vert s^\prime, \pi) p(s^\prime \vert s, \pi) = \sum_{s^\prime \in \mathcal{S}} p(s^\prime \vert s, \pi) \sum_{s^{\prime \prime} \in \mathcal{S}} p(s^{\prime \prime} \vert s^\prime, \pi) = \Pr \{ S_2 \neq S_T \vert S_0 = s, \pi \}$ which is to say the probability that after two transitions from $s$ we are not in a terminal state under the policy $\pi$. For the derivations that follow, we always take sums of these distributions over $\mathcal{S}$. For episodic problems, the on policy distribution $\mu_\pi(s)$ which is the probability of being in a state $s$ during an episode always excludes the terminal state. That is because if there is a non-zero probability of reaching a terminal state under a policy, then considering all possible episodes we may have an infinite number of visits to the terminal state. Technically the episodes have infinite length but we are only interested in the portion of the episode that preceeds the terminal state for the purpose of calculating probabilities. The more careful statement about the on policy distribution is that it measures the probability of being in a state during the non-terminal part of an episode. If we try to include the terminal states, then we cannot have a proper normalized definition of the on-policy distribution. Moreover, we have no need to measure the value of a terminal state accurately, since we always know it to be 0. The on policy distribution is used to formulate the value error objective function and it should only include states for which the value estimation is non-trivial. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$42775fd1-5b27-48e0-abf1-9b22bb775e6d„§cell_idÙ$42775fd1-5b27-48e0-abf1-9b22bb775e6d¤codeÙKcorridor_continuing_parameter_study(continuing_study_params, 5, 3, 100_000)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$7dbb42a3-aa8c-47e5-b668-18e6325d4038„§cell_idÙ$7dbb42a3-aa8c-47e5-b668-18e6325d4038¤codeÙ!md""" #### Tile Coding Method """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$192b9f82-8d3a-408f-91c2-829cfcd32572„§cell_idÙ$192b9f82-8d3a-408f-91c2-829cfcd32572¤codeÙcartpole_vector_update!(x::Vector{T}, s::CartPoleState{T}) where T<:Real = cartpole_fcann_feature_setup.update_feature_vector!(x, (s.x, s.Î¸, s.xÌ‡, s.Î¸Ì‡))¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$b5319d8b-0420-4ebf-b603-ea0b93365ac1„§cell_idÙ$b5319d8b-0420-4ebf-b603-ea0b93365ac1¤codeÚnfunction show_mountaincar_continuous_trajectory(Ï€::Function, max_steps::Integer; mdp = mountaincar_continuous_mdp) states, actions, rewards, sterm, nsteps = runepisode(mdp, Ï€; max_steps = max_steps) positions = [s[1] for s in states] velocities = [s[2] for s in states] tr1 = scatter(x = positions, y = velocities, mode = "markers", showlegend = false) tr2 = scatter(y = positions, showlegend = false) tr3 = scatter(y = actions, showlegend = false) p1 = plot(tr1, Layout(xaxis_title = "position", yaxis_title = "velocity", xaxis_range = [-1.2, 0.5], yaxis_range = [-0.07, 0.07], height = 400)) p2 = plot(tr2, Layout(xaxis_title = "time", yaxis_title = "position", height = 400)) p3 = plot(tr3, Layout(xaxis_title = "time", yaxis_title = "action", height = 400)) @htl(""" Total Reward: $(sum(rewards))

$([p1 p2 p3])

""") end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$4cbdb082-22ba-49e9-a6ed-4380917625ac„§cell_idÙ$4cbdb082-22ba-49e9-a6ed-4380917625ac¤codeÙCmd""" ### *Actor-Critic with Eligibility Traces Implementation* """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$cc80848a-6834-4272-9152-e17b45448814„§cell_idÙ$cc80848a-6834-4272-9152-e17b45448814¤codeÙÝfunction wind_speeds(directions) PlutoUI.combine() do Child @htl("""

Wind speeds

$(name): $(Child(name, Slider(1:100)))

""") end end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$05bfd818-bf4e-4bda-baa9-5ba647867097„§cell_idÙ$05bfd818-bf4e-4bda-baa9-5ba647867097¤codeÚ.function actor_critic_with_eligibility_traces_binary_features(mdp::StateMDP{T, S, A, P, F1, F2, F3}, Î»_Î¸::T, Î»_w::T, get_active_features::Function, num_features::Integer, args...; policy_params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} setup = setup_binary_policy_arguments(mdp, get_active_features, num_features) actor_critic_with_eligibility_traces!(policy_params, setup.eligibility_vector, value_params, BinaryFeatureVector(), mdp, Î»_Î¸, Î»_w, update_binary_action_preferences!, update_binary_eligibility_vector!, setup.feature_vector, setup.update_feature_vector!, binary_value_function, update_binary_value_gradient!, args...; action_preferences = setup.action_preferences, kwargs...) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$f0962801-0dfa-421f-8ffc-e64068e49913„§cell_idÙ$f0962801-0dfa-421f-8ffc-e64068e49913¤codeÙfconst mountaincar_fcann_feature_setup = fcann_feature_vector_setup((-1.2f0, -0.07f0), (0.5f0, 0.07f0))¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$11a55af7-5301-4507-bb26-88e1e11236db„§cell_idÙ$11a55af7-5301-4507-bb26-88e1e11236db¤codeÙ¥display_cartpole_episode((runepisode(cartpole_setup.mdps.episodic.discrete; Ï€ = reinforce_test3.policy_sample_action, max_steps = 100_000) |> x -> (x[1], x[2]))...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$ddbca73f-c692-46f2-95f3-a7dd849d33f7„§cell_idÙ$ddbca73f-c692-46f2-95f3-a7dd849d33f7¤codeÙOshow_mountaincar_trajectory(mountaincar_test_train.policy_sample_action, 1_000)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$b4875f2b-5487-429f-80a3-d1032bbccfc1„§cell_idÙ$b4875f2b-5487-429f-80a3-d1032bbccfc1¤codeÙCmd""" ### Policy Gradient Theorem Proof for Continuing Problems """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$0cd96c44-cae6-421f-9fae-26141600bef4„§cell_idÙ$0cd96c44-cae6-421f-9fae-26141600bef4¤codeÙ¬display_cartpole_episode((runepisode(cartpole_setup.mdps.episodic.discrete; Ï€ = cartpole_continuing_test.policy_sample_action, max_steps = 1_000) |> x -> (x[1], x[2]))...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$51d6337d-c0bd-40a9-9129-7d88e41e4093„§cell_idÙ$51d6337d-c0bd-40a9-9129-7d88e41e4093¤codeÙR#add plot under this to show the action selection or force being applied over time¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$5859ca11-90f8-4fd6-88ed-c56efe796fe8„§cell_idÙ$5859ca11-90f8-4fd6-88ed-c56efe796fe8¤codeÙ¥display_cartpole_episode((runepisode(cartpole_setup.mdps.episodic.discrete; Ï€ = reinforce_test2.policy_sample_action, max_steps = 100_000) |> x -> (x[1], x[2]))...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$3ea08816-705e-4be7-a175-dbd3f3e4c17d„§cell_idÙ$3ea08816-705e-4be7-a175-dbd3f3e4c17d¤codeÙ$md""" # Misc Utilities/Functions """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$f3e2db06-9cb7-464a-96b8-938175efd26b„§cell_idÙ$f3e2db06-9cb7-464a-96b8-938175efd26b¤codeÚ—function setup_fcann_value_arguments(policy_setup::NamedTuple, input_length::Integer, hidden_layers::Vector{Int64}, reslayers::Integer, l2::T, dropout::T, use_Î¼P::Bool, activation_list, scales) where {T<:Real} scale = (reslayers == 0) ? 1 : length(hidden_layers)/(reslayers + 1) + 1 c = scale*last(hidden_layers) f = use_Î¼P ? one(T) / c : c^T(-0.5) w_Î¸_out = T.(FCANN.makeorthonormalrand(1, last(hidden_layers)) .* f) w_Î²_out = zeros(T, 1) #value function shares its params with the policy function value_params = deepcopy(policy_setup.params) for i in eachindex(hidden_layers) for j in 1:2 value_params[j][i] = policy_setup.params[j][i] end end #replace the final layer of the value network with something that outputs a single value value_params[1][end] = w_Î¸_out value_params[2][end] = w_Î²_out # value_params = FCANN.initializeparams_saxe(input_length, hidden_layers, 1) #form activations for value network value_activations = FCANN.form_activations(value_params[1]) value_activations[end] = zeros(T, 1) value_tanh_grad_z = deepcopy(value_activations) value_deltas = deepcopy(value_activations) value_function(x, params) = fcann_value_function(x, params, value_activations, reslayers) function update_value_gradient!(âˆ‡vÌ‚, x, value_params) update_fcann_value_gradient!(âˆ‡vÌ‚, x, value_params, hidden_layers, l2, value_tanh_grad_z, value_activations, value_deltas, dropout, reslayers, activation_list, scales) use_Î¼P && scale_fcann_params!(âˆ‡vÌ‚, policy_setup.scales) end return (value_params = value_params, value_gradient = deepcopy(value_params), value_function = value_function, gradient_update = update_value_gradient!) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$b2082ab0-73a4-45a6-8772-a2e6e22b519a„§cell_idÙ$b2082ab0-73a4-45a6-8772-a2e6e22b519a¤codeÚábegin function beta_action_sampler(p1::T, p2::T) where T<:Real isnan(p1) && return p1 isnan(p2) && return p2 Ïµ = eps(zero(T)) Î± = max(Ïµ, exp(p1)) Î² = max(Ïµ, exp(p2)) T(rand(Beta(Î±, Î²))) end beta_action_sampler(params::Vector{T}) where T<:Real = beta_action_sampler(params[1], params[2]) make_beta_n_sampler(::Val{1}) = beta_action_sampler function make_beta_n_sampler(::Val{N}) where N function f(params::Vector{T}) where T<:Real ntuple(i -> beta_action_sampler(params[i], params[i+N]), N) end end make_beta_n_sampler(n::Integer) = make_beta_n_sampler(Val(n)) make_beta_sampler(::T) where T<:Real = beta_action_sampler make_beta_sampler(::NTuple{N, T}) where {N, T<:Real} = make_beta_n_sampler(N) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$a361f4c9-47ce-42ad-899c-87b611c0d471„§cell_idÙ$a361f4c9-47ce-42ad-899c-87b611c0d471¤codeÚ™function update_binary_action_preferences!(action_preferences::Vector{T}, binary_features::BinaryFeatureVector, params::Matrix{T}) where T<:Real @inbounds for i_a in eachindex(action_preferences) action_preferences[i_a] = zero(T) @simd for i in 1:binary_features.num_features j = binary_features.active_features[i] action_preferences[i_a] += params[j, i_a] end end return action_preferences end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$46fea69b-599e-46ab-8455-d2da865d9a8e„§cell_idÙ$46fea69b-599e-46ab-8455-d2da865d9a8e¤codeÙFconst mountaincar_continuing_mdp = create_mountaincar_continuing_mdp()¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÃ«code_foldedÂÙ$bfe7e41d-6318-4bd4-b892-287831876abc„§cell_idÙ$bfe7e41d-6318-4bd4-b892-287831876abc¤codeÚ7begin function update_beta_eligibility_vector!(âˆ‡lnÏ€::Matrix{T}, action_dist_params::Vector{T}, x::Vector{T}, action::T, policy_params::Matrix{T}) where T<:Real Î± = exp(first(action_dist_params)) Î² = exp(last(action_dist_params)) c1 = digamma(Î± + Î²) Î´1 = (log(action) + c1 - digamma(Î±))*Î± @inbounds @simd for i in eachindex(x) âˆ‡lnÏ€[i, 1] = x[i]*Î´1 end Î´2 = (log(one(T) - action) + c1 - digamma(Î²))*Î² @inbounds @simd for i in eachindex(x) âˆ‡lnÏ€[i, 2] = x[i]*Î´2 end end function update_beta_eligibility_vector!(âˆ‡lnÏ€::Matrix{T}, action_dist_params::Vector{T}, x::Vector{T}, action::NTuple{N, T}, policy_params::Matrix{T}) where {N, T <: Real} for k = 1:N Î± = exp(action_dist_params[k]) Î² = exp(action_dist_params[k+N]) c1 = digamma(Î± + Î²) Î´1 = (log(action[k]) + c1 - digamma(Î±))*Î± @inbounds @simd for i in eachindex(x) âˆ‡lnÏ€[i, k] = x[i]*Î´1 end Î´2 = (log(one(T) - action[k]) + c1 - digamma(Î²))*Î² @inbounds @simd for i in eachindex(x) âˆ‡lnÏ€[i, k+N] = x[i]*Î´2 end end end end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$c251a630-7114-4188-9323-8d8feb5c32e0„§cell_idÙ$c251a630-7114-4188-9323-8d8feb5c32e0¤codeÚCmountaincar_fcann_continuing_parameter_study(layer_size::Integer, num_layers::Integer, args...; kwargs...) = actor_critic_fcann_parameter_study(mountaincar_continuing_mdp, mountaincar_fcann_feature_setup.update_feature_vector!, mountaincar_fcann_feature_setup.num_features, fill(layer_size, num_layers), args...; kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$af144759-fe66-4ad0-b378-e9eb4e859db4„§cell_idÙ$af144759-fe66-4ad0-b378-e9eb4e859db4¤codeÙNplot_cartpole_policy(reinforce_test4.policy_and_value; s_ref = ep[1][ep_step])¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2„§cell_idÙ$d560b2a0-c571-4ad7-b1c9-83ec03fc8cc2¤codeÙIconst mountaincar_continuous_mdp = create_continuous_action_mountaincar()¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$fb8904a9-ae64-41cc-93b6-5a25855edad0„§cell_idÙ$fb8904a9-ae64-41cc-93b6-5a25855edad0¤codeÙÁfunction get_corridor_episode_stats(p::Real; ntrials=10_000) 1:ntrials |> Map(_ -> runepisode(corridor_mdp; Ï€ = s -> (rand() < p) + 1) |> first |> length) |> foldxt(+) |> a -> a / ntrials end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$a5b002c9-5e11-462a-9da0-6e060c7963f8„§cell_idÙ$a5b002c9-5e11-462a-9da0-6e060c7963f8¤codeÙ¦const ep2 = runepisode(cartpole_setup.mdps.episodic.discrete; Ï€ = reinforce_test5.policy_sample_action, max_steps = 1000, s0 = CartPoleState(30f0, 0.8f0, 0f0, -0f0))¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$83640f5b-fe13-4ec1-98a0-67a56c189ba1„§cell_idÙ$83640f5b-fe13-4ec1-98a0-67a56c189ba1¤codeÚ0function actor_critic_with_eligibility_traces!(policy_params::P1, âˆ‡lnÏ€, value_params::P2, âˆ‡vÌ‚, mdp::StateMDP{T, S, A, PTF, F1, F2, F3}, Î»_Î¸::T, Î»_w::T, update_action_preferences!::Function, update_eligibility_vector!::Function, x, update_feature_vector!::Function, value_function::Function, update_value_gradient!::Function, max_steps::Integer; Î±_w::T = one(T)/10, Î±_Î¸::T = one(T)/10, Î±_rÌ„ = one(T)/10, action_preferences = zeros(T, length(mdp.actions)), z_Î¸::P1 = deepcopy(policy_params), z_w::P2 = deepcopy(value_params), save_step_rewards = false) where {P1, P2, T<:Real, S, A, PTF, F1, F2, F3} step_rewards = Vector{T}() #initialize variables step = 1 rÌ„ = zero(T) zero_params!(z_Î¸) zero_params!(z_w) rtot = zero(T) s = mdp.initialize_state() update_feature_vector!(x, s) while step <= max_steps update_value_gradient!(âˆ‡vÌ‚, x, value_params) vÌ‚ = value_function(x, value_params) update_action_preferences!(action_preferences, x, policy_params) soft_max!(action_preferences) i_a = sample_action(action_preferences) update_eligibility_vector!(âˆ‡lnÏ€, action_preferences, x, i_a, policy_params) (r, sâ€²) = mdp.ptf(s, i_a) rtot += r save_step_rewards && push!(step_rewards, r) step += 1 mdp.isterm(sâ€²) && error("$sâ€² is a terminal state and this method only applies to continuing tasks") update_feature_vector!(x, sâ€²) vÌ‚â€² = value_function(x, value_params) Î´ = r - rÌ„ + vÌ‚â€² - vÌ‚ rÌ„ += Î±_rÌ„*Î´ update_traces_with_gradient!(Î»_w, z_w, âˆ‡vÌ‚) update_traces_with_gradient!(Î»_Î¸, z_Î¸, one(T), âˆ‡lnÏ€) update_params_with_gradient!(value_params, Î±_w*Î´, z_w) update_params_with_gradient!(policy_params, Î±_Î¸*Î´, z_Î¸) s = sâ€² end function_outputs = form_state_and_policy_function_outputs(update_feature_vector!, update_action_preferences!, value_function, x, action_preferences, policy_params, value_params) return (; step_rewards = step_rewards, total_reward = rtot, total_steps = step - 1, policy_parameters = policy_params, value_parameters = value_params, function_outputs...) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$61650a97-b353-4a85-b50b-93fee296ac7b„§cell_idÙ$61650a97-b353-4a85-b50b-93fee296ac7b¤codeÙqconst cartpole_fcann_feature_setup = fcann_feature_vector_setup(cartpole_setup.min_vals, cartpole_setup.max_vals)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$602a07dd-8928-4b44-97e5-01c5cbf38351„§cell_idÙ$602a07dd-8928-4b44-97e5-01c5cbf38351¤codeÚfunction plot_cartpole_policy(policy_and_value::Function; Î¸Ì‡_range = 1, npoints = 100, s_ref::CartPoleState = CartPoleState()) Î¸s = LinRange(-1.2f0, 1.2f0, npoints) Î¸Ì‡s = LinRange(-10f0, 10f0, npoints) value_output = zeros(Float32, npoints, npoints) policy_outputs = [zeros(Float32, npoints, npoints) for _ in 1:3] x = s_ref.x xÌ‡ = s_ref.xÌ‡ policy_output = policy_and_value(s_ref) policy_plot = plot(bar(x = [-1, 0, 1], y = policy_output.action_probabilities), Layout(height = 350, xaxis_title = "Policy Action", yaxis_title = "Action Probability", title = "Policy Distribution Function")) for i in 1:npoints for j in 1:npoints s = CartPoleState(x, Î¸s[i], xÌ‡, Î¸Ì‡s[j]) output = policy_and_value(s) value_output[i, j] = output.state_value_estimate for i_a in 1:3 policy_outputs[i_a][i, j] = output.action_probabilities[i_a] end end end reference_trace = scatter(x = [s_ref.Î¸], y = [s_ref.Î¸Ì‡], name = "reference state", marker_color = "black", marker_symbol = "x") value_plot = plot([heatmap(x = Î¸s, y = Î¸Ì‡s, z = value_output, name = "value function"), reference_trace], Layout(xaxis_title = "Pole Angle in Radians", yaxis_title = "Pole Angular Velocity", title = "Value Estimate for x = $x and xÌ‡ = $xÌ‡", height = 350)) policy_plots = [plot([heatmap(x = Î¸s, y = Î¸Ì‡s, z = policy_outputs[i_a], zmin = 0, zmax = 1), reference_trace], Layout(title = "Action $i_a", xaxis_title = "Pole Angle in Radians", yaxis_title = "Pole Angular Velocity", height = 350)) for i_a in 1:3] @htl("""

$(vcat(value_plot, policy_plot))

$policy_plots

""") # value_traces = [begin # states = [CartPoleState(0f0, Î¸, 0f0, Î¸Ì‡) for Î¸ in Î¸s] # output = [policy_and_value(s) for s in states] # scatter(x = Î¸s, y = [a.state_value_estimate for a in output], name = "Î¸Ì‡ = $Î¸Ì‡") # end # for Î¸Ì‡ in Î¸Ì‡s] # plot(value_traces, Layout(xaxis_title = "Pole Angle in Radians", yaxis_title = "State Value Estimate")) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$f7433324-acc3-49a5-b5b3-ada0c8f09d52„§cell_idÙ$f7433324-acc3-49a5-b5b3-ada0c8f09d52¤code¸runepisode(corridor_mdp)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$0c9986bb-54c0-4b08-9c29-4bfb0b68b54e„§cell_idÙ$0c9986bb-54c0-4b08-9c29-4bfb0b68b54e¤codeÚ2function collect_state_distributions(;num_episodes::Integer = 1_000_000, p::T = 0.5) where T<:Real function add_vecs(x::Array{T, N}, y::Array{T, N}) where {T<:Real, N} l1 = size(x, 1) l2 = size(y, 1) (l1 == l2) && return x .+ y if l1 > l2 out = copy(x) for i in 1:l2 view(out, i, :) .+= view(y, i, :) end else out = copy(y) for i in 1:l1 view(out, i, :) .+= view(x, i, :) end end return out end function Ï€(s) rand(T) <= p && return 1 return 2 end counts = 1:num_episodes |> Map() do _ (states, actions, rewards, _, l) = runepisode(corridor_mdp; Ï€ = Ï€) state_visits = zeros(T, l, 3) @inbounds @simd for i in eachindex(states) s = states[i] state_visits[i, s] += one(T) end return state_visits end |> foldxt(add_vecs) counts ./ num_episodes end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$6d0925d3-af96-4b94-8e2e-4941cce39e51„§cell_idÙ$6d0925d3-af96-4b94-8e2e-4941cce39e51¤codeÚ const mountaincar_test_train = actor_critic_with_eligibility_traces_binary_features(MountainCarTask.mdp, 0.1f0, 0.9f0, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, typemax(Int64), 100_000; Î±_Î¸ = 0.008f0, Î±_w = 0.004f0)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$6bb0263e-368e-462a-948c-baf9cfa82512„§cell_idÙ$6bb0263e-368e-462a-948c-baf9cfa82512¤code¾get_corridor_features(s) = 1:1¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$72273f27-d0b9-4645-a609-cb65cc9332ee„§cell_idÙ$72273f27-d0b9-4645-a609-cb65cc9332ee¤codeÙÄactor_critic_with_eligibility_traces_binary_features(corridor_mdp, 0f0, 0f0, get_corridor_features, 1, 100_000, Î±_Î¸ = 2f0 ^ -4, Î±_w = 2f0 ^ -10, policy_params = [0f0 3.7f0]).policy_and_value(1)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$87482ea5-5265-4e02-92c0-1a8bb44ff0f4„§cell_idÙ$87482ea5-5265-4e02-92c0-1a8bb44ff0f4¤codeÚUfunction actor_critic_binary_continuing_squashed_gaussian_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, amax::A, get_active_features::Function, num_features::Integer, Î»_Î¸::T, Î»_w::T, Î±_Î¸_list::AbstractVector{T}, Î±_w_list::AbstractVector{T}, Î±_rÌ„::T, max_steps::Integer; nruns::Integer = 100, seed = rand(UInt64), init_policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), init_value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} Random.seed!(seed) function average_runs(Î±_Î¸, Î±_w) 1:nruns |> Map(_ -> actor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions(mdp, Î»_Î¸, Î»_w, get_active_features, num_features, max_steps; Î±_Î¸ = Î±_Î¸, Î±_w = Î±_w, Î±_rÌ„ = Î±_rÌ„, policy_params = copy(init_policy_params), value_params = copy(init_value_params), kwargs...).total_reward) |> foldxt(+) |> x -> x / nruns / max_steps end traces = [begin scatter(x = Î±_Î¸_list, y = average_runs.(Î±_Î¸_list, Î±_w), name = "Î±_w = $Î±_w") end for Î±_w in Î±_w_list] plot(traces, Layout(xaxis_title = "Î±_Î¸", yaxis_title = "Average Reward Per Step in the First
$max_steps Steps Averaged Over $nruns Runs", xaxis_type = "log", title = "Binary Feature Encoding with $num_features Features, Î»_Î¸ = $Î»_Î¸, Î»_w = $Î»_w, Î±_rÌ„ = $Î±_rÌ„")) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$3bafd7df-9bc0-4d13-874d-739590cf3ad9„§cell_idÙ$3bafd7df-9bc0-4d13-874d-739590cf3ad9¤codeÚSmd""" > ### *Exercise 13.2* > Generalize the proof of the policy gradient theorem and the steps leading to the REINFORCE update equation (13.8), so that (13.8) ends up with a factor of $\gamma^t$ and thus aligns with the general algorithm given in the pseudocode. See proof above in the section on proving the policy gradient theorem. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$f27f2bcd-05b6-44fe-bf9e-a3e51556db7c„§cell_idÙ$f27f2bcd-05b6-44fe-bf9e-a3e51556db7c¤codeÙ6const cartpole_functions = create_cartpole_functions()¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÃ«code_foldedÂÙ$41dc149d-c6f3-4b0d-a856-06f3aaae3049„§cell_idÙ$41dc149d-c6f3-4b0d-a856-06f3aaae3049¤codeÙ{mutable struct BinaryEligibilityVector{T, B <: BinaryFeatureVector} binary_features::B i_a::Int64 Ï€_dist::Vector{T} end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$38e5d800-4d43-40d2-87ea-f7d4b4283dab„§cell_idÙ$38e5d800-4d43-40d2-87ea-f7d4b4283dab¤codeÚ†md""" In order to find the p that maximizes the expected value for state 1, we should differentiate by p and set the result to 0 $\frac{\partial v_1}{\partial p} = -\frac{2p(1-p) - 2(1+p)(1 - 2p)}{p^2(1-p)^2}$ Setting this equal to 0 implies $\begin{flalign} p-p^2 &= 1 - 2p + p - 2p^2\\ p^2 + 2p - 1 &= 0 \\ \end{flalign}$ Using the quadratic equation, there are two solutions but since we know p has to be positive we only take that one. $p = \frac{-2 \pm \sqrt{4 + 4}}{2} = \frac{-2 \pm 2\sqrt{2}}{2} = -1 \pm \sqrt{2} \implies p = \sqrt{2} - 1 \approx 0.41421$ So, in order to maximize the value at state 1, we have $p_{\text{left}} \approx 0.414$ and $p_{\text{right}} \approx 0.586$. That also implies that $v_1 = -2\frac{1+p}{p(1-p)} = -2\frac{\sqrt{2}}{(\sqrt{2}-1)(2 - \sqrt{2})}= \frac{-2\sqrt{2}}{2 \sqrt{2} - 2 - 2 + \sqrt{2}} = \frac{-2 \sqrt{2}}{3\sqrt{2} - 4} \approx -11.657$ """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2„§cell_idÙ$aa797ac6-5c79-4bc2-942f-7e2c6cdfaaa2¤codeÚ"function one_step_actor_critic_binary_features(mdp::StateMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, max_episodes::Integer, max_steps::Integer; policy_params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} setup = setup_binary_policy_arguments(mdp, get_active_features, num_features) one_step_actor_critic!(policy_params, setup.eligibility_vector, value_params, BinaryFeatureVector(), mdp, update_binary_action_preferences!, update_binary_eligibility_vector!, setup.feature_vector, setup.update_feature_vector!, binary_value_function, update_binary_value_gradient!, max_episodes, max_steps; action_preferences = setup.action_preferences, kwargs...) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$73b90260-d57a-449a-8db6-47f91e6a4e4f„§cell_idÙ$73b90260-d57a-449a-8db6-47f91e6a4e4f¤codeÙ5md""" ### Eligibility Vector with Binary Features """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$5aba4f96-e877-457e-8e95-18737348f99f„§cell_idÙ$5aba4f96-e877-457e-8e95-18737348f99f¤codeÚ_actor_critic_fcann_parameter_study(mdp::StateMDP{T, S, A, P, F1, F2, F3}, update_feature_vector!::Function, num_features::Integer, hidden_layers::Vector{Int64}, params::@NamedTuple{Î»_Î¸::T, Î»_w::T, Î±_rÌ„::T, Î±_Î¸_min::Int64, Î±_w_min::Int64}, num_Î¸::Integer, num_w::Integer, max_steps::Integer; kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = actor_critic_fcann_parameter_study(mdp, update_feature_vector!, num_features, hidden_layers, params.Î»_Î¸, params.Î»_w, params.Î±_rÌ„, 2f0 .^(params.Î±_Î¸_min:params.Î±_Î¸_min+num_Î¸-1), 2f0 .^(params.Î±_w_min:params.Î±_w_min+num_w-1), max_steps; kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$fed4dc4c-0d1c-4ee3-9d0e-8ef2a7db7486„§cell_idÙ$fed4dc4c-0d1c-4ee3-9d0e-8ef2a7db7486¤codeÙ@bind mountaincar_continuing_binary_params create_actor_critic_continuing_params_UI(Î»_Î¸ = 0.1f0, Î»_w = 0.98f0, log2Î±_Î¸ = -5, log2Î±_w = -8)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$27487ad0-4779-42ce-8def-e660ef04bee0„§cell_idÙ$27487ad0-4779-42ce-8def-e660ef04bee0¤codeÙZreinforce_test4.policy_and_value(cartpole_setup.mdps.episodic.discrete.initialize_state())¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$0d93132d-5819-47dc-8cf2-462d480d9c3d„§cell_idÙ$0d93132d-5819-47dc-8cf2-462d480d9c3d¤codeÚ€if run_mountaincar_binary_episodic_countinuous_param_study2 > 0 actor_critic_binary_episodic_squashed_gaussian_parameter_study(mountaincar_continuous_mdp, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, mountaincar_binary_continuous_params2, 4, 3, 1000; max_steps = 100_000, seed = 45) else md""" Waiting to run parameter study """ end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$9978d537-49ff-4014-a971-b42704c50a6b„§cell_idÙ$9978d537-49ff-4014-a971-b42704c50a6b¤codeÙ@bind fcann_cartpole_study_params create_actor_critic_fcann_params_UI(;Î»_Î¸ = 0.95f0, Î»_w = 0.2f0, h = 16, log2Î±_Î¸ = -10, log2Î±_w = -11)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$f8215517-b18f-4a03-9421-8edab4ca8089„§cell_idÙ$f8215517-b18f-4a03-9421-8edab4ca8089¤codeÙ`show_squashed_policy(mountaincar_continuous_test_train3.policy_function, test_mountaincar_state)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$1ac9296f-047b-4051-ba5c-0c23d5f9cde9„§cell_idÙ$1ac9296f-047b-4051-ba5c-0c23d5f9cde9¤codeÙ>const corridor_continuing_mdp = make_corridor_continuing_mdp()¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$c87dba8c-9a96-41b3-9dc7-a6c088ec1eaf„§cell_idÙ$c87dba8c-9a96-41b3-9dc7-a6c088ec1eaf¤codeÙfshow_mountaincar_continuous_trajectory(mountaincar_continuous_test_train.policy_sample_action, 10_000)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$5cc4d12d-b537-47e2-8109-4e7a234fdf25„§cell_idÙ$5cc4d12d-b537-47e2-8109-4e7a234fdf25¤codeÚ×function make_corridor_mdp() function step(s::Integer, i_a::Integer) Î´ = 2*i_a - 3 #calculates the s change -1 for left (1) and 1 for right (2) switch = iseven(s) #returns true in state 2 which is where actions are switched, when switch is true, multiply Î´ by -1, otherwise by 1 c = 1 - 2*switch sâ€² = max(1, s + c*Î´) (-1f0, sâ€²) end actions = [:left, :right] ptf = StateMDPTransitionSampler(step, 1) StateMDP(actions, ptf, () -> 1, s -> s == 4) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$5334064b-5a16-4135-afa0-86a48291725b„§cell_idÙ$5334064b-5a16-4135-afa0-86a48291725b¤codeÙ corridor_train.value_function(1)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$9c342958-1971-48ec-b919-5dfdcbc915a4„§cell_idÙ$9c342958-1971-48ec-b919-5dfdcbc915a4¤codeÙcmd""" #### Change Plot Background Color $(@bind bgcolor ColorStringPicker(default = "#121212")) """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$966ef17c-23be-49dc-bc37-4cb52b34c049„§cell_idÙ$966ef17c-23be-49dc-bc37-4cb52b34c049¤codeÙ$md""" #### Neural Network Method """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$e7e49ff8-32df-48a4-afb2-462859592e92„§cell_idÙ$e7e49ff8-32df-48a4-afb2-462859592e92¤codeÚ1function form_state_and_policy_function_outputs(update_feature_vector!::Function, update_action_preferences!::Function, value_function::Function, feature_vector, action_preferences::Vector, policy_params, value_params) Ï€! = form_state_policy_function(update_feature_vector!, update_action_preferences!) Ï€(s; x = deepcopy(feature_vector), action_preferences = copy(action_preferences)) = Ï€!(x, action_preferences, s, policy_params) Ï€_sample(s; kwargs...) = sample_action(Ï€(s; kwargs...)) v! = form_state_value_function(update_feature_vector!, value_function) estimate_state_value(s; x = deepcopy(feature_vector)) = v!(x, s, value_params) function policy_and_value(s; x = deepcopy(feature_vector), action_preferences = copy(action_preferences)) Ï€!(x, action_preferences, s, policy_params) vÌ‚ = value_function(x, value_params) return (action_probabilities = action_preferences, state_value_estimate = vÌ‚) end (policy_function = Ï€, policy_sample_action = Ï€_sample, estimate_state_value = estimate_state_value, policy_and_value = policy_and_value) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$78c83673-2117-4542-b4d8-1c243e8f610b„§cell_idÙ$78c83673-2117-4542-b4d8-1c243e8f610b¤codeÚmd""" #### Eligibility Vector Recall for the gaussian case and linear approximation we had: $\begin{flalign} \pi(a \vert s, \boldsymbol{\theta}) &= \frac{1}{\sqrt{2 \pi \sigma(s, \boldsymbol{\theta})^2}} \exp \left ( - \frac{(a - \mu(s, \boldsymbol{\theta}))^2}{2 \sigma(s, \boldsymbol{\theta})^2} \right )\\ \mu(s, \boldsymbol{\theta}) & \doteq \boldsymbol{\theta}_\mu ^ \top \mathbf{x}_\mu(s) \\ \sigma(s, \boldsymbol{\theta}) & \doteq \exp \left ( \boldsymbol{\theta}_\sigma ^ \top \mathbf{x}_\sigma(s) \right ) \\ \nabla \ln \pi(a \vert s, \boldsymbol{\theta}_\mu) &= \frac{1}{\sigma(s, \boldsymbol{\theta})^2} \left ( a - \mu(s, \boldsymbol{\theta}) \right ) \mathbf{x}_\mu(s) \\ \nabla \ln \pi(a \vert s, \boldsymbol{\theta}_\sigma) &= \left (\frac{(a - \mu(s, \boldsymbol{\theta}))^2}{\sigma(s, \boldsymbol{\theta})^2} \right )\mathbf{x}_\sigma(s) \\ \end{flalign}$ For the squashed gaussian we can apply the previous results to the new pdf: $\begin{flalign} \pi(a \vert s, \boldsymbol{\theta}) &= \frac{1}{\sqrt{2 \pi \sigma(s, \boldsymbol{\theta})^2}} \exp \left ( - \frac{(\tanh^{-1}(a) - \mu(s, \boldsymbol{\theta}))^2}{2 \sigma(s, \boldsymbol{\theta})^2} \right ) \left \vert \frac{1}{1 - a^2} \right \vert\\ \mu(s, \boldsymbol{\theta}) & \doteq \boldsymbol{\theta}_\mu ^ \top \mathbf{x}_\mu(s) \\ \sigma(s, \boldsymbol{\theta}) & \doteq \exp \left ( \boldsymbol{\theta}_\sigma ^ \top \mathbf{x}_\sigma(s) \right ) \\ \nabla \ln \pi(a \vert s, \boldsymbol{\theta}_\mu) &= \frac{1}{\sigma(s, \boldsymbol{\theta})^2} \left ( \tanh^{-1}(a) - \mu(s, \boldsymbol{\theta}) \right ) \mathbf{x}_\mu(s) \\ \nabla \ln \pi(a \vert s, \boldsymbol{\theta}_\sigma) &= \left (\frac{(\tanh^{-1}(a) - \mu(s, \boldsymbol{\theta}))^2}{\sigma(s, \boldsymbol{\theta})^2} \right )\mathbf{x}_\sigma(s) \\ \end{flalign}$ """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8f„§cell_idÙ$a6be9a4c-d43b-4867-b7a2-07a46a9d0d8f¤codeÙ‘show_mountaincar_continuous_trajectory(mountaincar_continuous_test_train_beta.policy_sample_action, 1_000; mdp = mountaincar_continuous_beta_mdp)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$396e0047-d848-462f-a769-0cc2829abc78„§cell_idÙ$396e0047-d848-462f-a769-0cc2829abc78¤codeÙÖactor_critic_with_eligibility_traces_binary_features(corridor_mdp, .5f0, .5f0, get_corridor_features, 1, typemax(Int64), 100_000, Î±_Î¸ = 2f0 ^ -4, Î±_w = 2f0 ^ -10, policy_params = [0f0 3.7f0]).policy_and_value(1)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$ff4f977e-48df-4c12-845c-c245b4d39d6d„§cell_idÙ$ff4f977e-48df-4c12-845c-c245b4d39d6d¤codeÚífunction actor_critic_linear_parameter_study(mdp::StateMDP{T, S, A, P, F1, F2, F3}, feature_function::Function, num_features::Integer, Î»_Î¸_list::AbstractVector{T}, Î»_w_list::AbstractVector{T}, Î±_rÌ„_list::AbstractVector{T}, Î±_Î¸_list::AbstractVector{T}, Î±_w_list::AbstractVector{T}, num_tests::Integer, max_steps::Integer; nruns::Integer = 100, seed = rand(UInt64), init_policy_params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), binary_features = false, kwargs...) where {T<:Real, S, A, P, F1, F2, F3} if binary_features algo = actor_critic_with_eligibility_traces_binary_features title_prefix = "Binary Feature Encoding" else algo = actor_critic_with_eligibility_traces_linear_features title_prefix = "Linear Encoding" end run_test(Î±_Î¸, Î±_w, Î±_rÌ„, Î»_Î¸, Î»_w) = average_continuing_runs(nruns, seed, Î±_Î¸, Î±_w, Î±_rÌ„, init_policy_params, algo, mdp, Î»_Î¸, Î»_w, feature_function, num_features, max_steps; kwargs...) test_params = [(Î±_Î¸ = rand(Î±_Î¸_list), Î±_w = rand(Î±_w_list), Î±_rÌ„ = rand(Î±_rÌ„_list), Î»_Î¸ = rand(Î»_Î¸_list), Î»_w = rand(Î»_w_list)) for _ in 1:num_tests] DataFrame([begin output = run_test(params...) (;params..., output = output) end for params in test_params]) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$aa450da4-fe84-4eea-b6c4-9820b7982437„§cell_idÙ$aa450da4-fe84-4eea-b6c4-9820b7982437¤codeÚ:md""" With continuous policy parametrization, we can smoothly very action selection probabilities by arbitrarily small amounts, something that was not possible with Ïµ-greedy action selection. Therefore stronger convergence guarantees are possible for policy-gradient methods than for action-value methods. In the episodic case, assuming some particular non-random starting state $s_0$, we define the performance of a policy parametrized by *Î¸* as: $\begin{align} J(\mathbf{\theta}) \doteq v_{\pi_\mathbf{\theta}}(s_0) \tag{13.4} \end{align}$ where $v_{\pi_\mathbf{\theta}}$ is the true value function for $\pi_\mathbf{\theta}$, the policy determined by $\mathbf{\theta}$. The *policy gradient theorem* provides an analytic expression for the gradient of performance with respect to the policy parameter that does *not* involve the derivative of the state distribution: $\begin{align} \nabla J(\mathbf{\theta}) \propto \sum_s \mu (s) \sum_a q_\pi (s, a) \nabla \pi (a|s,\mathbf{\theta}) \tag{13.5} \end{align}$ where the gradients are column vectors of partial derivatives with respect to the components of $\mathbf{\theta}$. In the episodic case, the constant of proportionality is the average length of an episode, and in the continuing case it is 1. The distribution here $\mu$ is the on-policy distribution under $\pi$. """ ¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$bb1ef180-39ac-475f-beea-ef573e71a3bf„§cell_idÙ$bb1ef180-39ac-475f-beea-ef573e71a3bf¤codeÙ7display_cartpole_episode((ep2 |> x -> (x[1], x[2]))...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27„§cell_idÙ$eae6493e-81b6-4d99-a9c6-6e75d3b3dc27¤codeÚconst cartpole_continuing_fcann_test = actor_critic_with_eligibility_traces_fcann(cartpole_continuing_mdp, 0.25f0, 0.1f0, cartpole_fcann_feature_setup.num_features, [4, 4], cartpole_vector_update!, 300_000, Î±_Î¸ = 0.015f0, Î±_w = 0.125f0, Î±_rÌ„ = 0.01f0; save_step_rewards=true)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$5b868eba-c1af-49f6-8f93-79b78c319a6f„§cell_idÙ$5b868eba-c1af-49f6-8f93-79b78c319a6f¤codeÚ ˜#version of reinforce for general function approximation function reinforce_with_baseline_monte_carlo_control!(policy_params, âˆ‡lnÏ€, value_params, âˆ‡vÌ‚, mdp::ContinuousMDP{T, S, A, PTF, F1, F2, F3}, update_action_distribution!::Function, action_dist_params::Vector{T}, action_sampler::Function, update_eligibility_vector!::Function, x, update_feature_vector!::Function, value_function::Function, update_value_gradient!::Function, max_episodes::Integer; Î±_w::T = one(T)/10, Î±_Î¸::T = one(T)/10, Î³::T = one(T), epkwargs...) where {T<:Real, S, A, PTF, F1, F2, F3} rewards = zeros(T, max_episodes) steps = zeros(Int64, max_episodes) Ï€! = form_state_continuous_policy_function(update_feature_vector!, update_action_distribution!) Ï€(s) = Ï€!(x, action_dist_params, s, policy_params) Ï€_sample(s) = action_sampler(Ï€(s)) v! = form_state_value_function(update_feature_vector!, value_function) estimate_state_value(s) = v!(x, s, value_params) state_history, action_history, reward_history, _, _ = runepisode(mdp, Ï€_sample, max_steps = 0) #initialize variables to update episodes for i in eachindex(rewards) # @info "On episode $i of $max_episodes" state_history, action_history, reward_history, sterm, nsteps = runepisode!((state_history, action_history, reward_history), mdp, Ï€_sample, epkwargs...) g = zero(T) rtotal = zero(T) #iterate through episode beginning at the end for i in nsteps:-1:1 g = (Î³ * g) + reward_history[i] update_feature_vector!(x, state_history[i]) vÌ‚ = value_function(x, value_params) Î´ = g - vÌ‚ update_value_gradient!(âˆ‡vÌ‚, x, value_params) c = Î±_w*Î´ update_params_with_gradient!(value_params, c, âˆ‡vÌ‚) update_eligibility_vector!(âˆ‡lnÏ€, action_dist_params, x, action_history[i], policy_params) c = Î±_Î¸ * Î³^(i-1) * Î´ update_params_with_gradient!(policy_params, c, âˆ‡lnÏ€) rtotal += reward_history[i] end rewards[i] = rtotal steps[i] = nsteps end Ï€2(s; feature_vector = deepcopy(x), action_dist_params = copy(action_dist_params)) = Ï€!(feature_vector, action_dist_params, s, policy_params) Ï€_sample2(s; kwargs...) = action_sampler(Ï€2(s; kwargs...)) function policy_and_value(s::S) Ï€!(x, action_dist_params, s, policy_params) vÌ‚ = value_function(x, value_params) return (action_distribution_parameters = action_dist_params, sampler_function = () -> action_sampler(action_dist_params), state_value_estimate = vÌ‚) end return (episode_rewards = rewards, episode_steps = steps, policy_function = Ï€2, policy_sample_action = Ï€_sample2, policy_parameters = policy_params, estimate_state_value = estimate_state_value, value_parameters = value_params, policy_and_value = policy_and_value) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$68469a40-7976-48b7-b7a1-eaa4c5f33a18„§cell_idÙ$68469a40-7976-48b7-b7a1-eaa4c5f33a18¤codeÚ)function plot_mountaincar_continuous_values(policy_and_value::Function; n1 = 100, n2 = 100) xvals = LinRange(-1.2f0, 0.5f0, n1) vvals = LinRange(-0.07f0, 0.07f0, n2) values = zeros(Float32, n1, n2) action_p1 = zeros(Float32, n1, n2) action_p2 = zeros(Float32, n1, n2) for (i, x) in enumerate(xvals) for (j, v) in enumerate(vvals) dist, vÌ‚ = policy_and_value((x, v)) values[j, i] = vÌ‚ action_p1[j, i] = dist[1] action_p2[j, i] = dist[2] end end p1 = plot(heatmap(x = xvals, y = vvals, z = values), Layout(xaxis_title = "position", yaxis_title = "velocity", title = "Learned Value Function", height = 400)) p2 = plot(heatmap(x = xvals, y = vvals, z = action_p1, colorscale = "rb"), Layout(xaxis_title = "position", yaxis_title = "velocity", title = "Policy Parameter 1", height = 400)) p3 = plot(heatmap(x = xvals, y = vvals, z = action_p2, colorscale = "rb"), Layout(xaxis_title = "position", yaxis_title = "velocity", title = "Policy Parameter 2", height = 400)) @htl("""

$p1 $p2 $p3

""") end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$2a586e46-66e4-461a-85c8-5817e4d1aa43„§cell_idÙ$2a586e46-66e4-461a-85c8-5817e4d1aa43¤codeÚ³md""" $\begin{flalign} \nabla J(\boldsymbol{\theta}) &= \nabla v_\pi(s_0) \\ &= \sum_s \sum_k \gamma^k \Pr \{ s_0 \rightarrow s, k, \pi \} f(s) \\ &= \sum_s \sum_k \gamma^k \frac{\sum_{x \in \mathcal{S}} \sum_{t = 0}^\infty \Pr \{ s_0 \rightarrow x, t, \pi \}}{\sum_{x \in \mathcal{S}} \sum_{t = 0}^\infty \Pr \{ s_0 \rightarrow x, t, \pi \}} \Pr \{ s_0 \rightarrow s, k, \pi \} f(s) \tag{multiply by 1}\\ &= \eta \sum_s \sum_k \gamma^k \frac{\Pr \{ s_0 \rightarrow s, k, \pi \}}{\sum_{x \in \mathcal{S}} \sum_{t = 0}^\infty \Pr \{ s_0 \rightarrow x, t, \pi \}} f(s) \tag{average episode length}\\ &= \eta \sum_s \sum_k \gamma^k \mu_\pi(s, k) f(s) \tag{on policy distribution over states and steps}\\ &= \eta \mathbb{E}_\pi[ \gamma^k f(s) \mid S_0 = s_0, S_k = s] \tag{definition of expected value}\\ &\propto \mathbb{E}_\pi \left [ \gamma^k \sum_a \nabla \pi(a \vert s) q_\pi(s, a) \mid S_0 = s_0, S_k = s \right ] \tag{13.5}\\ \end{flalign}$ """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$a206c759-3f6e-4003-8cba-5f6ce6742646„§cell_idÙ$a206c759-3f6e-4003-8cba-5f6ce6742646¤codeÙ¿md""" ### Figure 13.1 REINFORCE on short-corridor gridworld (Example 13.1). Performance varies with step size but can approach the ideal. Feature vector encodes every state identically. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$fc3dcd26-c5cf-4141-bf6c-eaed5fc9bb1d„§cell_idÙ$fc3dcd26-c5cf-4141-bf6c-eaed5fc9bb1d¤codeÚmd""" Consider the linear parameterization proposed with $h_a = \boldsymbol{\theta}^\top \mathbf{x}(s, a)$: $\frac{\partial{h_a}}{\partial{\theta_i}} = \mathbf{x}(s, a)_i \implies \nabla(\pi(a \vert s, \boldsymbol{\theta}))_i = \pi_a \left ( \mathbf{x}(s, a)_i - \sum_k \pi_k \mathbf{x}(s, k)_i \right)$ Now consider $\mathbf{h} = \theta ^ \top \mathbf{x}$ with $h_a = \mathbf{h}_a$. Since the parameters are now represented as a matrix, we can also index the gradient partial derivatives such that $\nabla \left ( f(\theta) \right )_{i, j} = \frac{\partial f(\theta)}{\theta_{i, j}}$ $\frac{\partial{h_a}}{\partial{\theta_{i, j}}} = \begin{cases} \mathbf{x}(s)_i, & \text{ if } j = a \\ 0, & \text{ else } \end{cases} \implies \nabla(\pi(s, \boldsymbol{\theta})_a)_{i, j} = \pi_a \left ( \frac{\partial h_a}{\partial \theta_{i, j}} - \sum_k \pi_k \frac{\partial h_k}{\partial \theta_{i, j}} \right)=\pi_a \begin{cases} \mathbf{x}(s)_i (1 - \pi_j), & \text{ if } j = a \\ -\pi_j \mathbf{x}(s)_i, & \text{ else }\\ \end{cases}$ """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$3cfd63ad-b1a2-4b99-ae97-2ff10351e4f5„§cell_idÙ$3cfd63ad-b1a2-4b99-ae97-2ff10351e4f5¤codeÙ+md""" ### Beta Distribution Alternative """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$31db0f58-28e4-454f-9394-25565687266f„§cell_idÙ$31db0f58-28e4-454f-9394-25565687266f¤codeÙxdisplay_cartpole_episode((runepisode(cartpole_mdps.episodic.continuous, s -> Float32(randn())) |> x -> (x[1], x[2]))...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$822e4d69-2582-4956-858e-06ecb091e76a„§cell_idÙ$822e4d69-2582-4956-858e-06ecb091e76a¤codeÚ^function display_cartpole_episode(states::Vector{S}, actions::Vector) where S<:CartPoleState fields = [:x, :Î¸, :xÌ‡, :Î¸Ì‡] names = ["x", "Î¸", "xÌ‡", "Î¸Ì‡"] yaxes = ["y", "y2", "y", "y2"] x = [s.t for s in states] #time history in seconds state_traces = [begin y = [getfield(s, f) for s in states] scatter(x = x, y = y, name = names[i], yaxis = yaxes[i]) end for (i, f) in enumerate(fields)] plot(state_traces, Layout(xaxis_title = "Time(s)", yaxis_title = "Horizontal Position", yaxis2 = attr(title = "Pole Angle (Radians)", overlaying = "y", side = "right"), legend_orientation = "h")) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580a„§cell_idÙ$d7f6ff79-3c0f-4f16-aa1c-3bc534ce580a¤codeÙVplot_mountaincar_continuous_values(mountaincar_continuous_test_train.policy_and_value)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$05b0fcad-628b-48d2-aa24-f6f562dbb660„§cell_idÙ$05b0fcad-628b-48d2-aa24-f6f562dbb660¤codeÚ_md""" $\begin{flalign} &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) \sum_{a^{\prime \prime}} [ \nabla \pi(a^{\prime \prime} \vert s^{\prime \prime}) q_\pi(s^{\prime \prime}, a^{\prime \prime})\right ] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) \sum_{s^{\prime \prime}} p(s^{\prime \prime} \vert s^\prime, a^\prime) f(s^{\prime \prime}) \right ] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \sum_{s^{\prime \prime}} f(s^{\prime \prime}) \sum_{a^\prime} \pi(a^\prime \vert s^\prime) p(s^{\prime \prime} \vert s^\prime, a^\prime) \right ] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \mathbb{E}_\pi[f(s^{\prime \prime}) \vert s^\prime] \right ] \\ &\gamma^2 \mathbb{E}_\pi[f(s^{\prime \prime}) \vert s] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) g(s^\prime) \right ] \\ &\gamma^2 \sum_a \left [ \pi(a \vert s) \mathbb{E}[g(s^\prime) \vert s, a] \right ] \\ &\gamma^2 \mathbb{E}_\pi[g(s^\prime) \vert s]\\ &\gamma^2 \sum_{s^{\prime \prime}} \Pr(s \rightarrow s^{\prime \prime}, 2, \pi) f(s^{\prime \prime}) \end{flalign}$ """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$d2729657-d0bf-4d39-8ec7-f242a1ad48d6„§cell_idÙ$d2729657-d0bf-4d39-8ec7-f242a1ad48d6¤codeÚfunction create_continuous_action_mountaincar_beta() #if we sample actions from a beta distribution then the action will always be bounded between 0 and 1. this step function rescales it to -1 to 1 mdp = MountainCarTask.mdp function step(s, a) f = 2f0*(a - 0.5f0) (-1f0, MountainCarTask.step(s, f)) end ContinuousMDP(step, mdp.initialize_state, 0f0; isterm = mdp.isterm) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$5c11a92d-7496-4aba-af15-2537eac49dd7„§cell_idÙ$5c11a92d-7496-4aba-af15-2537eac49dd7¤codeÙ Map(_ -> actor_critic_with_eligibility_traces_fcann(mdp, Î»_Î¸, Î»_w, num_features, hidden_layers, update_feature_vector!, max_episodes, max_steps; Î±_Î¸ = Î±_Î¸, Î±_w = Î±_w, kwargs...) |> x -> isempty(x.episode_rewards) ? missing : mean(x.episode_rewards)) |> Filter(!ismissing) |> tcollect |> x -> isempty(x) ? missing : mean(x) end traces = [begin scatter(x = Î±_Î¸_list, y = average_runs.(Î±_Î¸_list, Î±_w), name = "Î±_w = $Î±_w") end for Î±_w in Î±_w_list] plot(traces, Layout(xaxis_title = "Î±_Î¸", yaxis_title = "Average Reward Per Episode in the First
$max_episodes Episodes Averaged Over $nruns Runs", xaxis_type = "log", title = "$num_features Inputs, $hidden_layers Hidden Non Linear, Î»_Î¸ = $Î»_Î¸, Î»_w = $Î»_w")) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$76eb6743-cac0-4174-9ba3-a0691c200b54„§cell_idÙ$76eb6743-cac0-4174-9ba3-a0691c200b54¤codeÙ©begin make_n_param_dist_params(n::Integer, ::T) where T<:Real = zeros(T, n) make_n_param_dist_params(n::Integer, ::NTuple{N, T}) where {N, T<:Real} = zeros(T, n*N) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$94517664-6988-44dc-a297-e9d5873ee540„§cell_idÙ$94517664-6988-44dc-a297-e9d5873ee540¤codeÚN@bind squashed_gaussian_plot_params PlutoUI.combine() do Child md""" ### Squashed Gaussian Plot Parameters $$\mu$$: $(Child(Slider(-4:.1:4, default = 0, show_value=true))) $$\sigma$$: $(Child(Slider(0.1:0.1:2, default = .5, show_value=true))) maximum value: $(Child(Slider(.1:0.1:2., default = 1, show_value=true))) """ end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$d037ea92-915c-4dc7-97c6-d006d92e088a„§cell_idÙ$d037ea92-915c-4dc7-97c6-d006d92e088a¤codeÚefunction figure_13_1(Î±_list; nruns = 100, num_episodes = 1_000, max_steps = 1_000) Random.seed!(45) function average_runs(Î±) 1:nruns |> Map(_ -> reinforce_monte_carlo_control_binary_features(corridor_mdp, get_corridor_features, 1, num_episodes; params = [0f0 3.7f0], Î± = Î±, max_steps = max_steps).episode_rewards) |> foldxt((a, b) -> a .+ b) |> v -> v ./ nruns end traces = [begin out = average_runs(Î±) scatter(x = 1:num_episodes, y = out, name = "Î± = 2^$(round(Int64, log2(Î±)))") end for Î± in Î±_list] baselinetrace = scatter(x = 1:num_episodes, y = fill(-2*sqrt(2) / (3*sqrt(2) - 4), num_episodes), name = "ideal value", line_dash = "dash", line_color = "gray") plot([baselinetrace; traces], Layout(yaxis_range = [-90, -10], yaxis_title = "Total reward on episode
(averaged over $nruns runs)", xaxis_title = "Episode", width = 800)) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$24fa139c-ad4b-49db-ac8f-23c476ed8608„§cell_idÙ$24fa139c-ad4b-49db-ac8f-23c476ed8608¤codeÙ÷const reinforce_test = reinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions(cartpole_setup.mdps.episodic.continuous, cartpole_setup.get_active_features, cartpole_setup.num_features, 10_000; Î±_Î¸ = 2f0 ^-14, Î±_w = 2f0 ^-6)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$2025ff38-f2ec-4224-b771-ff72ffe1af28„§cell_idÙ$2025ff38-f2ec-4224-b771-ff72ffe1af28¤codeÙ.const mountaincar_min_vals = (-1.2f0, -0.07f0)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$cb70d400-3e9c-441c-b17c-e727e8c928f3„§cell_idÙ$cb70d400-3e9c-441c-b17c-e727e8c928f3¤codeÙàif start_mountaincar_continuing_fcann_param_study > 0 mountaincar_fcann_continuing_parameter_study(32, 3, mountaincar_continuing_fcann_params, 5, 3, 1_000_000; seed = 45) else md""" Waiting to run parameter study """ end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$e034b9cb-f4ee-46f4-bea6-72c93c75d966„§cell_idÙ$e034b9cb-f4ee-46f4-bea6-72c93c75d966¤code°using DataFrames¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$e6cf9550-2e69-4b82-92cf-5e07a35490aa„§cell_idÙ$e6cf9550-2e69-4b82-92cf-5e07a35490aa¤codeÙàbegin zero_params!(params::Array{T, N}) where {N, T<:Real} = params .= zero(T) function zero_params!(params::FCANNParams) for i = 1:2 for j in eachindex(params[i]) zero_params!(params[i][j]) end end end end ¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$717e4c69-59d5-4929-923f-dd35a97fb160„§cell_idÙ$717e4c69-59d5-4929-923f-dd35a97fb160¤codeÚ»actor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, Î»_Î¸::T, Î»_w::T, get_active_features::Function, num_features::Integer, args...; kwargs...) where {T<:Real, S, N, A <: Union{T, NTuple{N, T}}, P, F1, F2, F3} = actor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions(mdp, one(T), Î»_Î¸, Î»_w, get_active_features, num_features, args...; kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$1386ffdb-940d-4f1b-a872-4e38647b5335„§cell_idÙ$1386ffdb-940d-4f1b-a872-4e38647b5335¤codeÚ±md""" #### Test One-step Actor-Critic The following function calls execute the One-step Actor-Critic algorithm on Example 13.1. The output displayed is the policy function acting on the single state representation for the problem. The two values represent the probability of taking the left and right action respectively. If converged properly, the right action probability should be higher, approaching a value of about 60%. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$a893a87b-2d07-4db5-9d1a-9da8646216f4„§cell_idÙ$a893a87b-2d07-4db5-9d1a-9da8646216f4¤codeÙÑfunction update_params_with_gradient!(w::Vector{T}, Î±::T, âˆ‡w::BinaryFeatureVector) where {T<:Real} @inbounds @simd for i in 1:âˆ‡w.num_features j = âˆ‡w.active_features[i] w[j] += Î± end return w end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$2cbc972b-c685-4c1c-8a8d-9d58b197ad90„§cell_idÙ$2cbc972b-c685-4c1c-8a8d-9d58b197ad90¤codeÙ¹function update_binary_value_params!(params::Vector{T}, active_features::BinaryFeatures, c::T) where T<:Real @inbounds for i in active_features params[i] += c end return params end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$37ec6802-d4c2-4470-ad69-439d5a732f77„§cell_idÙ$37ec6802-d4c2-4470-ad69-439d5a732f77¤codeÚfunction form_state_policy_function(update_feature_vector!::Function, update_action_preferences!::Function) function Ï€!(x, action_preferences, s, params) update_feature_vector!(x, s) update_action_preferences!(action_preferences, x, params) soft_max!(action_preferences) end end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$98222fcd-b456-477c-90dd-844df36877e5„§cell_idÙ$98222fcd-b456-477c-90dd-844df36877e5¤codeÙKplot_continuing_step_rewards(mountaincar_continuing_tile_test.step_rewards)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$f7f58fd2-facc-4b87-9172-5e911677c8f4„§cell_idÙ$f7f58fd2-facc-4b87-9172-5e911677c8f4¤codeÙz#for an episode progressing, show the point in the state space that the cart exsits and use the value of x and xÌ‡ in that¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$58403c8e-0ee4-4466-ba25-ee0c86fb0b47„§cell_idÙ$58403c8e-0ee4-4466-ba25-ee0c86fb0b47¤codeÚòmd""" Consider $\mathbf{x}(s)$ and $\mathbf{h}(s, \boldsymbol{\theta})$ which produces a vector of action preferences. We would like to derive an expression for $\nabla \ln \pi (a \vert s, \boldsymbol{\theta})$ in the case of $\mathbf{\pi}(s, \boldsymbol{\theta}) = \sigma(\mathbf{h}(s, \boldsymbol{\theta}))$ where $\sigma(\mathbf{x})$ is the softmax function defined in section 13.1. Here I'm using the notation $\mathbf{\pi}(s, \boldsymbol{\theta})$ to refer to the vector of action probabilities at a given state. The subscript on the vector refers to selecting that element from the vector. To shorten expressions, the following terms are equivalent: $\begin{flalign} \mathbf{\pi} &\doteq \mathbf{\pi}(s, \boldsymbol{\theta}) \\ \mathbf{h} &\doteq \mathbf{h}(s, \boldsymbol{\theta}) \\ x_i &\doteq \mathbf{x}_i \text{ for all vectors} \\ \end{flalign}$ Using these conventions, we previously had an expression for the ith component of the gradient of the policy: $\nabla \left( \pi_a \right )_i = \pi_a \left ( \frac{\partial{h_a}}{\partial{\theta_i}} - \sum_k{\pi_k \frac{\partial{h_k}}{\partial{\theta_i}}} \right )$ We can use this expression to derive the components of the eligibility vector in general: $\begin{flalign} \nabla \left( \ln \mathbf{\pi}_a \right)_i &= \frac{\nabla \left( \pi_a \right )_i}{\pi_a}\\ &=\frac{\partial{h_a}}{\partial{\theta_i}} - \sum_k{\pi_k \frac{\partial{h_k}}{\partial{\theta_i}}} \\ \end{flalign}$ ### Connection to Cross-Entropy Loss Classification problems involve training a function to predict the class label of an input. The function returns a vector of class preferences which can be converted to a probability distribution by the soft-max function. The cross-entropy loss is a way of comparing this distribution with the desired output label to generate an error value. Let's denote $\mathbf{p}(s)$ as the vector of true probabilities for an example $s$ and keep our output function as $\pi(s,\theta) = \sigma(\mathbf{h}(s, \boldsymbol{\theta}))$. The cross entropy loss is defined as: $\mathcal{L}(\mathbf{p}, \mathbf{\pi}) = -\sum_i \mathbf{p}_i \ln \mathbf{\pi}_i$ omitting $s$ and $\boldsymbol{\theta}$. In a typical situation with a dataset, $\mathbf{p}(s)$ will be a one-hot vector representing the index of label of the example in the dataset. Let's call that index $a$ such that $p_a = 1$ and $p_i = 0 \: \forall i \neq a$. The loss then simplifies to $\mathcal{L}(a, \mathbf{\pi}) = -\ln \mathbf{\pi}_a$. When we train with gradient descent on such a dataset, we must compute the gradient of this loss with respect to the parameters or $-\nabla \ln \pi_a$ which is just negative one times the eligibility vector for general paramaterized approximation. So if we have a function that computes the gradient of the cross entropy loss of the soft-max output for a vector function and a label index, we can replace the label index of the dataset with the desired action index $a$ and then that gradient will match our desired gradient after multiplying by negative one. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$e1aec891-d95a-47d1-97d7-d2a4cfb16e64„§cell_idÙ$e1aec891-d95a-47d1-97d7-d2a4cfb16e64¤codeÚ”function setup_fcann_policy_and_value_arguments(policy_params::FCANNParams{T}, input_length::Integer, hidden_layers::Vector{Int64}, reslayers::Integer, l2::T, dropout::T, use_Î¼P::Bool, activation_list) where {T<:Real} policy_setup = setup_fcann_policy_arguments(policy_params::FCANNParams{T}, input_length::Integer, hidden_layers::Vector{Int64}, reslayers::Integer, l2::T, dropout::T, use_Î¼P::Bool, activation_list) value_setup = setup_fcann_value_arguments(policy_setup, input_length::Integer, hidden_layers::Vector{Int64}, reslayers::Integer, l2::T, dropout::T, use_Î¼P::Bool, activation_list, policy_setup.scales) (;policy_setup..., value_setup...) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$3d065608-eef2-4caa-b17d-ec60714e3d58„§cell_idÙ$3d065608-eef2-4caa-b17d-ec60714e3d58¤codeÚ;actor_critic_binary_episodic_beta_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, params::@NamedTuple{Î»_Î¸::T, Î»_w::T, Î±_Î¸_min::Int64, Î±_w_min::Int64}, num_Î¸::Integer, num_w::Integer, num_episodes::Integer; kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = actor_critic_binary_episodic_beta_parameter_study(mdp, get_active_features, num_features, params.Î»_Î¸, params.Î»_w, 2f0 .^(params.Î±_Î¸_min:params.Î±_Î¸_min+num_Î¸-1), 2f0 .^(params.Î±_w_min:params.Î±_w_min+num_w-1), num_episodes; kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$b87ff1a9-abff-40f7-a1d8-f751a1c8b060„§cell_idÙ$b87ff1a9-abff-40f7-a1d8-f751a1c8b060¤codeÚ9md""" In the episodic case, we provided a reward of -1 per step and then considered an episode finished when a failure state was reached. In the continuing case, the step function will provide a reward of 0 unless a failure occurs in which case it will provide a reward of -1 and then initialize a new state. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$e89bdc84-dbb5-4c73-a39c-6392e5f79704„§cell_idÙ$e89bdc84-dbb5-4c73-a39c-6392e5f79704¤codeÙ…plot_mountaincar_values(mountaincar_continuing_tile_test.estimate_state_value, mountaincar_continuing_tile_test.policy_sample_action)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$d3b56fca-5b79-4465-8987-8d0005f854d8„§cell_idÙ$d3b56fca-5b79-4465-8987-8d0005f854d8¤codeÙåconst reinforce_test2 = reinforce_with_baseline_monte_carlo_control_binary_features(cartpole_setup.mdps.episodic.discrete, cartpole_setup.get_active_features, cartpole_setup.num_features, 10_000; Î±_Î¸ = 2f0 ^-14, Î±_w = 2f0 ^-8)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$d21617aa-6f38-4a90-8586-4b32022497ad„§cell_idÙ$d21617aa-6f38-4a90-8586-4b32022497ad¤codeÙ'cartpole_setup.mdps.continuing.discrete¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$0574f5a0-72e7-4aa2-80ac-f4ce4f0fe7c2„§cell_idÙ$0574f5a0-72e7-4aa2-80ac-f4ce4f0fe7c2¤codeÙ’plot_cartpole_policy(cartpole_continuing_test.policy_and_value; s_ref = CartPoleState(sref_cartpole_binary.x, 0f0, sref_cartpole_binary.xÌ‡, 0f0))¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7d„§cell_idÙ$5eb8d9f9-8512-4e00-8cb5-cec68d73cc7d¤codeÚ;const mountaincar_continuous_test_train3 = actor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions(mountaincar_continuous_mdp, 0.2f0, 0.99f0, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, typemax(Int64), 1_000_000; Î±_Î¸ = 1f-5, Î±_w = 0.0001f0)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$d82e7ab8-c372-4462-afb5-1617560cdb56„§cell_idÙ$d82e7ab8-c372-4462-afb5-1617560cdb56¤codeÙ‘plot_mountaincar_values(mountaincar_continuous_test_train_beta.estimate_state_value, mountaincar_continuous_test_train_beta.policy_sample_action)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$3c89209c-9202-4d5d-841c-ea34be369616„§cell_idÙ$3c89209c-9202-4d5d-841c-ea34be369616¤codeÚHconst cartpole_continuing_test = actor_critic_with_eligibility_traces_binary_features(cartpole_continuing_mdp, 0.95f0, 0.8f0, s -> cartpole_tilecoding_setup.get_active_features((s.x, s.Î¸, s.xÌ‡, s.Î¸Ì‡)), cartpole_tilecoding_setup.num_features, 30_000, Î±_Î¸ = .125f0, Î±_w = 0.006f0, Î±_rÌ„ = 0.01f0, save_step_rewards = true)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$635abb34-2c97-4f04-a74c-22fbec32f408„§cell_idÙ$635abb34-2c97-4f04-a74c-22fbec32f408¤codeÙífunction fcann_value_function(x::Vector{T}, params::FCANNParams, activations::FCANNActivations{T}, reslayers::Integer) where T<:Float32 FCANN.forwardNOGRAD_base!(activations, params..., x, reslayers) return first(last(activations)) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$0bf3b988-b3fb-49d5-8dde-b25766596363„§cell_idÙ$0bf3b988-b3fb-49d5-8dde-b25766596363¤codeÙMlinear_value_function(x::Vector{T}, w::Vector{T}) where {T<:Real} = dot(x, w)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$d8222abf-139c-4220-8e92-cc987ec6900c„§cell_idÙ$d8222abf-139c-4220-8e92-cc987ec6900c¤codeÙÜmd""" Note that for the corridor problem, the state-value learning rates have very little impact and learning is most effective when $\lambda_{\boldsymbol{\theta}}$ is close to 1 which mimics REINFORCE with baseline. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$68e6f17e-8c87-40f0-a673-1115ecd1b71d„§cell_idÙ$68e6f17e-8c87-40f0-a673-1115ecd1b71d¤codeÚpmd""" > ### *Exercise 13.5* > A *Bernoulli-logistic unit* is a stochastic neuron-like unit used in some ANNs. Its input at time *t* is a feature vector $\mathbf{x}(S_t)$; its output, $A_t$, is a random variable having two values, 0 and 1, with $\Pr \{A_t=1 \}=P_t$ and $\Pr\{A_t=0\}=1-P_t$ (the Bernoulli distribution). Let $h(s, 0, \mathbf{\theta})$ and $h(s, 1, \mathbf{\theta})$ be the preferences in state $s$ for the unit's two actions given by policy parameter $\mathbf{\theta}$. Assume that the difference between the action preferences is given by a weights sum of teh unit's input vector, that is, assume that $h(s, 1, \mathbf{\theta})-h(s,0, \mathbf{\theta}) = \mathbf{\theta}^\top \mathbf{x}(s)$, where $\mathbf{\theta}$ is the unit's weight vector. > 1. Show that if the exponential soft-max distribution (13.2) is used to convert action preferences to policies, then ${P_t = \pi(1|S_t, \theta_t)=1/(1+\exp(-\theta_t^\top\mathbf{x}(S_t)))}$ (the logistic function). > 2. What is the Monte-Carlo REINFORCE update of $\theta_t$ to $\theta_{t+1}$ upon receipt of return $G_t$? > 3. Express the eligility $\nabla \ln \pi(a|s, \theta)$ for a Bernoulli-logistic unit, in terms of $a$, $\mathbf{x}(s)$, and $\pi(a|s, \theta)$ by calculating the gradient. > Hint for part (c): Define $P=\pi(1|s,\theta)$ and compute the derivative of the logarithm, for each action, using the chain rule on $P$. Combine the two results into one expression that depends on $a$ and $P$, and then use the chain rule again, this time on $\theta^\top\mathbf{x}(s)$, noting that the derivative of the logistic function $f(x)=1/(1+e^{-x})$ is $f(x)(1-f(x))$. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$cf1859d6-f889-4923-8c87-2d7c039f26c3„§cell_idÙ$cf1859d6-f889-4923-8c87-2d7c039f26c3¤codeÙDrunepisode(cartpole_mdps.episodic.continuous, s -> Float32(randn()))¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$5500fd8e-64cb-4af7-808d-230440746319„§cell_idÙ$5500fd8e-64cb-4af7-808d-230440746319¤codeÙ/md""" ### *Continuing Mountain Car Example* """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$76d54520-baa3-44bf-b303-4cdcb8b87080„§cell_idÙ$76d54520-baa3-44bf-b303-4cdcb8b87080¤codeÙƒbegin make_sample_vector(::T) where T<:Real = zeros(T, 1) make_sample_vector(::NTuple{N, T}) where {N, T<:Real} = zeros(T, N) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$27441783-d3c6-40be-9c36-4941613e6ae9„§cell_idÙ$27441783-d3c6-40be-9c36-4941613e6ae9¤codeÙzplot(reinforce_test5.step_rewards |> cumsum |> x -> x ./ length(x) |> x -> x[round.(Int64, LinRange(1, length(x), 1000))])¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$fac138d9-3c5d-44b0-a87c-b13872f19450„§cell_idÙ$fac138d9-3c5d-44b0-a87c-b13872f19450¤codeusing Memoize¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$82e0e9a0-9662-429a-87e3-e6bdae02709a„§cell_idÙ$82e0e9a0-9662-429a-87e3-e6bdae02709a¤codeÚfconst reinforce_test5 = actor_critic_with_eligibility_traces_fcann(cartpole_setup.mdps.continuing.discrete, 0.90f0, 0.1f0, cartpole_fcann_feature_setup.num_features, [32, 32], (x, s) -> cartpole_fcann_feature_setup.update_feature_vector!(x, (s.x, s.Î¸, s.xÌ‡, s.Î¸Ì‡)), 1_000_000; Î±_Î¸ = 0.0625f0, Î±_w = 0.0625f0, Î±_rÌ„ = 0.01f0, save_step_rewards = true)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$d3c1379f-acd6-4e15-be7e-a5dbe46a4f62„§cell_idÙ$d3c1379f-acd6-4e15-be7e-a5dbe46a4f62¤codeÙ_@bind start_mountaincar_continuing_param_study CounterButton("Run Mountaincar Parameter Study")¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$fad02876-efba-46a7-9cb7-43820528779f„§cell_idÙ$fad02876-efba-46a7-9cb7-43820528779f¤codeÙ½plot_cart(cartpole_fcann_continuing_test_episode[1][cartpole_fcann_continuing_episode_step_select], cartpole_fcann_continuing_test_episode[2][cartpole_fcann_continuing_episode_step_select])¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$1ce4bc6c-7cde-48e9-8ff1-7281697fd121„§cell_idÙ$1ce4bc6c-7cde-48e9-8ff1-7281697fd121¤codeÙ-plot_cart(ep2[1][ep2_step], ep2[2][ep2_step])¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$024dcd1a-8eaa-4a95-8037-2f578828309c„§cell_idÙ$024dcd1a-8eaa-4a95-8037-2f578828309c¤codeÙ-const cartpole_mdps = create_cartpole_mdps()¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$e1274f57-75cb-4659-a82f-e5870c5367e2„§cell_idÙ$e1274f57-75cb-4659-a82f-e5870c5367e2¤codeÙyconst ep = runepisode(cartpole_setup.mdps.episodic.discrete; Ï€ = reinforce_test4.policy_sample_action, max_steps = 1000)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$fdd3f4fd-4706-4d6b-b150-6ee6b4b370cb„§cell_idÙ$fdd3f4fd-4706-4d6b-b150-6ee6b4b370cb¤codeÚmd""" ### Notes on Probability Distributions In order to prove the policy gradient theorem, we must manipulate terms that are probability distributions over states and visit steps. In order to build intuition for these distributions, we can visualize how data is being averaged with the sort corridor example. The following function simulates many episodes in the environment with a stochastic policy that has some probability of moving left regardless of the state. The simulation keeps track of the visit count for a given state and the visit step. The result of the accumulation is a matrix who's columns contain the number of times each state was visited on every step of an episode across all of the simulated episodes. If we divide each count by the number of episodes simulated, then we have an unbiased sample of the probability of visiting a state on each step $k$ of an episode: $\Pr \{ S_k = s \mid \pi \}$ such that $\sum_{s \in \mathcal{S}^+} \Pr \{ S_k = s \mid \pi \} = 1$. Note that this distribution is only normalized over the sum of all states including terminal states which is denoted in episodic problems by the notation $\mathcal{S}^+$. The notation $\mathcal{S}$ excludes all terminal states, so if we sum the above probabilities over that set on a given step $k$ we calculate the probability that we are NOT in a terminal state by the time we reach step $k$: $\sum_\mathcal{S} \Pr \{ S_k = s \mid \pi \} = \Pr \{ T \gt k \mid \pi \}$ where we use the notation that $T$ is the step of termination for a particular episode. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$b02ba928-5b9f-4695-b980-07988c788bb9„§cell_idÙ$b02ba928-5b9f-4695-b980-07988c788bb9¤codeÚ8const mountaincar_continuing_tile_test = actor_critic_with_eligibility_traces_binary_features(mountaincar_continuing_mdp, 0.1f0, 0.98f0, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, 200_000, Î±_Î¸ = 0.5f0, Î±_w = 0.0025f0, Î±_rÌ„ = 0.005f0; save_step_rewards=true)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$f946c886-6246-4f98-a96f-f06984691ad8„§cell_idÙ$f946c886-6246-4f98-a96f-f06984691ad8¤codeÚbegin function ApproximationUtils.runepisode!((states, actions, rewards)::Tuple{Vector{S}, Vector{A}, Vector{T}}, mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, Ï€::Function; s0::S = mdp.initialize_state(), a0::A = Ï€(s0), max_steps = typemax(Int64)) where {T<:Real, S, A, P, F1<:Function, F2<:Function, F3<:Function} s = s0 l = length(states) @assert l == length(actions) == length(rewards) function add_value!(v, x, i) if i > l push!(v, x) else v[i] = x end end add_value!(states, s, 1) a = a0 # @info "Selected action is $a" (r, sâ€²) = mdp.ptf(s, a0) add_value!(actions, a, 1) add_value!(rewards, r, 1) step = 2 sterm = s if mdp.isterm(sâ€²) sterm = sâ€² else sterm = s end s = sâ€² #note that the terminal state will not be added to the state list while !mdp.isterm(s) && (step <= max_steps) add_value!(states, s, step) a = Ï€(s) if bad_continuous_action(a) @info "Terminating episode after $step steps due to bad continuous action $a taken in state $s" step = 1 break end # @info "Selected action is $a" (r, sâ€²) = mdp.ptf(s, a) add_value!(actions, a, step) add_value!(rewards, r, step) s = sâ€² step += 1 if mdp.isterm(sâ€²) sterm = sâ€² end end return states, actions, rewards, sterm, step-1 end function ApproximationUtils.runepisode(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, Ï€::Function; kwargs...) where {T<:Real, S, A, P, F1, F2, F3} states = Vector{S}() actions = Vector{A}() rewards = Vector{T}() runepisode!((states, actions, rewards), mdp, Ï€; kwargs...) end ApproximationUtils.runepisode(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}; kwargs...) where {T<:Real, S, N, A <: Union{T, NTuple{N, T}}, P, F1, F2, F3} = runepisode(mdp, Returns(rand(A)); kwargs...) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$3c316495-bb6c-41e2-a38f-ba867a319fbb„§cell_idÙ$3c316495-bb6c-41e2-a38f-ba867a319fbb¤codeÚ à#create a cart pole MDP environment function create_cartpole_mdps(; m::T = 1f0, #mass at the end of the pole in kg m_c::T = 10f0, #mass of the cart in kg l::T = 1f0, #length of the pole in meters g::T = 9.8f0, #gravitational constant in meters per second squared h::T = 1f-3, #step size parameter of simulation in seconds k::T = 1f0, #inertial constant of pendulum, m_f::T = 0f0, #friction of the rotating pole Î¼_c::T = 0f0, #friction of the cart wheels against the track fmax::T = 100f0, #force applied by throttle x_max::T = Inf32, #maximum horizontal position Î¸_max::T = Ï€/2f0, #maximum pole angle xÌ‡_max::T = Inf32, Î¸Ì‡_max::T = Inf32, init_x::Function = () -> 0f0, #initialize each of the 4 state variables init_Î¸::Function = () -> Float32(rand([-Ï€/6, Ï€/6])), init_xÌ‡::Function = () -> 0f0, init_Î¸Ì‡::Function = () -> 0f0) where T<:Real #the action space is full throttle forward or backwards or idle in the discrete case actions = [-fmax, zero(T), fmax] #create a vehicle to use in simulation steps vehicle = CartPoleVehicle(m, m_c, l, k, m_f, Î¼_c) initialize_state(;t = 0f0) = CartPoleState(init_x(), init_Î¸(), init_xÌ‡(), init_Î¸Ì‡(), t) function failure(s::CartPoleState) (abs(s.x) > x_max) || (abs(s.Î¸) > Î¸_max) || (abs(s.xÌ‡) > xÌ‡_max) || (abs(s.Î¸Ì‡) > Î¸Ì‡_max) end step(s::CartPoleState{T}, f::T) = cartpole_runge_kutta_step(vehicle, s, g, clamp(f, -fmax, fmax), h) function episodic_step(s::CartPoleState{T}, f::T) sâ€² = step(s, f) return (one(T), sâ€²) end function continuing_step(s::CartPoleState{T}, f::T) sâ€² = step(s, f) failure(sâ€²) && return (-one(T), initialize_state(;t = sâ€².t)) return (zero(T), sâ€²) end s0 = initialize_state() ptf = StateMDPTransitionSampler((s, i_a) -> episodic_step(s, actions[i_a]), s0) episodic_mdp = TabularRL.StateMDP(actions, ptf, initialize_state, failure) ptf = ContinuousMDPTransitionSampler(episodic_step, s0, zero(T)) episodic_mdp_continuous = ContinuousMDP(ptf, initialize_state; isterm = failure) ptf = StateMDPTransitionSampler((s, i_a) -> continuing_step(s, actions[i_a]), s0) continuing_mdp = TabularRL.StateMDP(actions, ptf, initialize_state, Returns(false)) ptf = ContinuousMDPTransitionSampler(continuing_step, s0, zero(T)) continuing_mdp_continuous = ContinuousMDP(ptf, initialize_state) (episodic = (discrete = episodic_mdp, continuous = episodic_mdp_continuous), continuing = (discrete = continuing_mdp, continuous = continuing_mdp_continuous)) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$6c5e9bb2-4c38-4613-9652-dec99e97b512„§cell_idÙ$6c5e9bb2-4c38-4613-9652-dec99e97b512¤codeÙ%md""" #### Policy Function Output """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$b0a66a19-ee76-463b-a704-8fcee85444d0„§cell_idÙ$b0a66a19-ee76-463b-a704-8fcee85444d0¤codeÚlbegin function update_params_with_gradient!(Î¸::Array{T, N}, Î±::T, âˆ‡Î¸::Array{T, N}) where {T<:Real, N} Î¸ .+= Î± .* âˆ‡Î¸ end function update_params_with_gradient!(Î¸::Matrix{T}, Î±::T, âˆ‡Î¸::BinaryEligibilityVector{T, B}) where {T<:Real, B<:BinaryFeatureVector} @inbounds for i in eachindex(âˆ‡Î¸.Ï€_dist) @simd for j in 1:âˆ‡Î¸.binary_features.num_features k = âˆ‡Î¸.binary_features.active_features[j] Î¸[k, i] -= Î±*âˆ‡Î¸.Ï€_dist[i] end end @inbounds @simd for i in 1:âˆ‡Î¸.binary_features.num_features j = âˆ‡Î¸.binary_features.active_features[i] Î¸[j, âˆ‡Î¸.i_a] += Î± end return Î¸ end function update_params_with_gradient!(params::FCANNParams{T}, Î±::T, âˆ‡::FCANNParams{T}) where T<:Float32 for i in eachindex(first(params)) for j in 1:2 # @info "updating parameter $((j, i)) $(params[j][i]) with gradient $(âˆ‡[j][i]) and constant $Î±" update_params_with_gradient!(params[j][i], Î±, âˆ‡[j][i]) # @info "new parameter values are: $(params[j][i])" end end end update_params_with_gradient!(::Nothing, Î±::T, ::Nothing) where T<:Real = return nothing end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$13ebc12f-ff6f-4266-88d3-28d6df5fcf59„§cell_idÙ$13ebc12f-ff6f-4266-88d3-28d6df5fcf59¤codeÚCactor_critic_binary_episodic_gaussian_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, params::@NamedTuple{Î»_Î¸::T, Î»_w::T, Î±_Î¸_min::Int64, Î±_w_min::Int64}, num_Î¸::Integer, num_w::Integer, num_episodes::Integer; kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = actor_critic_binary_episodic_gaussian_parameter_study(mdp, get_active_features, num_features, params.Î»_Î¸, params.Î»_w, 2f0 .^(params.Î±_Î¸_min:params.Î±_Î¸_min+num_Î¸-1), 2f0 .^(params.Î±_w_min:params.Î±_w_min+num_w-1), num_episodes; kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$7a6fb1f0-fc3c-4c29-a6d9-769d32ca98a9„§cell_idÙ$7a6fb1f0-fc3c-4c29-a6d9-769d32ca98a9¤codeÙ3md""" ### Example 13.1 Short corridor gridworld """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$f2f2dd1d-180c-4d36-b515-5079d129f93a„§cell_idÙ$f2f2dd1d-180c-4d36-b515-5079d129f93a¤codeÙÉsarsa_Î»(corridor_mdp, 1f0, 0.9f0, typemax(Int64), 100_000, 1, get_corridor_features; Ïµ = 0.0001f0, Î± = 0.000001f0, save_episode_steps = true).history.episode_steps |> a -> a ./ (1:length(a)) |> plot¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$553b0ceb-f2ca-41ee-99bc-9f53a4487b49„§cell_idÙ$553b0ceb-f2ca-41ee-99bc-9f53a4487b49¤codeÙTget_corridor_episode_stats(best_mc_corridor.policy_sample_action; ntrials = 100_000)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$f9facbba-39d4-483e-9066-275603156db0„§cell_idÙ$f9facbba-39d4-483e-9066-275603156db0¤codeÚifunction plot_mountaincar_values(vÌ‚_mountain_car, Ï€; n1 = 100, n2 = 100) xvals = LinRange(-1.2f0, 0.5f0, n1) vvals = LinRange(-0.07f0, 0.07f0, n2) values = zeros(Float32, n1, n2) actions = zeros(Float32, n1, n2) for (i, x) in enumerate(xvals) for (j, v) in enumerate(vvals) vÌ‚ = vÌ‚_mountain_car((x, v)) values[j, i] = vÌ‚ actions[j, i] = Ï€((x, v)) end end p1 = plot(heatmap(x = xvals, y = vvals, z = values), Layout(xaxis_title = "position", yaxis_title = "velocity", title = "Learned Value Function", height = 400)) p2 = plot(heatmap(x = xvals, y = vvals, z = actions, colorscale = "rb", showscale = false), Layout(xaxis_title = "position", yaxis_title = "velocity", title = "Policy (blue = accelerate left,
red = accelerate right, gray = no acceleration)", height = 400)) @htl("""

$p1 $p2

""") end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0„§cell_idÙ$0fbf45c8-3e3c-47c1-b763-3b06bcdc60e0¤codeÙ˜one_step_actor_critic_fcann(corridor_mdp, 1, [1], update_corridor_features!, typemax(Int64), 100_000, Î±_Î¸ = 2f0^-4, Î±_w = 2f0^-20).policy_function(1)¨metadataƒ©show_logsÂ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$d41f1dd1-45fe-4456-9a01-ed47fd6704a7„§cell_idÙ$d41f1dd1-45fe-4456-9a01-ed47fd6704a7¤codeÚebegin function update_beta_eligibility_vector!(âˆ‡lnÏ€::BinaryBetaEligibilityVector{T, T, T, B}, dist_params::Vector{T}, x::B, action::T, policy_params::Matrix{T}) where {T<:Real, B<:BinaryFeatureVector} # @info "Beta eligibility vector is $âˆ‡lnÏ€" âˆ‡lnÏ€.binary_features = x âˆ‡lnÏ€.a = action âˆ‡lnÏ€.Î± = exp(first(dist_params)) âˆ‡lnÏ€.Î² = exp(last(dist_params)) # @info "Beta eligibility vector updated to $âˆ‡lnÏ€" return âˆ‡lnÏ€ end function update_beta_eligibility_vector!(âˆ‡lnÏ€::BinaryBetaEligibilityVector{T, NTuple{N, T}, Vector{T}, B}, dist_params::Vector{T}, x::B, action::NTuple{N, T}, policy_params::Matrix{T}) where {T<:Real, N, B<:BinaryFeatureVector} âˆ‡lnÏ€.binary_features = x âˆ‡lnÏ€.a = action for i in 1:N âˆ‡lnÏ€.Î±[k] = exp(dist_params[k]) âˆ‡lnÏ€.Î²[k] = exp(dist_params[k+N]) end return âˆ‡lnÏ€ end end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$ba5d6311-daee-4abc-b2fb-fae2184ef3eb„§cell_idÙ$ba5d6311-daee-4abc-b2fb-fae2184ef3eb¤codeÚ”function setup_binary_gaussian_policy_arguments(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer) where {T<:Real, S, N, A<:Union{T, NTuple{N, T}}, P, F1, F2, F3} x = BinaryFeatureVector() update_feature_vector!(x::BinaryFeatureVector, s) = update_binary_feature_vector!(x, s, get_active_features) sample_action = rand(A) action_dist_params = make_n_param_dist_params(2, sample_action) âˆ‡lnÏ€ = BinaryGaussianEligibilityVector(sample_action) return (feature_vector = x, update_feature_vector! = update_feature_vector!, action_distribution_parameters = action_dist_params, eligibility_vector = âˆ‡lnÏ€) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$8e742d32-c074-4981-b35b-b596b64c869b„§cell_idÙ$8e742d32-c074-4981-b35b-b596b64c869b¤codeÙ¨@bind cartpole_continuing_binary_study_params create_actor_critic_continuing_params_UI(;Î»_Î¸ = 0.95f0, Î»_w = 0.05f0, log2Î±_Î¸ = -4, log2Î±_w = -16, Î±_rÌ„ = 0.005f0)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$03a218cb-aa83-4000-85b5-c6f247087053„§cell_idÙ$03a218cb-aa83-4000-85b5-c6f247087053¤codeÚfunction update_binary_value_gradient!(âˆ‡vÌ‚::BinaryFeatureVector, binary_features::BinaryFeatureVector, value_params::Vector{T}) where T<:Real âˆ‡vÌ‚.active_features = binary_features.active_features âˆ‡vÌ‚.num_features = binary_features.num_features end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$1ec1acf1-f833-4478-9b3c-88029340a629„§cell_idÙ$1ec1acf1-f833-4478-9b3c-88029340a629¤codeÚ:md""" ##### Non-linear Features This version of REINFORCE uses non-linear features in a fully connected neural network. The number of parameters no longer matches the size of the input feature vector, but a mapping from state to feature vector is still required. One must specify the size of the feature vector, a function that updates the values in a feature vector given a state, and the size of each hidden layer in the neural network. Additional keyword arguments are available to change the construction of the neural network such as adding residual layers. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$de3cba34-9842-44d1-9b79-47126c0a0751„§cell_idÙ$de3cba34-9842-44d1-9b79-47126c0a0751¤codeÙœconst cartpole_tilecoding_setup = tile_coding_setup(cartpole_functions.min_vals, cartpole_functions.max_vals, (1f0/4, 1f0/8, 1f0/8, 1f0/8), 8, (1, 3, 5, 7))¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$04f42c09-8ab5-4233-b196-51c4aa2dcedb„§cell_idÙ$04f42c09-8ab5-4233-b196-51c4aa2dcedb¤codeÙÓif start_mountaincar_continuing_param_study > 0 mountaincar_binary_continuing_parameter_study(mountaincar_continuing_binary_params, 5, 3, 100_000; seed = 45) else md""" Waiting to run parameter study """ end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$54ff46a2-489a-4dd2-bc30-df70c780cc42„§cell_idÙ$54ff46a2-489a-4dd2-bc30-df70c780cc42¤codeÚ\cartpole_fcann_parameter_study(fill(fcann_cartpole_study_params.h, fcann_cartpole_study_params.l), fcann_cartpole_study_params.Î»_Î¸, fcann_cartpole_study_params.Î»_w, 2f0 .^(fcann_cartpole_study_params.Î±_Î¸_min:fcann_cartpole_study_params.Î±_Î¸_min+4), 2f0 .^ (fcann_cartpole_study_params.Î±_w_min:fcann_cartpole_study_params.Î±_w_min+2), 1_000)¨metadataƒ©show_logsÃ¨disabledÃ®skip_as_scriptÂ«code_foldedÂÙ$7126aefd-b847-497a-9545-514e9b9afa71„§cell_idÙ$7126aefd-b847-497a-9545-514e9b9afa71¤codeÚactor_critic_fcann_episodic_parameter_study(MountainCarTask.mdp, mountaincar_fcann_setup.update_feature_vector!, mountaincar_fcann_setup.num_features, fill(fcann_mountaincar_study_params.h, fcann_mountaincar_study_params.l), fcann_mountaincar_study_params.Î»_Î¸, fcann_mountaincar_study_params.Î»_w, 2f0 .^ (fcann_mountaincar_study_params.Î±_Î¸_min:fcann_mountaincar_study_params.Î±_Î¸_min+4), 2f0 .^ (fcann_mountaincar_study_params.Î±_w_min:fcann_mountaincar_study_params.Î±_w_min+2), 100_000; nruns = 100, max_steps = 1_000)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$48dcd2d0-a940-41da-a097-90c780f2ec4d„§cell_idÙ$48dcd2d0-a940-41da-a097-90c780f2ec4d¤codeÚzmd""" ### Alternative Paramaterization If the action space is small enough, then it may be convenient to create a function that simply outputs the preferences for all of the actions at a given state. Let's call $N_a$ to be the number of available actions. We would then consider the vector function $\mathbf{h}(s, \boldsymbol{\theta}) \in \mathbb{R}^{N_a}$ and its components $h_1, h_2, h_3, \dots, h_{N_a}$. To be the action preferences at each state. With this style of paramaterization, we need only compute state feature vectors $\mathbf{x}(s) \in \mathbb{R}^d$. Similarly, the policy function would also be a vector function. In order to compute the softmax, we must evaluate the denominator of (13.2) which requires knowing all of the action preferences. Practically, it is only defined as a function on vectors, so consider the following notation to simplify expressions where we use the symbol $\mathbf{\sigma}$ to denote the soft-max vector function. $\sigma(\mathbf{x}) = \frac{e^{\mathbf{x}}}{\sum_j{e^{x_j}}} \text{ where we abuse the notation } e^{\mathbf{x}} = \begin{pmatrix} e^{x_1} \\ e^{x_2} \\ \vdots \\ e^{x_n} \end{pmatrix}$ Using this notation, we can write down the policy function under this new parameterization: $\mathbf{\pi}(s, \boldsymbol{\theta}) = \mathbf{\sigma}(\mathbf{h}(s, \boldsymbol{\theta}))$. What do linear preferences look like with this parameterization? Instead of a parameter vector $\boldsymbol{\theta} \in \mathbb{R}^{d^\prime}$, we have a parameter matrix $\boldsymbol{\theta} \in \mathbb{R}^{d \times N_a}$ and the vector of preferences is the result of a matrix vector multiplication: $\mathbf{h}(s, \boldsymbol{\theta}) = \theta^\top \mathbf{x}(s) \in \mathbb{R}^{N_a}$. Subscript notation is used to refer to single preference values so $\mathbf{h}_i$ would be the $ith$ index of $\mathbf{h}$ for the $ith$ action preference equivalent to $h_i$. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$e1493cea-19c4-475d-98a0-86d27fb04af1„§cell_idÙ$e1493cea-19c4-475d-98a0-86d27fb04af1¤codeÙ sarsa_Î»(corridor_mdp, 1f0, 0.9f0, typemax(Int64), 100_000, 1, get_corridor_features; Ïµ = 0.001f0, Î± = 0.000001f0).greedy_policy |> get_corridor_episode_stats¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$511a847f-234c-465e-8f4a-688e79d9b975„§cell_idÙ$511a847f-234c-465e-8f4a-688e79d9b975¤codeÚSmd""" ## 13.6 Policy Gradient for Continuing Problems In the continuing case we need to define the average reward per time step as discussed in Section 10.3. In the update procedure the Î´ is calculated differently in terms of the reward compared to this long running average. The value functions in this case will also learn the reward difference from the average which is assumed to have a well defined expected value under the stationary state distribution for the policy. This shift in the value function will not affect performance since shifting the value function up and down by a constant does not affect the learned policy. To implement this we need a new learning rate $Î±_{\overline{R}}$ which controls how quickly the reward average updates. This replaces $Î³$ in a sense since we no longer discount rewards of future time steps. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$697b2310-9d96-4f7f-be62-c3bd6bf736f3„§cell_idÙ$697b2310-9d96-4f7f-be62-c3bd6bf736f3¤codeÚ¨function reinforce_with_baseline_monte_carlo_control_fcann(mdp::StateMDP{T, S, A, P, F1, F2, F3}, input_length::Integer, hidden_layers::Vector{Int64}, update_feature_vector!::Function,max_episodes::Integer; policy_params::FCANNParams = FCANN.initializeparams_saxe(input_length, hidden_layers, length(mdp.actions)), reslayers = 0, l2 = 0f0, dropout = 0f0, use_Î¼P = true, activation_list = fill(true, length(hidden_layers)), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} setup = setup_fcann_policy_and_value_arguments(policy_params, input_length, hidden_layers, reslayers, l2, dropout, use_Î¼P, activation_list) reinforce_with_baseline_monte_carlo_control!(policy_params, setup.eligibility_vector, setup.value_params, setup.value_gradient, mdp, setup.update_action_preferences!, setup.update_eligibility_vector!, setup.feature_vector, update_feature_vector!, setup.value_function, setup.gradient_update, max_episodes; kwargs...) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$056a8adc-92f4-4b33-90d9-4b3b4026bbbc„§cell_idÙ$056a8adc-92f4-4b33-90d9-4b3b4026bbbc¤codeÚbegin function update_traces_with_gradient!(c::T, z_Î¸::Matrix{T}, âˆ‡Î¸::BinaryGaussianEligibilityVector{T, T, T, B}) where {T<:Real, B<:BinaryFeatureVector} c1 = âˆ‡Î¸.a - âˆ‡Î¸.Î¼ c2 = âˆ‡Î¸.Ïƒ^(-2) # isnan(c2) && @info "warning Ïƒ of $âˆ‡Î¸.Ïƒ is causing nan results" # isinf(c2) && @info "warning Ïƒ of $âˆ‡Î¸.Ïƒ is causing inf results" c3 = c1 * c2 c4 = c3*c1 - one(T) z_Î¸ .*= c @inbounds @simd for j in 1:âˆ‡Î¸.binary_features.num_features i = âˆ‡Î¸.binary_features.active_features[j] z_Î¸[i, 1] += c3 end @inbounds @simd for j in 1:âˆ‡Î¸.binary_features.num_features i = âˆ‡Î¸.binary_features.active_features[j] z_Î¸[i, 2] += c4 end return z_Î¸ end function update_traces_with_gradient!(c::T, z_Î¸::Matrix{T}, âˆ‡Î¸::BinaryBetaEligibilityVector{T, T, T, B}) where {T<:Real, B<:BinaryFeatureVector} c1 = digamma(âˆ‡Î¸.Î± + âˆ‡Î¸.Î²) Î´1 = âˆ‡Î¸.Î±*(log(âˆ‡Î¸.a) + c1 - digamma(âˆ‡Î¸.Î±)) z_Î¸ .*= c @inbounds @simd for j in 1:âˆ‡Î¸.binary_features.num_features i = âˆ‡Î¸.binary_features.active_features[j] z_Î¸[i, 1] += Î´1 end Î´2 = âˆ‡Î¸.Î²*(log(one(T) - âˆ‡Î¸.a) + c1 - digamma(âˆ‡Î¸.Î²)) @inbounds @simd for j in 1:âˆ‡Î¸.binary_features.num_features i = âˆ‡Î¸.binary_features.active_features[j] z_Î¸[i, 2] += Î´2 end return z_Î¸ end function update_traces_with_gradient!(c::T, z_Î¸::Matrix{T}, âˆ‡Î¸::BinarySquashedGaussianEligibilityVector{T, T, T, B}) where {T<:Real, B<:BinaryFeatureVector} c1 = atanh(âˆ‡Î¸.a / âˆ‡Î¸.amax) - âˆ‡Î¸.Î¼ c2 = âˆ‡Î¸.Ïƒ^(-2) # isnan(c2) && @info "warning Ïƒ of $âˆ‡Î¸.Ïƒ is causing nan results" # isinf(c2) && @info "warning Ïƒ of $âˆ‡Î¸.Ïƒ is causing inf results" c3 = c1 * c2 c4 = c3*c1 - one(T) z_Î¸ .*= c @inbounds @simd for j in 1:âˆ‡Î¸.binary_features.num_features i = âˆ‡Î¸.binary_features.active_features[j] z_Î¸[i, 1] += c3 end @inbounds @simd for j in 1:âˆ‡Î¸.binary_features.num_features i = âˆ‡Î¸.binary_features.active_features[j] z_Î¸[i, 2] += c4 end return z_Î¸ end function update_traces_with_gradient!(a::T, z_Î¸::Matrix{T}, b::T, âˆ‡Î¸::BinaryGaussianEligibilityVector{T, T, T, B}) where {T<:Real, B<:BinaryFeatureVector} c1 = âˆ‡Î¸.a - âˆ‡Î¸.Î¼ c2 = âˆ‡Î¸.Ïƒ^(-2) # isnan(c2) && @info "warning Ïƒ of $âˆ‡Î¸.Ïƒ is causing nan results" # isinf(c2) && @info "warning Ïƒ of $âˆ‡Î¸.Ïƒ is causing inf results" c3 = c1 * c2 c4 = c3*c1 - one(T) z_Î¸ .*= a Î´1 = b*c3 @inbounds @simd for j in 1:âˆ‡Î¸.binary_features.num_features i = âˆ‡Î¸.binary_features.active_features[j] z_Î¸[i, 1] += Î´1 end Î´2 = b*c4 @inbounds @simd for j in 1:âˆ‡Î¸.binary_features.num_features i = âˆ‡Î¸.binary_features.active_features[j] z_Î¸[i, 2] += Î´2 end return z_Î¸ end function update_traces_with_gradient!(a::T, z_Î¸::Matrix{T}, b::T, âˆ‡Î¸::BinarySquashedGaussianEligibilityVector{T, T, T, B}) where {T<:Real, B<:BinaryFeatureVector} c1 = atanh(âˆ‡Î¸.a / âˆ‡Î¸.amax) - âˆ‡Î¸.Î¼ c2 = âˆ‡Î¸.Ïƒ^(-2) # isnan(c2) && @info "warning Ïƒ of $âˆ‡Î¸.Ïƒ is causing nan results" # isinf(c2) && @info "warning Ïƒ of $âˆ‡Î¸.Ïƒ is causing inf results" c3 = c1 * c2 c4 = c3*c1 - one(T) z_Î¸ .*= a Î´1 = b*c3 @inbounds @simd for j in 1:âˆ‡Î¸.binary_features.num_features i = âˆ‡Î¸.binary_features.active_features[j] z_Î¸[i, 1] += Î´1 end Î´2 = b*c4 @inbounds @simd for j in 1:âˆ‡Î¸.binary_features.num_features i = âˆ‡Î¸.binary_features.active_features[j] z_Î¸[i, 2] += Î´2 end return z_Î¸ end function update_traces_with_gradient!(a::T, z_Î¸::Matrix{T}, b::T, âˆ‡Î¸::BinaryBetaEligibilityVector{T, T, T, B}) where {T<:Real, B<:BinaryFeatureVector} c1 = digamma(âˆ‡Î¸.Î± + âˆ‡Î¸.Î²) z_Î¸ .*= a Î´1 = b*âˆ‡Î¸.Î±*(log(âˆ‡Î¸.a) + c1 - digamma(âˆ‡Î¸.Î±)) @inbounds @simd for j in 1:âˆ‡Î¸.binary_features.num_features i = âˆ‡Î¸.binary_features.active_features[j] z_Î¸[i, 1] += Î´1 end Î´2 = b*âˆ‡Î¸.Î²*(log(one(T) - âˆ‡Î¸.a) + c1 - digamma(âˆ‡Î¸.Î²)) @inbounds @simd for j in 1:âˆ‡Î¸.binary_features.num_features i = âˆ‡Î¸.binary_features.active_features[j] z_Î¸[i, 2] += Î´2 end return z_Î¸ end function update_traces_with_gradient!(a::T, z_Î¸::Matrix{T}, âˆ‡Î¸::BinaryGaussianEligibilityVector{T, NTuple{N, T}, Vector{T}, B}) where {T<:Real, N, B<:BinaryFeatureVector} z_Î¸ .*= a for k in 1:N c1 = âˆ‡Î¸.a[k] - âˆ‡Î¸.Î¼[k] c2 = âˆ‡Î¸.Ïƒ[k] ^-2 # isnan(c2) && @info "warning Ïƒ of $âˆ‡Î¸.Ïƒ is causing nan results" # isinf(c2) && @info "warning Ïƒ of $âˆ‡Î¸.Ïƒ is causing inf results" c3 = c1 * c2 c4 = c3*c1 - one(T) @inbounds @simd for i in 1:size(Î¸, 1) Î¸[i, k] += c3 end @inbounds @simd for i in 1:size(Î¸, 1) Î¸[i, k+N] += c4 end end return Î¸ end function update_traces_with_gradient!(a::T, z_Î¸::Matrix{T}, âˆ‡Î¸::BinaryBetaEligibilityVector{T, NTuple{N, T}, Vector{T}, B}) where {T<:Real, N, B<:BinaryFeatureVector} z_Î¸ .*= a for k in 1:N c1 = digamma(âˆ‡Î¸.Î±[k] + âˆ‡Î¸.Î²[k]) Î´1 = âˆ‡Î¸.Î±[k]*(log(âˆ‡Î¸.a[k]) + c1 - digamma(âˆ‡Î¸.Î±[k])) @inbounds @simd for i in 1:size(Î¸, 1) Î¸[i, k] += Î´1 end Î´2 = âˆ‡Î¸.Î²[k]*(log(one(T) - âˆ‡Î¸.a[k]) + c1 - digamma(âˆ‡Î¸.Î²[k])) @inbounds @simd for i in 1:size(Î¸, 1) Î¸[i, k+N] += Î´2 end end return Î¸ end function update_traces_with_gradient!(a::T, z_Î¸::Matrix{T}, b::T, âˆ‡Î¸::BinaryGaussianEligibilityVector{T, NTuple{N, T}, Vector{T}, B}) where {T<:Real, N, B<:BinaryFeatureVector} z_Î¸ .*= a for k in 1:N c1 = âˆ‡Î¸.a[k] - âˆ‡Î¸.Î¼[k] c2 = âˆ‡Î¸.Ïƒ[k] ^-2 isnan(c2) && @info "warning Ïƒ of $âˆ‡Î¸.Ïƒ is causing nan results" isinf(c2) && @info "warning Ïƒ of $âˆ‡Î¸.Ïƒ is causing inf results" c3 = c1 * c2 c4 = c3*c1 - one(T) Î´1 = b*c3 @inbounds @simd for i in 1:size(Î¸, 1) Î¸[i, k] += Î´1 end Î´2 = b*c4 @inbounds @simd for i in 1:size(Î¸, 1) Î¸[i, k+N] += Î´2 end end return Î¸ end function update_traces_with_gradient!(a::T, z_Î¸::Matrix{T}, b::T, âˆ‡Î¸::BinaryBetaEligibilityVector{T, NTuple{N, T}, Vector{T}, B}) where {T<:Real, N, B<:BinaryFeatureVector} z_Î¸ .*= a for k in 1:N c1 = digamma(âˆ‡Î¸.Î±[k] + âˆ‡Î¸.Î²[k]) Î´1 = b*âˆ‡Î¸.Î±[k]*(log(âˆ‡Î¸.a[k]) + c1 - digamma(âˆ‡Î¸.Î±[k])) @inbounds @simd for i in 1:size(Î¸, 1) Î¸[i, k] += Î´1 end Î´2 = b*âˆ‡Î¸.Î²[k]*(log(one(T) - âˆ‡Î¸.a[k]) + c1 - digamma(âˆ‡Î¸.Î²[k])) @inbounds @simd for i in 1:size(Î¸, 1) Î¸[i, k+N] += Î´2 end end return Î¸ end end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$bc8a399b-8864-4473-89d2-e3b0a03d15b5„§cell_idÙ$bc8a399b-8864-4473-89d2-e3b0a03d15b5¤codeÙ¹corridor_parameter_study(args...; kwargs...) = actor_critic_binary_episodic_parameter_study(corridor_mdp, get_corridor_features, 1, args...; init_policy_params = [0f0 3.7f0], kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$bba13634-ff0e-47f7-a23b-8d56098f4ac6„§cell_idÙ$bba13634-ff0e-47f7-a23b-8d56098f4ac6¤codeÚ²begin function gaussian_action_sampler(params::Vector{T}) where T<:Real Ïƒ = exp(params[2]) Î¼ = params[1] isinf(Î¼) && return Î¼ isapprox(Ïƒ, zero(T)) && return Î¼ isnan(Ïƒ) && return Î¼ rand(Normal(Î¼, Ïƒ)) end make_gaussian_n_sampler(::Val{1}) = gaussian_action_sampler function make_gaussian_n_sampler(::Val{N}) where N function f(params::Vector{T}) where T<:Real ntuple(i -> rand(Normal(params[i], exp(params[i+N]))), N) end end make_gaussian_n_sampler(n::Integer) = make_gaussian_n_sampler(Val(n)) make_gaussian_sampler(::T) where T<:Real = gaussian_action_sampler make_gaussian_sampler(::NTuple{N, T}) where {N, T<:Real} = make_gaussian_n_sampler(N) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$407a0724-4bb6-4c83-ab2d-17a0e19c4072„§cell_idÙ$407a0724-4bb6-4c83-ab2d-17a0e19c4072¤codeÚKconst reinforce_test4 = actor_critic_with_eligibility_traces_fcann(cartpole_setup.mdps.episodic.discrete, 0.95f0, 0.2f0, cartpole_fcann_feature_setup.num_features, [64, 64], (x, s) -> cartpole_fcann_feature_setup.update_feature_vector!(x, (s.x, s.Î¸, s.xÌ‡, s.Î¸Ì‡)), typemax(Int64), 1_000_000; Î±_Î¸ = 4f-4, Î±_w = 2f-5, Î³ = 1f0)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$77cf3a74-899f-4ade-99f2-5aaf7a98c02d„§cell_idÙ$77cf3a74-899f-4ade-99f2-5aaf7a98c02d¤codeÙ´function scale_fcann_params!(params::FCANNParams, scales::Vector{T}) where T<:Real @inbounds for i in eachindex(scales) for j in 1:2 params[j][i] ./= scales[i] end end end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$28ce6e60-59cf-408a-8081-b978507b3c72„§cell_idÙ$28ce6e60-59cf-408a-8081-b978507b3c72¤codeÚ¾@bind cartpole_fcann_continuing_test_state PlutoUI.combine() do Child md""" x position: $(Child(Slider(-50f0:50f0, default = 0, show_value=true))) pole angle: $(Child(Slider(LinRange(-deg2rad(70f0), deg2rad(70f0), 1000), default = 0, show_value=true))) x velocity: $(Child(Slider(-50f0:50f0, default = 0, show_value=true))) pole angular velocity: $(Child(Slider(-10f0:10f0, default = 0, show_value=true))) """ end |> confirm¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$7ccadf01-fbba-4dfd-a5ad-770dab9946f9„§cell_idÙ$7ccadf01-fbba-4dfd-a5ad-770dab9946f9¤codeÚTmd""" We can define our policy as a normal distribution function over actions for a given state and parameter vector. $\pi(a|s, \mathbf{\theta}) \doteq \frac{1}{\sigma(s, \mathbf{\theta}) \sqrt{2\pi}} \exp \left ( - \frac{(a-\mu(s, \mathbf{\theta}))^2}{2\sigma(s, \mathbf{\theta})^2} \right ) \tag{13.19}$ This policy requires Î¼ and Ïƒ to be parameterized by the parameter vector. To make a linear model for both parameters we can use the following formulas: $\mu(s, \mathbf{\theta}) \doteq \mathbf{\theta}_\mu ^\top \mathbf{x}_\mu(s) \text{ and } \sigma(s, \mathbf{\theta}) \doteq \exp{( \mathbf{\theta}_\sigma ^ \top \mathbf{x}_\sigma (s))} \tag{13.20}$ where $\mathbf{x}_\mu(s)$ and $\mathbf{x}_\sigma(s)$ are state feature vectors. With these formulas we can apply the previous algorithms to solve environments with real-valued actions. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$b72e030f-7d52-481f-b4f7-2b16b227e547„§cell_idÙ$b72e030f-7d52-481f-b4f7-2b16b227e547¤codeÚ\md""" ### Figure 13.2 Adding a baseline to REINFORCE can make it learn much faster as illustrated here on the short-corridor gridworld (Example 13.1). Here the approximate state-value function used in the baseline is $\hat v(s, \mathbf{w}) = w$. There is only one component of the feature vector and the state value approximation parameters. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$4c5cb75e-79b5-4502-b1eb-6246e002feaf„§cell_idÙ$4c5cb75e-79b5-4502-b1eb-6246e002feaf¤codeÙZ@bind mountaincar_binary_params create_actor_critic_params_UI(Î»_Î¸ = 0.1f0, Î»_w = 0.9f0)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$48b342f2-e48f-457a-9bd3-b3504a79f3a6„§cell_idÙ$48b342f2-e48f-457a-9bd3-b3504a79f3a6¤codeÙÛmd""" #### Binary Features This version of REINFORCE uses binary feature vectors for which one needs to specify the total number of features as well as a function that returns the active features for a given state. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$5d50a5d0-8fe2-4c6e-b76c-d5614e4fd884„§cell_idÙ$5d50a5d0-8fe2-4c6e-b76c-d5614e4fd884¤codeÚm#for displaying plots that do not load by default when the notebook first runs. Displays a placeholder markdown and then if the counter is more than 0 runs the function f with the provided arguments and caches the result in the appropriate dictionary function show_or_lookup_plot(buttoncounter::Integer, args::Tuple, kwargs::NamedTuple, dict::Dict, f::Function, name::AbstractString) buttoncounter == 0 && return md""" #### Placeholder for $name plot. Click above button to run """ haskey(dict, (args, kwargs)) && return dict[(args, kwargs)] p = f(args...; kwargs...) dict[(args, kwargs)] = p end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$ba645f6b-143f-4e83-9003-707770ae308d„§cell_idÙ$ba645f6b-143f-4e83-9003-707770ae308d¤codeÚ|function show_mountaincar_trajectory(Ï€::Function, max_steps::Integer) states, actions, rewards, sterm, nsteps = runepisode(MountainCarTask.mdp; Ï€ = Ï€, max_steps = max_steps) positions = [s[1] for s in states] velocities = [s[2] for s in states] tr1 = scatter(x = positions, y = velocities, mode = "markers", showlegend = false) tr2 = scatter(y = positions, showlegend = false) tr3 = scatter(y = [MountainCarTask.actions[i] for i in actions], showlegend = false) p1 = plot(tr1, Layout(xaxis_title = "position", yaxis_title = "velocity", xaxis_range = [-1.2, 0.5], yaxis_range = [-0.07, 0.07], height = 400)) p2 = plot(tr2, Layout(xaxis_title = "time", yaxis_title = "position", height = 400)) p3 = plot(tr3, Layout(xaxis_title = "time", yaxis_title = "action", height = 400)) @htl(""" Total Reward: $(sum(rewards))

$([p1 p2 p3])

""") end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811„§cell_idÙ$1acc0d86-fd5b-4f2e-acb2-dc9a96d3b811¤codeÙHupdate_corridor_features!(x::Vector{T}, s) where T<:Real = x[1] = one(T)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$8f1b2db4-ed35-44fc-a3d5-e06deae16d48„§cell_idÙ$8f1b2db4-ed35-44fc-a3d5-e06deae16d48¤codeÙ`cartpole_tilecoding_reinforce_continuous_parameter_study(2f0 .^ (-18:-15), 2f0 .^ (-6:-4), 1000)¨metadataƒ©show_logsÃ¨disabledÃ®skip_as_scriptÂ«code_foldedÂÙ$57bbdb10-bed8-459d-8f67-9ea637cf12ba„§cell_idÙ$57bbdb10-bed8-459d-8f67-9ea637cf12ba¤codeÚœfunction one_step_actor_critic_fcann(mdp::StateMDP{T, S, A, P, F1, F2, F3}, input_length::Integer, hidden_layers::Vector{Int64}, update_feature_vector!::Function, max_episodes::Integer, max_steps::Integer; policy_params::FCANNParams = FCANN.initializeparams_saxe(input_length, hidden_layers, length(mdp.actions)), reslayers = 0, l2 = 0f0, dropout = 0f0, use_Î¼P = true, activation_list = fill(true, length(hidden_layers)), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} setup = setup_fcann_policy_and_value_arguments(policy_params, input_length, hidden_layers, reslayers, l2, dropout, use_Î¼P, activation_list) one_step_actor_critic!(policy_params, setup.eligibility_vector, setup.value_params, setup.value_gradient, mdp, setup.update_action_preferences!, setup.update_eligibility_vector!, setup.feature_vector, update_feature_vector!, setup.value_function, setup.gradient_update, max_episodes, max_steps; kwargs...) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$ca360680-afc9-4dd9-9351-493643f91575„§cell_idÙ$ca360680-afc9-4dd9-9351-493643f91575¤codeÙ|md""" #### Probability distributions for short corridor gridworld example with probability of left action selected below """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$d95f75b5-21d8-4862-baa7-50b58d9725b8„§cell_idÙ$d95f75b5-21d8-4862-baa7-50b58d9725b8¤codeÚ ªmd""" ### Soft-max notation and gradients To use policy gradient methods, we must be able to take the gradient of the policy function for every state-action pair. Using the above notation and treating the policy as a vector function, we must know the gradient of the soft-max applied to a vector function at a particular index. Each gradient is a column vector of length $d$ where $d$ is the number of parameters. There is a separate gradient available for every index in the vector output which is one for each action or a total of $N_a$. To simplify expressions, $\mathbf{h}(s, \boldsymbol{\theta})$ will we written as $\mathbf{h}$ and $\mathbf{\pi} = \mathbf{\sigma}(\mathbf{h})$. Our desired gradient is with respect to a particular component of $\mathbf{\sigma}(\mathbf{h})$ denoted $\mathbf{\sigma}(\mathbf{h})_a$ where $a$ represents the action index. The gradient itself is the vector of partial derivatives with respect to the parameters $\theta$. The $ith$ component of the gradient $\nabla(f(\theta))_i = \frac{\partial f(\theta)}{\partial \theta_i}$. When we compute the gradient we need all the components whose expression is derived below. $\begin{align} \nabla \left ( \sigma(\mathbf{h})_a \right )_i &= \frac{\partial}{\partial \theta_i} \left ( \frac{e^{h_a}}{\sum_k{e^{h_k}}} \right ) \\ &=\left ( \frac{1}{{\sum_k{e^{h_k}}}} \right )^2 \left ( e^{h_a} \frac{\partial{h_a}}{\partial{\theta_i}} \sum_k{e^{h_k}} - e^{h_a} \sum_k{e^{h_k} \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \\ &=\left ( \frac{1}{{\sum_k{e^{h_k}}}} \right )^2 e^{h_a} \left ( \frac{\partial{h_a}}{\partial{\theta_i}} \sum_k{e^{h_k}} - \sum_k{e^{h_k} \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \tag{factoring out exponenential term}\\ &=\left ( \frac{e^{h_a}}{{\sum_k{e^{h_k}}}} \right ) \left ( \frac{\partial{h_a}}{\partial{\theta_i}} \sum_k{\frac{e^{h_k}}{\sum_l e^{h_l}}} - \sum_k{\frac{e^{h_k}}{\sum_l e^{h_l}} \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \tag{distributing squared fraction}\\ &=\pi_a \left ( \frac{\partial{h_a}}{\partial{\theta_i}} \sum_k{\pi_k} - \sum_k{\pi_k \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \tag{definition of policy function}\\ &=\pi_a \left ( \frac{\partial{h_a}}{\partial{\theta_i}} - \sum_k{\pi_k \frac{\partial{h_k}}{\partial{\theta_i}}} \right ) \end{align}$ The final step results form the fact that the policy function is a probability distribution so the sum over it is always 1. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$65be0e58-24be-4932-92a9-9e4825b14144„§cell_idÙ$65be0e58-24be-4932-92a9-9e4825b14144¤codeÚbactor_critic_binary_continuing_squashed_gaussian_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, args...; kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = actor_critic_binary_continuing_squashed_gaussian_parameter_study(mdp, one(T), get_active_features, num_features, args...; kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$60c21e9c-e42d-4f0b-a910-3b318440fbc8„§cell_idÙ$60c21e9c-e42d-4f0b-a910-3b318440fbc8¤codeÙø@bind gaussian_plot_params PlutoUI.combine() do Child md""" ### Normal Distribution Plot with $$\mu$$: $(Child(Slider(-4:.01:4, default = 0, show_value=true))) $$\sigma$$: $(Child(Slider(0.01:0.01:5, default = 1, show_value=true))) """ end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$da2d3186-a778-41cc-9b49-759bf1e9b8fa„§cell_idÙ$da2d3186-a778-41cc-9b49-759bf1e9b8fa¤codeÙŸconst BinaryFeatures{I} = Union{C1, C2, C3} where {I <: Integer, C1 <: AbstractVector{I}, N, C2 <: NTuple{N, I}, T<:AbstractVector{I}, C3 <: Base.Generator{T}}¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$b695ef21-a1ac-4d1f-a0e1-71cd81cede18„§cell_idÙ$b695ef21-a1ac-4d1f-a0e1-71cd81cede18¤codeÙWplot_mountaincar_continuous_values(mountaincar_continuous_test_train2.policy_and_value)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00„§cell_idÙ$7d5c5e78-cdb9-4c1f-8b6d-53591f47ff00¤codeÚ’function reinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, max_episodes::Integer; policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, N, A <: Union{T, NTuple{N, T}}, P, F1, F2, F3} setup = setup_binary_gaussian_policy_arguments(mdp, get_active_features, num_features) reinforce_with_baseline_monte_carlo_control!(policy_params, setup.eligibility_vector, value_params, BinaryFeatureVector(), mdp, update_binary_action_preferences!, setup.action_distribution_parameters, make_gaussian_sampler(rand(A)), update_gaussian_eligibility_vector!, setup.feature_vector, setup.update_feature_vector!, binary_value_function, update_binary_value_gradient!, max_episodes; kwargs...) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$dcb306ae-a1b1-43d6-ba6e-e38668838689„§cell_idÙ$dcb306ae-a1b1-43d6-ba6e-e38668838689¤codeÙ'md""" ### *Soft-max Implementation* """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$54f559b6-8a62-4a42-894d-c56e41d5ebef„§cell_idÙ$54f559b6-8a62-4a42-894d-c56e41d5ebef¤codeÙ;const corridor_state_counts = collect_state_distributions()¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$f545c800-0bf3-491f-9d7d-42341cfdb573„§cell_idÙ$f545c800-0bf3-491f-9d7d-42341cfdb573¤codeÚ‘function form_state_continuous_policy_function(update_feature_vector!::Function, update_action_preferences!::Function) function Ï€!(x, action_preferences, s, params) # @info "Updating feature vector with state $(s)" update_feature_vector!(x, s) update_action_preferences!(action_preferences, x, params) # @info "Action distribution is $action_preferences" return action_preferences end end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$8b35661b-5075-4d63-bc31-044407f99acf„§cell_idÙ$8b35661b-5075-4d63-bc31-044407f99acf¤codeÚactor_critic_with_eligibility_traces_binary_features(corridor_continuing_mdp, 0.75f0, 0.25f0, get_corridor_features, 1, 1_000_000, Î±_Î¸ = 0.00625f0, Î±_w = 0.0004f0, Î±_rÌ„ = 0.004f0, policy_params = [0f0 3.7f0]; save_step_rewards = true).policy_and_value(1)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$09dd1440-5d09-421f-addc-b1ede43ff517„§cell_idÙ$09dd1440-5d09-421f-addc-b1ede43ff517¤codeÙolet x = LinRange(-5, 5, 1000) plot(scatter(x = x, y = pdf.(Normal(gaussian_plot_params...), x)), Layout()) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$a0ca7a5e-0089-4a45-9278-c0f27cd096a0„§cell_idÙ$a0ca7a5e-0089-4a45-9278-c0f27cd096a0¤codeÙWplot_mountaincar_continuous_values(mountaincar_continuous_test_train3.policy_and_value)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$64b38d1f-ecf9-4843-89a1-4c8953048265„§cell_idÙ$64b38d1f-ecf9-4843-89a1-4c8953048265¤codeÙconst cartpole_fcann_continuing_test_episode = runepisode(cartpole_setup.mdps.episodic.discrete; Ï€ = cartpole_continuing_fcann_test.policy_sample_action, max_steps = 1_000)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$d963ff6d-f1b6-4799-aa0e-1ae100310d84„§cell_idÙ$d963ff6d-f1b6-4799-aa0e-1ae100310d84¤codeÙpPlutoDevMacros.@frompackage @raw_str(joinpath(@__DIR__, "..", "ApproximationUtils.jl")) using ApproximationUtils¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$b16899b7-36bf-4a5e-8e2f-4496b8450687„§cell_idÙ$b16899b7-36bf-4a5e-8e2f-4496b8450687¤codeÙÄsquashed_gaussian_pdf(x::Union{T, AbstractArray{N, T}}, Î¼::T, Ïƒ::T, xmax::T) where {N, T<:Real} = inv(Ïƒ*sqrt(T(2)*Ï€)) * exp(-inv(T(2))*((atanh(x/xmax) - Î¼)/Ïƒ)^2) / abs(xmax*(1 - (x/xmax)^2))¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$10cdd16e-a337-4421-a7a0-6de4e4b60c0f„§cell_idÙ$10cdd16e-a337-4421-a7a0-6de4e4b60c0f¤codeÚäbegin mutable struct BinaryGaussianEligibilityVector{T<:Real, A<:Union{T, NTuple{N, T} where N}, P<:Union{T, Vector{T}}, B <: BinaryFeatureVector} binary_features::B a::A Î¼::P Ïƒ::P end BinaryGaussianEligibilityVector(a::T) where T<:Real = BinaryGaussianEligibilityVector(BinaryFeatureVector(), a, zero(T), one(T)) BinaryGaussianEligibilityVector(a::NTuple{N, T}) where {T<:Real, N} = BinaryGaussianEligibilityVector(BinaryFeatureVector(), a, zeros(T, N), ones(T, N)) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$a8b40b8f-051a-4e6f-a079-ece4f32873de„§cell_idÙ$a8b40b8f-051a-4e6f-a079-ece4f32873de¤codeÚVfunction create_actor_critic_params_UI(;Î»_Î¸ = 0.5f0, Î»_w = 0.5f0, log2Î±_Î¸ = -10, log2Î±_w = -10) PlutoUI.combine() do Child @htl(""" $(md""" $$\lambda_\theta$$: $(Child(:Î»_Î¸, Slider(0.00f0:0.001f0:.999f0, default = Î»_Î¸, show_value=true))) $$\lambda_\mathbf{w}$$: $(Child(:Î»_w, Slider(0.00f0:0.001f0:.999f0, default = Î»_w, show_value=true))) $$\log_2 \alpha_\theta$$ min: $(Child(:Î±_Î¸_min, NumberField(-100:0, default = log2Î±_Î¸))) $$\log_2 \alpha_{\mathbf{w}}$$ min: $(Child(:Î±_w_min, NumberField(-100:0, default = log2Î±_w))) """)""") end |> confirm end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$5eebf3da-bfe7-46eb-81a3-f87f334ee270„§cell_idÙ$5eebf3da-bfe7-46eb-81a3-f87f334ee270¤codeÚÀfunction create_actor_critic_fcann_params_UI(;Î»_Î¸ = 0.5f0, Î»_w = 0.5f0, h = 8, l = 2, log2Î±_Î¸ = -10, log2Î±_w = -10) PlutoUI.combine() do Child md""" $$\lambda_\theta$$: $(Child(:Î»_Î¸, Slider(0.00f0:0.001f0:.999f0, default = Î»_Î¸, show_value=true))) $$\lambda_\mathbf{w}$$: $(Child(:Î»_w, Slider(0.00f0:0.001f0:.999f0, default = Î»_w, show_value=true))) hidden layer size: $(Child(:h, NumberField(1:128, default = h))), num layers: $(Child(:l, NumberField(1:5, default = l))) $$\log_2 \alpha_\theta$$ min: $(Child(:Î±_Î¸_min, NumberField(-100:0, default = log2Î±_Î¸))) $$\log_2 \alpha_{\mathbf{w}}$$ min: $(Child(:Î±_w_min, NumberField(-100:0, default = log2Î±_w))) """ end |> confirm end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$9bce6fdb-2cbc-4758-9a8b-794e490c973d„§cell_idÙ$9bce6fdb-2cbc-4758-9a8b-794e490c973d¤codeÙ8@bind ep2_step Slider(1:length(ep2[1]), show_value=true)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbf„§cell_idÙ$b86ee9d3-b6b5-4ea0-8f55-1927571cdfbf¤codeÚ"function create_continuous_action_mountaincar(;slipforce = 1f0) mdp = MountainCarTask.mdp function step(s, a) f = if abs(a) > slipforce sign(a)*0.1f0 else a end (-1f0, MountainCarTask.step(s, f)) end ContinuousMDP(step, mdp.initialize_state, 0f0; isterm = mdp.isterm) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6„§cell_idÙ$0ce66c9d-6d1c-4c2d-8178-5bcdfa247cd6¤codeÙšconst mountaincar_continuing_test_episode = runepisode(MountainCarTask.mdp, Ï€ = mountaincar_continuing_tile_test.policy_sample_action, max_steps = 1_000)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$7afb6fb0-248a-4518-b94f-9876f81eca64„§cell_idÙ$7afb6fb0-248a-4518-b94f-9876f81eca64¤codeÙçcorridor_continuing_parameter_study(args...; kwargs...) = actor_critic_linear_parameter_study(corridor_continuing_mdp, get_corridor_features, 1, args...; init_policy_params = [0f0 3.7f0], seed = 45, binary_features=true, kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$37a273b6-b104-46f0-987a-401dc1c97327„§cell_idÙ$37a273b6-b104-46f0-987a-401dc1c97327¤codeÙW@bind start_cartpole_continuing_binary_param_study CounterButton("Run Parameter Study")¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$7a6f3f79-ea06-4994-8b62-90b2056e4034„§cell_idÙ$7a6f3f79-ea06-4994-8b62-90b2056e4034¤codeÚõbegin function squashed_gaussian_action_sampler(params::Vector{T}, amax::T) where T<:Real Ïƒ = exp(params[2]) Î¼ = params[1] isinf(Î¼) && return amax*sign(Î¼) isapprox(Ïƒ, zero(T)) && return amax*tanh(Î¼) isnan(Ïƒ) && return amax*tanh(Î¼) amax*tanh(rand(Normal(Î¼, Ïƒ))) end make_squashed_gaussian_n_sampler(::Val{1}, amax::T) where T<:Real = params -> squashed_gaussian_action_sampler(params, amax) function make_squashed_gaussian_n_sampler(::Val{N}, amax::NTuple{N, T}) where {N, T<:Real} function f(params::Vector{T}) where T<:Real ntuple(i -> amax[i]*tanh(rand(Normal(params[i], exp(params[i+N])))), N) end end make_squashed_gaussian_n_sampler(n::Integer, amax::T) where T<:Real = make_squashed_gaussian_n_sampler(Val(n), amax) make_squashed_gaussian_sampler(::T, amax::T) where T<:Real = params -> squashed_gaussian_action_sampler(params, amax) make_squashed_gaussian_sampler(::NTuple{N, T}, amax::NTuple{N, T}) where {N, T<:Real} = make_squashed_gaussian_n_sampler(N, amax) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$f2ed56c9-c2b7-42cb-a083-e12aeaa126ef„§cell_idÙ$f2ed56c9-c2b7-42cb-a083-e12aeaa126ef¤codeÙreinforce_monte_carlo_control_binary_features(corridor_mdp, get_corridor_features, 1, 1_000, Î± = 2f0^-13, max_steps = 1_000).policy_function(1)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$cbea5840-49d2-4e91-be9c-f5f15666d78a„§cell_idÙ$cbea5840-49d2-4e91-be9c-f5f15666d78a¤codeÙ°reinforce_with_baseline_monte_carlo_control_binary_features(corridor_mdp, get_corridor_features, 1, 1_000, Î±_Î¸ = 2f0^-12, Î±_w = 2f0^-6, max_steps = 1_000).policy_function(1)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$1f041cb3-618c-4380-a1ec-d7bbe4a80f62„§cell_idÙ$1f041cb3-618c-4380-a1ec-d7bbe4a80f62¤codeÚCfunction actor_critic_binary_episodic_parameter_study(mdp::StateMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, Î»_Î¸::T, Î»_w::T, Î±_Î¸_list::AbstractVector{T}, Î±_w_list::AbstractVector{T}, max_episodes::Integer; nruns::Integer = 100, max_steps::Integer = 10_000, seed = rand(UInt64), init_policy_params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), init_value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} Random.seed!(seed) function average_runs(Î±_Î¸, Î±_w) 1:nruns |> Map(_ -> actor_critic_with_eligibility_traces_binary_features(mdp, Î»_Î¸, Î»_w, get_active_features, num_features, max_episodes, max_steps; Î±_Î¸ = Î±_Î¸, Î±_w = Î±_w, policy_params = copy(init_policy_params), value_params = copy(init_value_params), kwargs...) |> x -> isempty(x.episode_rewards) ? -T(Inf) : mean(x.episode_rewards)) |> foldxt(+) |> x -> x / nruns end traces = [begin scatter(x = Î±_Î¸_list, y = average_runs.(Î±_Î¸_list, Î±_w), name = "Î±_w = $Î±_w") end for Î±_w in Î±_w_list] plot(traces, Layout(xaxis_title = "Î±_Î¸", yaxis_title = "Average Reward Per Episode in the First
$max_episodes Episodes Averaged Over $nruns Runs", xaxis_type = "log", title = "Binary Feature Encoding with $num_features Features, Î»_Î¸ = $Î»_Î¸, Î»_w = $Î»_w")) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$96506201-6b66-49e6-8179-06952e2394e1„§cell_idÙ$96506201-6b66-49e6-8179-06952e2394e1¤codeÚBfunction setup_binary_policy_arguments(mdp::StateMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer) where {T<:Real, S, A, P, F1, F2, F3} x = BinaryFeatureVector() update_feature_vector!(x::BinaryFeatureVector, s) = update_binary_feature_vector!(x, s, get_active_features) action_preferences = zeros(T, length(mdp.actions)) âˆ‡lnÏ€ = BinaryEligibilityVector(x, 1, copy(action_preferences)) return (feature_vector = x, update_feature_vector! = update_feature_vector!, action_preferences = action_preferences, eligibility_vector = âˆ‡lnÏ€) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$76b03e72-da04-4530-8534-6d6468268cbd„§cell_idÙ$76b03e72-da04-4530-8534-6d6468268cbd¤codeÚ‡md""" $\sum_{s \in \mathcal{S}} \sum_{k = 0}^\infty \Pr \{ s_0 \rightarrow s, k, \pi \} = \sum_{k = 0}^\infty \left [ 1 - \Pr \{s_0 \rightarrow S_T, k, \pi \} \right ] = \eta$ where $\eta$ is the average length of an episode. The quantity inside the brackets is the probability that an episode has not terminated by step k and follows from the fact that the sum over states in $\mathcal{S}$ is over the set of non-terminal states. If the sum was over $\mathcal{S}^+$ instead then it would be infinite since the first sum term would be 1 for every k. Normally to calculate $\eta$, we would use the expected value with the probability of an episode lasting exactly $k$ steps, but the probability we have access to here is actually the distribution function, not the density function. That is $\Pr \{s_0 \rightarrow S_T, k, \pi \} = \sum_{t = 0}^k \Pr \{ T = t \} = \Pr \{ T \leq k \}$ where $T$ is the length of an episode. Using these probabilities, we can write $\eta = \mathbb{E}_\pi [T] = \sum_{k = 0}^\infty k \Pr \{ T = k \} = \Pr \{T = 1 \} + 2 \Pr \{T = 2 \} + \cdots$. Earlier we had the expression $\eta = \sum_{k = 0}^\infty \left [ 1 - \Pr \{s_0 \rightarrow S_T, k, \pi \} \right ] = \sum_{k = 0}^\infty \Pr \{T \gt k \} = \sum_{k = 0}^\infty \sum_{t = k + 1}^\infty \Pr \{T = t \}$ We can stack up the terms of this double sum to see that it is equivalent to the expected value calcuation from before: $\begin{flalign} \Pr \{ T = 1 \} + \Pr \{ T = 2 \} + &\Pr \{ T = 3 \} +\cdots \\ \Pr \{ T = 2 \} + &\Pr \{ T = 3 \} + \cdots \\ &\Pr \{ T = 3 \} + \cdots \\ \vdots \end{flalign}$ If we count terms along the diagonal, we see that each value of $k$ has exactly $k$ terms, matching the expected value calculation. What if we wanted to calculate the bivariate distribution over states and steps where we ignore the terminal states $\mu_\pi(s, k)$ such that $\sum_{s \in \mathcal{S}} \sum_k \mu_\pi(s, k) = 1$. This probability represents the chance of sampling a particular step and state simultaneously from a unbiased sample of non-terminal states in an episode. Luckily we can break down this probability into two components: 1) the probability of reaching a step k without terminating 2) the probability of being in a non-terminal state on step k. We saw already that 1) is just $\sum_{s \in \mathcal{S}} \Pr \{ s_0 \rightarrow s, k, \pi \}$ and 2) we can calculate by normalizing those probabilities over only the non-terminal states: $\frac{\Pr \{ s_0 \rightarrow s, k, \pi \}}{\sum_{s \in \mathcal{S}} \Pr \{ s_0 \rightarrow s, k, \pi \} }$. By multiplying these two together we see that the probability is just the original distribution but where the domain of possible input values is $s \in \mathcal{S}$ and all possible steps $k$. Therefore, we can transform this into a normalized bivariate distribution by dividing by its sum over those two sets: $\mu_\pi(s, k) = \frac{\Pr \{ s_0 \rightarrow s, k, \pi \}}{\sum_{x \in \mathcal{S}} \sum_{t = 0}^\infty \Pr \{ s_0 \rightarrow x, t, \pi \}}$ Now that we have established the relationship between the on-policy distribution function and the probability expression we have, we can use it to complete the proof below. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$fd89433e-643c-474b-b3c4-a997678421a6„§cell_idÙ$fd89433e-643c-474b-b3c4-a997678421a6¤codeÙâmd""" #### Linear Features This version of REINFORCE uses linear feature vectors for which one needs to specify the total number of features as well as a function that updates the values in a feature vector given a state. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$87feff3e-e510-4916-91a9-db3a2cd12225„§cell_idÙ$87feff3e-e510-4916-91a9-db3a2cd12225¤codeÚ×@bind fcann_continuing_cartpole_study_params PlutoUI.combine() do Child md""" $$\lambda_\theta$$: $(Child(:Î»_Î¸, Slider(0.00f0:0.001f0:.999f0, default = 0.75f0, show_value=true))) $$\lambda_\mathbf{w}$$: $(Child(:Î»_w, Slider(0.00f0:0.001f0:.999f0, default = 0.25f0, show_value=true))) $$\alpha_{\overline{r}}$$: $(Child(:Î±_rÌ„, NumberField(0.00f0:0.001f0:.999f0, default = 0.1f0))) hidden layer size: $(Child(:h, NumberField(1:128, default = 8))), num layers: $(Child(:l, NumberField(1:5, default = 3))) $$\log_2 \alpha_\theta$$ min: $(Child(:Î±_Î¸_min, NumberField(-100:0, default = -11))) $$\log_2 \alpha_{\mathbf{w}}$$ min: $(Child(:Î±_w_min, NumberField(-100:0, default = -10))) """ end |> confirm¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$5261651e-a51e-4e80-8e23-83a4c10e5259„§cell_idÙ$5261651e-a51e-4e80-8e23-83a4c10e5259¤codeÚ¬begin function update_gaussian_eligibility_vector!(âˆ‡lnÏ€::Matrix{T}, action_dist_params::Vector{T}, x::Vector{T}, action::T, policy_params::Matrix{T}) where T<:Real c1 = action - first(action_dist_params) Ïƒ = exp(last(action_dist_params)) c2 = Ïƒ^-2 c3 = c2*c1 c4 = c3*c1 - one(T) @inbounds @simd for i in eachindex(x) âˆ‡lnÏ€[i, 1] = x[i]*c3 end @inbounds @simd for i in eachindex(x) âˆ‡lnÏ€[i, 2] = x[i]*c4 end end function update_gaussian_eligibility_vector!(âˆ‡lnÏ€::Matrix{T}, action_dist_params::Vector{T}, x::Vector{T}, action::NTuple{N, T}, policy_params::Matrix{T}) where {N, T <: Real} for k = 1:N c1 = action - action_dist_params[k] Ïƒ = exp(action_dist_params[k+N]) c2 = Ïƒ^-2 c3 = c2*c1 c4 = c3*c1 - one(T) @inbounds @simd for i in eachindex(x) âˆ‡lnÏ€[i, k] = x[i]*c3 end @inbounds @simd for i in eachindex(x) âˆ‡lnÏ€[i, k+N] = x[i]*c4 end end end end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$dddc4a2f-34b2-41dc-85b3-55aba4880fa6„§cell_idÙ$dddc4a2f-34b2-41dc-85b3-55aba4880fa6¤codeÙŒdisplay_cartpole_episode((runepisode(cartpole_setup.mdps.episodic.continuous, reinforce_test.policy_sample_action) |> x -> (x[1], x[2]))...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$54fff14b-cf53-47b0-9cfa-8b9ee33df54e„§cell_idÙ$54fff14b-cf53-47b0-9cfa-8b9ee33df54e¤codeÚÎbegin mutable struct BinaryBetaEligibilityVector{T<:Real, A<:Union{T, NTuple{N, T} where N}, P<:Union{T, Vector{T}}, B <: BinaryFeatureVector} binary_features::B a::A Î±::P Î²::P end BinaryBetaEligibilityVector(a::T) where T<:Real = BinaryBetaEligibilityVector(BinaryFeatureVector(), a, one(T), one(T)) BinaryBetaEligibilityVector(a::NTuple{N, T}) where {T<:Real, N} = BinaryBetaEligibilityVector(BinaryFeatureVector(), a, ones(T, N), ones(T, N)) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$023f67b8-8f38-470a-9766-ac60a75678aa„§cell_idÙ$023f67b8-8f38-470a-9766-ac60a75678aa¤codeÙfconst mountaincar_fcann_setup = fcann_feature_vector_setup(mountaincar_min_vals, mountaincar_max_vals)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$1558cec1-c4fd-4bc0-85ed-ae22c6067d41„§cell_idÙ$1558cec1-c4fd-4bc0-85ed-ae22c6067d41¤codeÚêmd""" We can also repeat this derivation for the alternative linear parameterization where we only have state feature vectors and a parameter matrix with components $\boldsymbol{\theta}_{i, j}$: $\begin{flalign} \mathbf{h} &= \boldsymbol{\theta}^\top \mathbf{x}(s) \\ h_a &= \mathbf{h}_a \\ \mathbf{\pi}(s) &= \sigma(\mathbf{h}) \\ \pi_a &= \sigma(\mathbf{h})_a \\ \nabla(\pi_a)_{i, j} &= \pi_a \begin{cases} \mathbf{x}(s)_i (1 - \pi_j), & \text{ if } j = a \\ -\pi_j \mathbf{x}(s)_i, & \text{ else }\\ \end{cases} \end{flalign}$ We already know how to apply the chain rule to the natural logarithm so our final gradient is: Applying this to the above expression yields: $\begin{flalign} \nabla \left ( \ln \pi_a \right )_{i, j} &= \frac{\nabla \left ( \pi_a \right )_{i, j}}{\pi_a} \\ &= \begin{cases} \mathbf{x}(s)_i (1 - \pi_j), & \text{ if } j = a \\ -\pi_j \mathbf{x}(s)_i, & \text{ else }\\ \end{cases} \end{flalign}$ which is the per component version of the desired vector expression. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$da8d0bca-105b-4d0b-a73d-ee5c9059aeaf„§cell_idÙ$da8d0bca-105b-4d0b-a73d-ee5c9059aeaf¤codeÚ±md""" Notice now that all of the parameters associated with the state-value estimate are irrelevent since they always cancel out in the parameter update. Even though we have added a parameter, this method effectively removes two from the analysis. Also, we seem to actually benefit from an intermediate value of $\lambda_{\boldsymbol{\theta}}$ unlike in the episodic case where using the Monte Carlo method was always the best. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$3e7cecec-eb77-4862-8e3c-b510422e06db„§cell_idÙ$3e7cecec-eb77-4862-8e3c-b510422e06db¤codeÙ8plot_squashed_gaussian(squashed_gaussian_plot_params...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$0284f0d7-b8a9-4ae6-add0-ac1078571d9b„§cell_idÙ$0284f0d7-b8a9-4ae6-add0-ac1078571d9b¤codeÚ3md""" $\begin{flalign} J(\boldsymbol{\theta}) \doteq r(\pi) &\doteq \lim_{h \rightarrow \infty} \frac{1}{h} \sum_{t=1}^h \mathbb{E} [R_t \mid S_0, A_{0:t-1} \sim \pi] \tag{13.15} \\ &= \lim_{t \rightarrow \infty} \mathbb{E}[R_t \vert S_0,A_{0:t-1} \sim \pi] \\ &= \sum_s \mu(s) \sum_a \pi(a \vert s) \sum_{s^\prime, r} p(s^\prime, r \vert s, a) r \end{flalign}$ where $\mu$ is the steady-state distribution under $\pi$, $\mu(s) \doteq \lim_{t \rightarrow \infty} \Pr \{ S_t = s \vert A_{0:t} \sim \pi \}$, which is assumed to exist and to be independent of $S_0$ (an ergodicity assumption). Remember that this is the special distribution under which, if you select actions according to $\pi$, you remain the same distribution: $\sum_s \mu(s) \sum_a \pi(a \vert s, \boldsymbol{\theta})p(s^\prime \vert s, a) = \mu(s^\prime), \: \forall s^\prime \in \mathcal{S}$ Naturally, in the continuing case, we define values, $v_\pi(s) \doteq \mathbb{E}_\pi [G_t \vert S_t = s]$ and $q_\pi(s, a) \doteq \mathbb{E}_\pi[G_t \vert S_t = s, A_t = a]$, with respect to the differential return: $G_t \doteq R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + R_{t+3} - r(\pi) + \cdots \tag{13.17}$ With these alternate definitions, the policy gradient theorem as given for the episodic case (13.5) remains true for the continuing case. See proof below: """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$b94fc99c-f439-4df2-8da3-c01718a136c4„§cell_idÙ$b94fc99c-f439-4df2-8da3-c01718a136c4¤codeÚ{md""" Repeating this process for state 2 yields: $\begin{flalign} v_2 &= -\frac{2+p}{p(1-p)} \\ \frac{\partial v_2}{\partial p} &= -\frac{p(1-p) - (2+p)(1 - 2p)}{p^2(1-p)^2} \end{flalign}$ Setting this equal to 0 implies $\begin{flalign} p - p^2 &= 2 - 4p + p - 2p^2 \\ p^2 + 4p - 2 &= 0 \\ \end{flalign}$ Using the quadratic equation and taking only the positive solution yields: $p = \frac{-4 + \sqrt{16 + 8}}{2} = \frac{-4 + \sqrt{24}}{2} = -2 + \sqrt{6} \approx 0.4495$ So, in order to maximize the value at state 2, we have $p_{\text{left}} \approx 0.4495$ and $p_{\text{right}} \approx 0.5505$. Which is different from the value we got for state 1. So There is a different optimal policy depending on the starting state. It should be obvious for example that starting in the third state results in an optimial policy of choosing the right action every time. The value functions for each state are plotted below. The behavior of $v_3$ is not well defined at $p=0$ because for any finite $v_2$ it should be 0 but the limit approaching from the right side is -3. This is because for $p=0$ both $v_1$ and $v_2$ are not finite and the episode never terminates. The value of the state at this probability is: $v_2 = - \frac{2+p}{p(1-p)} = -\frac{\sqrt{6}}{(\sqrt{6}-2)(3 - \sqrt{6})} = - \frac{\sqrt{6}}{3 \sqrt{6} - 6 - 6 + 2 \sqrt{6}} = - \frac{\sqrt{6}}{5 \sqrt{6} - 12} \approx -9.9$ """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$b8532822-179b-4cd5-a279-4b71dafb544a„§cell_idÙ$b8532822-179b-4cd5-a279-4b71dafb544a¤codeÚ2const mountaincar_continuous_test_train = actor_critic_with_eligibility_traces_binary_features_gaussian_actions(mountaincar_continuous_mdp, 0.05f0, 0.8f0, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, typemax(Int64), 1_000_000; Î±_Î¸ = 5f-5, Î±_w = 0.00008f0)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$07ba9fe4-aaa7-4123-9865-cbfa79d0d44a„§cell_idÙ$07ba9fe4-aaa7-4123-9865-cbfa79d0d44a¤codeÙ£display_cartpole_episode((runepisode(cartpole_setup.mdps.episodic.discrete; Ï€ = reinforce_test4.policy_sample_action, max_steps = 1_000) |> x -> (x[1], x[2]))...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$f487f2dd-ad09-48ac-ae34-bf50cfa6ac7d„§cell_idÙ$f487f2dd-ad09-48ac-ae34-bf50cfa6ac7d¤codeÙe@bind start_mountaincar_continuing_fcann_param_study CounterButton("Run Mountaincar Parameter Study")¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$5c4a383f-fcf2-4f2b-819f-6d84471dda00„§cell_idÙ$5c4a383f-fcf2-4f2b-819f-6d84471dda00¤codeÚufunction update_fcann_value_gradient!(âˆ‡vÌ‚::FCANNParams, x::Vector{T}, params::FCANNParams, hidden_layers::Vector{Int64}, l2::T, tanh_grad_z::FCANNActivations{T}, activations::FCANNActivations{T}, deltas::FCANNActivations{T}, dropout::T, reslayers::Integer, activation_list::AbstractVector{B}, scales) where {T<:Float32, B<:Bool} FCANN.nnCostFunction(params..., hidden_layers, x, 1, l2, âˆ‡vÌ‚..., tanh_grad_z, activations, deltas, dropout; resLayers = reslayers, loss_type = OutputIndex(), activation_list = activation_list) @inbounds for i in eachindex(params[1]) for j in 1:2 âˆ‡vÌ‚[j][i] .*= scales[i] end end end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$135f205a-f87e-4691-8e87-d317d6312c84„§cell_idÙ$135f205a-f87e-4691-8e87-d317d6312c84¤codeÚ:md""" The plots below visualize these distributions for the corridor problem starting with the normalized distributions per step which include the terminal states. If we continued to create these plots for larger values of $k$, then the distribution would collapse to a value of 1 for being in a terminal state. In order to calculate other distributions such as the stationary state distribution, it is necessary to renormalize these probabilities by excluding the terminal states: #### On-policy Distributions $$\begin{flalign} &\mu_{k, \pi}(s) = \Pr\{S_k = s \mid \pi \} \; \forall s \in \mathcal{S}^+ \tag{state visits per step}\\ &\Pr \{ T \leq k \vert \pi \} = 1 - \sum_{s \in \mathcal{S}} \Pr\{S_k = s \mid \pi \} \; \forall k \tag{Chance of terminating already (distribution function not density)}\\ &\mu_\pi(s) = \frac{\sum_k \Pr \{ S_k = s \mid \pi \}}{\sum_{k} \sum_{s \in \mathcal{S}} \Pr \{ S_k = s \mid \pi \}} \; \forall s \in \mathcal{S} \tag{non-terminal state visits}\\ &\mu_\pi(s, k) = \frac{\Pr \{ S_k = s \mid \pi \}}{\sum_{k} \sum_{s \in \mathcal{S}} \Pr \{ S_k = s \mid \pi \}} \; \forall s \in \mathcal{S} \tag{non-terminal state and step visits}\\ \end{flalign}$$ Note that final two distributions are only defined for non-terminal states. If we tried to include terminal states we would be unable to normalize the distribution since $\lim_{k \rightarrow \infty} \Pr \{ S_k = S_T \mid \pi \} = 1$ and we would have a diverging sum in the denominator. The only reason these calculation is possible is that the probabilities reach zero quickly enough at higher $k$ for the non-terminal states. The plots below visualize the four expressions above. The second expression notably is not a probability density but a cummulative distribution function since it includes a sum of all probabilities that meet the condition. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$4a39f9a7-72d4-44ad-895a-742cd1291f92„§cell_idÙ$4a39f9a7-72d4-44ad-895a-742cd1291f92¤codeÙW@bind dist_plot_p Slider(0.1f0:0.1f0:.9f0; default = 0.5f0, show_value=true) |> confirm¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$ee72af8d-3cb8-4314-82df-580f068e1252„§cell_idÙ$ee72af8d-3cb8-4314-82df-580f068e1252¤codeÚ md""" One common form of linear feature vector is one that selects active features per state. Tile coding is an example of this where a state is assigned a tile in each tiling used and the number of tilings control how many active features a given state will have. Because the only possible feature vector values are 1 or 0, this style of encoding need not be as complex as other methods. We can see by the form of the gradients an abbreviated algorithm that need not compute the eligibility vector explicitely. We can define a binary feature encoding by the function $\mathcal{F}(s)$ which returns the indices of active features for a state $s$ as well as the knowledge of how many total features there are, $d$. All of the values of $\mathbf{x}(s)$ are zero except for the indices in $\mathcal{F}(s)$ whose values are 1. That simplifies the expression we have before for the linear feature eligibility vector: $\begin{flalign} \nabla \left ( \ln \pi_a \right )_{i, j} &= \frac{\nabla \left ( \pi_a \right )_{i, j}}{\pi_a} \\ &= \begin{cases} \mathbf{x}(s)_i (1 - \pi_j), & \text{ if } j = a \\ -\pi_j \mathbf{x}(s)_i, & \text{ else }\\ \end{cases} \\ &= \begin{cases} (1 - \pi_j), & \text{ if } j = a \text{, } i \in \mathcal{F}(s) \\ -\pi_j, & \text{ if } j \neq a \text{, } i \in \mathcal{F}(s) \\ 0, & \text{ otherwise} \end{cases} \end{flalign}$ We can see from this form of the eligibility vector that it need not be computed explicitely and we do not need to instantiate a feature vector either. Rather we can simply go through the active feature indices and subtract the policy output for the column index at each row and then add 1 to the column corresponding to the selected action: Loop for each step of the episode $t = 0, 1, \cdots, T-1$ $G \leftarrow \sum_{k=t+1} \gamma^{k-t-1}R_k$ $c = \alpha \times \gamma^t \times G$ Loop for each action index j Loop for each feature i $\theta_{i, j} \leftarrow \theta_{i, j} - c \times \pi(a_j, S_t, \mathbf{\theta})$ Define $j_a$ as the column index corresponding to action $A_t$ Loop for each feature i $\theta_{i, j_a} \leftarrow \theta_{i, j_a} + c$ Specialized versions of REINFORCE that use binary features and linear features can be found below as well as the general case that works for any type of parameterized function approximation. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$e524f8cc-ab69-4f8b-a59f-28156696a104„§cell_idÙ$e524f8cc-ab69-4f8b-a59f-28156696a104¤codeÙc@bind run_mountaincar_binary_episodic_countinuous_param_study2 CounterButton("Run Parameter Study")¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$1894ae1a-bb68-4de0-a4d2-ac5d02c49f09„§cell_idÙ$1894ae1a-bb68-4de0-a4d2-ac5d02c49f09¤codeÙ,plot(mountaincar_test_train.episode_rewards)¨metadataƒ©show_logsÃ¨disabledÃ®skip_as_scriptÂ«code_foldedÂÙ$f3bc47b5-03fc-4bd9-a890-26f9608a730b„§cell_idÙ$f3bc47b5-03fc-4bd9-a890-26f9608a730b¤codeÚ.md""" ### *Continuing Corridor Gridworld Example* Note that if we try to apply this algorithm to the short corridor gridworld it fails because a terminal state is encountered. This condition is checked inside the algorithm because there is nothing about an MDP the way it is defined which tells you in advance if it is a continuing task or not. In the tabular case you can always check to see if a terminal state exists since every state is available, but for the non-tabular case, all we can do is note the problem if a terminal state is encountered. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$4915b1ed-ad53-4ece-9b00-bc136d47d8dc„§cell_idÙ$4915b1ed-ad53-4ece-9b00-bc136d47d8dc¤codeÚšmd""" It is implicit in all expressions below that $\pi$ is a function of $\boldsymbol{\theta}$ and that the gradients are with respect to $\boldsymbol{\theta}$. The performance measure for the continuing case is $J(\boldsymbol{\theta}) = r(\boldsymbol{\theta})$ (13.15) and all value functions use the definition of the differential return. We begin by expressing the gradient of the state value function in terms of the state-action value function, the policy, the average return and gradients thereof: $\begin{flalign} \nabla v_\pi(s) &= \nabla \left [ \sum_a \pi(a \vert s) q_\pi (s, a) \right ], \: \forall s \in \mathcal{S} \\ &= \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \nabla q_\pi(s, a) \right ] \tag{product rule} \\ &=\sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \nabla \sum_{s^\prime, r} p(s^\prime, r, \vert s, a)\left (r - r(\boldsymbol{\theta}) + v_\pi(s^\prime) \right ) \right ] \tag{differential return definitions} \\ &=\sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) [ -\nabla r(\boldsymbol{\theta}) + \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) ] \right ] \tag{distributing gradient}\\ \end{flalign}$ The purpose of this expression is to isolate the term which is the gradient of the average return since this is the performance metric gradient we originally sought. Note that if we separate the terms inside the sum, the one with the gradient of $r$ is $\sum_a \pi(a\vert s) [- \nabla r(\boldsymbol{\theta})] = -\nabla r(\boldsymbol{\theta}) \sum_a \pi(a \vert s)$. But the policy function is a probability distribution so its sum over actions is just 1. Therefore, this term simplifies to just $-\nabla r(\boldsymbol{\theta})$ which we can simply move to the other side of the expression swapping its place with the state value function: $\begin{flalign} \nabla v_\pi(s)&=-\nabla r(\boldsymbol{\theta}) + \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) \right ] \\ \nabla r(\boldsymbol{\theta}) &=-\nabla v_\pi(s) + \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) \right ] \end{flalign}$ Now the left hand side is $\nabla J(\boldsymbol{\theta})$ and does not depend on $s$. As such, the right hand side as a whole must be independent of $s$ as well so we are free to take a weighted sum of it over some probability distribution on $s$ since all the terms sum to 1. That is, if $f$ is independent of $s$, then $f = \sum_s \mu(s) f = f \sum_s \mu(s) = f \times 1 = f$: $\begin{flalign} \nabla J(\boldsymbol{\theta}) &= \sum_s \mu(s) \left ( \sum_a \left [ \nabla \pi(a \vert s) q_\pi(s, a) + \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) \right ] - \nabla v_\pi(s) \right ) \\ &= \sum_s \mu(s) \sum_a \nabla \pi(a \vert s) q_\pi(s, a) + \sum_s \mu(s) \sum_a \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) - \sum_s \mu(s) \nabla v_\pi(s) \tag{separating sum terms}\\ &= \sum_s \mu(s) \sum_a \nabla \pi(a \vert s) q_\pi(s, a) + \sum_{s^\prime} \sum_s \mu(s) \sum_a \pi(a \vert s) p(s^\prime \vert s, a) \nabla v_\pi(s^\prime) - \sum_s \mu(s) \nabla v_\pi(s) \tag{swapping sum order in second term}\\ &= \sum_s \mu(s) \sum_a \nabla \pi(a \vert s) q_\pi(s, a) + \sum_{s^\prime} \mu(s^\prime) \nabla v_\pi(s^\prime) - \sum_s \mu(s) \nabla v_\pi(s) \tag{stationary state distribution definition}\\ &= \sum_s \mu(s) \sum_a \nabla \pi(a \vert s) q_\pi(s, a) \tag{cancelling equivalent sum terms}\\ &= \mathbb{E}_\pi \left [ \sum_a \nabla \pi(a \vert S_t) q_\pi(S_t, a) \right ] \tag{expected value definition}\\ &= \mathbb{E}_\pi \left [ \sum_a \pi(a \vert S_t) \frac{\nabla \pi(a \vert S_t)}{\pi(a \vert S_t)} q_\pi(S_t, a) \right ] \tag{multiplying and dividing by the policy}\\ &= \mathbb{E}_\pi \left [\frac{\nabla \pi(A_t \vert S_t)}{\pi(A_t \vert S_t)} q_\pi(S_t, A_t) \right ] \tag{expected value definition}\\ &= \mathbb{E}_\pi \left [\frac{\nabla \pi(A_t \vert S_t)}{\pi(A_t \vert S_t)} G_t \right ] \tag{differential return definition}\\ &= \mathbb{E}_\pi \left [G_t \nabla \ln \pi(A_t \vert S_t) \right ] \tag{chain rule}\\ \end{flalign}$ The expression inside the expected value can be sampled on every time step and the gradient is only in terms of the policy function which we have selected as something differentiable with respect to the parameters. Since this method will only be used for continuing problems, we cannot rely on Monte Carlo sampling for the differential return. Instead, our only option is to use a bootstrap value estimate in combination with a running estimate of the average reward and the immediate sample reward: $R - \overline{R} + \hat v^\prime$ where $\hat v^\prime$ is the differential value function estimate at the transition state and $\overline{R}$ is an estimate of the average reward. We can apply the existing actor-critic algorithms to these continuing problems as long as we track that additional information and use an additional step size parameter to update the average reward estimate. This step size parameter replaces the discount rate. See a full implementation below: """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$f924eb30-d1cc-4941-8fb5-ff70ad425ab9„§cell_idÙ$f924eb30-d1cc-4941-8fb5-ff70ad425ab9¤codeÚžmd""" ## 13.3 REINFORCE: Monte Carlo Policy Gradient If we replace the true action-value function in (13.5) with a learned approximation $\hat q_\pi$, then we have a method called the *all-actions* method because the update involves the sum over all actions. For the REINFORCE algorithm, we instead sample this value using the actual return and the policy distribution. We can re-write (13.5) using an expected value under the policy and continue from there: $\begin{flalign} \nabla J(\boldsymbol{\theta}) & \propto \mathbb{E}_\pi \left [ \gamma^t \sum_a q_\pi (S_t, a) \nabla \pi(a|S_t, \boldsymbol{\theta}) \right ] \tag{13.6}\\ &= \mathbb{E}_\pi \left [\gamma^t \sum_a \pi(a|S_t, \boldsymbol{\theta}) q_\pi (S_t, a) \frac{\nabla \pi(a|S_t, \boldsymbol{\theta})}{\pi(a|S_t, \boldsymbol{\theta})} \right ] \tag{multiply and divide by policy} \\ &= \mathbb{E}_\pi \left [ \gamma^t q_\pi (S_t, A_t) \frac{\nabla \pi(A_t|S_t, \boldsymbol{\theta})}{\pi(A_t|S_t, \boldsymbol{\theta})} \right ] \tag{replace a with sample under policy} \\ &= \mathbb{E}_\pi \left [ \gamma^t G_t \frac{\nabla \pi(A_t|S_t, \boldsymbol{\theta})}{\pi(A_t|S_t, \boldsymbol{\theta})} \right ] \tag{replace value with sample return} \\ \end{flalign}$ Using the expression in the brackets we can write down an update rule for the parameters that can be sampled on each time step. This is the **REINFORCE update**: $\begin{align} \boldsymbol{\theta}_{t+1} \doteq \boldsymbol{\theta}_t + \alpha \gamma^t G_t \frac{\nabla \pi(A_t|S_t, \boldsymbol{\theta}_t)}{\pi(A_t|S_t, \boldsymbol{\theta}_t)} \tag{13.8} \end{align}$ Because it uses all future returns after step t, REINFORCE is a Monte Carlo algorithm and is well defined only for the episodic case. For implementation purposes we can replace $\frac{\nabla \pi(A_t|S_t, \boldsymbol{\theta})}{\pi(A_t|S_t, \boldsymbol{\theta})}$ with $\nabla \ln \pi(A_t|S_t, \boldsymbol{\theta}_t)$ which is usually refered to as the *eligibility vector*. With the alternative parameterization, the eligibility vector is $\nabla \ln \pi(S_t, \theta_t)_{A_t}$ where $\pi$ is a vector and the $A_t$ subscript takes the value of that vector at the index corresponding to the action $A_t$. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$d83dc659-dce7-41dd-a8e7-2933ab39d15c„§cell_idÙ$d83dc659-dce7-41dd-a8e7-2933ab39d15c¤codeÚómd""" ### *REINFORCE with Baseline Implementation* These functions use two sets of parameters, one to calculate the policy function and another to calculate the state value function. The state representation vector is shared between the two functions, but the policy function will return a distribution of preferences over actions while the value function will return a single value. If linear approximation is used to estimate both functions, the the policy parameters $\boldsymbol{\theta}$ will be a $d \times N_a$ matrix where $d$ is the length of the state feature vector representation and the value function parameters $\mathbf{w}$ will be a length $d$ vector. It is also possible to mix linear and non-linear approximation with this method. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fce„§cell_idÙ$7f77d574-8f65-4e1e-8f5f-6f1bcccc3fce¤codeÙndisplay_cartpole_episode(cartpole_fcann_continuing_test_episode[1], cartpole_fcann_continuing_test_episode[2])¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$83ca0577-15d7-4448-b597-c77810b812bf„§cell_idÙ$83ca0577-15d7-4448-b597-c77810b812bf¤codeÚ¦function figure_13_2_test(Î±_list, Î±_pair_list; nruns = 100, num_episodes = 1_000, max_steps = 1_000) Random.seed!(45) function average_runs(Î±) 1:nruns |> Map(_ -> reinforce_monte_carlo_control_binary_features(corridor_mdp, get_corridor_features, 1, num_episodes, params = [0f0 3.7f0], Î± = Î±, max_steps = max_steps).episode_rewards) |> foldxt((a, b) -> a .+ b) |> v -> v ./ nruns end function average_runs(Î±_Î¸, Î±_w) 1:nruns |> Map(_ -> reinforce_with_baseline_monte_carlo_control_binary_features(corridor_mdp, get_corridor_features, 1, num_episodes, policy_params = [0f0 3.7f0], Î±_Î¸ = Î±_Î¸, Î±_w = Î±_w, max_steps = max_steps).episode_rewards) |> foldxt((a, b) -> a .+ b) |> v -> v ./ nruns end traces1 = [begin out = average_runs(Î±) scatter(x = 1:num_episodes, y = out, name = name = "Î± = 2^$(round(Int64, log2(Î±)))") end for Î± in Î±_list] traces2 = [begin out = average_runs(Î±s...) scatter(x = 1:num_episodes, y = out, name = name = "Î±_Î¸ = 2^$(round(Int64, log2(Î±s[1]))) and Î±_w = 2^$(round(Int64, log2(Î±s[2])))") end for Î±s in Î±_pair_list] baselinetrace = scatter(x = 1:num_episodes, y = fill(-2*sqrt(2) / (3*sqrt(2) - 4), num_episodes), name = "ideal value", line_dash = "dash", line_color = "gray") plot([baselinetrace; traces1; traces2], Layout(yaxis_range = [-90, -10], yaxis_title = "Total reward on episode
(averaged over $nruns runs)", xaxis_title = "Episode", width = 800)) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb„§cell_idÙ$a7c9ae69-f4b8-471c-ab97-90642f3c2bdb¤codeÚ/function reinforce_with_baseline_monte_carlo_control_binary_features(mdp::StateMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, max_episodes::Integer; policy_params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} setup = setup_binary_policy_arguments(mdp, get_active_features, num_features) reinforce_with_baseline_monte_carlo_control!(policy_params, setup.eligibility_vector, value_params, BinaryFeatureVector(), mdp, update_binary_action_preferences!, update_binary_eligibility_vector!, setup.feature_vector, setup.update_feature_vector!, binary_value_function, update_binary_value_gradient!, max_episodes; action_preferences = setup.action_preferences, kwargs...) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$a7dcc8cd-04ec-48f2-a387-116330eaffb2„§cell_idÙ$a7dcc8cd-04ec-48f2-a387-116330eaffb2¤codeÙgfigure_13_2_test([2f0^-13], vcat([(2f0^n, 2f0^-4) for n in -12:-10], [(2f0^n, 2f0^-2) for n in -8:-6]))¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$0ab70fc3-6188-42eb-aba2-d808f319be9f„§cell_idÙ$0ab70fc3-6188-42eb-aba2-d808f319be9f¤code¸md""" # Dependencies """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$047656d1-2921-40f2-b75b-ce4a87098007„§cell_idÙ$047656d1-2921-40f2-b75b-ce4a87098007¤codeÙ1md""" ### Switched Corridor Parameter Studies """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$5d434c83-c9ca-499f-8695-c7733031c2de„§cell_idÙ$5d434c83-c9ca-499f-8695-c7733031c2de¤codeÚffunction cartpole_continuing_step(s::CartPoleState, i_a::Integer) sâ€² = cartpole_functions.step(s, cartpole_functions.discrete_actions[i_a]) if cartpole_functions.failure(sâ€²) sâ€² = cartpole_functions.initialize_state() sâ€² = CartPoleState(sâ€².x, sâ€².Î¸, sâ€².xÌ‡, sâ€².Î¸Ì‡, s.t+cartpole_functions.h) (-1f0, sâ€²) else (0f0, sâ€²) end end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$3a37b53d-9174-4faa-9404-74a40c385b0a„§cell_idÙ$3a37b53d-9174-4faa-9404-74a40c385b0a¤codeÙYshow_mountaincar_trajectory(mountaincar_continuing_fcann_test.policy_sample_action, 1000)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$820752af-8966-4ee8-82f7-a40934522de5„§cell_idÙ$820752af-8966-4ee8-82f7-a40934522de5¤codeÚPtest_study2 = actor_critic_fcann_parameter_study(cartpole_continuing_mdp, cartpole_vector_update!, cartpole_fcann_feature_setup.num_features, [4, 4], LinRange(0f0, .95f0, 20), LinRange(0.0f0, .95f0, 20), [0.005f0, 0.01f0, 0.05f0], 2f0 .^ (-8:-2), 2f0 .^ (-8:-2), 100, 100_000; nruns = 40, seed = 45) |> df -> sort(df, :output; rev=true)¨metadataƒ©show_logsÃ¨disabledÃ®skip_as_scriptÂ«code_foldedÂÙ$6acb549a-5d90-4457-a347-d22448ad8071„§cell_idÙ$6acb549a-5d90-4457-a347-d22448ad8071¤codeÙ‚@bind cartpole_fcann_continuing_episode_step_select Slider(1:length(cartpole_fcann_continuing_test_episode[1]); show_value = true)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62„§cell_idÙ$f52fc4a9-f6dd-422d-aeae-6c327d1a7b62¤codeÚcartpole_fcann_continuing_parameter_study(layer_size::Integer, num_layers::Integer, args...; kwargs...) = actor_critic_fcann_parameter_study(cartpole_continuing_mdp, cartpole_vector_update!, cartpole_fcann_feature_setup.num_features, fill(layer_size, num_layers), args...; kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728„§cell_idÙ$3bccf6fc-6e5e-4f62-ad40-1ff0a3740728¤codeÙÔactor_critic_with_eligibility_traces_binary_features(corridor_mdp, 0f0, 0f0, get_corridor_features, 1, typemax(Int64), 100_000, Î±_Î¸ = 2f0 ^ -4, Î±_w = 2f0 ^ -10, policy_params = [0f0 3.7f0]).policy_and_value(1)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$ae0f5a96-7a4b-47f9-be1e-e803a238a071„§cell_idÙ$ae0f5a96-7a4b-47f9-be1e-e803a238a071¤codeÙ@md""" ### *MDP Types and Transitions for Continuous Actions* """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$41d62de1-2c92-41ee-9430-b9ca3007afd9„§cell_idÙ$41d62de1-2c92-41ee-9430-b9ca3007afd9¤codeÚumd""" The above matrix represents an estimate of $\Pr \{ S_k = s \mid \pi \}$; however note that the terminal states are excluded from the rows. This corridor problem only has three non-terminal states. If we sum across each row, then we have the probability of reaching that step prior to terminating. The vector defined below measures the probability of an episode terminating prior to each step. Notably, this probablity is 0 for the first three steps since no policy starting from the left can terminate that quickly. As expected, the probability of terminating under the random policy grows with time approaching 1. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$8eb42403-1234-4e59-993e-057cc3a6d5c9„§cell_idÙ$8eb42403-1234-4e59-993e-057cc3a6d5c9¤codeÚCif run_mountaincar_binary_episodic_param_study > 0 actor_critic_binary_episodic_parameter_study(MountainCarTask.mdp, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, mountaincar_binary_params, 5, 3, 1000; max_steps = 100_000) else md""" Waiting to run parameter study """ end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$bbc8864a-1545-433f-bc7c-0ddf6e907138„§cell_idÙ$bbc8864a-1545-433f-bc7c-0ddf6e907138¤codeÚ¥function plot_mountaincar_policy_values(policy_and_value::Function; n1 = 100, n2 = 100) xvals = LinRange(-1.2f0, 0.5f0, n1) vvals = LinRange(-0.07f0, 0.07f0, n2) values = zeros(Float32, n1, n2) action_dists = [zeros(Float32, n1, n2) for i in 1:3] for (i, x) in enumerate(xvals) for (j, v) in enumerate(vvals) Ï€, vÌ‚ = policy_and_value((x, v)) values[j, i] = vÌ‚ for k in 1:3 action_dists[k][j, i] = Ï€[k] end end end p1 = plot(heatmap(x = xvals, y = vvals, z = values), Layout(xaxis_title = "position", yaxis_title = "velocity", title = "Learned Value Function", height = 400, width = 600)) p2 = [plot(heatmap(x = xvals, y = vvals, z = action_dists[k], colorscale = "rb"), Layout(xaxis_title = "position", yaxis_title = "velocity", title = "Policy Probability for Action $k", height = 400, width = 600)) for k in 1:3] @htl("""

$p1 $p2

""") end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$a12b92d1-e045-4f92-b8cd-eee5d56fa67d„§cell_idÙ$a12b92d1-e045-4f92-b8cd-eee5d56fa67d¤codeÙ¹const best_mc_corridor = reinforce_with_baseline_monte_carlo_control_linear_features(corridor_mdp, update_corridor_features!, 1, 100; Î±_Î¸ = 0.006f0, Î±_w = 2f0^-2, max_steps = 1_000)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$ce33f710-fd9d-4dfa-acda-40204e54d518„§cell_idÙ$ce33f710-fd9d-4dfa-acda-40204e54d518¤codeÚ¶md""" ## 13.5 Actor-Critic Methods Here we also use the value function estimator to calculate the the return estimate using the one step bootstrap return. When the state value function is used in this way we call it the *critic*. In general we can use this function with n-step returns and eligibility traces. Recall from the subject of TD learning of value functions that the one-step return is often superior to the actual return regarding variance and ease of computation, although it does introduce bias to the estimate. With the use of eligibility traces we can smoothly vary arbitrarily close to the Monte Carlo return. Note that the bias in the gradient estimate is n due to the bootstrapping as such; the actor would be biased even if the critic was learned by a Monte Carlo method. The one-step actor-critic method is the analog of the one step methods such as TD$(0)$, Sarsa$(0)$, and Q learning. These methods replace the full return of REINFORCE with the one step return as follows: $\begin{flalign} \boldsymbol{\theta}_{t+1} &\doteq \boldsymbol{\theta}_t + \alpha(G_{t:t+1} - \hat v(S_t, \mathbf{w}))\ln\nabla\pi(A_t|S_t, \mathbf{\theta_t}) \tag{13.12} \\ & = \boldsymbol{\theta}_t + \alpha(R_{t+1} + \gamma \hat v(S_{t+1}, \mathbf{w}) - \hat v(S_t, \mathbf{w}))\ln\nabla\pi(A_t|S_t, \mathbf{\theta_t}) \tag{13.13} \\ & = \boldsymbol{\theta}_t + \delta_t\ln\nabla\pi(A_t|S_t, \mathbf{\theta_t}) \tag{13.14} \\ \end{flalign}$ This can be implemented as a fully online algorithm because we do not have to wait until the end of an episode to calculate return estimates. The natural state-value-function learning method to pair with this is semi-gradient TD(0). See a full implementation below. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$339b4d2b-2237-46a3-9867-ecc3332856c1„§cell_idÙ$339b4d2b-2237-46a3-9867-ecc3332856c1¤codeÚ!md""" This expression repeats terms of the form $\nabla \pi(a \vert s) q_\pi(s, a)$ summed over different probabilities. The first appearance of this term is just a sum over all actions at the state $s$ which is the state we are using for the gradient expression. The next appearance of the expression is a sum over actions at state $s^\prime$. Let's define a new expressions: $\begin{flalign} f(s) &\doteq \sum_a \nabla \pi(a \vert s) q_\pi(s, a) \\ \end{flalign}$ Then we can rewrite the second term as follows: $\gamma \sum_a \left [ \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s, a) f(s^\prime) \right ] = \gamma \sum_{s^\prime} f(s^\prime) \sum_a \left [ \pi(a \vert s) p(s^\prime \vert s, a) \right ] = \gamma \mathbb{E}_\pi [f(s^\prime) \vert s] = \gamma \sum_{s^\prime} f(s ^\prime) \Pr \{ S_1 = s^\prime \mid S_0 = s, A_1 \sim \pi(s) \}$ Define a new term $g(s) = \sum_{s^\prime} f(s^\prime) \Pr \{ S_1 = s^\prime \vert S_0 = s, A_1 \sim \pi(s) \} = \sum_{s^\prime} f(s^\prime) \sum_a [\pi(a \vert s) p(s^\prime \vert s, a)$ So the second term can be written as $\gamma g(s)$ where the final expression uses the probability that the agent transitions from state $s$ to $s^\prime$ in one step under the policy $\pi$. Using this same logic, we can rewrite the third expression as well. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$a8349352-3242-46d5-b0d5-1b6eb5d77e90„§cell_idÙ$a8349352-3242-46d5-b0d5-1b6eb5d77e90¤codeÙ4@bind x Slider(-50:50; default = 0, show_value=true)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$7d63b960-3998-4f7b-8cbb-ccd49db9aeac„§cell_idÙ$7d63b960-3998-4f7b-8cbb-ccd49db9aeac¤codeÙ»one_step_actor_critic_binary_features(corridor_mdp, get_corridor_features, 1, typemax(Int64), 100_000, Î±_Î¸ = 2f0 ^ -3, Î±_w = 2f0 ^ -10, policy_params = [0f0 3.7f0]).policy_and_value(1)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$65d2add6-fd6f-456c-92ed-3cd9d1862ef6„§cell_idÙ$65d2add6-fd6f-456c-92ed-3cd9d1862ef6¤codeÚPfunction update_binary_policy_params!(params::Matrix{T}, active_features::BinaryFeatures, i_a::Integer, Ï€_dist::Vector{T}, c::T) where T<:Real @inbounds for i in eachindex(Ï€_dist) for j in active_features params[j, i] -= c*Ï€_dist[i] end end @inbounds for j in active_features params[j, i_a] += c end return params end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$f55afa58-962d-4551-8d95-a5b467d61adf„§cell_idÙ$f55afa58-962d-4551-8d95-a5b467d61adf¤codeÚmbegin function update_params_with_gradient!(Î¸::Matrix{T}, Î±::T, âˆ‡Î¸::BinaryGaussianEligibilityVector{T, T, T, B}) where {T<:Real, B<:BinaryFeatureVector} c1 = âˆ‡Î¸.a - âˆ‡Î¸.Î¼ c2 = âˆ‡Î¸.Ïƒ^(-2) # isnan(c2) && @info "warning Ïƒ of $âˆ‡Î¸.Ïƒ is causing nan results" # isinf(c2) && @info "warning Ïƒ of $âˆ‡Î¸.Ïƒ is causing inf results" c3 = c1 * c2 c4 = c3*c1 - one(T) Î´1 = Î±*c3 @inbounds @simd for j in 1:âˆ‡Î¸.binary_features.num_features i = âˆ‡Î¸.binary_features.active_features[j] Î¸[i, 1] += Î´1 end Î´2 = Î±*c4 @inbounds @simd for j in 1:âˆ‡Î¸.binary_features.num_features i = âˆ‡Î¸.binary_features.active_features[j] Î¸[i, 2] += Î´2 end return Î¸ end function update_params_with_gradient!(Î¸::Matrix{T}, Î±::T, âˆ‡Î¸::BinaryBetaEligibilityVector{T, T, T, B}) where {T<:Real, B<:BinaryFeatureVector} c1 = digamma(âˆ‡Î¸.Î± + âˆ‡Î¸.Î²) Î´1 = Î±*(log(âˆ‡Î¸.a) + c1 - digamma(âˆ‡Î¸.Î±)) @inbounds @simd for j in 1:âˆ‡Î¸.binary_features.num_features i = âˆ‡Î¸.binary_features.active_features[j] Î¸[i, 1] += Î´1 end Î´2 = Î±*(log(one(T) - âˆ‡Î¸.a) + c1 - digamma(âˆ‡Î¸.Î²)) @inbounds @simd for j in 1:âˆ‡Î¸.binary_features.num_features i = âˆ‡Î¸.binary_features.active_features[j] Î¸[i, 2] += Î´2 end return Î¸ end function update_params_with_gradient!(Î¸::Matrix{T}, Î±::T, âˆ‡Î¸::BinarySquashedGaussianEligibilityVector{T, T, T, B}) where {T<:Real, B<:BinaryFeatureVector} c1 = atanh(âˆ‡Î¸.a/âˆ‡Î¸.amax) - âˆ‡Î¸.Î¼ c2 = âˆ‡Î¸.Ïƒ^(-2) # isnan(c2) && @info "warning Ïƒ of $âˆ‡Î¸.Ïƒ is causing nan results" # isinf(c2) && @info "warning Ïƒ of $âˆ‡Î¸.Ïƒ is causing inf results" c3 = c1 * c2 c4 = c3*c1 - one(T) Î´1 = Î±*c3 @inbounds @simd for j in 1:âˆ‡Î¸.binary_features.num_features i = âˆ‡Î¸.binary_features.active_features[j] Î¸[i, 1] += Î´1 end Î´2 = Î±*c4 @inbounds @simd for j in 1:âˆ‡Î¸.binary_features.num_features i = âˆ‡Î¸.binary_features.active_features[j] Î¸[i, 2] += Î´2 end return Î¸ end function update_params_with_gradient!(Î¸::Matrix{T}, Î±::T, âˆ‡Î¸::BinaryGaussianEligibilityVector{T, NTuple{N, T}, Vector{T}, B}) where {T<:Real, N, B<:BinaryFeatureVector} for k in 1:N c1 = âˆ‡Î¸.a[k] - âˆ‡Î¸.Î¼[k] c2 = âˆ‡Î¸.Ïƒ[k] ^-2 # isnan(c2) && @info "warning Ïƒ of $âˆ‡Î¸.Ïƒ is causing nan results" # isinf(c2) && @info "warning Ïƒ of $âˆ‡Î¸.Ïƒ is causing inf results" c3 = c1 * c2 c4 = c3*c1 - one(T) Î´1 = Î±*c3 @inbounds @simd for i in 1:size(Î¸, 1) Î¸[i, k] += Î´1 end Î´2 = Î±*c4 @inbounds @simd for i in 1:size(Î¸, 1) Î¸[i, k+N] += Î´2 end end return Î¸ end function update_params_with_gradient!(Î¸::Matrix{T}, Î±::T, âˆ‡Î¸::BinaryBetaEligibilityVector{T, NTuple{N, T}, Vector{T}, B}) where {T<:Real, N, B<:BinaryFeatureVector} for k in 1:N c1 = digamma(âˆ‡Î¸.Î±[k] + âˆ‡Î¸.Î²[k]) Î´1 = Î±*(log(âˆ‡Î¸.a[k]) + c1 - digamma(âˆ‡Î¸.Î±[k])) @inbounds @simd for i in 1:size(Î¸, 1) Î¸[i, k] += Î´1 end Î´2 = Î±*(log(one(T) - âˆ‡Î¸.a[k]) + c1 - digamma(âˆ‡Î¸.Î²[k])) @inbounds @simd for i in 1:size(Î¸, 1) Î¸[i, k+N] += Î´2 end end return Î¸ end end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$d9d11d69-bc16-400a-8f46-f9a8ecb8516a„§cell_idÙ$d9d11d69-bc16-400a-8f46-f9a8ecb8516a¤codeÚ,actor_critic_binary_episodic_parameter_study(mdp::StateMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, params::@NamedTuple{Î»_Î¸::T, Î»_w::T, Î±_Î¸_min::Int64, Î±_w_min::Int64}, num_Î¸::Integer, num_w::Integer, num_episodes::Integer; kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = actor_critic_binary_episodic_parameter_study(mdp, get_active_features, num_features, params.Î»_Î¸, params.Î»_w, 2f0 .^(params.Î±_Î¸_min:params.Î±_Î¸_min+num_Î¸-1), 2f0 .^(params.Î±_w_min:params.Î±_w_min+num_w-1), num_episodes; kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$ed93259c-7b8b-46d7-97fb-f194e0e04b3a„§cell_idÙ$ed93259c-7b8b-46d7-97fb-f194e0e04b3a¤codeÚŒfunction setup_binary_beta_policy_arguments(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer) where {T<:Real, S, N, A<:Union{T, NTuple{N, T}}, P, F1, F2, F3} x = BinaryFeatureVector() update_feature_vector!(x::BinaryFeatureVector, s) = update_binary_feature_vector!(x, s, get_active_features) sample_action = rand(A) action_dist_params = make_n_param_dist_params(2, sample_action) âˆ‡lnÏ€ = BinaryBetaEligibilityVector(sample_action) return (feature_vector = x, update_feature_vector! = update_feature_vector!, action_distribution_parameters = action_dist_params, eligibility_vector = âˆ‡lnÏ€) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$d1ed25e6-60c6-411f-a541-99986e5da2c5„§cell_idÙ$d1ed25e6-60c6-411f-a541-99986e5da2c5¤codeÚÿreinforce_with_baseline_monte_carlo_control_linear_features(mdp::StateMDP{T, S, A, P, F1, F2, F3}, update_feature_vector!::Function, num_features::Integer, max_episodes::Integer; policy_params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), value_params::Vector{T} = zeros(T, num_features), x = zeros(T, num_features), action_preferences = zeros(T, length(mdp.actions)), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = reinforce_with_baseline_monte_carlo_control!(policy_params, copy(policy_params), value_params, copy(value_params), mdp, update_linear_action_preferences!, update_linear_eligibility_vector!, x, update_feature_vector!, linear_value_function, update_linear_value_gradient!, max_episodes; action_preferences = action_preferences, kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$b966b248-fb4d-457d-90f6-114370846242„§cell_idÙ$b966b248-fb4d-457d-90f6-114370846242¤codeÙ±begin bad_continuous_action(a) = false bad_continuous_action(a::Real) = isnan(a) bad_continuous_action(a::NTuple{N, T}) where {N, T<:Real} = any(bad_continuous_action, a) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$4156d955-9daf-4429-b152-e8332980fb9e„§cell_idÙ$4156d955-9daf-4429-b152-e8332980fb9e¤codeÚ7const mountaincar_continuous_test_train_beta = actor_critic_with_eligibility_traces_binary_features_beta_actions(mountaincar_continuous_beta_mdp, 0.01f0, 0.99f0, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, typemax(Int64), 100_000; Î±_Î¸ = 1f-4, Î±_w = 0.00002f0)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$b09e1e48-494e-4967-826a-6e70199acad4„§cell_idÙ$b09e1e48-494e-4967-826a-6e70199acad4¤codeÙ+md""" ### Squashed Gaussian Alternative """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$734573e5-547b-4dcc-89bb-412aa6cc42d6„§cell_idÙ$734573e5-547b-4dcc-89bb-412aa6cc42d6¤codeÚfunction actor_critic_linear_parameter_study(mdp::StateMDP{T, S, A, P, F1, F2, F3}, feature_function::Function, num_features::Integer, Î»_Î¸::T, Î»_w::T, Î±_rÌ„::T, Î±_Î¸_list::AbstractVector{T}, Î±_w_list::AbstractVector{T}, max_steps::Integer; nruns::Integer = 100, seed = rand(UInt64), init_policy_params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), binary_features = false, kwargs...) where {T<:Real, S, A, P, F1, F2, F3} if binary_features algo = actor_critic_with_eligibility_traces_binary_features title_prefix = "Binary Feature Encoding" else algo = actor_critic_with_eligibility_traces_linear_features title_prefix = "Linear Encoding" end make_trace_data(Î±_Î¸_list, Î±_w) = [average_continuing_runs(nruns, seed, Î±_Î¸, Î±_w, Î±_rÌ„, init_policy_params, algo, mdp, Î»_Î¸, Î»_w, feature_function, num_features, max_steps; kwargs...) for Î±_Î¸ in Î±_Î¸_list] traces = [begin scatter(x = Î±_Î¸_list, y = make_trace_data(Î±_Î¸_list, Î±_w), name = "Î±_w = $Î±_w") end for Î±_w in Î±_w_list] plot(traces, Layout(xaxis_title = "Î±_Î¸", yaxis_title = "Average Reward Per Step in the First
$max_steps Steps Averaged Over $nruns Runs", xaxis_type = "log", title = "$title_prefix with $num_features Features, Î»_Î¸ = $Î»_Î¸, Î»_w = $Î»_w")) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54„§cell_idÙ$97b7ce3f-6d1e-41bc-ba07-50e8516a2d54¤codeÚ¨function actor_critic_with_eligibility_traces_fcann(mdp::StateMDP{T, S, A, P, F1, F2, F3}, Î»_Î¸::T, Î»_w::T, input_length::Integer, hidden_layers::Vector{Int64}, update_feature_vector!::Function, args...; policy_params::FCANNParams = FCANN.initializeparams_saxe(input_length, hidden_layers, length(mdp.actions)), reslayers = 0, l2 = 0f0, dropout = 0f0, use_Î¼P = true, activation_list = fill(true, length(hidden_layers)), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} setup = setup_fcann_policy_and_value_arguments(policy_params, input_length, hidden_layers, reslayers, l2, dropout, use_Î¼P, activation_list) actor_critic_with_eligibility_traces!(policy_params, setup.eligibility_vector, setup.value_params, setup.value_gradient, mdp, Î»_Î¸, Î»_w, setup.update_action_preferences!, setup.update_eligibility_vector!, setup.feature_vector, update_feature_vector!, setup.value_function, setup.gradient_update, args...; kwargs...) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$692c1043-4eaf-491e-b8fe-368618867f99„§cell_idÙ$692c1043-4eaf-491e-b8fe-368618867f99¤codeÚ¡md""" 1. The soft-max distribution is: $\sigma(a|s, \theta) = \frac{e^{h(s, a, \theta)}}{\sum_b e^{h(s, b, \theta)}}$ We only have two possible actions in each state so the policy for action 1 would be given by: $\pi(1|S_t, \theta_t) = \frac{e^{h(s, 1, \theta_t)}}{e^{h(S_t, 0, \theta_t)} + e^{h(S_t, 1, \theta)}}$ Simplify this expression by dividing by $e^{h(s, 1, \theta_t)}$ which results in: $\pi(1|S_t, \theta_t) = \frac{1}{e^{h(S_t, 0, \theta_t) - h(S_t, 1, \theta_t)} + 1}$ Given the assumption that $h(s, 1, \theta)-h(s, 0, \theta) = \theta^\top\mathbf{x}(s)$, we replace the expression in the exponent resulting in the final expression of: $\pi(1|S_t, \theta_t) = \frac{1}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1}$ Using the notation $f(x) = 1/(1+e^{-x})$ we can write $\pi(1|S_t, \theta_t) = f(\theta_t^\top \mathbf{x}(S_t))$ where $f$ is the logistic function. Consider this notation for the rest of the exercises. 2. The REINFORCE update is given by: $\theta_{t+1} = \theta_t + \alpha G_t \frac{\nabla\pi(A_t|S_t, \theta_t)}{\pi(A_t|S_t, \theta_t)}$, so we need to compute the gradient of the policy in terms of the parameters for this action selection: $\nabla \pi(1|S_t, \theta_t)$. Luckily, the derivative of the logistic function is simply given by: $f(x)(1-f(x))$ where $f(x)$ is the logistic function itself. In our case $x = \theta_t^\top \mathbf{x}_t$ so after applying the chain rule we have: $\nabla\pi(1|S_t, \theta_t) = f(x)(1-f(x))\nabla x = f(x)(1-f(x)) \mathbf{x_t}$ since $x$ is just a linear function of the parameters. So for the parameter update step we have: $\frac{\nabla\pi(1|S_t, \theta_t)}{\pi(1|S_t, \theta_t)} = \frac{f(x)(1-f(x))\mathbf{x}_t}{f(x)} = (1 - f(x))\mathbf{x}_t$ Also note that: $1 - f(x) = 1 - \frac{1}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1} = \frac{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1 - 1}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1} = \frac{e^{-\theta_t^\top\mathbf{x}(S_t)}}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1}$ The REINFORCE update will then be: $\theta_{t+1} = \theta_t + \alpha G_t \left ( \frac{e^{-\theta_t^\top\mathbf{x}(S_t)}}{e^{-\theta_t^\top\mathbf{x}(S_t)} + 1} \right ) \mathbf{x}_t$ 3. For the general case, we want to calculate $\frac{\nabla\pi(a|s, \theta)}{\pi(a|s, \theta)}$. We already know this expression for $a = 1$. $\nabla {\pi(1|s, \mathbf{\theta})} = f(x)(1 - f(x))\mathbf{x}(s) = \pi(1|s, \mathbf{\theta})(1 - \pi(1|s, \mathbf{\theta})\mathbf{x}(s)$ Since $\pi(a|s, \theta)$ is a probability distribution across actions, we also know that $\pi(0|s, \theta) = 1 - \pi(1|s, \theta)$ which implies that $\nabla \pi(0|s, \theta) = -\nabla \pi(1|s, \theta) = -\pi(1|s, \mathbf{\theta})(1 - \pi(1|s, \mathbf{\theta}))\mathbf{x}(s)$ We can express this in terms of $\pi(0|s, \theta)$ completely: $\nabla \pi(0|s, \theta) = (\pi(0|s, \mathbf{\theta}) - 1)\pi(0|s, \theta)\mathbf{x}(s) = -\pi(0|s, \theta)(1 - \pi(0|s, \mathbf{\theta}))\mathbf{x}(s)$ Let's now compare the two expressions for the policy gradient at each action: $\begin{align} \nabla {\pi(1|s, \mathbf{\theta})} &= \pi(1|s, \mathbf{\theta})(1 - \pi(1|s, \mathbf{\theta})\mathbf{x}(s) \\ \nabla \pi(0|s, \theta) &= -\pi(0|s, \theta)(1 - \pi(0|s, \mathbf{\theta}))\mathbf{x}(s) \\ \therefore \\ \nabla \pi(a|s, \theta) &= \chi (a) \pi(a|s, \theta)(1 - \pi(a|s, \mathbf{\theta}))\mathbf{x}(s) \\ \end{align}$ Where $\chi (a)$ is a function that returns 1 for $a=1$ and -1 for $a=0$. There are many ways to achieve this but the following expression is simple and works: $\chi(a) = 2a - 1$. Dividing by the policy yields a unified expression for the eligibility vector: $\nabla \ln{\pi(a|s,\theta)} = (2a - 1) (1 - \pi(a|s, \mathbf{\theta}))\mathbf{x}(s)$ """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$2c5d221a-2469-49e1-9249-dfdc2457f2fa„§cell_idÙ$2c5d221a-2469-49e1-9249-dfdc2457f2fa¤codeÙ\@bind start_cartpole_continuing_fcann_param_study CounterButton("Run FCANN Parameter Study")¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$7c592385-e8d3-4efe-962c-d39debb64405„§cell_idÙ$7c592385-e8d3-4efe-962c-d39debb64405¤codeÙ~const mountaincar_tilecoding_setup = tile_coding_setup(mountaincar_min_vals, mountaincar_max_vals, (0.1f0, 0.1f0), 12, (1, 3))¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaeb„§cell_idÙ$16ae3aa6-8f28-4cb0-a15f-7a96c01cdaeb¤code¼import HypertextLiteral.@htl¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÃ«code_foldedÂÙ$8eab55a5-41b7-4f5e-a02f-4c19388bc9ea„§cell_idÙ$8eab55a5-41b7-4f5e-a02f-4c19388bc9ea¤codeÚifunction update_binary_feature_vector!(x::BinaryFeatureVector, s::S, get_active_features::Function) where S active_features = get_active_features(s) l = length(x.active_features) n = 0 for (i, f) in enumerate(active_features) if i > l push!(x.active_features, f) else x.active_features[i] = f end n += 1 end x.num_features = n return x end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$0ac7ea44-14f6-4e80-80f9-d6df8059bb38„§cell_idÙ$0ac7ea44-14f6-4e80-80f9-d6df8059bb38¤codeÚfunction reinforce_monte_carlo_control!(policy_params, âˆ‡lnÏ€, mdp::StateMDP{T, S, A, PTF, F1, F2, F3}, update_action_preferences!::Function, update_eligibility_vector!::Function, x, update_feature_vector!::Function, max_episodes::Integer; Î± = one(T)/10, kwargs...) where {T<:Real, S, A, PTF, F1, F2, F3} out = reinforce_with_baseline_monte_carlo_control!(policy_params, âˆ‡lnÏ€, nothing, nothing, mdp, update_action_preferences!, update_eligibility_vector!, x, update_feature_vector!, Returns(zero(T)), Returns(nothing), max_episodes; Î±_Î¸ = Î±, kwargs...) return (episode_rewards = out.episode_rewards, episode_steps = out.episode_steps, policy_function = out.policy_function, policy_sample_action = out.policy_sample_action, parameters = out.policy_parameters) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$5ffc271f-c73f-494a-9727-8d7516af2191„§cell_idÙ$5ffc271f-c73f-494a-9727-8d7516af2191¤codeÙ£@bind cartpole_continuing_fcann_study_params create_actor_critic_continuing_params_UI(;Î»_Î¸= 0.8f0, Î»_w = 0.15f0, Î±_rÌ„ = 0.05f0, log2Î±_Î¸ = -6, log2Î±_w = -5)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$c5a2879c-e89b-47f7-bbd6-48200d7e89e3„§cell_idÙ$c5a2879c-e89b-47f7-bbd6-48200d7e89e3¤codeÚ^actor_critic_binary_episodic_squashed_gaussian_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, args...; kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = actor_critic_binary_episodic_squashed_gaussian_parameter_study(mdp, one(T), get_active_features, num_features, args...; kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$537270ba-122b-4f2b-880b-31d086766295„§cell_idÙ$537270ba-122b-4f2b-880b-31d086766295¤codeÚ·begin #the following struct represents a problem for which both the state and action space can take arbitrary values struct ContinuousMDP{T<:Real, S, A, P<:AbstractContinuousTransition{T, S, A, F} where F<:Function, StateInit<:Function, IsTerm<:Function, ValidAction <: Function} <: AbstractMDP{T, S, A, P, StateInit} ptf::P initialize_state::StateInit #function which provides an initial state index isterm::IsTerm #function that returns true if a state is terminal and false otherwise is_valid_action::ValidAction #is_valid_action(s, a) returns true if the action a is valid to take from state. by default every action is assumed to be available ContinuousMDP(ptf::P, initialize_state::F1, isterm::F2, is_valid_action::F3) where {T<:Real, S, A, F<:Function, P<:AbstractContinuousTransition{T, S, A, F}, F1<:Function, F2<:Function, F3<:Function} = new{T, S, A, P, F1, F2, F3}(ptf, initialize_state, isterm, is_valid_action) end ContinuousMDP(ptf::AbstractContinuousTransition{T, S, A, F}, initialize_state::StateInit; isterm::Function = Returns(false), is_valid_action::Function = Returns(true)) where {T<:Real, S, A, F<:Function, StateInit<:Function} = ContinuousMDP(ptf, initialize_state, isterm, is_valid_action) function ContinuousMDP(step::Function, initialize_state::Function, a::A; kwargs...) where A s0 = initialize_state() ptf = ContinuousMDPTransitionSampler(step, s0, a) ContinuousMDP(ptf, initialize_state; kwargs...) end end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$dc2efc6c-8da8-425b-aa5f-290949109565„§cell_idÙ$dc2efc6c-8da8-425b-aa5f-290949109565¤codeÙGplot_mountaincar_policy_values(mountaincar_test_train.policy_and_value)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$a019925a-460a-410e-a54b-50a4cfe0e90e„§cell_idÙ$a019925a-460a-410e-a54b-50a4cfe0e90e¤codeÚplot(scatter(x = 1 .- LinRange(0.01, 0.99, 100), y = -[get_corridor_episode_stats(p) for p in 1 .- LinRange(0.01, 0.99, 100)]), Layout(xaxis_title = "probability of right action", yaxis_title = "sample mean value of starting state", width = 800, yaxis_range = [-60, -10]))¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$f92bb265-4b19-4f0e-a698-d7547bb6dd41„§cell_idÙ$f92bb265-4b19-4f0e-a698-d7547bb6dd41¤codeÙ§mutable struct BinaryFeatureVector{I <: Integer} active_features::Vector{I} num_features::I function BinaryFeatureVector() new{Int64}(Vector{Int64}(), 0) end end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$ac9c8845-284d-4c21-b05d-d930f86598a3„§cell_idÙ$ac9c8845-284d-4c21-b05d-d930f86598a3¤codeÙb@bind run_mountaincar_binary_episodic_countinuous_param_study CounterButton("Run Parameter Study")¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$192cc1cf-9ea1-492d-baa7-f2e197abecd4„§cell_idÙ$192cc1cf-9ea1-492d-baa7-f2e197abecd4¤codeÙV@bind run_mountaincar_binary_episodic_param_study CounterButton("Run Parameter Study")¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547d„§cell_idÙ$a4eec4d3-5a75-4b52-ab9c-9d9e83d5547d¤codeÙ6@bind ep_step Slider(1:length(ep[1]), show_value=true)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$c8b47eac-2d45-419a-bec6-2ae0cdc59393„§cell_idÙ$c8b47eac-2d45-419a-bec6-2ae0cdc59393¤codeÚhbegin #represents a transition where the state must be referenced directly instead of through a tabular index abstract type AbstractContinuousTransition{T<:Real, S, A, F<:Function} <: AbstractTransition{T, 2} end struct ContinuousMDPTransitionSampler{T <: Real, S, A, F <: Function} <: AbstractContinuousTransition{T, S, A, F} step::F function ContinuousMDPTransitionSampler(step::F, s::S, a::A) where {F<:Function, S, A} (r, sâ€²) = step(s, a) @assert promote_type(S, typeof(sâ€²)) != Any "There is no common type between the provided state $s and the transition state $sâ€²" new{typeof(r), promote_type(S, typeof(sâ€²)), A, F}(step) end end #when used as a functor just apply the step function to the state action pair indices (ptf::ContinuousMDPTransitionSampler{T, S, A, F})(s::S, a::A) where {T<:Real, S, A, F<:Function} = ptf.step(s, a) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$36a6e43f-6bcf-4c27-bfbb-047760e77ada„§cell_idÙ$36a6e43f-6bcf-4c27-bfbb-047760e77ada¤codeÚ md""" # Chapter 13 Policy Gradient Methods Introduction Instead of selection actions based on *action-value estimates* we learn a *parameterized policy* with parameters $\boldsymbol{Î¸}$. $\pi(a|s, \boldsymbol{\theta}) = \text{Pr}\{A_t=a|S_t=s, \boldsymbol{\theta}_t=\boldsymbol{\theta\}}$ denotes the probability that action *a* is taken at time *t* given that the environment is in state *s* at time *t* with parameter $\boldsymbol{Î¸}$. We consider methods that improve the policy parameter using the gradient of some scalar performance measure $J(\boldsymbol{\theta})$ with respect to the policy parameters. We follow gradient ascent since we are trying to maximize this value and methods that use this approach are called *policy gradient methods*. Methods that learn approximations to both policy and value functions are often called *actor-critic methods*, where 'actor' is a reference to the learned policy, and 'critic' refers to the learned value function, usually a state-value function. ## 13.1 Policy Approximation and its Advantages """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$436c52d2-280b-4ca4-9360-d6587b8254c7„§cell_idÙ$436c52d2-280b-4ca4-9360-d6587b8254c7¤codeÚ~md""" In order to test this algorithm we need to use a continuing task which is lacking a terminal state. We could simply modify the corridor MDP to be a continuing task by altering the reward structure so a reward of 1 is received upon moving to the right from state 3 after which the state is reset to 1. Se below for a version of this MDP updated to be a continuing problem. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$e96d592d-1e54-486d-8ad9-b857f85476e8„§cell_idÙ$e96d592d-1e54-486d-8ad9-b857f85476e8¤codeÚ.actor_critic_linear_parameter_study(mdp::StateMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, params::@NamedTuple{Î»_Î¸::T, Î»_w::T, Î±_rÌ„::T, Î±_Î¸_min::Int64, Î±_w_min::Int64}, num_Î¸::Integer, num_w::Integer, max_steps::Integer; kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = actor_critic_linear_parameter_study(mdp, get_active_features, num_features, params.Î»_Î¸, params.Î»_w, params.Î±_rÌ„, 2f0 .^(params.Î±_Î¸_min:params.Î±_Î¸_min+num_Î¸-1), 2f0 .^(params.Î±_w_min:params.Î±_w_min+num_w-1), max_steps; kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6c„§cell_idÙ$5583ae6d-f6fa-47ba-aab4-cb6a4f32cb6c¤codeÙLcorridor_parameter_studies(2f0 .^ (-15:-8), 2f0 .^ (-35:5:-15); nruns = 100)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$4da20fd7-b897-4f26-bf2a-f08d66ddf90f„§cell_idÙ$4da20fd7-b897-4f26-bf2a-f08d66ddf90f¤codeÚ b#version of reinforce for general function approximation function actor_critic_with_eligibility_traces!(policy_params::P1, âˆ‡lnÏ€, value_params::P2, âˆ‡vÌ‚, mdp::ContinuousMDP{T, S, A, PTF, F1, F2, F3}, Î»_Î¸::T, Î»_w::T, update_action_distribution!::Function, action_dist_params::Vector{T}, action_sampler::Function, update_eligibility_vector!::Function, x, update_feature_vector!::Function, value_function::Function, update_value_gradient!::Function, max_steps::Integer; Î±_w::T = one(T)/10, Î±_Î¸::T = one(T)/10, Î±_rÌ„ = one(T)/10, z_Î¸::P1 = deepcopy(policy_params), z_w::P2 = deepcopy(value_params), save_step_rewards = false) where {P1, P2, T<:Real, S, A, PTF, F1, F2, F3} step_rewards = Vector{T}() #initialize variables step = 1 rtot = zero(T) rÌ„ = zero(T) c = one(T) zero_params!(z_Î¸) zero_params!(z_w) s = mdp.initialize_state() update_feature_vector!(x, s) while step <= max_steps update_value_gradient!(âˆ‡vÌ‚, x, value_params) vÌ‚ = value_function(x, value_params) update_action_distribution!(action_dist_params, x, policy_params) a = action_sampler(action_dist_params) if bad_continuous_action(a) @info "terminating after $step steps due to invalid continuous action $a taken in state $s with action distribution parameters $action_dist_params" push!(episode_steps, max_steps) push!(episode_rewards, typemin(T)) break end update_eligibility_vector!(âˆ‡lnÏ€, action_dist_params, x, a, policy_params) (r, sâ€²) = mdp.ptf(s, a) rtot += r save_step_rewards && push!(step_rewards, r) step += 1 mdp.isterm(sâ€²) && error("$sâ€² is a terminal state and this method only applies to continuing tasks") update_feature_vector!(x, sâ€²) vÌ‚â€² = value_function(x, value_params) Î´ = r - rÌ„ + vÌ‚â€² - vÌ‚ rÌ„ += Î±_rÌ„*Î´ update_traces_with_gradient!(Î³*Î»_w, z_w, âˆ‡vÌ‚) update_traces_with_gradient!(Î³*Î»_Î¸, z_Î¸, c, âˆ‡lnÏ€) update_params_with_gradient!(value_params, Î±_w*Î´, z_w) update_params_with_gradient!(policy_params, Î±_Î¸*c*Î´, z_Î¸) s = sâ€² end function_outputs = form_state_and_policy_function_outputs(update_feature_vector!, update_action_distribution!, action_dist_params, action_sampler, value_function, x, policy_params, value_params) return (;step_rewards = step_rewards, total_reward = rtot, policy_parameters = policy_params, value_parameters = value_params, function_outputs...) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$11ea640c-3981-404d-87c6-4d3d0708a2b8„§cell_idÙ$11ea640c-3981-404d-87c6-4d3d0708a2b8¤codeÚ†function actor_critic_linear_episodic_parameter_study(mdp::StateMDP{T, S, A, P, F1, F2, F3}, update_feature_vector!::Function, num_features::Integer, Î»_Î¸::T, Î»_w::T, Î±_Î¸_list::AbstractVector{T}, Î±_w_list::AbstractVector{T}, max_episodes::Integer; nruns = 100, max_steps::Integer = 10_000, seed = rand(UInt64), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} Random.seed!(seed) function average_runs(Î±_Î¸, Î±_w) 1:nruns |> Map(_ -> actor_critic_with_eligibility_traces_linear_features(mdp, Î»_Î¸, Î»_w, update_feature_vector!, num_features, max_episodes, max_steps; Î±_Î¸ = Î±_Î¸, Î±_w = Î±_w, kwargs...) |> x -> isempty(x.episode_rewards) ? -T(Inf) : sum(x.episode_rewards) / length(x.episode_rewards)) |> foldxt(+) |> x -> x / nruns end traces = [begin scatter(x = Î±_Î¸_list, y = average_runs.(Î±_Î¸_list, Î±_w), name = "Î±_w = $Î±_w") end for Î±_w in Î±_w_list] plot(traces, Layout(xaxis_title = "Î±_Î¸", yaxis_title = "Average Reward Per Episode in the First
$max_episodes Episodes Averaged Over $nruns Runs", xaxis_type = "log2", title = "Linear Feature Encoding with $num_features Features, Î»_Î¸ = $Î»_Î¸, Î»_w = $Î»_w")) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$281360af-46bf-4c73-bf11-3cb1153ad3e2„§cell_idÙ$281360af-46bf-4c73-bf11-3cb1153ad3e2¤codeÙUcartpole_tilecoding_reinforce_parameter_study(2f0 .^ (-12:-7), 2f0 .^ (-13:-10), 100)¨metadataƒ©show_logsÃ¨disabledÃ®skip_as_scriptÂ«code_foldedÂÙ$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690c„§cell_idÙ$9ae58dd6-3cde-4943-9ac1-bd9d4f7d690c¤codeÚ%begin function update_squashed_gaussian_eligibility_vector!(âˆ‡lnÏ€::BinarySquashedGaussianEligibilityVector{T, T, T, B}, dist_params::Vector{T}, x::B, action::T, policy_params::Matrix{T}) where {T<:Real, B<:BinaryFeatureVector} âˆ‡lnÏ€.binary_features = x âˆ‡lnÏ€.a = action âˆ‡lnÏ€.Î¼ = first(dist_params) âˆ‡lnÏ€.Ïƒ = exp(last(dist_params)) return âˆ‡lnÏ€ end function update_squashed_gaussian_eligibility_vector!(âˆ‡lnÏ€::BinarySquashedGaussianEligibilityVector{T, NTuple{N, T}, Vector{T}, B}, dist_params::Vector{T}, x::B, action::NTuple{N, T}, policy_params::Matrix{T}) where {T<:Real, N, B<:BinaryFeatureVector} âˆ‡lnÏ€.binary_features = x âˆ‡lnÏ€.a = action for i in 1:N âˆ‡lnÏ€.Î¼[k] = dist_params[k] âˆ‡lnÏ€.Ïƒ[k] = exp(dist_params[k+N]) end return âˆ‡lnÏ€ end end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$da3cb392-78f2-48b2-b0dc-5f016664798c„§cell_idÙ$da3cb392-78f2-48b2-b0dc-5f016664798c¤codeÙXshow_mountaincar_trajectory(mountaincar_continuing_tile_test.policy_sample_action, 1000)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$dca2f8e2-76af-4679-bf81-3824c15fc76d„§cell_idÙ$dca2f8e2-76af-4679-bf81-3824c15fc76d¤codeÚ const reinforce_test3 = actor_critic_with_eligibility_traces_binary_features(cartpole_setup.mdps.episodic.discrete, 0.85f0, 0.5f0, cartpole_setup.get_active_features, cartpole_setup.num_features, typemax(Int64), 100_000; Î±_Î¸ = 2f0 ^-6, Î±_w = 2f0 ^-4, Î³ = 0.99f0)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$8019bec9-1228-407b-9199-2fe29f26a981„§cell_idÙ$8019bec9-1228-407b-9199-2fe29f26a981¤codeÚ md""" > ### *Exercise 13.1* > Use your knowledge of the gridworld and its dynamics to determine an *exact* symbolic expression for the optimal probability of selecting the right action in Example 13.1 Example 13.1 is a gridworld with 3 non-terminal states and a terminal state at the far right. The reward is -1 per step. States 1 and 3 have actions left/right that move in the expected directions but state 2 reverses the directions. We use a performance measure $J(\mathbf{\theta}) = v_{\pi_\theta}(S)$. Given our feature representations of $\mathbf{x}(s, \text{right}) = [1, 0]^{\top}$ and $\mathbf{x}(s, \text{left}) = [0, 1]^{\top}$, we can only learn policies that are stochastic in terms of left/right action selection but do not vary between states. Also observe that due to probability constraints $p_{\text{right}} = 1 - p_{\text{left}}$. For simplicity, we will use the notation $p \doteq p_{\text{left}}$ and the following for the three state values: $v1, v2, v3$. $\begin{flalign} v_1 &= p \times v_1 + (1-p) \times v_2 - 1 \tag{1} \\ v_1 (1-p) &= v_2 (1-p) - 1 \\ v_1 &= v_2 - \frac{1}{1-p} \tag{1â€²}\\ v_2 &= p \times v_3 + (1-p) \times v_1 - 1 \tag{2} \\ v_3 &= p \times v_2 - 1 \tag{3}\\ v_2 &= p \times [p\times v_2 - 1] +(1-p) \times v_1 - 1 \tag{substituting 3 into 2} \\ v_2(1 - p^2) &= -p +(1-p) \times v_1 - 1 \\ v_2 &= \frac{(1-p) v_1 - (1+p)}{(1+p)(1-p)} \tag{collecting terms} \\ &= \frac{(1-p) v_2 - 1 - (1+p)}{(1+p)(1-p)} \tag{using 1â€²} \\ &= \frac{v_2}{1+p} - \frac{2 + p}{(1+p)(1-p)} \\ v_2 \left [1 - \frac{1}{1+p} \right ] &= - \frac{2 + p}{(1+p)(1-p)} \\ v_2 \frac{1+p-1}{1+p} &= - \frac{2 + p}{(1+p)(1-p)} \\ v_2 &= - \frac{2 + p}{(1-p)p} \\ v_1 &= - \frac{2 + p}{(1-p)p} - \frac{1}{1-p} \\ &= \frac{-2 - p - p}{(1-p)p} \\ &= -\frac{2 + 2p}{(1-p)p} \\ v_3 &= -\frac{2 + p}{1-p} - 1\\ &= \frac{-2 - p - 1 + p}{1-p}\\ &= -\frac{3}{1-p}\\ \end{flalign}$ To summarize all the state values: $\begin{flalign} v_1 &= -\frac{2 + 2p}{(1-p)p} \\ v_2 &= - \frac{2 + p}{(1-p)p} \\ v_3 &= -\frac{3}{1-p} \end{flalign}$ """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$fd964539-2baf-4ff1-b286-5a0bb1b222c4„§cell_idÙ$fd964539-2baf-4ff1-b286-5a0bb1b222c4¤codeÚmd""" The beta distribution has two parameters like the normal distribution but is only defined from 0 to 1. The two parameters $\alpha$ and $\beta$ are positive real numbers and control the shape of the distribution. The density function is given below: $f(x; \alpha, \beta) = \frac{x^{\alpha-1} (1-x)^{\beta - 1}}{\text{B}(\alpha, \beta)}$ where $\text{B}(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha + \beta)}$ and $\Gamma(z) = \int_0^\infty t^{z-1}e^{-t} \text{d} t$ We saw earlier from the treatment of the gaussian distribution that we need to find the gradient of a function of each distribution parameter with respect to the parameters of the function approximation. Luckily, the maximum likelihood estimator already computes the gradient we are interested in for this distribution. Note that the likelihood function for a single sample of the random variable $x$ which follows the beta distribution is given by $\mathcal{L}(\alpha, \beta \vert X) = \ln(f(X_i; \alpha, \beta))$ and the partial derivative of this function with respect to each parameter $\alpha$ and $\beta$ is given by: $\frac{\partial \mathcal{L}(\alpha, \beta, \vert X)}{\partial \alpha} = \ln X - \frac{\partial \ln \text{B}(\alpha, \beta)}{\partial \alpha}$ $\frac{\partial \mathcal{L}(\alpha, \beta, \vert X)}{\partial \beta} = \ln (1-X) - \frac{\partial \ln \text{B}(\alpha, \beta)}{\partial \beta}$ where $\frac{\partial \ln \text{B}(\alpha, \beta)}{\partial \alpha} = -\psi(\alpha + \beta) + \psi(\alpha)$ and $\frac{\partial \ln \text{B}(\alpha, \beta)}{\partial \beta} = -\psi(\alpha + \beta) + \psi(\beta)$ and $\phi(\alpha)$ is the digamma function which is just the derivative of the logarithm of the gamma function. Since both $\alpha$ and $\beta$ must be greater than zero, we can use for an estimate for each one the exponential function applied to a dot product of the parameter vector with the feature vector: $\alpha(s, \boldsymbol{\theta}) \doteq \exp \left (\boldsymbol{\theta}_\alpha^\top \mathbf{x}(s) \right )$ and $\beta(s, \boldsymbol{\theta}) \doteq \exp \left (\boldsymbol{\theta}_\beta^\top \mathbf{x}(s) \right )$. The eligibility vector for this distribution is then: $\nabla \ln f(a \vert \alpha(s, \boldsymbol{\theta}_\alpha), \beta(s, \boldsymbol{\theta}_\beta))$ where $\alpha$ is a function of its parameters and $\beta$ is a function of the other parameter vector. The gradient components corresponding to each vector is only a function of a partial derivative of the distribution with respect to $\alpha$ and $\beta$. That is, since $\frac{\partial \alpha}{\partial \theta_{\beta_i}} = 0 \forall i$ and vice versa, then we can treat each part of the gradient separately. $\begin{flalign} \nabla_{\boldsymbol{\theta}_\alpha} \ln f(a \vert \alpha, \beta) &= \frac{\partial f(a \vert \alpha, \beta)}{\partial \alpha} \nabla_{\boldsymbol{\theta}_\alpha}\alpha \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\alpha) \right ) \nabla_{\boldsymbol{\theta}_\alpha} \alpha \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\alpha) \right ) \nabla_{\boldsymbol{\theta}_\alpha} \exp \left ( \boldsymbol{\theta}_\alpha^\top \mathbf{x}(s) \right ) \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\alpha) \right ) \alpha \mathbf{x}(s)\\ \end{flalign}$ $\begin{flalign} \nabla_{\boldsymbol{\theta}_\beta} \ln f(a \vert \alpha, \beta) &= \frac{\partial f(a \vert \alpha, \beta)}{\partial \beta} \nabla_{\boldsymbol{\theta}_\beta}\beta \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\beta) \right ) \nabla_{\boldsymbol{\theta}_\beta} \beta \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\beta) \right ) \nabla_{\boldsymbol{\theta}_\beta} \exp \left ( \boldsymbol{\theta}_\beta^\top \mathbf{x}(s) \right ) \\ &= \left ( \ln a - \psi(\alpha + \beta) + \psi(\beta) \right ) \beta \mathbf{x}(s)\\ \end{flalign}$ """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$5720e942-d3f8-4329-83a8-8bcedf078b6a„§cell_idÙ$5720e942-d3f8-4329-83a8-8bcedf078b6a¤codeÙ”reinforce_monte_carlo_control_linear_features(corridor_mdp, update_corridor_features!, 1, 1_000; Î± = 2f0^-14, max_steps = 1_000).policy_function(1)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$62e677ac-2070-4f6b-9df2-90849d89fa9f„§cell_idÙ$62e677ac-2070-4f6b-9df2-90849d89fa9f¤codeÙQconst corridor_terminal_probabilities = 1 .- sum(corridor_state_counts, dims = 2)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$11b9beea-b0cd-45eb-84c6-151728894df0„§cell_idÙ$11b9beea-b0cd-45eb-84c6-151728894df0¤codeÚfunction form_state_and_policy_function_outputs(update_feature_vector!::Function, update_action_distribution!::Function, action_dist_params::Vector{T}, action_sampler::Function, value_function::Function, feature_vector, policy_params, value_params) where T<:Real Ï€! = form_state_continuous_policy_function(update_feature_vector!, update_action_distribution!) Ï€(s) = Ï€!(feature_vector, action_dist_params, s, policy_params) Ï€_sample(s) = action_sampler(Ï€(s)) v! = form_state_value_function(update_feature_vector!, value_function) estimate_state_value(s; x = deepcopy(feature_vector)) = v!(x, s, value_params) function policy_and_value(s; x = deepcopy(feature_vector), action_dist_params = copy(action_dist_params)) Ï€!(x, action_dist_params, s, policy_params) vÌ‚ = value_function(x, value_params) return (action_distribution_parameters = action_dist_params, state_value_estimate = vÌ‚) end (policy_function = Ï€, policy_sample_action = Ï€_sample, estimate_state_value = estimate_state_value, policy_and_value = policy_and_value) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290„§cell_idÙ$961f02ee-a6e5-4fe8-b1d2-eb3f8824d290¤codeÚxfunction reinforce_monte_carlo_control_binary_features(mdp::StateMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, max_episodes::Integer; params::Matrix{T} = zeros(T, num_features, length(mdp.actions)), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} setup = setup_binary_policy_arguments(mdp, get_active_features, num_features) reinforce_monte_carlo_control!(params, setup.eligibility_vector, mdp, update_binary_action_preferences!, update_binary_eligibility_vector!, setup.feature_vector, setup.update_feature_vector!, max_episodes; action_preferences = setup.action_preferences, kwargs...) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$55ba8725-0ddf-4196-a41d-3f3c490a8d84„§cell_idÙ$55ba8725-0ddf-4196-a41d-3f3c490a8d84¤codeÚofunction actor_critic_binary_episodic_gaussian_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, Î»_Î¸::T, Î»_w::T, Î±_Î¸_list::AbstractVector{T}, Î±_w_list::AbstractVector{T}, max_episodes::Integer; nruns::Integer = 100, max_steps::Integer = 10_000, seed = rand(UInt64), init_policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), init_value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} Random.seed!(seed) function average_runs(Î±_Î¸, Î±_w) 1:nruns |> Map(_ -> actor_critic_with_eligibility_traces_binary_features_gaussian_actions(mdp, Î»_Î¸, Î»_w, get_active_features, num_features, max_episodes, max_steps; Î±_Î¸ = Î±_Î¸, Î±_w = Î±_w, policy_params = copy(init_policy_params), value_params = copy(init_value_params), kwargs...) |> x -> isempty(x.episode_rewards) ? -T(Inf) : mean(x.episode_rewards)) |> foldxt(+) |> x -> x / nruns end traces = [begin scatter(x = Î±_Î¸_list, y = average_runs.(Î±_Î¸_list, Î±_w), name = "Î±_w = $Î±_w") end for Î±_w in Î±_w_list] plot(traces, Layout(xaxis_title = "Î±_Î¸", yaxis_title = "Average Reward Per Episode in the First
$max_episodes Episodes Averaged Over $nruns Runs", xaxis_type = "log", title = "Binary Feature Encoding with $num_features Features, Î»_Î¸ = $Î»_Î¸, Î»_w = $Î»_w")) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$a540814a-57a1-4b98-9443-59e401425444„§cell_idÙ$a540814a-57a1-4b98-9443-59e401425444¤codeÙûfunction binary_value_function(binary_features::BinaryFeatureVector, params::Vector{T})::T where T<:Real v = zero(T) @inbounds @simd for i in 1:binary_features.num_features j = binary_features.active_features[i] v += params[j] end return v end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$1b102220-6d78-480d-a77f-0e57bad23dca„§cell_idÙ$1b102220-6d78-480d-a77f-0e57bad23dca¤codeÚ!cartpole_binary_continuing_parameter_study(args...; kwargs...) = actor_critic_linear_parameter_study(cartpole_continuing_mdp, s -> cartpole_tilecoding_setup.get_active_features((s.x, s.Î¸, s.xÌ‡, s.Î¸Ì‡)), cartpole_tilecoding_setup.num_features, binary_features = true, args...; kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$4d4ae57b-afc3-44f9-b6fc-892f59f82921„§cell_idÙ$4d4ae57b-afc3-44f9-b6fc-892f59f82921¤codeÚ º#version of reinforce for general function approximation function one_step_actor_critic!(policy_params, âˆ‡lnÏ€, value_params, âˆ‡vÌ‚, mdp::StateMDP{T, S, A, PTF, F1, F2, F3}, update_action_preferences!::Function, update_eligibility_vector!::Function, x, update_feature_vector!::Function, value_function::Function, update_value_gradient!::Function, max_episodes::Integer, max_steps::Integer; Î±_w::T = one(T)/10, Î±_Î¸::T = one(T)/10, Î³::T = one(T), action_preferences = zeros(T, length(mdp.actions)), save_episode_steps = false) where {T<:Real, S, A, PTF, F1, F2, F3} step_rewards = Vector{T}() episode_steps = Vector{Int64}() episode_rewards = Vector{T}() #initialize variables ep = 1 step = 1 rtot = zero(T) c = one(T) s = mdp.initialize_state() update_feature_vector!(x, s) # @info "initial value params: $value_params" while (ep <= max_episodes) && (step <= max_steps) update_value_gradient!(âˆ‡vÌ‚, x, value_params) vÌ‚ = value_function(x, value_params) update_action_preferences!(action_preferences, x, policy_params) soft_max!(action_preferences) i_a = sample_action(action_preferences) update_eligibility_vector!(âˆ‡lnÏ€, action_preferences, x, i_a, policy_params) (r, sâ€²) = mdp.ptf(s, i_a) rtot += r save_episode_steps && push!(step_rewards, r) step += 1 if mdp.isterm(sâ€²) push!(episode_steps, step) push!(episode_rewards, rtot) vÌ‚â€² = zero(T) ep += 1 rtot = zero(T) c = one(T) s = mdp.initialize_state() update_feature_vector!(x, s) else update_feature_vector!(x, sâ€²) vÌ‚â€² = value_function(x, value_params) s = sâ€² c *= Î³ end Î´ = r + Î³*vÌ‚â€² - vÌ‚ # @info "About to update value params with gradient $âˆ‡vÌ‚ and constant $(Î±_w * Î´)" update_params_with_gradient!(value_params, Î±_w*Î´, âˆ‡vÌ‚) # @info "About to update policy params with eligibility vector $âˆ‡lnÏ€ and constant $(Î±_Î¸*c*Î´)" update_params_with_gradient!(policy_params, Î±_Î¸*c*Î´, âˆ‡lnÏ€) # @info "policy params after $step updates: $policy_params" # @info "value params after $step updates: $value_params" end function_outputs = form_state_and_policy_function_outputs(update_feature_vector!, update_action_preferences!, value_function, x, action_preferences, policy_params, value_params) return (;step_rewards = step_rewards, episode_steps = episode_steps, episode_rewards = episode_rewards, policy_parameters = policy_params, value_parameters = value_params, function_outputs...) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$61949faa-8174-4b7b-8fbc-01d5f850b419„§cell_idÙ$61949faa-8174-4b7b-8fbc-01d5f850b419¤codeÚ:function actor_critic_binary_continuing_gaussian_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, get_active_features::Function, num_features::Integer, Î»_Î¸::T, Î»_w::T, Î±_Î¸_list::AbstractVector{T}, Î±_w_list::AbstractVector{T}, Î±_rÌ„::T, max_steps::Integer; nruns::Integer = 100, seed = rand(UInt64), init_policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), init_value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} Random.seed!(seed) function average_runs(Î±_Î¸, Î±_w) 1:nruns |> Map(_ -> actor_critic_with_eligibility_traces_binary_features_gaussian_actions(mdp, Î»_Î¸, Î»_w, get_active_features, num_features, max_steps; Î±_Î¸ = Î±_Î¸, Î±_w = Î±_w, Î±_rÌ„ = Î±_rÌ„, policy_params = copy(init_policy_params), value_params = copy(init_value_params), kwargs...).total_reward) |> foldxt(+) |> x -> x / nruns / max_steps end traces = [begin scatter(x = Î±_Î¸_list, y = average_runs.(Î±_Î¸_list, Î±_w), name = "Î±_w = $Î±_w") end for Î±_w in Î±_w_list] plot(traces, Layout(xaxis_title = "Î±_Î¸", yaxis_title = "Average Reward Per Step in the First
$max_steps Steps Averaged Over $nruns Runs", xaxis_type = "log", title = "Binary Feature Encoding with $num_features Features, Î»_Î¸ = $Î»_Î¸, Î»_w = $Î»_w, Î±_rÌ„ = $Î±_rÌ„")) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$5b15f5c9-80bf-47f0-898a-f8dead5b927c„§cell_idÙ$5b15f5c9-80bf-47f0-898a-f8dead5b927c¤codeÚìmd""" ### *Continuing Case Actor-Critic Implementation* Note that this function has the same name as the episodic version. The only difference other than keyword arguments is that the `max_episodes` argument is missing. Since we already defined the versions of the algorithm for linear and non-linear cases in a generic manner, we only need to define the core version of this algorithm and the other functions will dispatch to it if they are called without the `max_episodes` argument. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$266d2234-26c8-43f1-9e75-49440a230ed6„§cell_idÙ$266d2234-26c8-43f1-9e75-49440a230ed6¤codeÚ …#version of reinforce for general function approximation function actor_critic_with_eligibility_traces!(policy_params::P1, âˆ‡lnÏ€, value_params::P2, âˆ‡vÌ‚, mdp::StateMDP{T, S, A, PTF, F1, F2, F3}, Î»_Î¸::T, Î»_w::T, update_action_preferences!::Function, update_eligibility_vector!::Function, x, update_feature_vector!::Function, value_function::Function, update_value_gradient!::Function, max_episodes::Integer, max_steps::Integer; Î±_w::T = one(T)/10, Î±_Î¸::T = one(T)/10, Î³::T = one(T), action_preferences = zeros(T, length(mdp.actions)), z_Î¸::P1 = deepcopy(policy_params), z_w::P2 = deepcopy(value_params), save_step_rewards = false) where {P1, P2, T<:Real, S, A, PTF, F1, F2, F3} step_rewards = Vector{T}() episode_steps = Vector{Int64}() episode_rewards = Vector{T}() #initialize variables ep = 1 step = 1 rtot = zero(T) c = one(T) zero_params!(z_Î¸) zero_params!(z_w) s = mdp.initialize_state() update_feature_vector!(x, s) while (ep <= max_episodes) && (step <= max_steps) update_value_gradient!(âˆ‡vÌ‚, x, value_params) vÌ‚ = value_function(x, value_params) update_action_preferences!(action_preferences, x, policy_params) soft_max!(action_preferences) i_a = sample_action(action_preferences) update_eligibility_vector!(âˆ‡lnÏ€, action_preferences, x, i_a, policy_params) (r, sâ€²) = mdp.ptf(s, i_a) rtot += r save_step_rewards && push!(step_rewards, r) step += 1 if mdp.isterm(sâ€²) push!(episode_steps, step) push!(episode_rewards, rtot) vÌ‚â€² = zero(T) rtot = zero(T) zero_params!(z_Î¸) zero_params!(z_w) ep += 1 c = one(T) s = mdp.initialize_state() update_feature_vector!(x, s) else update_feature_vector!(x, sâ€²) vÌ‚â€² = value_function(x, value_params) s = sâ€² c *= Î³ end Î´ = r + Î³*vÌ‚â€² - vÌ‚ update_traces_with_gradient!(Î³*Î»_w, z_w, âˆ‡vÌ‚) update_traces_with_gradient!(Î³*Î»_Î¸, z_Î¸, c, âˆ‡lnÏ€) update_params_with_gradient!(value_params, Î±_w*Î´, z_w) update_params_with_gradient!(policy_params, Î±_Î¸*c*Î´, z_Î¸) end function_outputs = form_state_and_policy_function_outputs(update_feature_vector!, update_action_preferences!, value_function, x, action_preferences, policy_params, value_params) return (;step_rewards = step_rewards, episode_steps = episode_steps, episode_rewards = episode_rewards, policy_parameters = policy_params, value_parameters = value_params, function_outputs...) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$aa69e4ea-91e0-496a-a7be-529e67f4dbec„§cell_idÙ$aa69e4ea-91e0-496a-a7be-529e67f4dbec¤codeÙ´reinforce_with_baseline_monte_carlo_control_fcann(corridor_mdp, 1, [10, 10], update_corridor_features!, 100; Î±_Î¸ = 2f0^-14, Î±_w = 2f0^-14, max_steps = 10_000).policy_function(1)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$10ee7709-0816-48d2-abe0-9be3dd04700f„§cell_idÙ$10ee7709-0816-48d2-abe0-9be3dd04700f¤codeÙLplot_continuing_step_rewards(mountaincar_continuing_fcann_test.step_rewards)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$7d94922e-dc9f-4953-b539-24aaa2c85b12„§cell_idÙ$7d94922e-dc9f-4953-b539-24aaa2c85b12¤codeÙ˜@bind continuing_study_params create_actor_critic_continuing_params_UI(;Î»_Î¸ = 0.75f0, Î»_w = 0.25f0, log2Î±_Î¸ = -6, log2Î±_w = -10, Î±_rÌ„ = 0.005f0)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$df7f84e8-b42a-4001-9dbf-6bc3ced94207„§cell_idÙ$df7f84e8-b42a-4001-9dbf-6bc3ced94207¤codeÙŽusing PlutoDevMacros, Random, Statistics, LinearAlgebra, Transducers, Base.Threads, Random, Distributions, Statistics, StatsBase, StaticArrays¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$352d2952-cb83-47d3-9078-2b2ef9927443„§cell_idÙ$352d2952-cb83-47d3-9078-2b2ef9927443¤codeÚ·#create a cart pole MDP environment function create_cartpole_functions(; m::T = 1f0, #mass at the end of the pole in kg m_c::T = 10f0, #mass of the cart in kg l::T = 1f0, #length of the pole in meters g::T = 9.8f0, #gravitational constant in meters per second squared h::T = 4f-2, #step size parameter of simulation in seconds k::T = 1f0, #inertial constant of pendulum, m_f::T = 0f0, #friction of the rotating pole Î¼_c::T = 0f0, #friction of the cart wheels against the track fmax::T = 300f0, #force applied by throttle x_max::T = 50f0, #maximum horizontal position Î¸_max::T = deg2rad(70f0), #maximum pole angle xÌ‡_max::T = 50f0, Î¸Ì‡_max::T = 10f0, init_x::Function = () -> 0f0, #initialize each of the 4 state variables init_Î¸::Function = () -> Float32(rand([-0.02f0, 0.02f0])), init_xÌ‡::Function = () -> 0f0, init_Î¸Ì‡::Function = () -> 0f0) where T<:Real #the action space is full throttle forward or backwards or idle in the discrete case actions = [-fmax, zero(T), fmax] #create a vehicle to use in simulation steps vehicle = CartPoleVehicle(m, m_c, l, k, m_f, Î¼_c) initialize_state(;t = 0f0) = CartPoleState(init_x(), init_Î¸(), init_xÌ‡(), init_Î¸Ì‡(), t) function failure(s::CartPoleState) (abs(s.x) > x_max) || (abs(s.Î¸) > Î¸_max) || (abs(s.xÌ‡) > xÌ‡_max) || (abs(s.Î¸Ì‡) > Î¸Ì‡_max) end step(s::CartPoleState{T}, f::T) = cartpole_runge_kutta_step(vehicle, s, g, clamp(f, -fmax, fmax), h) min_vals = (-x_max, -Î¸_max, -xÌ‡_max, -Î¸Ì‡_max) max_vals = (x_max, Î¸_max, xÌ‡_max, Î¸Ì‡_max) (step = step, failure = failure, initialize_state = initialize_state, discrete_actions = actions, min_vals = min_vals, max_vals = max_vals, h = h) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$0964133c-3a5b-433b-a8c4-a97813c37583„§cell_idÙ$0964133c-3a5b-433b-a8c4-a97813c37583¤codeÚ,function plot_continuing_step_rewards(r::Vector{T}; npoints = 1000) where T<:Real rsum = cumsum(r) ravg = rsum ./ (1:length(r)) inds = round.(Int64, LinRange(1, length(r), npoints)) plot(scatter(x = inds, y = ravg[inds]), Layout(xaxis_title = "Training Step", yaxis_title = "Reward Average")) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$349631b2-4686-49a9-9f3a-1e4ad588b568„§cell_idÙ$349631b2-4686-49a9-9f3a-1e4ad588b568¤codeÙ\const mountaincar_continuous_mdp2 = create_continuous_action_mountaincar(;slipforce = 100f0)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$8544eddb-2095-4a3c-82e0-920123a88e6d„§cell_idÙ$8544eddb-2095-4a3c-82e0-920123a88e6d¤codeÚ²md""" ### Test REINFORCE With and Without Baseline The following function calls execute the REINFORCE algorithm on Example 13.1. The output displayed is the policy function acting on the single state representation for the problem. The two values represent the probability of taking the left and right action respectively. If converged properly, the right action probability should be higher, approaching a value of about 60%. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$31f7e903-30b6-4193-9174-88093e004de4„§cell_idÙ$31f7e903-30b6-4193-9174-88093e004de4¤codeÚ¬md""" In policy gradient methods, the policy can be parameterized in any way, as long as $\pi(a \vert s, \boldsymbol{\theta})$ is differentiable with respect to its parameters, that is, as long as $\nabla \pi(a \vert s, \boldsymbol{\theta})$ exists and is finite for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$, and $\boldsymbol{\theta} \in \mathbb{R}^{d^\prime}$ where $d^\prime$ is the number of parameters. If the action space is discrete and not too large then we can have numerical preferences for each state/action pair parameterized by $\boldsymbol{\theta}$. $h(s, a, \boldsymbol{\theta})$ and the corresponding policy can be to select actions according to the probability distribution generated by the soft-max. $\pi(a|s, \boldsymbol{\theta}) \doteq \frac{\exp{h(s, a, \boldsymbol{\theta})}}{\sum_b \exp{h(s, b, \boldsymbol{\theta})}}$. One advantage of using the soft-max is that the optimal policy can be stochastic or we can approach a deterministic policy by selecting the action with the highest probability. If we include a temperature parameter in the soft-max then we can vary the same policy to be more or less stochastic as needed. If we calculate preferences with linear features, then we would have feature vectors $\mathbf{x}(s, a) \in \mathbb{R}^{d^\prime}$ to match with the parameter vector $\boldsymbol{\theta} \in \mathbb{R}^{d^\prime}$. Then the preferences would be calculated: $h(s, a, \boldsymbol{\theta}) = \boldsymbol{\theta}^\top \mathbf{x}(s, a)$ Another advantage is that for some problems the policy may be easier to approximate than the action-value function. We can also inject some prior knowledge of the environment into how the policy is parametrized. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$fee14dfe-c5ca-4126-a830-cc9d7eda5433„§cell_idÙ$fee14dfe-c5ca-4126-a830-cc9d7eda5433¤codeÚ1const mountaincar_continuous_test_train2 = actor_critic_with_eligibility_traces_binary_features_gaussian_actions(mountaincar_continuous_mdp2, 0.05f0, 0.8f0, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, typemax(Int64), 100_000; Î±_Î¸ = 5f-4, Î±_w = 0.0008f0)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$b53dba81-a9e9-41da-8fc2-7736bf25f2dc„§cell_idÙ$b53dba81-a9e9-41da-8fc2-7736bf25f2dc¤codeÚjif run_mountaincar_binary_episodic_countinuous_param_study > 0 actor_critic_binary_episodic_gaussian_parameter_study(mountaincar_continuous_mdp, mountaincar_tilecoding_setup.get_active_features, mountaincar_tilecoding_setup.num_features, mountaincar_binary_continuous_params, 4, 3, 1000; max_steps = 100_000) else md""" Waiting to run parameter study """ end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$beb01fb8-c77d-4b5c-a66d-3812415e04a3„§cell_idÙ$beb01fb8-c77d-4b5c-a66d-3812415e04a3¤codeÚ ### *Exercise 13.4* > For the Gaussian policy parameterization, derive the formula for the eligibility vector $\nabla \ln{\pi(a|s, \mathbf{\theta})}$ Starting with our expression for the parameter function, we can calculate the gradient: $\nabla \pi(a|s, \mathbf{\theta}) = \nabla \left ( \frac{1}{\sigma(s, \mathbf{\theta}) \sqrt{2\pi}} \exp \left ( - \frac{(a-\mu(s, \mathbf{\theta}))^2}{2\sigma(s, \mathbf{\theta})^2} \right ) \right )$ We will eventually need $\nabla \mu$ and $\nabla \sigma$ so let's calculate them now. $\nabla (\sigma(s, \mathbf{\theta})) = \nabla \exp{( \mathbf{\theta}_\sigma ^ \top \mathbf{x}_\sigma (s))} = \sigma(s, \mathbf{\theta})\mathbf{x}_\sigma (s)$ $\nabla(\mu(s, \mathbf{\theta})) = \nabla ( \mathbf{\theta}_\mu ^\top \mathbf{x}_\mu(s)) = \mathbf{x}_\mu (s)$ The first application of the quotient rule is trivial, I will omit the input arguments to Î¼ and Ïƒ keeping in mind that these are functions of the parameters. Also let $\left ( - \frac{(a-\mu)^2}{2\sigma^2} \right ) = f(\mu, \sigma)$ which results in $\pi(a|s, \mathbf{\theta}) = \frac{1}{\sigma \sqrt{2\pi}} \exp{(f(\mu, \sigma))}$. Therefore: $\begin{flalign} \nabla \pi(a|s, \mathbf{\theta}) \sqrt{2\pi} &= \frac{1}{\sigma ^2} \left (- \exp{(f(\mu, \sigma))} \nabla \sigma + \sigma \exp{(f(\mu, \sigma))}\nabla f(\mu, \sigma) \right ) \\ &= \frac{1}{\sigma ^2} \left ( -\exp{(f(\mu, \sigma))} \sigma\mathbf{x}_\sigma + \sigma \exp{(f(\mu, \sigma))}\nabla f(\mu, \sigma) \right ) \\ &=\frac{\exp{(f(\mu, \sigma))}}{\sigma} \left (-\mathbf{x}_\sigma + \nabla f(\mu, \sigma) \right ) \\ \end{flalign}$ Now we need only calculate the gradient of $f$: $\begin{flalign} \nabla f(\mu, \sigma) &= \frac{-1}{2} \nabla \left [ \frac{(a-\mu)^2}{\sigma^2} \right ] \\ & = \frac{-1}{2\sigma^4} \left [-2 \sigma^2 (a - \mu) \nabla \mu - (a - \mu)^2 2\sigma \nabla \sigma \right ] \\ & = \frac{-1}{\sigma^3} \left [ -\sigma (a - \mu) \nabla \mu - (a - \mu)^2 \nabla \sigma \right ] \\ & = \frac{-1}{\sigma^3} \left [ -\sigma (a - \mu) \mathbf{x}_\mu (s) - (a - \mu)^2 \sigma \mathbf{x}_\sigma \right ] \tag{substituting gradients}\\ & = \frac{(a - \mu)}{\sigma^2} ((a - \mu) \mathbf{x}_\sigma + \mathbf{x}_\mu) \tag{simplifying}\\ \end{flalign}$ Now substitute this back into the policy gradient: $\nabla \pi(a|s, \mathbf{\theta}) \sqrt{2\pi} = \frac{\exp{(f(\mu, \sigma))}}{\sigma} \left (-\mathbf{x}_\sigma + \frac{(a - \mu)}{\sigma^2} ((a - \mu) \mathbf{x}_\sigma + \mathbf{x}_\mu) \right )$ Furthermore, observe that $\pi(a|s, \mathbf{\theta}) = \frac{1}{\sigma\sqrt{2\pi}} \exp(f(\mu, \sigma))$ So our expression for the policy gradient is: $\nabla \pi(a|s, \mathbf{\theta}) = \pi(a|s, \mathbf{\theta}) \left (-\mathbf{x}_\sigma + \frac{(a - \mu)}{\sigma^2} ((a - \mu) \mathbf{x}_\sigma + \mathbf{x}_\mu) \right )$ To get the eligibility vector we must divide this by the policy which is conveniently already in the expression: $\begin{flalign} \frac{\nabla \pi(a|s, \mathbf{\theta})}{\pi(a|s, \mathbf{\theta})} &= -\mathbf{x}_\sigma + \frac{(a - \mu)}{\sigma^2} ((a - \mu) \mathbf{x}_\sigma + \mathbf{x}_\mu)\\ &= \mathbf{x}_\mu \left [ \frac{(a - \mu)}{\sigma^2} \right ] + \mathbf{x}_\sigma \left [\frac{(a-\mu)^2}{\sigma^2} -1 \right ] \\ \end{flalign}$ There are two components to the sum, one for $\mu$ and one for $\sigma$. If we think of the paramters and feature vectors as concatenated, then this sum would be an element by element sum where $\mathbf{x}_\mu$ has a zero value for all the feature indices corresponding to $\sigma$ and vice-versa. This way doing the sum will form one complete vector that has gradient components for all the parameters $\mathbf{\theta}_\mu$ and $\mathbf{\theta}_\sigma$. Alternatively, the sum can be separated and each gradient can be treated separately with only those components keeping them separated throughout the calculation. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$8bc280db-e57d-4e40-be46-1790f4f7d9e7„§cell_idÙ$8bc280db-e57d-4e40-be46-1790f4f7d9e7¤codeÚ¡function actor_critic_fcann_parameter_study(mdp::StateMDP{T, S, A, P, F1, F2, F3}, update_feature_vector!::Function, num_features::Integer, hidden_layers::Vector{Int64}, Î»_Î¸::T, Î»_w::T, Î±_rÌ„::T, Î±_Î¸_list::AbstractVector{T}, Î±_w_list::AbstractVector{T}, max_steps::Integer; nruns::Integer = 100, seed = rand(UInt64), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} Random.seed!(seed) init_policy_params = FCANN.initializeparams_saxe(num_features, hidden_layers, length(mdp.actions)) make_trace_data(Î±_Î¸_list, Î±_w) = [average_continuing_runs(nruns, seed, Î±_Î¸, Î±_w, Î±_rÌ„, init_policy_params, actor_critic_with_eligibility_traces_fcann, mdp, Î»_Î¸, Î»_w, num_features, hidden_layers, update_feature_vector!, max_steps; kwargs...) for Î±_Î¸ in Î±_Î¸_list] traces = [begin scatter(x = Î±_Î¸_list, y = make_trace_data(Î±_Î¸_list, Î±_w), name = "Î±_w = $Î±_w") end for Î±_w in Î±_w_list] plot(traces, Layout(xaxis_title = "Î±_Î¸", yaxis_title = "Average Reward Per Step in the First
$max_steps Steps Averaged Over $nruns Runs", xaxis_type = "log", title = "$num_features Input, $hidden_layers Hidden Non Linear Approximation, Î»_Î¸ = $Î»_Î¸, Î»_w = $Î»_w")) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$89901156-b874-416b-89c1-6dc434a4eb17„§cell_idÙ$89901156-b874-416b-89c1-6dc434a4eb17¤codeÙ(md""" ### *REINFORCE Implementation* """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$ff76ef94-fdf5-41f3-a31a-21c4629efabe„§cell_idÙ$ff76ef94-fdf5-41f3-a31a-21c4629efabe¤codeÙ(const corridor_mdp = make_corridor_mdp()¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÃ«code_foldedÂÙ$581f7e9b-a5c2-4841-9605-85f9585b0274„§cell_idÙ$581f7e9b-a5c2-4841-9605-85f9585b0274¤codeÙºupdate_linear_action_preferences!(action_preferences::Vector{T}, x::Vector{T}, params::Matrix{T}) where T<:AbstractFloat = BLAS.gemv!('T', one(T), params, x, zero(T), action_preferences)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$8aa16866-bfda-48df-9cf1-cf3d2e203ccb„§cell_idÙ$8aa16866-bfda-48df-9cf1-cf3d2e203ccb¤codeÚofunction cartpole_tilecoding_reinforce_continuous_parameter_study(Î±1_list, Î±2_list, max_episodes; num_trials = 100, kwargs...) setup = setup_cartpole_problem(;kwargs...) traces = [begin steps = [begin 1:num_trials |> Map() do i solution = reinforce_with_baseline_monte_carlo_control_binary_features_gaussian_actions(cartpole_setup.mdps.episodic.continuous, cartpole_setup.get_active_features, cartpole_setup.num_features, max_episodes; Î±_Î¸ = Î±1, Î±_w = Î±2) steps = solution.episode_steps isempty(steps) && return 0 mean(steps) end |> foldxt(+) |> x -> x / num_trials end for Î±1 in Î±1_list] scatter(x = Î±1_list, y = steps, name = "Î±_w = $Î±2") end for Î±2 in Î±2_list] plot(traces, Layout(xaxis_title = "Policy Learning Rate Î±_Î¸", yaxis_title = "Average Episode Duration Over First $max_episodes Episodes", xaxis_type = "log")) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$04b5929a-2058-49c9-963a-96c752a1d67d„§cell_idÙ$04b5929a-2058-49c9-963a-96c752a1d67d¤codeÙIplot_continuing_step_rewards(cartpole_continuing_fcann_test.step_rewards)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$f0104778-81a6-417b-8501-f916e5e7f3af„§cell_idÙ$f0104778-81a6-417b-8501-f916e5e7f3af¤codeÚAfunction make_corridor_continuing_mdp() function step(s::Integer, i_a::Integer) Î´ = 2*i_a - 3 #calculates the s change -1 for left (1) and 1 for right (2) switch = iseven(s) #returns true in state 2 which is where actions are switched, when switch is true, multiply Î´ by -1, otherwise by 1 c = 1 - 2*switch sâ€² = s + c*Î´ goal = s == 4 left_limit = s == 0 sâ€² = ifelse(left_limit || goal, 1, sâ€²) r = Float32(goal) (r, sâ€²) end actions = [:left, :right] ptf = StateMDPTransitionSampler(step, 1) StateMDP(actions, ptf, () -> 1, Returns(false)) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$3e3c5897-809f-46e3-bb58-f115b082443e„§cell_idÙ$3e3c5897-809f-46e3-bb58-f115b082443e¤codeÚfunction actor_critic_with_eligibility_traces_binary_features_beta_actions(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, Î»_Î¸::T, Î»_w::T, get_active_features::Function, num_features::Integer, args...; policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, N, A <: Union{T, NTuple{N, T}}, P, F1, F2, F3} setup = setup_binary_beta_policy_arguments(mdp, get_active_features, num_features) actor_critic_with_eligibility_traces!(policy_params, setup.eligibility_vector, value_params, BinaryFeatureVector(), mdp, Î»_Î¸, Î»_w, update_binary_action_preferences!, setup.action_distribution_parameters, make_beta_sampler(rand(A)), update_beta_eligibility_vector!, setup.feature_vector, setup.update_feature_vector!, binary_value_function, update_binary_value_gradient!, args...; kwargs...) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$a9db3f85-ff56-4bbc-be87-47b893ef3b7b„§cell_idÙ$a9db3f85-ff56-4bbc-be87-47b893ef3b7b¤codeÙÛfunction mountaincar_continuing_step(s, i_a::Integer) a = MountainCarTask.actions[i_a] sâ€² = MountainCarTask.step(s, a) (sâ€²[1] == 0.5f0) && return (1f0, MountainCarTask.initialize_state()) return (0f0, sâ€²) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$08505e88-9c23-4e95-91e3-d18bf5133dbc„§cell_idÙ$08505e88-9c23-4e95-91e3-d18bf5133dbc¤codeÚfunction actor_critic_binary_episodic_squashed_gaussian_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, amax::A, get_active_features::Function, num_features::Integer, Î»_Î¸::T, Î»_w::T, Î±_Î¸_list::AbstractVector{T}, Î±_w_list::AbstractVector{T}, max_episodes::Integer; nruns::Integer = 100, max_steps::Integer = 10_000, seed = rand(UInt64), init_policy_params::Matrix{T} = make_n_param_dist_policy_params(2, num_features, rand(A)), init_value_params::Vector{T} = zeros(T, num_features), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} Random.seed!(seed) function average_runs(Î±_Î¸, Î±_w) 1:nruns |> Map(_ -> actor_critic_with_eligibility_traces_binary_features_squashed_gaussian_actions(mdp, amax, Î»_Î¸, Î»_w, get_active_features, num_features, max_episodes, max_steps; Î±_Î¸ = Î±_Î¸, Î±_w = Î±_w, policy_params = copy(init_policy_params), value_params = copy(init_value_params), kwargs...) |> x -> isempty(x.episode_rewards) ? -T(Inf) : mean(x.episode_rewards)) |> foldxt(+) |> x -> x / nruns end traces = [begin scatter(x = Î±_Î¸_list, y = average_runs.(Î±_Î¸_list, Î±_w), name = "Î±_w = $Î±_w") end for Î±_w in Î±_w_list] plot(traces, Layout(xaxis_title = "Î±_Î¸", yaxis_title = "Average Reward Per Episode in the First
$max_episodes Episodes Averaged Over $nruns Runs", xaxis_type = "log", title = "Binary Feature Encoding with $num_features Features, Î»_Î¸ = $Î»_Î¸, Î»_w = $Î»_w")) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$ad0009af-2cfc-4820-bd4a-698ad391f459„§cell_idÙ$ad0009af-2cfc-4820-bd4a-698ad391f459¤codeÙbplot(scatter(x = LinRange(0, 1, 1000), y = make_beta_dist(beta_params...).(LinRange(0, 1, 1000))))¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$16fcc2d0-9f2f-4226-9dcc-6d86248cab26„§cell_idÙ$16fcc2d0-9f2f-4226-9dcc-6d86248cab26¤codeÚXfunction plot_state_distributions(p_left; kwargs...) state_visits = collect_state_distributions(;p = p_left, kwargs...) Î· = sum(state_visits) Î¼ = sum(state_visits, dims=1)[:] Î¼s_plot = plot(bar(x = 1:3, y = Î¼ ./ sum(Î¼)), Layout(yaxis_range = [0, 1], xaxis_tickvals = [1, 2, 3], xaxis_title = "State", yaxis_title = "Probability", title = "Stationary State Distribution")) p_not_term = sum(state_visits, dims = 2) pterm = 1 .- p_not_term termplot = plot(pterm, Layout(xaxis_title = "Step", yaxis_title = "Probability", title = "Probability of Episode Terminating On or Before Step")) (n, m) = size(state_visits) plots = [begin v = state_visits[k, :][:] vterm = pterm[k] t = bar(x = 1:4, y = [v; vterm], name = "k = $k") p = plot(t, Layout(width = 270, height = 250, yaxis_range = [0, 1], xaxis = attr(tickvals = 1:4, ticktext = ["1", "2", "3", "Term"], title = "State"), yaxis_title = "Probability", title = "Step $k")) end for k in vcat(1:5, 10:10:50)] full_p = plot(heatmap(x = 0:20, y = 1:3, z = state_visits[1:21, :]' ./ sum(state_visits)), Layout(xaxis_title = "Step", yaxis_title = "State", title = "Probability Over States and Steps", yaxis_tickvals = [1, 2, 3])) # p3 = plot(traces2) @htl(""" Policy Probability for Left Action is $p_left and Average Episode Length is $Î·

State Distribution Per Step Including Terminal State

$plots

$termplot $Î¼s_plot

$full_p

""") end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$11063fff-4d36-46d5-828f-dbed0f46b9cf„§cell_idÙ$11063fff-4d36-46d5-828f-dbed0f46b9cf¤codeÚrfunction actor_critic_fcann_parameter_study(mdp::StateMDP{T, S, A, P, F1, F2, F3}, update_feature_vector!::Function, num_features::Integer, hidden_layers::Vector{Int64}, Î»_Î¸_list::AbstractVector{T}, Î»_w_list::AbstractVector{T}, Î±_rÌ„_list::AbstractVector{T}, Î±_Î¸_list::AbstractVector{T}, Î±_w_list::AbstractVector{T}, num_tests::Integer, max_steps::Integer; nruns::Integer = 100, seed = rand(UInt64), kwargs...) where {T<:Real, S, A, P, F1, F2, F3} Random.seed!(seed) init_policy_params = FCANN.initializeparams_saxe(num_features, hidden_layers, length(mdp.actions)) run_test(Î±_Î¸, Î±_w, Î±_rÌ„, Î»_Î¸, Î»_w) = average_continuing_runs(nruns, seed, Î±_Î¸, Î±_w, Î±_rÌ„, init_policy_params, actor_critic_with_eligibility_traces_fcann, mdp, Î»_Î¸, Î»_w, num_features, hidden_layers, update_feature_vector!, max_steps; kwargs...) test_params = [(Î±_Î¸ = rand(Î±_Î¸_list), Î±_w = rand(Î±_w_list), Î±_rÌ„ = rand(Î±_rÌ„_list), Î»_Î¸ = rand(Î»_Î¸_list), Î»_w = rand(Î»_w_list)) for _ in 1:num_tests] DataFrame([begin output = run_test(params...) (;params..., output = output) end for params in test_params]) end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$8fcdca63-01a0-4d4b-933c-06a7621d980a„§cell_idÙ$8fcdca63-01a0-4d4b-933c-06a7621d980a¤codeÙW#add neural network implementation of continuous policy gradient and do parameter study¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$33c99850-67cd-4754-94b9-6df97b238e27„§cell_idÙ$33c99850-67cd-4754-94b9-6df97b238e27¤codeÙûfunction soft_max!(x::AbstractVector{T}) where T<:Real minx, maxx = extrema(x) if minx == maxx x .= one(T) / length(x) return x end s = zero(T) @inbounds @simd for i in eachindex(x) h = exp(x[i] - maxx) s += h x[i] = h end x ./= s end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$786a5385-b648-4fc3-8e19-bf6582828136„§cell_idÙ$786a5385-b648-4fc3-8e19-bf6582828136¤codeÚKmd""" #### Continuous Action Space Now that we have verified the success of policy gradient methods on this problem, we can consider using a continuous action space where the policy can output a distribution over throttles. In the original problem, the maximum throttle value is 1, but the velocity of the car is already capped at 0.07. We can see if a policy attempts to use much higher throttle values to end the episode faster even if the physics is unrealistic. That observation would confirm a successful use of continuous actions where the throttle is an unbounded continuous value. The optimal policy would likely try to use the highest throttle possible to reach the maximum speed in either direction faster. We could apply friction to the problem so that the car would actually slip if it attempts to accelerate too quickly. """¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$573878bb-020d-40f6-9329-3d5f91843010„§cell_idÙ$573878bb-020d-40f6-9329-3d5f91843010¤codeÙ^get_corridor_episode_stats(corridor_train.greedy_policy; max_steps = 100, ntrials = 1_000_000)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$2e7c737c-c798-4442-a7e1-d74ccfd73119„§cell_idÙ$2e7c737c-c798-4442-a7e1-d74ccfd73119¤codeÙ6@bind xÌ‡ Slider(-50:50; default = 0, show_value=true)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$9d264543-33ab-498a-90f5-5f913c252484„§cell_idÙ$9d264543-33ab-498a-90f5-5f913c252484¤codeÙ]plot(reinforce_test4.episode_steps[1:10:end] ./ (1:10:length(reinforce_test4.episode_steps)))¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$9cf3dc5f-8a25-479f-93db-06e34f0d37a0„§cell_idÙ$9cf3dc5f-8a25-479f-93db-06e34f0d37a0¤codeÙ%plot_state_distributions(dist_plot_p)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÃÙ$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70„§cell_idÙ$d04d4234-d97f-11ed-2ea3-85ee0fc3bd70¤codeÙ‡begin using PlutoUI, PlutoPlotly, LaTeXStrings, PlutoProfile, HypertextLiteral, ProgressLogging, BenchmarkTools TableOfContents() end¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÃ«code_foldedÂÙ$bd6a7c16-6c25-4fc2-8e1b-4dab693ce19f„§cell_idÙ$bd6a7c16-6c25-4fc2-8e1b-4dab693ce19f¤codeÚdactor_critic_binary_episodic_squashed_gaussian_parameter_study(mdp::ContinuousMDP{T, S, A, P, F1, F2, F3}, amax::A, get_active_features::Function, num_features::Integer, params::@NamedTuple{Î»_Î¸::T, Î»_w::T, Î±_Î¸_min::Int64, Î±_w_min::Int64}, num_Î¸::Integer, num_w::Integer, num_episodes::Integer; kwargs...) where {T<:Real, S, A, P, F1, F2, F3} = actor_critic_binary_episodic_squashed_gaussian_parameter_study(mdp, amax, get_active_features, num_features, params.Î»_Î¸, params.Î»_w, 2f0 .^(params.Î±_Î¸_min:params.Î±_Î¸_min+num_Î¸-1), 2f0 .^(params.Î±_w_min:params.Î±_w_min+num_w-1), num_episodes; kwargs...)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂÙ$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72c„§cell_idÙ$3e5fc75b-61a5-49d5-b5bd-3d2847f5f72c¤codeÙ„corridor_train = sarsa_Î»(corridor_mdp, 1f0, 0.99f0, typemax(Int64), 1_000_000, 1, get_corridor_features; Ïµ = 0.5f0, Î± = 0.0001f0)¨metadataƒ©show_logsÃ¨disabledÂ®skip_as_scriptÂ«code_foldedÂ«notebook_idÙ$6d683db8-38f5-11f0-0729-898e37e867d8«in_temp_dirÂ¨metadata€