Differences

This shows you the differences between two versions of the page.

--- public:t-720-atai:atai-21:engineering_assignment_1 [2021/09/15 21:45] – thorisson
+++ public:t-720-atai:atai-21:engineering_assignment_1 [2024/04/29 13:33] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
-[[public:t-720-atai:atai-21:main|ATAI-21 Main]]
+[[/public:t-720-atai:atai-21:main|T-720-ATAI-2021 Main]]
 \\
@@ Line 7: / Line 7: @@
 \\
 ====Engineering Assignment 1:====
-=====Deep-Reinforcement Learner=====
+=====Learning=====
 \\
-**Aim:** This assignment is meant to give you a better insight into complexity dimensions and properties of task-environments.
+**Aim:** This assignment is meant to give you a better insight into deep reinforcement learning and its difference to cumulative and life-long learning on the example of human learning.
 **Summary:** In this first exercise you are asked to evaluate a given Deep-Reinforcement Learner (an actor-critic learner, to be specific; for further information see [[https://papers.nips.cc/paper/1786-actor-critic-algorithms.pdf|Konda & Tsitsiklis 2000]]) in different task-environments, coded in Python. The task-environment (or just task, really) is the well-known cart-pole task: Balancing a pole on a moving platform, in 1-D (left and right movements).
-===Your task:===
+Additionally you are supposed to get a better idea on cumulative learning by example of a human learner - i.e. yourself. Furthermore, this assignment will highlight the importance of perception for learning and task solving. You are given Python code with an implementation of the same cart-pole task as the RL designed as a game for a human to play. For this, 4 different conditions have been implemented, giving you the chance to experience the importance of the presentation of data for yourself.\\
-  - **Plain Vanilla.** Evaluate the actor-critic’s performance on the cart-pole task given to you as python code:
+\\
-    - Run the learner repeatedly; collect the data. Stop each run when either 1000 epochs are reached or the agent manages to get more than 200 iterations in average per epoch over at least 100 continuous epochs (this is usually the case at around 400-500 epochs).
-    - Plot its improvement in performance over time.
-  - **Modified Version.** Evaluate the learner’s performance on a modified version of the cart-pole task. For this you should evaluate at least 3 of the following modifications of the environments and compare them to the results from 1.:
-    - Noise on observation/ action/ and environment dynamics.
-    - Hide each variable once (x, v, theta, omega) and run the setup with only three observables.
-    - Introduce extremely high noise on one observable for all four observables (three normal, one noisy variables).
-    - Change the task after a certain amount of epochs. Think of at least three different changes, one is given as an example in the code.
-    - Change the discreteness of time/ observables increasing or decreasing the variable resolution.
-  - **New Task-Environment.** Design your own simple task-environment in which you can show your own ideas of complexity of task-environments which might not have been included in the cart-pole.
-  - **Report.** Write a 1-2 page report where you describe your results. Draw some insights in relation to learning in general and try to make some generalizations based on them, and discuss, e.g.:
-    - When does the actor-critic learner fail?
-    - Which changes will be //impossible// for the actor-critic to adjust to (try this out yourself from what you know of neural networks; hint: input and output layers of ANNs).
-    - What is your opinion of the //generality// and //adaptability// of the actor-critic learner with respect to **//novelty//** (novel task-environments)?
-    - Is this in any way similar to how humans learn? If 'yes', how? If 'no', what's different, and why?
-    - ...more
 |  {{/public:t-720-atai:cart-pole-task.jpg?500}}  |
 |  The cart-pole task.  |
-\\
-\\
 === Setup ===
 Install python3 on your computer (https://www.python.org/downloads/).\\
-Download the attached zip file and extract it to some location (e.g. .../assignment_1/) andcd into the folder.\\
+Download the attached zip file and extract it to some location and cd into the folder.\\
 Install the included requirements.txt file:\\
   $ pip install -r requirements.txt
 Run the code:\\
   $ python main.py
+For the first task (Deep-Reinforcement-Learning) you will need to install pytorch. Since this is different depending on which OS you use and whether you have a GPU which supports CUDA (Nvidia GPUs only) you should follow the installation instructions [[https://pytorch.org/get-started/locally/
+|here]].
 Zip Files:\\
-{{:public:t-720-atai:atai-20:exercise_1.zip|Old zip file}}\\
+{{:public:t-720-atai:atai-21:assignment_1_rl.zip|Assignment 1 Reinforcement Learning}}\\
-{{:public:t-720-atai:atai-20:exercise_1_updated.zip|Updated zip file}}
+{{:public:t-720-atai:atai-21:assignment_1_rl_new.zip|Assignment 1 Reinforcement Learning Updated env.py file to correctly apply action noise}}\\
+{{:public:t-720-atai:atai-21:assignment_1_hl.zip|Assignment 1 Human Learning}}
-\\
+====Assignment 1.1: Deep Reinforcement Learning====
+===Your task:===
+  - **Plain Vanilla.** Evaluate the actor-critic’s performance on the cart-pole task given to you as python code:
+    - Run the learner repeatedly; collect the data. Stop each run when either 1000 epochs are reached or the agent manages to get more than 200 iterations in average per epoch over at least 100 continuous epochs (this is usually the case at around 400-500 epochs).
+    - Plot its improvement in performance over time.
+  - **Modified Version.** Evaluate the learner’s performance on a modified version of the cart-pole task. For this you should evaluate at least two of the following modifications of the environments and compare them to the results from 1.:
+    - Noise on observation and action.
+    - Hide each variable once (x, v, theta, omega) and run the setup with only three observables.
+    - Introduce extremely high noise on one observable for all four observables (three normal, one noisy variables).
+    - Change the task after a certain amount of epochs. Think of at least three different changes, one is given as an example in the code.
+    - Change the discreteness of time/ observables increasing or decreasing the variable resolution.
+  - Calculate the average score, median score, maximum score, and standard deviation of each task.
 \\
 === Further information ===
-In the file “env.py” you can find the parent-class of the cart-pole environment, you should inherit from this class when you write your own environment in 3. and include all abstract methods.
 In the file “ac_learner.py” you can find the source code of an actor critic learner.
@@ Line 81: / Line 78: @@
 In each of these methods you can implement a different method to adjust the task or the information passed to or from the agent. In “env.py” a helper class for noise is included, which you can use.
-After the agent runs for the defined max number of epochs a plot is created of iterations per epoch. Use this to evaluate the learning performance of the learner.
+After the agent runs for the defined max number of epochs a plot is created of iterations per epoch. Use this to evaluate the learning performance of the learner. Extend the plots to include whatever information you deem to be important.
+**HINT** You are allowed to change any parts of the learner or the environment, just make sure to document all changes and explain //**how and why**// they influence the learning performance of the agent.
-**HINT** You are allowed to change any parts of the learner or the environment, just make sure to document all changes and explain how and why they influence the learning performance of the agent.
 \\
@@ Line 89: / Line 87: @@
 \\
-=== Some more reading ===
+====Assignment 1.2: Human Learning====
-Besides many more:
-Thórisson, K.R., Bieger, J., Schiffel, S., Garrett, D.: [[http://alumni.media.mit.edu/~kris/ftp/AGIEvaluationFlexibleFramework-ThorissonEtAl2015.pdf|Towards Flexible Task-Environments for Comprehensive Evaluation of Artificial Intelligent Systems and Automatic Learners]]. In: International Conference on Artificial General Intelligence. pp. 187–196. Springer (2015)
+====The Game====
+**Condition 1: SYMBOLIC**\\
+You are presented with an alphanumeric display of a continuously updated state of observable variables relevant to the task (e.g. x, v, theta, and omega). With the arrow keys you can apply a force of -10 (left) or +10 (right) Newton to the cart. Your task is to keep the pole upright for as long as possible. In the top-right you can see your current score (the total reward you achieved in this epoch (+1 for each successful iteration)), in the center are the values of the observables. You can set the environment to run synchronous or async, meaning that, if sync is not set the environment updates automatically after 100 ms. Further, you can invert the forces by pressing the i key on your keyboard.\\
+|  {{/public:t-720-atai:atai-20:condition_1.png?500}}  |
+|  Condition 1 - symbolic  |
+\\
+\\
+**Condition 2: COLOR CODING**\\
+You are presented with colors for each of the variables as well as an arrow indicating if the value is negative (pointing to the left) or positive (pointing to the right). Green means a value around 0, the redder it becomes the closer to either edge you are. For v and omega (both not in the solution space restriction) red implies a high value, green a low value. Other things are the same as in Condition 1.\\
+|  {{/public:t-720-atai:atai-20:condition_2.png?500}}  |
+|  Condition 2 - color coding  |
+\\
+\\
+**Condition 3: COLORS & LINES**\\
+In the setting colors_lines the same colors are used as in Level 2; however, additionally a line is drawn on the bar giving you additional information of the current state of the observables. Otherwise things are the same as in Conditions 1 and 2.\\
+|  {{/public:t-720-atai:atai-20:condition_3.png?500}}  |
+|  Condition 3 - colors and lines  |
+\\
+\\
+**Condition 4: ANIMATION**\\
+In the animation setting an animation of the cart-pole is presented to you, including the cart (rectangle), the pole (line to the top of the image), the velocity (line inside the cart to the left or right), and the angular velocity (line at the top of the pole) indicating the current position and velocity of the cart. Otherwise this is the same as in Conditions 1.\\
+|  {{/public:t-720-atai:atai-20:condition_4.png?500}}  |
+|  Condition 4 - animation  |
+\\
+\\
-Russell, S.J., Norvig, P.: Artificial intelligence: A modern approach. Malaysia; Pear-son Education Limited, (2016)
+====Your task:====
+**IMPORTANT: Read the full instructions before beginning.**\\
+  - Your task is to get good at performing the task in all 4 Levels. Record the data for your training in each condition.
+  - Apply the following settings to the environment (If you downloaded and use the provided environment this should be the case already):
+    - All variables (x, v, theta, omega) are observables.
+    - Apply noise (all of them at the same time) to the environment with a mean of 0 and a standard deviation of
+      - x: 0.2 m
+      - v: 0.2 m/s
+      - theta: 1.0 deg
+      - omega: 0.2 rad/s
+    - Set the environment to run asynchronous
+  - Play the game on Conditions 1, 2, 3, and 4 in that order for at least 10 epochs (or better phrased: until you are confident in playing in the condition, but for at least 10 epochs) each and note for each condition your
+    - highest score,
+    - average score and standard deviation,
+    - median score.
+  - **From now on only play on the two conditions in which you performed worst**
+  - Invert the forces by pressing the “i” key on your keyboard during a run (after 5-10 restarts/ fails) and continue for another 5-10 episodes. Do this in the two conditions in which you performed worst earlier in the same order as previously. (Redo instruction 3 with this inversion). What can you say about your learning speed with force inversion?
+  - Reset the settings back to the ones from the beginning and replay the game (as described in the third instruction) on the two conditions you performed worst in the first tries.
+  - Compare your results from the first tries from number 3 to the others. What can you conclude about the possibilities of cumulative, life-long learning?
-Eberding, L.M., Sheikhlar, A., Thórisson, K.R.: Sage: Task-environment platform for autonomy and generality evaluation. In: International Conference on ArtificialGeneral Intelligence. Springer (2020)
+\\
-Konda, V.R., Tsitsiklis, J.N.: [[https://papers.nips.cc/paper/1786-actor-critic-algorithms.pdf|Actor-Critic algorithms]]. In: Advances in neural in-formation processing systems. pp. 1008–1014 (2000)
+====Further information====
+Try to not only do what you are asked, but rather investigate learning further by yourself (if you have the time to do so, of course). Some of the things you could to:
+  - Change dt, make it faster or slower.
+  - Try out the synchronous environment - this is actually, how reinforcement learners would play the game.
+  - Try the things you changed in the reinforcement learning part of this assignment for the RL now on yourself, for example changing the task in the middle of a run.
+  - Try even more (things the RL could not do, for example change the observation state during a run (e.g. after 100 iterations)).
+  - What else can you think of?
+  - You can adjust the plot_iterations function in the env.py file to plot additional information like mean, std_dev etc.
+  - Generally speaking: Feel free to add and/or remove anything you like. This is an assignment where you should reflect on your own learning abilities. Whatever you think might help you to get a better insight into your own learning is more than welcome. Please add all changes you have done to the report!
+Try to make a point on the advantages and disadvantages of cumulative/ human learning.
+\\
+\\
+====Assignment 1.3: Report====
+  - **Report.** Write a 4-5 page report where you describe your results. Draw some insights in relation to learning in general and try to make some generalizations based on them, and discuss, e.g.:
+    - Regarding RL:
+      - When does the actor-critic learner fail?
+      - Which changes will be //impossible// for the actor-critic to adjust to (try this out yourself from what you know of neural networks; hint: input and output layers of ANNs).
+      - What is your opinion of the //generality// and //adaptability// of the actor-critic learner with respect to **//novelty//** (novel task-environments)?
+    - Regarding human learning:
+      - Discuss the advantages, and disadvantages of human learning (and human nature).
+      - This might include (but is not restricted to):
+        - Previously acquired knowledge used in this game.
+        - Cumulative learning.
+        - Transfer learning.
+        - Boredom.
+    - Is RL in any way similar to how humans learn? If 'yes', how? If 'no', what's different, and why?
+    - Compare the RL to human learning.
+\\
+**Set up the conditions:**\\
+For most of the tasks given to you all you need is to change parts in the main.py or cart_pole_env.py, I hope it is self explaining, if any questions arise please ask!