This is an old revision of the document!
Table of Contents
ATAI-21 Reykjavik University
Engineering Assignment 1:
Learning
Aim: This assignment is meant to give you a better insight into cumulative learning and life-long learning on the example of human learning.
Summary: In this assignment you are supposed to get a better idea on cumulative learning by example of a human learner - i.e. yourself. Furthermore, this assignment will highlight the importance of perception for learning and task solving. You are given Python code with an implementation of the cart-pole task designed as a game for a human to play. For this, 4 different conditions have been implemented, giving you the chance to experience the importance of the presentation of data for yourself.
The files for this assignment can be found here: Assignment 1
Setup
Install python3 on your computer (https://www.python.org/downloads/).
Download the attached zip file and extract it to some location (e.g. …/assignment_1/) and cd into the folder.
Install the included requirements.txt file:
$ pip install -r requirements.txt
Run the code:
$ python main.py
Zip Files:
Assignment 1
The requirements.txt file was tested using python 3.9. If you use a different python version or have problems with the installation please contact us early enough before the deadline so that we can help sort it out.
The Game
Condition 1: SYMBOLIC
You are presented with an alphanumeric display of a continuously updated state of observable variables relevant to the task (e.g. x, v, theta, and omega). With the arrow keys you can apply, as previously the reinforcement learner, a force of -10 (left) or +10 (right) Newton to the cart. Your task is to keep the pole upright for as long as possible. In the top-right you can see your current score (the total reward you achieved in this epoch (+1 for each successful iteration)), in the center are the values of the observables. You can set the environment to run synchronous or async, meaning that, if sync is not set the environment updates automatically after 100 ms. Further, you can invert the forces by pressing the i key on your keyboard.
Condition 2: COLOR CODING
You are presented with colors for each of the variables as well as an arrow indicating if the value is negative (pointing to the left) or positive (pointing to the right). Green means a value around 0, the redder it becomes the closer to either edge you are. For v and omega (both not in the solution space restriction) red implies a high value, green a low value. Other things are the same as in Condition 1.
Condition 3: COLORS & LINES
In the setting colors_lines the same colors are used as in Level 2; however, additionally a line is drawn on the bar giving you additional information of the current state of the observables. Otherwise things are the same as in Conditions 1 and 2.
Condition 4: ANIMATION
In the animation setting an animation of the cart-pole is presented to you, including the cart (rectangle), the pole (line to the top of the image), the velocity (line inside the cart to the left or right), and the angular velocity (line at the top of the pole) indicating the current position and velocity of the cart. Otherwise this is the same as in Conditions 1.
Your task:
IMPORTANT: Read the full instructions before beginning.
- Your task is to get good at performing the task in all 4 Levels. Record the data for your training in each condition.
- Apply the following settings to the environment (If you downloaded and use the provided environment this should be the case already):
- All variables (x, v, theta, omega) are observables.
- Apply noise (all of them at the same time) to the environment with a mean of 0 and a standard deviation of
- x: 0.2 m
- v: 0.2 m/s
- theta: 1.0 deg
- omega: 0.2 rad/s
- Set the environment to run asynchronous
- Play the game on Conditions 1, 2, 3, and 4 in that order for at least 10 epochs (or better phrased: until you are confident in playing in the condition, but for at least 10 epochs) each and note for each condition your
- highest score,
- average score and standard deviation,
- median score.
- Invert the forces by pressing the “i” key on your keyboard during a run (after 5-10 restarts/ fails) and continue for another 5-10 episodes. Do this in all four conditions in the same order as previously. (Redo instruction 3 with this inversion). What can you say about your learning speed with force inversion?
- Apply the following settings to the environment (all of them at the same time):
- Only the variables x, v, omega are observables.
- Do not apply noise.
- Set the environment to run asynchronous.
- And replay the game as stated under instruction 3.
- Change the environment in any way you like to make it harder (or easier) to play note your changes and your scores accordingly.
- Implement the adjust_task function and at least one other in the cart_pole_env.py file (e.g. apply_action_noise, apply_discretization, etc.).
- Play the game using those implemented conditions (One by one and all of them together).
- Again note your scores under these different conditions.
- Reset the settings back to the ones from the beginning and replay the game (as described in the third instruction).
- Compare your results from the first tries from number 3 to the other, and especially the last tries from number 8. What can you conclude about the possibilities of cumulative, life-long learning?
- Report. Write a report on your results including the different scores and comparisons between the different tries. Discuss the advantages, and disadvantages of human learning (and human nature) this might include (but is not restricted to):
- Previously acquired knowledge used in this game.
- Cumulative learning.
- Transfer learning.
- Boredom.
Further information
Try to not only do what you are asked, but rather investigate learning further by yourself (if you have the time to do so, of course). Some of the things you could to:
- Change dt, make it faster or slower.
- Try out the synchronous environment - this is actually, how reinforcement learners would play the game.
- Try the other functions which you maybe did not implement in step 7 earlier. How do changes of a task-environment influence your learning?
- Try even more (for example change the observation state during a run (e.g. after 100 iterations)).
- What else can you think of?
- You can adjust the plot_iterations function in the env.py file to plot additional information like mean, std_dev etc.
- Generally speaking: Feel free to add and/or remove anything you like. This is an assignment where you should reflect on your own learning abilities. Whatever you think might help you to get a better insight into your own learning is more than welcome. Please add all changes you have done to the report!
Try to make a point on the advantages and disadvantages of cumulative/ human learning.
Set up the conditions:
For most of the tasks given to you all you need is to change parts in the main.py or cart_pole_env.py, I hope it is self explaining, if any questions arise please ask!