Aim: This assignment is meant to give you a better insight into deep reinforcement learning and its difference to cumulative and life-long learning on the example of human learning.
Summary: In this first exercise you are asked to evaluate a given Deep-Reinforcement Learner (an actor-critic learner, to be specific; for further information see Konda & Tsitsiklis 2000) in different task-environments, coded in Python. The task-environment (or just task, really) is the well-known cart-pole task: Balancing a pole on a moving platform, in 1-D (left and right movements).
Additionally you are supposed to get a better idea on cumulative learning by example of a human learner - i.e. yourself. Furthermore, this assignment will highlight the importance of perception for learning and task solving. You are given Python code with an implementation of the same cart-pole task as the RL designed as a game for a human to play. For this, 4 different conditions have been implemented, giving you the chance to experience the importance of the presentation of data for yourself.
Install python3 on your computer (https://www.python.org/downloads/).
Download the attached zip file and extract it to some location and cd into the folder.
Install the included requirements.txt file:
$ pip install -r requirements.txt
Run the code:
$ python main.py
For the first task (Deep-Reinforcement-Learning) you will need to install pytorch. Since this is different depending on which OS you use and whether you have a GPU which supports CUDA (Nvidia GPUs only) you should follow the installation instructions here.
In the file “ac_learner.py” you can find the source code of an actor critic learner.
In the file “cart_pole_env.py” you can find the source code for the task environment. In order to evaluate the learner on the different environment settings the following methods are of importance:
Apply noise only to the observations received by the agent, NOT the internal environment variables
Apply noise to the force applied to the cart-pole
Apply noise to the dynamics of the environment, but not to the observations of the agent
Change the task in a chosen manner
discretize the data passed to the agent
In each of these methods you can implement a different method to adjust the task or the information passed to or from the agent. In “env.py” a helper class for noise is included, which you can use.
After the agent runs for the defined max number of epochs a plot is created of iterations per epoch. Use this to evaluate the learning performance of the learner. Extend the plots to include whatever information you deem to be important.
HINT You are allowed to change any parts of the learner or the environment, just make sure to document all changes and explain how and why they influence the learning performance of the agent.
Condition 1: SYMBOLIC
You are presented with an alphanumeric display of a continuously updated state of observable variables relevant to the task (e.g. x, v, theta, and omega). With the arrow keys you can apply a force of -10 (left) or +10 (right) Newton to the cart. Your task is to keep the pole upright for as long as possible. In the top-right you can see your current score (the total reward you achieved in this epoch (+1 for each successful iteration)), in the center are the values of the observables. You can set the environment to run synchronous or async, meaning that, if sync is not set the environment updates automatically after 100 ms. Further, you can invert the forces by pressing the i key on your keyboard.
Condition 2: COLOR CODING
You are presented with colors for each of the variables as well as an arrow indicating if the value is negative (pointing to the left) or positive (pointing to the right). Green means a value around 0, the redder it becomes the closer to either edge you are. For v and omega (both not in the solution space restriction) red implies a high value, green a low value. Other things are the same as in Condition 1.
Condition 3: COLORS & LINES
In the setting colors_lines the same colors are used as in Level 2; however, additionally a line is drawn on the bar giving you additional information of the current state of the observables. Otherwise things are the same as in Conditions 1 and 2.
Condition 4: ANIMATION
In the animation setting an animation of the cart-pole is presented to you, including the cart (rectangle), the pole (line to the top of the image), the velocity (line inside the cart to the left or right), and the angular velocity (line at the top of the pole) indicating the current position and velocity of the cart. Otherwise this is the same as in Conditions 1.
IMPORTANT: Read the full instructions before beginning.
Try to not only do what you are asked, but rather investigate learning further by yourself (if you have the time to do so, of course). Some of the things you could to:
Try to make a point on the advantages and disadvantages of cumulative/ human learning.
Set up the conditions:
For most of the tasks given to you all you need is to change parts in the main.py or cart_pole_env.py, I hope it is self explaining, if any questions arise please ask!