Reinforcement Learning
Deep Q Networks
Assignment 2 Due Date: November 25, 2024
Exercises
In this set of exercises, you will implement the DQN algorithm and several of its extensions. Your task is to implement the algorithms, evaluate them on benchmark environments, and answer questions about their performance and properties. You may work alone or in groups of two. If you work in a pair, include both names and UCOs in your report.
To pass the homework, you need to get at least 70 points out of 100 points. Implementation (max. 40 points)
Complete the classes DQNNet and DQNTrainer in the hw2.py file. Submit the modified source file to the file vault (odevzdavarna) called codeJiw2 in the IS.
Gym Environments In contrast to the previous assignment, no environment wrapper will be used this time. Instead, you will interact directly with the environments from the OpenAI Gym library. An example interaction is shown in the train function of DQNTrainer. For further details, see the documentation. You will evaluate your algorithms on the following environments: CartPole-vl, Acrobot-vl, and LunarLander-v2.
Logging As in the previous homework, you can use the Logger class to log data during training to a CSV file.
Interface You will complete the trainer class for DQN, which is required to have the constructor and the train method. The train method should train a new DQNPolicy from scratch in a provided environment and return it.
The signatures (positional arguments, return type) of the prepared train method and the constructor should remain unchanged for evaluation purposes, however, you are free to add your own methods, or your own training (hyper) parameters via keyword arguments. Once again, make sure you provide default values for these parameters since we will only supply the positional parameters during the evaluation.
We have also provided several helper functions that should aid you in starting the assignment. These illustrate some core Pytorch functionality - such as how to train your model and some useful tensor operations. However, feel free to delete these methods, or modify them as you like, since they are not part of the required interface.
You can also check out this PyTorch tutorial, which gives some more examples.
DQN and Its Extensions
[25 points] Complete the train method of trainer class DQNTrainer to implement the DQN algorithm. It should support three modes 'dqn', 'dqn+target', and 'ddqn' specified by the argument mode. See the Algorithms Overview section for a brief description of the differences. Use an ^-greedy policy for exploration.
1
Reinforcement Learning
Deep Q Networks
[5 points] Modify the train method of the DQNTrainer class to support linear and/or exponential decay of the exploration rate e over time. The initial value of e is specified by the argument initial_eps. The final value of e is specified by the argument f inal_eps. You can experiment with various decay schedules.
[10 points] Modify the train method of the DQNTrainer class to support n-step bootstrapping. The number of steps is specified by the argument n_steps.
Evaluation (max. 60 points)
Submit your solutions to the following exercises in a single PDF file to the file vault (ode-vzdävärna) called reportJiw2 in the IS. Evaluate your implementation on environments CartPole-vl, Acrobot-vl, and LunarLander-v2.
• Run the algorithm for 50000 steps on the CartPole and Acrobot environments and for 100000 steps on Lunar Lander. Experiment with different hyperparameters. Use the discount factor 7 = 0.99, but the other hyperparameters are up to you.1 We have provided reasonable default values for the other parameters in the source code to get you started easily. However, please keep the batch size < 128, so that we are able to evaluate your implementation.
With a reasonable amount of hyperparameter tuning, you should, on average, achieve approximately the following mean undiscounted returns in each environment: 400 on CartPole, —100 on Acrobot, and 200 on Lunar Lander.
• Repeat the training process multiple times and aggregate the results.
• [7 points] Plot how the mean discounted and undiscounted return of the trained policy and value estimate maxa Qe(s0, a) of the initial state change over time. Use whatever plot you think is the most informative. However, the plot should ideally display more than merely average values [+4 points].
• [10 points] Describe the obtained figures.
Answer the following questions in your report. Make conjectures based on your experimental results and your understanding of the algorithms. Keep your responses brief and concise.2
• [15 points] Compare the results for individual DQN modes. Explain the differences in performance.
• [12 points] What is the impact of the decay of the exploration rate on the performance of the DQN algorithm?
• [12 points] What is the impact of the n-step bootstrapping on the performance of the DQN algorithm? Experiment with various values of n.
1You do not have to train all combinations of the extensions. It is up to you to select a reasonable subset.
2The assignment is intentionally designed to be open-ended, encouraging you to experiment with the algorithms and consider their properties. If you would feel more comfortable with additional structural guidelines, three reasonable bullet points for each question should be sufficient.
2
Reinforcement Learning
Deep Q Networks
Algorithms Overview Deep Q-Network (Lecture 7)
DQN is a model-free reinforcement learning algorithm based on Q-Learning that leverages neural network approximators for estimating the Q-function. The line of research on DQN-based algorithms includes several features that improve the overall stability of the learning, namely: replay buffers, target networks, and double Q-learning.
Vanilla DQN ('DQN') The simplest version of DQN uses a replay buffer to store transitions experienced in the past. Every step through the environment (s, a,r, s') is stored in the buffer for later training. Whenever the number of samples in the buffer exceeds its capacity, the oldest samples are discarded.
Every K steps through the environment, the algorithm samples the buffer for a mini-batch of size B to train the Q-network. Let us define an MSE loss which is a differentiate equivalent of Bellman's equation:
where = (sj, g^, r^, s^) is the i-th sampled transition and y^ is the target, calculated as follows:
The vanilla DQN update proceeds by taking a semi-gradient descent step in the direction VqL(9, 9') evaluated at 9' = 9. Note that yi is treated as a constant when calculating the gradient with respect to 9 even though 9' is set to 9 later; thus the name "semi"-gradient. The update is in line with the Bellman update which computes the new Q-values estimate based on the old one.
DQN with Target Network ('DQN+target') Unlike the discrete Bellman operator, the semi-gradient descent needs to perform multiple updates to converge to an optimal minimizer $ := argmmdL(9,9'). This results in a stability issue, since changing parameters 9 changes parameters 9' := 9; and therefore, both target values y and expected loss minimizer $ constantly shift as we optimize 9. The DQN algorithm with the target network resolves this issue by decoupling 9' and 9. It maintains a separate lagged target network with parameters 9', which are updated less frequently than the original network with parameters 9. To update 9, the algorithm performs a gradient descent step on the loss L(9, 9'), this time with a fixed value of 9'.
There are two basic approaches to update the target parameters 9':
1. Hard updates - every M learning steps, copy parameters 9' 9.
2. Soft updates, or "Polyak Averaging" - every learning step, set 9' = p-9 + (1 — p) -9', where p is a sufficiently small constant (e.g. 0.005).
Hard update every M steps and soft update with p := 1/M should be somewhat comparable in terms of the effect on the learning process. You can implement either, or experiment with both.
i=l
Vi if s'i is terminal,
ri + 7 • maxa/ Qe'{s'i-, a') otherwise .
3
Reinforcement Learning
Deep Q Networks
Double DQN (<DoubleDQN)) DQN also suffers from a shortcoming of the classic Q-Learning algorithm, which is maximization bias. In order to combat overestimation, the selection of maximizing action and realizing its value is split between the target network and the main network, similar to the idea of double Q-learning. In DDQN, the target values are computed according to the following formula:
Ti if s'i is terminal,
f\ + lQe'[.s'ii argmaxa/ Qe(s'i, a')) otherwise.
Note where 9 and 9' are used in the formula.
n-step Bootstrapping All of the above equations generalize to n-step returns as well. Instead of using a single transition to calculate each target yt, we consider a window of n successive transitions {s^j, aij,rij, -}"~o fr°m the buffer3. Let / be the least index such that s'i l is terminal, or n if no such index exists. We calculate the n-step return target yi as follows:
= n "''•<., Hl<n,
{Y^j=o l3ri,3 + 7n maxa' Qe'(si,n-i; a') otherwise.
The last bootstrap term should be adjusted based on the mode of operation, as discussed above.
3siJ+1 = s'id
4