Reinforcement Learning Policy Gradient
Assignment 3
Due Date: December 16, 2024
Exercises
In the last homework, you will implement a variant of the policy gradient algorithm and
compete with your classmates in a tournament. You may work alone or in groups
of two. If you work in a pair, include both names and UˇCOs in your report.
To pass the homework, you need to pass the minimal implementation requirements
and submit the report. This time, neither plotting nor analysis is required
;).
Tournament Rules
What environmnets will you compete on? CartPole-v1, Acrobot-v1, LunarLanderv2,
and CarRacing-v2. You might focus on one or more of the environments. The environments
are listed according to their increasing difficulty.
What algorithm can you use? You can use any policy gradient-based algorithm.
That includes all algorithms described in the section Algorithms Overview or any other
policy gradient algorithms you find in the literature. The concrete version, modifications,
and extensions are up to you.
Can you play with the task itself? Yes, you can use reward shaping, modify the
observations, or even change the action space through wrappers of provided environments.
What will be evaluated? We will run your algorithm on each of the four environments
10 times (i.e. train 10 policies) and count the second-best score. A leaderboard of the
achieved scores will be published for each environment.
What will you win? Glory and better understanding of the state-of-the-art RL algorithms.
Also, good results will be reflected in the final exam. The most impressive
algorithm based on the evaluation will be awarded with a very special prize 1
.
Implementation
Implement the functions train cartpole, train acrobot, train lunarlander, and
train carracing. Submit the modified source file to the file vault (odevzd´av´arna) called
code hw3 in the IS.
Interface For the tournament, we only require you to implement the functions that
train the policy on the given environment: train cartpole, train acrobot, train lunarlander,
and train carracing. This allows you to fine-tune the hyperparameters for each environment
separately.
1
Please report your food allergies :p
1
Reinforcement Learning Policy Gradient
Besides the required interface, we provide a template for the policy gradient algorithm
to help you get started. However, you are free to modify the code as you see fit. You can
even add wrappers to the provided environments that will, e.g., trim the episodes after
certain number of steps or modify the observations (e.g. downsample the image). This
might be especially useful for the CarRacing environment.
As always, do not introduce new imports so that the evaluation script can run your
code without any issues.
Minimal Implementation Requirements Implement the actor-critic algorithm with
GAE advantage estimation. It should work on CartPole-v1 environment (i.e. obtain a
score of at least 100).
Report
Implementation Description As the first part of your report, write a short paragraph
to guide us through your implementation and explain your design choices. What
extensions of the policy gradient algorithm have you implemented and why did you choose
them? Have you implemented any other extensions that are not mentioned in the assignment?
Lastly, give your agent a name that we can feature in the leaderboard.
Feedback We would like to hear your feedback on the implementation and evaluation
parts of the assignments. As the second part of the report, please answer the following
questions:
• Did you find the implementation part overall interesting and helpful?
• Did you find the evaluation part overall interesting and helpful?
• What should definitely be kept in the future assignments? Either you enjoyed it or
found it useful.
• What should we work on in the future assignments? Either you found it boring or
not helpful.
• Any other feedback you would like to share?
2
Reinforcement Learning Policy Gradient
Algorithms Overview
Policy Gradient Methods
Policy gradient methods are a class of model-free reinforcement learning algorithms that
directly optimize a parameterized policy πθ(a|s) using gradient-based optimization techniques.
These methods aim to maximize the expected cumulative reward by updating
the policy parameters θ in the direction of the gradient of the objective.
Vanilla Policy Gradient The simplest policy gradient algorithm is the REINFORCE
algorithm, which uses the Monte Carlo estimate of the return to update the policy. The
objective function for policy optimization is defined as:
J(θ) = Eτ∼πθ
T−1
t=0
γt
rt ,
where τ represents a trajectory (s0, a0, r0, s1, . . . , sT ) sampled from the policy πθ. Using
the policy gradient theorem, the gradient of the objective can be expressed as:
∇θJ(θ) = Eτ∼πθ
T−1
t=0
γt
∇θ log πθ(at|st) · Gt ,
where Gt = T
k=t γk−t
rk is the discounted return from time step t onwards, and γ ∈ (0, 1)
is the discount factor.
The REINFORCE update is performed by sampling N trajectories, calculating Gt for
each time step, and taking gradient ascent steps with the gradient estimate:
∇θJ(θ) ≈
1
N
N
i=1
Ti−1
t=0
γt
∇θ log πθ(ai,t|si,t) · Gi,t. (1)
Here, Ti is the length of trajectory i. For every gradient update, a new batch of trajectories
needs to be sampled.
REINFORCE with Baseline The variance of the REINFORCE gradient estimator
can be high, making convergence slow and unstable. This is due to various reason including
the “offset” of the return Gt from the expected return. If, say, all returns were large
and positive, every term in the formula (1) would push the policy to increase the probability
of the action taken. Only with many samples, the actions with high returns will have
higher weight in the resulting direction causing the probability of the underperforming
actions to decrease.
To address this, we can dirrectly estimate whether the action underperforms with
respect to a certain baseline value. Namely, we can choose any function b(st) that depends
only on the state and subtract it from the return yielding the following gradient estimate:
∇θJ(θ) = Eτ∼πθ
T−1
t=0
γt
∇θ log πθ(at|st) · (Gt − b(st)) .
A common choice for b(st) is the state value function Vϕ(st), which represents the expected
return from state st under the current policy. It can be shown that the baseline does not
3
Reinforcement Learning Policy Gradient
affect the expected value of the gradient estimate; it can, however, significantly reduce
the variance of the estimator.
The term Gt − Vϕ(st) is called the advantage function, as it measures the relative
benefit of taking action at compared to the average return from state st under the current
policy.
The value function Vϕ(s) can be learned using any suitable evaluation method, such
as Monte Carlo estimation or temporal difference learning. Examples of the update rules
for the value function include:
Monte Carlo: ϕ ← ϕ + α(Gt − Vϕ(st))∇ϕVϕ(st),
TD(0): ϕ ← ϕ + α(rt + γVϕ(st+1) − Vϕ(st))∇ϕVϕ(st),
TD(n): ϕ ← ϕ + α(G
(n)
t − Vϕ(st))∇ϕVϕ(st).
Actor-Critic Methods In actor-critic methods, the value function estimate Vϕ(s) is
used not only as a baseline for the policy gradient estimator but also to guide the policy
optimization. In this scheme, the policy is called the actor, and the value function is
called the critic as it directly tells the actor how good the action taken was.
The critic estimates the value of states and helps to stabilize the variance of the policy
gradient estimator, similar to how bootstrapping is used in Q-Learning and SARSA.
Concretely the MC estimate of the return Gt is replaced by the estimate rt + γVϕ(st+1),
G
(n)
t , or any other suitable estimate of the expected return. A concrete update rule for
the actor can be the following:
∇θJ(θ) = Eτ∼πθ
T−1
t=0
∇θ log πθ(at|st) · (rt + γVϕ(st+1) − Vϕ(st)) .
Note that the discount factor γt
is often omitted in the actor-critic methods. This simplifies
the implementation and assumes that we are actually interested in non-discounted
return. However, note that this simplification does not correspond to a gradient of any
reasonable objective function and thus loses convergence guarantees.
Generalized Advantage Estimation (GAE) While TD(0) advantage estimate has
a low variance, it can have a high bias, especially in the early stages of learning. On the
other hand, TD(n) or even Monte Carlo (which can be seen as TD(T)) estimates have
lower bias but much higher variance. Generalized Advantage Estimation introduces a
hyperparameter λ that controls the trade-off between bias and variance in the advantage
calculation.
Let us use ˆAk
t to denote the k-step advantage estimate at time step t
ˆAk
t =
k−1
i=0
γi
rt+i + γk
V (st+k) − V (st).
Note that V (sT ) should be set to zero if it is a terminal state. The GAE advantage
estimate is then defined as the following weighted sum of the k-step advantage estimates:
ˆA
GAE(λ)
t = (1 − λ)
T−1
k=1
λk−1 ˆAk
t + λT−1 ˆAT
t .
4
Reinforcement Learning Policy Gradient
The formula above can be rewritten as follows:
ˆA
GAE(λ)
t =
T−1
k=0
(γλ)k
δt+k,
where δt+k is the TD residual, defined as:
δt+k = rt+k + γV (st+k+1) − V (st+k).
This allows us to compute the GAE estimate for every time step in an episode by a single
linear backward pass using the recursive formula:
ˆA
GAE(λ)
t = δt + γλ ˆA
GAE(λ)
t+1 .
By adjusting λ ∈ [0, 1], we can tune the bias-variance trade-off, where λ = 0 corresponds
to TD(0) (high bias, low variance) and λ = 1 approaches Monte Carlo estimation
(low bias, high variance).
Proximal Policy Optimization The disadvantage of the discussed policy gradient
methods is that they cannot reuse the data collected for multiple gradient updates. This
is because the policy changes during training, and the collected trajectories no longer
reflect the updated policy. To address this issue, importance sampling is introduced,
which allows us to correct for the mismatch between the policy used to collect the data
(behavioral policy) and the current policy being optimized (target policy).
The importance sampling ratio is defined as:
rt(θ) =
t
k=0
πθ(ak|sk)
πθold
(ak|sk)
=
πθ(a0|s0)
πθold
(a0|s0)
·
πθ(a1|s1)
πθold
(a1|s1)
· · ·
πθ(at|st)
πθold
(at|st)
.
The policy gradient objective with importance sampling is then defined as:
LIS
(θ) = Et rt(θ) ˆAt ,
where ˆAt is the advantage estimate. However, this formulation can lead to high variance
and instability in optimization. To simplify the computation and reduce variance, we
retain only the last multiplicative term, resulting in a per-time-step importance sampling
ratio:
rt(θ) =
πθ(at|st)
πθold
(at|st)
.
This simplification stabilizes the optimization process, however assumes that the policy
does not deviate from the behavioral policy too much. For that reason Proximal Policy
Optimization (PPO) constrains the importance sampling ratio to prevent overly large
updates. The PPO objective is defined as:
LCLIP
(θ) = Et min rt(θ) ˆAt, clip (rt(θ), 1 − ϵ, 1 + ϵ) ˆAt ,
where ˆAt is the advantage estimate, and ϵ is a small hyperparameter that constrains the
importance sampling ratio. By clipping the ratio, PPO prevents updates that would steer
the policy too far from the behavioral policy. This combination of importance sampling
and clipping makes PPO both robust and sample efficient, establishing it as a widely
used algorithm in reinforcement learning.
5
Reinforcement Learning Policy Gradient
Disclaimer: The real implementation of the PPO algorithm is more complex and includes
additional tricks to stabilize the training process. It is rather difficult to implement
PPO from scratch, so do not be discouraged if PPO initially performs worse than REINFORCE
with GAE. For useful tips on how to implement PPO, see the following blog post
https: // ppo-details. cleanrl. dev/ 2021/ 11/ 05/ ppo-implementation-details/
and the references to concrete implementations in the blog post.
6