Skip to content

SimplyPPO replicates Proximal-Policy-Optimization with minimum (~250) lines of code in clean, readable PyTorch style, while trying to use as few additional tricks and hyper-parameters as possible (PyBullet benchmarks included).

License

Notifications You must be signed in to change notification settings

arthur-x/SimplyPPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SimplyPPO: A Minimal Proximal-Policy-Optimization PyTorch Implementation

SimplyPPO replicates PPO with minimum (~250) lines of code in clean, readable PyTorch style, while trying to use as few additional tricks and hyper-parameters as possible.

Implementation details:

  • Advantage and state normalization.
  • Gradient clipping.
  • Entropy bonus.
  • Tanh squashing to ensure action bounds and log_std clamping (as in SAC).

That's it! All other things follow the original paper.

Also check out SimplySAC, a minimal Soft-Actor-Critic PyTorch implementation.

Note

This is a single-threaded PPO implementation for continuous control tasks. The particular implementations of state normalization are adopted from here, where various other tricks are also discussed.

PyBullet benchmarks:

You can find the performance of Stable Baselines3 here as a reference.

hopper_b

walker_b

cheetah_b

ant_b

These figures are produced with:

  • One evaluation episode every 1e4 steps.
  • 5 random seeds, where the mean return is represented by the solid line, and max/min return by the shaded area.

To execute a single run:

python learn.py -g [gpu_id] -e [env_id] -l [log_id]

Experiments use pybullet==3.0.8.

About

SimplyPPO replicates Proximal-Policy-Optimization with minimum (~250) lines of code in clean, readable PyTorch style, while trying to use as few additional tricks and hyper-parameters as possible (PyBullet benchmarks included).

Topics

Resources

License

Stars

Watchers

Forks

Languages