A Reinforcement Learning benchmark: Experiment Reproducibility

Objective: Benchmark Reinforcement Learning (RL) algorithms from Stable Baselines 2.10 on OpenAI Gym environments. The codebase focuses on verifying reproducibility of results on repeatign experiments with a fixed random seed (motivation: gave me a lot of trouble during my Thesis!). An extension of this work is available here.

Idea: Pick your favourite (environment, RL algorithm) pair -> train RL on env's reward function & evaluate learned policy -> re-run experiment to verify if you get the same result. The idea is to achieve same performance on repeating the experiment (may not be the case with TRPO).

Framework, langauge, OS: Tensorflow 1.14 (now bumped to 2.4.0), Python 3.7, Windows 10

Prerequisites

The implementation uses Stable Baselines 2.10. I have included the utils and hyperparams folder from Baselines Zoo

# create virtual environment (optional)
conda create -n myenv python=3.7
conda activate myenv

git clone https://github.com/prabhasak/reproducibilty.git
cd reproducibility
pip install -r requirements.txt # recommended
pip install stable-baselines[mpi] # MPI needed for TRPO algorithm

For CustomEnvs: Register your CustomEnv on Gym (examples), and add your custom env and/or algorithm details. You can use the "airsim_env" folder for reference.

Usage

python reproducibility.py --seed 42 --env Pendulum-v0 --algo sac -trl 1e5 -tb -check -eval -m -params learning_starts:1000

Here, I use a fixed random seed of 42 (the answer to life, universe and everything) to run the experiments. I choose the Pendulum-v0 environment and the Soft Actor-Critic (SAC) algorithm to learn a model from the reward. SAC is trained for 100,000 steps with the tuned hyperparameters from the hyperparams folder. Some features are enable to monitor training and performance, explained next. Ideally, running this code should give you the performance like so:

Verify reproducibility: 65/100 successful episodes on SAC policy evaluation, with (mean, std) = (-161.68, 86.39)

Features

Tensorboard: monitor performance during training (tensorboard --logdir "/your/path")
Callbacks: a. Saving the model periodically (useful for continual learning and to resume training) b. Evaluating the model periodically and saves the best model throughout training (you can choose to save and evaluate just the best model with -best)
Multiprocessing: speed up training (observed 6x speedup for CartPole-v0 on my CPU with 12 threads). Note: TRPO uses MPI, so has multiprocessing enabled by default
VecNormalize: normalize env observation, action spaces (useful for MuJoCo environments)
Monitor: record internal state information during training (episode length, rewards). You can save a plot of the episode reward by modifying results_plotter.py
Passing arguments to your CustomEnv, and loading hyperparameters from Baselines Zoo (some are tuned)

Check out my imitation learning repo!

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
airsim_env		airsim_env
callbacks/0_pendulum_sac		callbacks/0_pendulum_sac
hyperparams		hyperparams
logs/single/0_pendulum_sac		logs/single/0_pendulum_sac
tensorboard/0_pendulum/SAC_1		tensorboard/0_pendulum/SAC_1
README.md		README.md
description.txt		description.txt
reproducibility.py		reproducibility.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Reinforcement Learning benchmark: Experiment Reproducibility

Prerequisites

Usage

Features

About

Releases

Packages

Contributors 2

Languages

prabhasak/reproducibility

Folders and files

Latest commit

History

Repository files navigation

A Reinforcement Learning benchmark: Experiment Reproducibility

Prerequisites

Usage

Features

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages