Objective: Benchmark Reinforcement Learning (RL) algorithms from Stable Baselines 2.10 on OpenAI Gym environments. The codebase focuses on verifying reproducibility of results on repeatign experiments with a fixed random seed (motivation: gave me a lot of trouble during my Thesis!). An extension of this work is available here.
Idea: Pick your favourite (environment, RL algorithm) pair -> train RL on env's reward function & evaluate learned policy -> re-run experiment to verify if you get the same result. The idea is to achieve same performance on repeating the experiment (may not be the case with TRPO).
Framework, langauge, OS: Tensorflow 1.14 (now bumped to 2.4.0), Python 3.7, Windows 10
The implementation uses Stable Baselines 2.10. I have included the utils
and hyperparams
folder from Baselines Zoo
# create virtual environment (optional)
conda create -n myenv python=3.7
conda activate myenv
git clone https://github.com/prabhasak/reproducibilty.git
cd reproducibility
pip install -r requirements.txt # recommended
pip install stable-baselines[mpi] # MPI needed for TRPO algorithm
For CustomEnvs: Register your CustomEnv on Gym (examples), and add your custom env and/or algorithm details. You can use the "airsim_env" folder for reference.
python reproducibility.py --seed 42 --env Pendulum-v0 --algo sac -trl 1e5 -tb -check -eval -m -params learning_starts:1000
Here, I use a fixed random seed of 42 (the answer to life, universe and everything) to run the experiments. I choose the Pendulum-v0 environment and the Soft Actor-Critic (SAC) algorithm to learn a model from the reward. SAC is trained for 100,000 steps with the tuned hyperparameters from the hyperparams
folder. Some features are enable to monitor training and performance, explained next. Ideally, running this code should give you the performance like so:
Verify reproducibility: 65/100 successful episodes on SAC policy evaluation, with (mean, std) = (-161.68, 86.39)
- Tensorboard: monitor performance during training (
tensorboard --logdir "/your/path"
) - Callbacks:
a. Saving the model periodically (useful for continual learning and to resume training)
b. Evaluating the model periodically and saves the best model throughout training (you can choose to save and evaluate just the best model with
-best
) - Multiprocessing: speed up training (observed 6x speedup for CartPole-v0 on my CPU with 12 threads). Note: TRPO uses MPI, so has multiprocessing enabled by default
- VecNormalize: normalize env observation, action spaces (useful for MuJoCo environments)
- Monitor: record internal state information during training (episode length, rewards). You can save a plot of the episode reward by modifying
results_plotter.py
- Passing arguments to your CustomEnv, and loading hyperparameters from Baselines Zoo (some are tuned)
Check out my imitation learning repo!