Loads a range of baseball data to Google BigQuery.
Gets data from three main sources:
- Baseball Savant (Statcast): uses the Statcast Search tool to collect pitch-by-pitch logs for every team and player. Sample download here.
- Crunchtime Baseball player maps: a full table of current MLB players by MLBAM ID, mapped to their IDs in other "systems". Sample download here.
- Baseball Prospectus player maps: a table containing current and retired MLB players. Not as complete as the Crunchtime Baseball maps. Sample download here.
- Bill Petti's weather (hosted on Box): a table containing weather for every game. Sample here.
- Python 3.6+ (versions 3.5 and earlier haven't been tested)
Before using any of this tool's features, a BigQuery project and dataset need to be created with credentials matching those in config.yaml
.
For a quick introduction to Google BigQuery, have a look at their tutorials here.
To set up the repository's virtual evironment, run:
> make venv
To initialize the BigQuery tables, run:
> make tables
To run a standard database update (all events for the current year and players), run:
> make data
To make more granular updates, refer to the documentation in the src/data.py
file. For example, to update all events from 2016 without updating the players
table, run:
> python src/update.py --year=2016 --no-players