Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added model type GBM (LightGBM tree learner), as an alternative to ECD #2027

Merged
merged 101 commits into from
Jun 29, 2022
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
c844bea
WIP LightGBM
tgaddair Nov 14, 2021
b61e993
Fixed training e2e
tgaddair Nov 14, 2021
7c54ec4
Added example
tgaddair Nov 14, 2021
9c221dd
Added more boosting rounds
tgaddair Nov 14, 2021
d54364a
Hummingbird
tgaddair Nov 14, 2021
ee5184a
WIP refactor
tgaddair Nov 18, 2021
faa8707
add requirements for tree models
jppgks Apr 24, 2022
2161baf
format workspace
jppgks Apr 24, 2022
fc5f47d
incorporate api changes
jppgks Apr 24, 2022
0adcdb3
WIP training metrics
jppgks Apr 24, 2022
9bf5f79
WIP ray
jppgks Apr 25, 2022
4ddb10e
cache preprocessed dask df
jppgks Apr 27, 2022
c1b181e
only log eval results on rank 0
jppgks Apr 27, 2022
cc4097c
add abstract model class
jppgks May 10, 2022
408e7d8
rename lightgbm to trainer_lightgbm
jppgks May 11, 2022
4d97c6e
create trainer based on config
jppgks May 11, 2022
34fbd37
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 12, 2022
2c50c8d
add accuracy for GBM
jppgks May 18, 2022
04c9c00
expose GBM params through config
jppgks May 18, 2022
3ea822f
move build input/output to abstractmodel
jppgks May 19, 2022
36c5294
use abstractmodel in predictor
jppgks May 19, 2022
1aa2704
fix wrong defaults
jppgks May 19, 2022
21bc00d
fix hummingbird conversion
jppgks May 19, 2022
9d7eb96
parse logits from GBM predictions
jppgks May 20, 2022
29e110d
perform evaluation after training
jppgks May 20, 2022
7c4793b
fix hang by closing summary writers and ckpt mgr
jppgks May 24, 2022
2398615
fix tests
jppgks May 25, 2022
c47e5a5
test: verify binary predictions + add non-supported output
jppgks May 25, 2022
610a17d
add test for categorical output + multiple outputs
jppgks May 25, 2022
6f477bd
test: use local and ray backend
jppgks May 25, 2022
2f85e73
test: disable ray
jppgks May 25, 2022
c73d0c9
add test for numerical output
jppgks May 25, 2022
9d5c3b7
fix model loading and include load in tests
jppgks May 26, 2022
63c5815
modify ray trainers interface to support kwargs
jppgks May 26, 2022
6811637
replace occurences of ECD with AbstractModel where appropriate
jppgks May 26, 2022
a417ec9
Merge branch 'master' into gbt
jppgks May 26, 2022
9f4f6d4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 26, 2022
886666a
rename test file
jppgks May 26, 2022
6883b73
add test for model saving/loading
jppgks May 26, 2022
48afcd2
add gbm to torchscript test
jppgks May 26, 2022
1068b71
fix ray tests
jppgks May 26, 2022
eba4a7e
rename abstractmodel to basemodel
jppgks May 26, 2022
c9e44b8
remove logger object
jppgks May 26, 2022
7a69b9e
add type as abstract method to basemodel
jppgks May 26, 2022
ebf3542
add docstrings to basemodel
jppgks May 26, 2022
c1bcfcb
address review comments Piero
jppgks Jun 1, 2022
a87c17d
fix: required params for trainer init
jppgks Jun 3, 2022
b345bf8
error if default trainer already registered
jppgks Jun 8, 2022
b356db7
distributed eval (#2117)
jppgks Jun 8, 2022
6b380ab
localized ray imports + type hints
jppgks Jun 8, 2022
e1ce56d
fix ray
jppgks Jun 10, 2022
e820685
Trainer schema separation. (#2107)
ksbrar Jun 11, 2022
ff591e5
import all trainers
jppgks Jun 11, 2022
428e150
Merge remote-tracking branch 'origin/master' into gbt
jppgks Jun 12, 2022
0d922d2
remove need for backend to infer trainer type
jppgks Jun 12, 2022
d12e83a
Merge remote-tracking branch 'origin/master' into gbt
jppgks Jun 12, 2022
61ef893
remove unused get_default_from_registry
jppgks Jun 12, 2022
6331276
fix flake8
jppgks Jun 12, 2022
edfab03
make gbm imports optional
jppgks Jun 12, 2022
fb01120
Adding default input size 1 to PassthroughEncoder
w4nderlust Jun 14, 2022
3b65cd5
add default eval_batch_size for GBM trainer
jppgks Jun 14, 2022
9f7efc1
passthrough encoder/decoder for GBM
jppgks Jun 14, 2022
55ba4d9
lightgbm ray allow_less_than_two_cpus
jppgks Jun 14, 2022
6cf101e
add missing on_trainer_train_setup callback
jppgks Jun 14, 2022
1799335
fix ray serialization error
jppgks Jun 15, 2022
51b6def
Merge remote-tracking branch 'origin/master' into gbt
jppgks Jun 16, 2022
b2cbd00
remove unused import
jppgks Jun 16, 2022
49089f0
mark distributed gbm tests, exclude lightgbm-ray from non-distributed…
jppgks Jun 23, 2022
59bb10c
Merge remote-tracking branch 'origin/master' into gbt
jppgks Jun 23, 2022
c4cf1d0
do not require dask import for type hint
jppgks Jun 23, 2022
0b8c230
schema: remove required type + add description
jppgks Jun 23, 2022
f144752
tests: reduce resource request
jppgks Jun 23, 2022
ece96f7
shutdown ray cluster after each test
jppgks Jun 23, 2022
e7b9d11
test: optionally import ray
jppgks Jun 23, 2022
ea31f03
fix test: allow none for gbm eval batch size
jppgks Jun 23, 2022
14e9a51
mimic test_ray
jppgks Jun 24, 2022
2eaf25a
Merge remote-tracking branch 'origin/master' into gbt
jppgks Jun 24, 2022
ebec048
push model save/load to model classes
jppgks Jun 26, 2022
e48535f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 26, 2022
641e1eb
fix(ray eval): do not pass non-serializable callbacks
jppgks Jun 26, 2022
a97cdcb
Merge branch 'gbt' of https://github.com/ludwig-ai/ludwig into gbt
jppgks Jun 26, 2022
26cb32a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 26, 2022
d8c48d0
fix tests
jppgks Jun 27, 2022
2fd52be
add on_batch_start, on_batch_end callbacks
jppgks Jun 27, 2022
0387da1
Merge branch 'gbt' of https://github.com/ludwig-ai/ludwig into gbt
jppgks Jun 27, 2022
4cbe0d6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 27, 2022
feefa79
fix test: increase num examples to 100
jppgks Jun 27, 2022
30ff1eb
format
jppgks Jun 27, 2022
de545a4
Merge branch 'gbt' of https://github.com/ludwig-ai/ludwig into gbt
jppgks Jun 27, 2022
6106b93
Merge remote-tracking branch 'origin/master' into gbt
jppgks Jun 27, 2022
0de3c4c
use DataFrame type from utils
jppgks Jun 28, 2022
c0ba8ec
Merge remote-tracking branch 'origin/master' into gbt
jppgks Jun 28, 2022
971eda9
fix circular import
jppgks Jun 28, 2022
85f7654
fix merge with master
jppgks Jun 28, 2022
e70cf62
Apply suggestions from code review
justinxzhao Jun 28, 2022
6ba8730
rename TrainerConfig to ECDTrainerConfig
jppgks Jun 28, 2022
1eb0046
Merge remote-tracking branch 'origin/master' into gbt
jppgks Jun 28, 2022
3c0e700
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 28, 2022
ccb524c
update docstring occurences of TrainerConfig
jppgks Jun 28, 2022
67334c7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 28, 2022
ca7db50
Merge remote-tracking branch 'origin/master' into gbt
jppgks Jun 28, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions examples/lightgbm/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
model_type: gbm
input_features:
- name: age
type: number
- name: workclass
type: category
- name: fnlwgt
type: number
- name: education
type: category
- name: education-num
type: number
- name: marital-status
type: category
- name: occupation
type: category
- name: relationship
type: category
- name: race
type: category
- name: sex
type: category
- name: capital-gain
type: number
- name: capital-loss
type: number
- name: hours-per-week
type: number
- name: native-country
type: category
output_features:
- name: income
type: category
training:
jppgks marked this conversation as resolved.
Show resolved Hide resolved
learning_rate: 0.05
70 changes: 70 additions & 0 deletions examples/lightgbm/config_higgs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
model_type: gbm
input_features:
- name: lepton_pT
type: number
- name: lepton_eta
type: number
- name: lepton_phi
type: number
- name: missing_energy_magnitude
type: number
- name: missing_energy_phi
type: number
- name: jet_1_pt
type: number
- name: jet_1_eta
type: number
- name: jet_1_phi
type: number
- name: jet_1_b-tag
type: number
- name: jet_2_pt
type: number
- name: jet_2_eta
type: number
- name: jet_2_phi
type: number
- name: jet_2_b-tag
type: number
- name: jet_3_pt
type: number
- name: jet_3_eta
type: number
- name: jet_3_phi
type: number
- name: jet_3_b-tag
type: number
- name: jet_4_pt
type: number
- name: jet_4_eta
type: number
- name: jet_4_phi
type: number
- name: jet_4_b-tag
type: number
- name: m_jj
type: number
- name: m_jjj
type: number
- name: m_lv
type: number
- name: m_jlv
type: number
- name: m_bb
type: number
- name: m_wbb
type: number
- name: m_wwbb
type: number
output_features:
- name: label
type: binary
trainer:
learning_rate: 0.1
num_boost_round: 50
num_leaves: 255
feature_fraction: 0.9
bagging_fraction: 0.8
bagging_freq: 5
min_data_in_leaf: 1
min_sum_hessian_in_leaf: 100
40 changes: 40 additions & 0 deletions examples/lightgbm/train.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#!/usr/bin/env python

import logging
import os
import shutil

from ludwig.api import LudwigModel
from ludwig.backend import initialize_backend
from ludwig.data.cache.types import CacheableDataframe
from ludwig.datasets import higgs # adult_census_income

shutil.rmtree("./results", ignore_errors=True)

backend_config = {
"type": "ray",
"processor": {
"parallelism": 6,
"type": "dask",
},
"trainer": {
"num_actors": 3,
"cpus_per_actor": 2,
},
}
backend = initialize_backend(backend_config)
model = LudwigModel(config="./config_higgs.yaml", logging_level=logging.INFO, backend=backend)

# df = adult_census_income.load(split=False)
df = higgs.load(split=False, add_validation_set=True)
df = CacheableDataframe(df=df, name="cache_higgs", checksum="9YeB0J_fiQ9Dh_lL84NdZg==")

(
train_stats, # dictionary containing training statistics
preprocessed_data, # tuple Ludwig Dataset objects of pre-processed training data
output_directory, # location of training results stored on disk
) = model.train(dataset=df)

print("contents of output directory:", output_directory)
for item in os.listdir(output_directory):
print("\t", item)
23 changes: 10 additions & 13 deletions ludwig/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
EVAL_BATCH_SIZE,
FULL,
LEARNING_RATE,
MODEL_TYPE,
PREPROCESSING,
TEST,
TRAINER,
Expand All @@ -61,18 +62,19 @@
set_disable_progressbar,
TRAIN_SET_METADATA_FILE_NAME,
)
from ludwig.models.ecd import ECD
from ludwig.models.abstractmodel import AbstractModel
from ludwig.models.inference import InferenceModule
from ludwig.models.predictor import (
calculate_overall_stats,
print_evaluation_stats,
save_evaluation_stats,
save_prediction_outputs,
)
from ludwig.models.trainer import Trainer
from ludwig.models.registry import model_type_registry
from ludwig.modules.metric_modules import get_best_function
from ludwig.schema import validate_config
from ludwig.schema.utils import load_config_with_kwargs
from ludwig.trainers.trainer import Trainer
from ludwig.utils import metric_utils
from ludwig.utils.data_utils import (
figure_data_format,
Expand All @@ -84,7 +86,7 @@
)
from ludwig.utils.defaults import default_random_seed, merge_with_defaults
from ludwig.utils.fs_utils import makedirs, open_file, path_exists, upload_output_directory
from ludwig.utils.misc_utils import get_file_names, get_output_directory
from ludwig.utils.misc_utils import get_file_names, get_from_registry, get_output_directory
from ludwig.utils.print_utils import print_boxed

logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -1470,8 +1472,8 @@ def _check_initialization(self):
raise ValueError("Model has not been trained or loaded")

@staticmethod
def create_model(config: dict, random_seed: int = default_random_seed) -> ECD:
"""Instantiates Encoder-Combiner-Decoder (ECD) object.
def create_model(config: dict, random_seed: int = default_random_seed) -> AbstractModel:
"""Instantiates AbstractModel object.
jppgks marked this conversation as resolved.
Show resolved Hide resolved

# Inputs
:param config: (dict) Ludwig config
Expand All @@ -1480,15 +1482,10 @@ def create_model(config: dict, random_seed: int = default_random_seed) -> ECD:
splits and any other random function.

# Return
:return: (ludwig.models.ECD) Instance of the Ludwig model object.
:return: (ludwig.models.AbstractModel) Instance of the Ludwig model object.
jppgks marked this conversation as resolved.
Show resolved Hide resolved
"""
# todo: support loading other model types based on config
return ECD(
input_features_def=config["input_features"],
combiner_def=config["combiner"],
output_features_def=config["output_features"],
random_seed=random_seed,
)
model_type = get_from_registry(config[MODEL_TYPE], model_type_registry)
return model_type(**config, random_seed=random_seed)

@staticmethod
def set_logging_level(logging_level: int) -> None:
Expand Down
20 changes: 14 additions & 6 deletions ludwig/backend/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,10 @@
from ludwig.data.dataframe.pandas import PANDAS
from ludwig.data.dataset.base import DatasetManager
from ludwig.data.dataset.pandas import PandasDatasetManager
from ludwig.models.ecd import ECD
from ludwig.models.abstractmodel import AbstractModel
from ludwig.schema.trainer import TrainerConfig
from ludwig.trainers.registry import trainers_registry
from ludwig.utils.misc_utils import get_from_registry
from ludwig.utils.torch_utils import initialize_pytorch


Expand Down Expand Up @@ -54,7 +57,7 @@ def initialize_pytorch(self, *args, **kwargs):

@contextmanager
@abstractmethod
def create_trainer(self, **kwargs):
def create_trainer(self, **kwargs) -> "BaseTrainer": # noqa: F821
raise NotImplementedError()

@abstractmethod
Expand Down Expand Up @@ -106,12 +109,17 @@ class LocalTrainingMixin:
def initialize_pytorch(self, *args, **kwargs):
initialize_pytorch(*args, **kwargs)

def create_trainer(self, **kwargs):
from ludwig.models.trainer import Trainer
def create_trainer(self, **kwargs) -> "BaseTrainer": # noqa: F821
config: TrainerConfig = kwargs["config"]
model: AbstractModel = kwargs["model"]

return Trainer(**kwargs)
trainers_for_model = get_from_registry(model.type(), trainers_registry)

def create_predictor(self, model: ECD, **kwargs):
trainer_cls = get_from_registry(config.type, trainers_for_model)

return trainer_cls(**kwargs)

def create_predictor(self, model: AbstractModel, **kwargs):
from ludwig.models.predictor import Predictor

return Predictor(model, **kwargs)
Expand Down
7 changes: 5 additions & 2 deletions ludwig/backend/horovod.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,11 @@
import time

from ludwig.backend.base import Backend, LocalPreprocessingMixin
from ludwig.constants import MODEL_GBM, MODEL_TYPE
from ludwig.data.dataset.pandas import PandasDatasetManager
from ludwig.models.ecd import ECD
from ludwig.models.predictor import Predictor
from ludwig.models.trainer import Trainer
from ludwig.trainers.trainer import Trainer
from ludwig.utils.horovod_utils import initialize_horovod
from ludwig.utils.torch_utils import initialize_pytorch

Expand All @@ -36,7 +37,9 @@ def initialize(self):
def initialize_pytorch(self, *args, **kwargs):
initialize_pytorch(*args, horovod=self._horovod, **kwargs)

def create_trainer(self, **kwargs):
def create_trainer(self, **kwargs) -> "BaseTrainer": # noqa: F821
if kwargs.get(MODEL_TYPE, "") == MODEL_GBM:
raise ValueError("Horovod backend does not support GBM models.")
return Trainer(horovod=self._horovod, **kwargs)

def create_predictor(self, model: ECD, **kwargs):
Expand Down
33 changes: 24 additions & 9 deletions ludwig/backend/ray.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,16 @@
from ray.util.dask import ray_dask_get

from ludwig.backend.base import Backend, RemoteTrainingMixin
from ludwig.constants import NAME, PREPROCESSING, PROC_COLUMN
from ludwig.constants import MODEL_ECD, MODEL_GBM, NAME, PREPROCESSING, PROC_COLUMN
from ludwig.data.dataset.ray import RayDataset, RayDatasetManager, RayDatasetShard
from ludwig.models.abstractmodel import AbstractModel
from ludwig.models.ecd import ECD
from ludwig.models.predictor import BasePredictor, get_output_columns, Predictor, RemotePredictor
from ludwig.models.trainer import BaseTrainer, RemoteTrainer, TrainerConfig
from ludwig.schema.trainer import TrainerConfig
from ludwig.trainers.registry import ray_trainers_registry, register_ray_trainer
from ludwig.trainers.trainer import BaseTrainer, RemoteTrainer
from ludwig.utils.horovod_utils import initialize_horovod
from ludwig.utils.misc_utils import get_from_registry
from ludwig.utils.torch_utils import initialize_pytorch

_ray112 = LooseVersion(ray.__version__) >= LooseVersion("1.12")
Expand Down Expand Up @@ -271,6 +275,7 @@ def tune_learning_rate_fn(
hvd.shutdown()


@register_ray_trainer("ray_trainer_v2", MODEL_ECD, default=True)
class RayTrainerV2(BaseTrainer):
def __init__(self, model, trainer_kwargs, data_loader_kwargs, executable_kwargs):
self.model = model.cpu()
Expand Down Expand Up @@ -454,6 +459,7 @@ def __init__(self, **kwargs):
super().__init__(horovod=horovod, **kwargs)


@register_ray_trainer("ray_legacy_trainer", MODEL_ECD)
class RayLegacyTrainer(BaseTrainer):
def __init__(self, horovod_kwargs, executable_kwargs):
# TODO ray: make this more configurable by allowing YAML overrides of timeout_s, etc.
Expand Down Expand Up @@ -717,17 +723,26 @@ def initialize_pytorch(self, **kwargs):
initialize_pytorch(gpus=-1)
self._pytorch_kwargs = kwargs

def create_trainer(self, model: ECD, **kwargs):
def create_trainer(self, model: AbstractModel, **kwargs) -> "BaseTrainer": # noqa: F821
executable_kwargs = {**kwargs, **self._pytorch_kwargs}
if not self._use_legacy:
trainers_for_model = get_from_registry(model.type(), ray_trainers_registry)

config: TrainerConfig = kwargs["config"]
trainer_cls = get_from_registry(config.type, trainers_for_model)

# Deep copy to workaround https://github.com/ray-project/ray/issues/24139
return RayTrainerV2(
model,
copy.deepcopy(self._horovod_kwargs),
self._data_loader_kwargs,
executable_kwargs,
)
all_kwargs = {
"model": model,
"trainer_kwargs": copy.deepcopy(self._horovod_kwargs),
"data_loader_kwargs": self._data_loader_kwargs,
"executable_kwargs": executable_kwargs,
}
return trainer_cls(**all_kwargs)
else:
if model.name == MODEL_GBM:
raise RuntimeError("Legacy trainer not supported for GBM models.")

# TODO: deprecated 0.5
return RayLegacyTrainer(self._horovod_kwargs, executable_kwargs)

Expand Down
4 changes: 4 additions & 0 deletions ludwig/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,3 +167,7 @@
)

DEFAULT_AUDIO_TENSOR_LENGTH = 70000

MODEL_TYPE = "model_type"
MODEL_ECD = "ecd"
MODEL_GBM = "gbm"
5 changes: 4 additions & 1 deletion ludwig/data/dataset/pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
from ludwig.data.dataset.base import Dataset, DatasetManager
from ludwig.data.sampler import DistributedSampler
from ludwig.utils import data_utils
from ludwig.utils.data_utils import DATA_TRAIN_HDF5_FP, to_numpy_dataset
from ludwig.utils.data_utils import DATA_TRAIN_HDF5_FP, from_numpy_dataset, to_numpy_dataset
from ludwig.utils.fs_utils import download_h5
from ludwig.utils.misc_utils import get_proc_features

Expand All @@ -34,6 +34,9 @@ def __init__(self, dataset, features, data_hdf5_fp):
self.size = len(dataset)
self.dataset = to_numpy_dataset(dataset)

def to_df(self, features):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add docstring and type hints.

return from_numpy_dataset({feature.feature_name: self.dataset[feature.proc_column] for feature in features})

def get(self, proc_column, idx=None):
if idx is None:
idx = range(self.size)
Expand Down
Loading