Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adding feature type shared parameter capability for hyperopt #2133

Merged
merged 66 commits into from
Jun 29, 2022

Conversation

arnavgarg1
Copy link
Contributor

@arnavgarg1 arnavgarg1 commented Jun 13, 2022

This change enables using the defaults keyword within hyperopt to set default parameters for feature groups. This will help add search spaces more concisely for datasets with a large number of features while also reducing the search space to allow for a deeper search during hyperopt.

For e.g., a user can now add

"defaults.text.num_filters": {"space": "choice", "categories": [64, 128, 256, 512]}

in their hyperopt search space so that all text features (inputs and outputs) use the sampled cell_type for that particular trial.

The API for this change is as follows: defaults.<feature_type>.<parameter>, where:

  • feature_type: a valid input or output feature type, for e.g., text/category/sequence etc.
  • parameter: the parameter to search over for that particular feature type, for e.g., embedding_size for category features

Example Configuration:

hyperopt:
    goal: minimize
    parameters:
        trainer.learning_rate:
            lower: 0.0001
            upper: 0.01
            space: loguniform
        defaults.text.num_filters:
            space: choice
            categories: [64, 128, 256, 512]
        defaults.category.embedding_size:
            space: choice
            categories: [64, 128, 256, 512]
        ....
    ....

_Note: The output status that gets logged every few seconds will show the parameter as defaults.text.cell_type in the printed table with each trial status, but underneath the surface, this is getting updated correctly. This is unfortunately not something we can control, but I think it is okay to leave it this way anyway because the table width would become too large if we were to add a column for each feature whose default parameter was updated in the trial status table that gets printed.

Screen Shot 2022-06-13 at 1 12 33 PM_

@arnavgarg1 arnavgarg1 self-assigned this Jun 13, 2022
@arnavgarg1 arnavgarg1 marked this pull request as draft June 13, 2022 20:36
@arnavgarg1
Copy link
Contributor Author

Working on writing some tests so will update this PR, but wanted to create the PR so we can test if it works.

@github-actions
Copy link

github-actions bot commented Jun 13, 2022

Unit Test Results

       6 files  ±  0         6 suites  ±0   2h 11m 2s ⏱️ - 17m 17s
2 887 tests +  7  2 841 ✔️ +  7    46 💤 ±0  0 ±0 
8 661 runs  +21  8 519 ✔️ +21  142 💤 ±0  0 ±0 

Results for commit 1c9f65a. ± Comparison against base commit a53f9f8.

♻️ This comment has been updated with latest results.

@arnavgarg1 arnavgarg1 changed the title feat: Shared Parameters For Hyperopt enh: Shared Parameters For Hyperopt Jun 13, 2022
@arnavgarg1 arnavgarg1 marked this pull request as ready for review June 14, 2022 00:19
@arnavgarg1

This comment was marked as outdated.

ludwig/hyperopt/execution.py Outdated Show resolved Hide resolved
ludwig/hyperopt/run.py Outdated Show resolved Hide resolved
@arnavgarg1
Copy link
Contributor Author

Not sure where my comment went, but i have an issue with vocab_size. it is an internal only parameter, that is derived by most_common and preprocessing, it's not something to hyperopt on, but the pr searches over it, we should choose another parameter to search over

This should be fixed now!

@w4nderlust
Copy link
Collaborator

Not sure where my comment went, but i have an issue with vocab_size. it is an internal only parameter, that is derived by most_common and preprocessing, it's not something to hyperopt on, but the pr searches over it, we should choose another parameter to search over

This should be fixed now!

In test_hyperopt I can see it though

@arnavgarg1
Copy link
Contributor Author

Not sure where my comment went, but i have an issue with vocab_size. it is an internal only parameter, that is derived by most_common and preprocessing, it's not something to hyperopt on, but the pr searches over it, we should choose another parameter to search over

This should be fixed now!

In test_hyperopt I can see it though

Good catch, I've updated all hyperopt tests in test_hyperopt to use parameters that are user-facing now :)

@w4nderlust
Copy link
Collaborator

Not sure where my comment went, but i have an issue with vocab_size. it is an internal only parameter, that is derived by most_common and preprocessing, it's not something to hyperopt on, but the pr searches over it, we should choose another parameter to search over

This should be fixed now!

In test_hyperopt I can see it though

Good catch, I've updated all hyperopt tests in test_hyperopt to use parameters that are user-facing now :)

Just to provide additional clarification: in the tests, when we define features like category_feature(), we provide a vocab_size. The reason there is that that function has two purposes: it creates a dictionary that can be both used by synthesize_data and as part of a ludwig config. vocab_size is a parameter for synthesize_data (it determines how many categories will there be in the generated data) and not a parameter for Ludwig config (because it is internally determined by ludwig).

So, its use in, say tests/integration_tests/test_hyperopt.py line 89 and 92 is completely fine, and indeed helps generating data with a small vocabulary, which in turn helps making those tests faster (although I think for category features, it should be minimum 3). So that should be kept. So I suggest reintroducing those uses of vocab_size.

What I was arguing for removal, and sorry if I wasn't more precise about it before, is the presence of vocab_size as a parameter in the hyperopt section of the test config, as there it is basically a config override of an internal only parameter. embedding_size on the other hand, is a good replacement parameter you are using instead of vocab_size, because it is actually a parameter we expect users of hyperopt to be able to change.

Another unrelated comment:

num_filters_search_space = [128, 512]
embedding_size_search_space = [128, 512]

Those are reasonable values user may set, although in these tests we want to make them as small as possible to make tests quick. Here i suggest something like 4 and 8 to be the two options.

@arnavgarg1
Copy link
Contributor Author

tests/integration_tests/test_hyperopt.py

That makes a lot of sense @w4nderlust , thanks for the in-depth clarification! I've added vocab_size back to any instance where a categorical feature is being created, but removed all instances where it was in the hyperopt search space. I've also reduced the sample space to be [4, 8] to speed up the tests.

@arnavgarg1 arnavgarg1 changed the title feat: Shared Parameters For Hyperopt feat: Adding feature type shared parameter capability for hyperopt Jun 25, 2022
@arnavgarg1
Copy link
Contributor Author

@justinxzhao @ShreyaR @w4nderlust Does this PR look good to go? Happy to make any additional changes if needed!

tests/integration_tests/test_hyperopt.py Outdated Show resolved Hide resolved
tests/integration_tests/test_hyperopt.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question in my mind is whether it's necessary to specify the input_features section, or if we want to just specify the feature type?

@arnavgarg1
Copy link
Contributor Author

arnavgarg1 commented Jun 27, 2022

Open question in my mind is whether it's necessary to specify the input_features section, or if we want to just specify the feature type?

@tgaddair I think that's a good question.

Ideally we should allow users to do both so they can control the level of granularity. If they want to only set defaults for a specific feature group (say input_feature vs output_feature) then they can add this, otherwise they could just do something like defaults.text.cell_type.

For now, doing it in the way I've implemented it technically gives users both levels of control because the users can just copy and paste the same thing in their configs but just replace input_features with output_features, which is only a little bit of additional work. If we see requests or demands to not use this distinction and allow setting defaults for inputs + outputs together in one go, it can be added in a v2.

Another reason to consider this distinction is that the parameters for input features and output features of the same type (e.g. text) can be different, so it's just neater to specify it with this distinction. We could theoretically infer whether the parameter is for the encoder or the decoder based on the parameter names if we want to, but I think it might be more error-prone in terms of usability. It could be confusing for an end user to specify these things in one place without any distinction of whether the parameter works for the encoder or the decoder or both, or only have it apply to say the output feature of that type.

Happy to discuss this further if you feel differently!

@tgaddair
Copy link
Collaborator

@arnavgarg1, a few things I would say as a counter-point to the above:

  1. In most cases the user will not have more than one output feature (in the extreme, a handful), so I wouldn't over-optimize for the case where the user has many input and output features that need their own distinct defaults. The 99% scenario for this would be the user having a lot of input features, and then any output feature specific logic would make sense to belong to the specific output feature.
  2. Once we implement the hierarchical encoder and decoder sections, it should be easy for users to separate the input and output feature specific sections.
  3. This would be consistent with the existing behavior of the global preprocessing section (which we wish to replace with this defaults section).

@arnavgarg1
Copy link
Contributor Author

@tgaddair Makes sense, I'll update this to not be feature group (input or output) specific for now and make sure both input and output features use the same shared parameters.

@arnavgarg1 arnavgarg1 merged commit 236cf54 into master Jun 29, 2022
@arnavgarg1 arnavgarg1 deleted the hyperopt_shared_parameters branch June 29, 2022 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Hyperopt] Allow default shared parameter search spaces
5 participants