-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow passing fixed transformers to evaluations #367
Comments
Hi @PierreGtch! You don't need to change the evaluation inside the moabb. You will need to build a function or transformation using a pre-trained deep learning model to extract this new feature space and use this representation for the classification step. The library Scikit-learn has enough flexibility for you. A possible path is to follow this tutorial: https://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#sphx-glr-auto-examples-ensemble-plot-feature-transformation-py And translate this for deep learning, something like: from sklearn.preprocessing import FunctionTransformer
pre_trained_torch_model = ...
def feature_from_deep_learning(X, model):
return model(X)
transformation_step = FunctionTransformer(func=feature_from_deep_learning,
kw_args={"model": pre_trained_torch_model}) Maybe you will need to do some trick or another to indicate that the model is already "fitted", and pay attention on how you will load the model to ensure no data leak.
|
As Igor and I are passing a similar issue, I asked his opinion on the subject |
Hi @bruAristimunha, For example, in the (I can do the implementation of course) |
If I understand correctly, and you are using skorch, you just need to extract the PyTorch model trained Now, suppose you want to apply a more agnostic transformation to the dataset before the split. In that case, I think it wouldn't be in the evaluation that would need to be changed, but in the paradigm, together with the pre-processing functions (i.e., resample). It seems to me that we are agreeing but in different words. Maybe it's cool for us to talk more with a concrete example. Perhaps we can follow with a small code example. I was wondering, can you make one example? |
Yes, using a pytorch model in a transformation function was not a problem, it was just an example. Here are examples: Currently, we have to do this: from sklearn.pipeline import make_pipeline
from moabb.evaluations import WithinSessionEvaluation
from moabb.paradigms import LeftRightImagery
transformer = ...
classifier = ...
pipelines = {'transformer+classifier': make_pipeline(transformer, classifier)}
paradigm = LeftRightImagery()
evaluation = WithinSessionEvaluation(paradigm=paradigm)
results = evaluation.process(pipelines) What I propose could be implemented in two different ways: pipelines = {'classifier': classifier}
paradigm = LeftRightImagery(transformer=transformer)
evaluation = WithinSessionEvaluation(paradigm=paradigm)
results = evaluation.process(pipelines) Option 2: pipelines = {'classifier1': classifier1, 'classifier2': classifier2}
paradigm = LeftRightImagery()
evaluation = WithinSessionEvaluation(paradigm=paradigm)
results = evaluation.process(pipelines, transformers=transformer) 2.b. or a different one for each: pipelines = {'classifier1': classifier1, 'classifier2': classifier2}
transformers = {'classifier1': transformer1, 'classifier2': transformer2}
paradigm = LeftRightImagery()
evaluation = WithinSessionEvaluation(paradigm=paradigm)
results = evaluation.process(pipelines, transformers=transformers) |
I like the first option, what is your preference @sylvchev? |
Hi @PierreGtch |
Hi @sylvchev I also prefer option 2, I think it can be simpler to understand for users. What do you mean by cache the transformer results? On disk or in memory? I was assuming we should cache them in memory while they are used within each call to Also, I just had another question: do you think we should also introduce a parameter like result1 = eval.process(pipelines, transformer1, transformer_suffix="_transformer1")
result2 = eval.process(pipelines, transformer2, transformer_suffix="_transformer2") But when Without a An option 3 could be to still pass the transformers as a Also, a potential risk is that some unsupervised algorithms make use of the data passed to |
I think I now prefer option 3: this way, the pipelines and transformers parameters would have relatively similar behaviours and not depend too much on each other. Also, there wouldn't be multiple cases like with option 1. What do you think @bruAristimunha @sylvchev ? |
I was thinking about disk cache. This could be useful when there are computationally intensive preprocessing/transformation of the dataset. I used this disk caching in a previous project and it worked quite well to speed up computation.
I think adding a suffix in the pipeline that depends of the applied transformer is a good idea. It could be simpler to extract it directly from the transformer dict key rather than to add a specific argument in the
Not sure to understand why it is necessary to change the key between calls. Same transformer applied to same pipeline should give the same pipeline+suffix name, don't they?
yes, I prefer also this 3rd option.
The leakage of information is a major risk. I think we could ensure that point by checking that the transformer estimator or pipeline are purely unsupervised and do not use the label information, and raise a warning if it is the case. |
Hi @sylvchev for subject in dataset.subject_list:
X_no_tf, labels, metadata = paradigm.get_data(dataset, [subject])
for name_tf, transformer in transformers.items():
X = transformer.transform(X_no_tf)
for session in metadata.session.unique():
ix = metadata.session==session
for name_pipe, pipeline in pipelines.items():
...
pipeline.fit(X[ix], labels[ix])
...
results.add(name=name_tf + " + " + name_pipe, ...) Do you have any preliminary comments?
Yes, it would be great to have such a caching mechanism automatically taken care of by MOABB! I created a new issue for that: #385.
Even unsupervised algorithms can be problematic if they train on the (unlabelled) data we ask them to transform. For example: in a cross-session evaluation, we will probably pass the data from all the sessions simultaneously to the transformer. If this transformer uses all the sessions to train an unsupervised algorithm, it breaks the train/test separation... |
Now we have PR #408, The API would be: pipelines = {'classifier1': classifier1, 'classifier2': classifier2}
paradigm = LeftRightImagery()
evaluation = WithinSessionEvaluation(paradigm=paradigm)
results = evaluation.process(pipelines, transformer=transformer) |
Closes with #408 |
This is not completely implemented in #408. Currently, we can only pass a fixed transformer to |
Oh sorry >.< |
Closed by #372 which implements option 2.a. (see above). |
Hi @sylvchev,
I think it would be very convenient to allow passing "fixed" sklearn transformer(s) (i.e. with a
transform
method but nofit
method) to the evaluations. A pipeline starting with such a fixed transformer currently applies k-folds times the transformation to the data when being evaluated. One time is enough if the transformer does not need to be trained.If we evaluate multiple pipelines all starting with the same transformer, the time gain can be even greater.
concrete use-case
I use pre-trained (and frozen) neural networks for feature extraction (with skorch). The expensive part of the evaluation is the feature extraction. Training a classifier on the extracted features is relatively fast.
implementation
Implementing that in
BaseParadigm.get_data
would require only little changes to the different evaluations.What do you think?
The text was updated successfully, but these errors were encountered: