Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up dependencies indexing #243

Merged
merged 28 commits into from
Jun 9, 2021

Conversation

Remi-Gau
Copy link
Collaborator

@Remi-Gau Remi-Gau commented May 31, 2021

fixes #209
supersedes #185

  • speed up indexing by copying dependencies from the previously indexed file when possible / appropriate
  • improve regexp on filename and prefix parsing
  • minor changes in copy_to_derivatives
    • unzip files directly when possible
    • make pipeline name an obligatory argument
    • change default ouput path

@Remi-Gau Remi-Gau linked an issue Jun 1, 2021 that may be closed by this pull request
@Remi-Gau Remi-Gau requested a review from nbeliy June 1, 2021 04:53
Copy link
Collaborator

@nbeliy nbeliy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work. The only problem that you don't clone the last file is it is with same base name.
So in this case you need to update metafile (just copying it from previous file?) and already calculated dependencies.

+bids/layout.m Outdated Show resolved Hide resolved
+bids/layout.m Outdated Show resolved Hide resolved
+bids/layout.m Outdated Show resolved Hide resolved
@Remi-Gau
Copy link
Collaborator Author

Remi-Gau commented Jun 2, 2021

@nbeliy

Thanks for all those pointers.

I think that I have implemented your suggestion.

Added some tests to make sure that we get the correct number of files listed in dependencies for a couple of datasets.

Can you cross check that:

  • it behaves as expected on your datasets
  • I have not messed up something that would kill the speed optimization

@codecov
Copy link

codecov bot commented Jun 2, 2021

Codecov Report

Merging #243 (3c49dd6) into dev (bb8a80a) will not change coverage.
The diff coverage is 0.00%.

Impacted file tree graph

@@          Coverage Diff          @@
##             dev    #243   +/-   ##
=====================================
  Coverage   0.00%   0.00%           
=====================================
  Files         31      31           
  Lines       1446    1480   +34     
=====================================
- Misses      1446    1480   +34     
Flag Coverage Δ
unittests 0.00% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
+bids/+internal/append_to_layout.m 0.00% <0.00%> (ø)
+bids/+internal/parse_filename.m 0.00% <0.00%> (ø)
+bids/copy_to_derivative.m 0.00% <0.00%> (ø)
+bids/layout.m 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb8a80a...3c49dd6. Read the comment docs.

@nbeliy
Copy link
Collaborator

nbeliy commented Jun 2, 2021

For some reason github don't allow me to see changes if I login. I'll put review in comments.

append_to_layout.m:
l31: Should we put a warning here? To tell user that some files are skipped?
l74: Just in case bids will add more formats that are folders, it will be more generic to test if path is directory.

copy_to_derivatives.m:
l3: Misleading description, in implementation function returns nothing. Bu it will be good to return the path to created derivative folder (e.g. derivatives/bids-matlab)
l24: To my taste, the pipeline name argument must be mandatory.
l94: From my understanding of (bids)[https://bids-specification.readthedocs.io/en/stable/02-common-principles.html#storage-of-derived-datasets], by default derivatives should be within dataset folder, so out_path = fullfile(BIDS.pth, 'derivatives');. Also you provide default value to out_path -- it will be never empty so if will never trigger.
l258: This will be slow. You first copy data, then extract it, and then remove zipped files. You can extract files directly to destination. For ex. in copy_with_symlink add if ends_with('.gz') && unzip; unzip_data(...). Obviously in this case you need to drop of removal of original zipped files.

layout.m:
It takes ~4s per subject (each containing ~800 files), so performance is improved.
However it treats json files as data, the search patterns must be adapted, for ex. like in #185

Overwise, it looks good for me

@Remi-Gau
Copy link
Collaborator Author

Remi-Gau commented Jun 2, 2021

layout.m:
It takes ~4s per subject (each containing ~800 files), so performance is improved.

😄

However it treats json files as data, the search patterns must be adapted, for ex. like in #185

Yup I am working on it on a different PR in large swath inspired by #185.

@Remi-Gau
Copy link
Collaborator Author

Remi-Gau commented Jun 2, 2021

append_to_layout.m:
l31: Should we put a warning here? To tell user that some files are skipped?
l74: Just in case bids will add more formats that are folders, it will be more generic to test if path is directory.

Done

@nbeliy
Copy link
Collaborator

nbeliy commented Jun 2, 2021

Tell me when I can retest

@Remi-Gau
Copy link
Collaborator Author

Remi-Gau commented Jun 2, 2021

l24: To my taste, the pipeline name argument must be mandatory.

Agreed. Done. Mind you this change the order of input arguments.

l94: From my understanding of (bids)[https://bids-specification.readthedocs.io/en/stable/02-common-principles.html#storage-of-derived-datasets], by default derivatives should be within dataset folder, so out_path = fullfile(BIDS.pth, 'derivatives');. Also you provide default value to out_path -- it will be never empty so if will never trigger.

Good point. Actually the whole thing about folder organizations of derivatives and raw are mostly suggestions, but I agree that we can make the default to nest the derivatives in the raw.

l258: This will be slow. You first copy data, then extract it, and then remove zipped files. You can extract files directly to destination. For ex. in copy_with_symlink add if ends_with('.gz') && unzip; unzip_data(...). Obviously in this case you need to drop of removal of original zipped files.

Yup I know but I am pretty sure (as far as I can recall) I had to do things that way because the behavior of Octave and Matlab is slightly different with gunzip. Need to double check.

@nbeliy
Copy link
Collaborator

nbeliy commented Jun 2, 2021

Yup I know but I am pretty sure (as far as I can recall) I had to do things that way because the behavior of Octave and Matlab is slightly different with gunzip. Need to double check.

From quick look into docs, the only difference I could spot is that matlab automatically creates folder if it not exists. But in any way, this has no relations with dependencies calculations.

@Remi-Gau
Copy link
Collaborator Author

Remi-Gau commented Jun 2, 2021

latest commits implements #185 but relaxes constraints to have more than one instance of the string sub- in the filename: we'll just assume that the prefix is what comes before the first one.

@Remi-Gau
Copy link
Collaborator Author

Remi-Gau commented Jun 2, 2021

So the difference in gunzip behavior (and not unzip as the name of function implies) between Octave and Matlab is that Octave deletes the source file.

Here is MWE below.

Will try to add a work around with an if is_octave to keep the speed up with matlab.

Not sure how if gunzip plays nice with symbolic links when working with datalad datasets. Will try it out.

Script

src_folder = fullfile(pwd, 'input');
tgt_folder = fullfile(pwd, 'output');

% set up
copyfile('sub-01_T1w.nii.gz', src_folder);

%do
gunzip(fullfile(src_folder, 'sub-01_T1w.nii.gz'), tgt_folder)

Starting point

├── gunzip_cmp_matlab_octave.m
├── input
├── output
└── sub-01_T1w.nii.gz

Results

With Matlab

├── gunzip_cmp_matlab_octave.m
├── input
│   └── sub-01_T1w.nii.gz
├── output
│   └── sub-01_T1w.nii
└── sub-01_T1w.nii.gz

With octave

├── gunzip_cmp_matlab_octave.m
├── input
├── output
│   └── sub-01_T1w.nii
└── sub-01_T1w.nii.gz

@Remi-Gau
Copy link
Collaborator Author

Remi-Gau commented Jun 2, 2021

Tell me when I can retest

@nbeliy

It should be good for another cross check.

Added a test for copy derivatives and unzip to try to make sure Octave and Matlab behave the same way on Linux.

Will quickly run things on Windows.

Copy link
Collaborator

@nbeliy nbeliy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for me. Performance on my tests is the same:

>> tic; BIDS = bids.layout('funct_3D', false); toc;
Elapsed time is 13.879293 seconds.
>> size(bids.query(BIDS, 'data')) 
ans =
        4793           1

It is still a waiting time but I'm impression we hit matlab itself.

I checked the Approuve, but for some reason ity didn't worked. For me it works and I "approve" the merge :-D

+bids/copy_to_derivative.m Outdated Show resolved Hide resolved
+bids/copy_to_derivative.m Show resolved Hide resolved
+bids/copy_to_derivative.m Show resolved Hide resolved
@nbeliy
Copy link
Collaborator

nbeliy commented Jun 3, 2021

P.S. How is the advances on regexp for file selection? I can't find the corresponding merge request

@Remi-Gau
Copy link
Collaborator Author

Remi-Gau commented Jun 3, 2021

P.S. How is the advances on regexp for file selection? I can't find the corresponding merge request

Sorry.

Was done on local branch that I merged here:

e43f2e3

@nbeliy
Copy link
Collaborator

nbeliy commented Jun 3, 2021

Was done on local branch that I merged here:

Ok thet's why I still have json -- you exclude them only for datasets following the schema.
Is there reason why shemaless datasets should have json files as data?

@Remi-Gau
Copy link
Collaborator Author

Remi-Gau commented Jun 3, 2021

Is there reason why shemaless datasets should have json files as data?

It was suggested by @arnodelorme in here
#97 (comment)

@nbeliy
Copy link
Collaborator

nbeliy commented Jun 3, 2021

It was suggested by @arnodelorme in here

Then will it be possible to get json files only if they are not metadada for data?
A json file will be created for each file in derivative, then to do query I will need to add 'extention', '.nii' each time.

Also, if he want to get metadata json, they can be retrived with query('metafiles')

@Remi-Gau
Copy link
Collaborator Author

Remi-Gau commented Jun 4, 2021

It was suggested by @arnodelorme in here

Then will it be possible to get json files only if they are not metadada for data?
A json file will be created for each file in derivative, then to do query I will need to add 'extention', '.nii' each time.

Also, if he want to get metadata json, they can be retrived with query('metafiles')

Would you be ok if I opened a separate issue for this: I feel that otherwise we are ending up changing behavior on this PR that was supposed to be only about optimization and has already devolved into way more than this.

Tempted to have an issue + meeting where several of the interested people on this can come together an agree on what we would like the API and the behavior to look like for a first version.

@nbeliy
Copy link
Collaborator

nbeliy commented Jun 4, 2021

Tempted to have an issue + meeting where several of the interested people on this can come together an agree on what we would like the API and the behavior to look like for a first version.

It could be a good idea, if I will be invited)

@Remi-Gau
Copy link
Collaborator Author

Remi-Gau commented Jun 4, 2021

Tempted to have an issue + meeting where several of the interested people on this can come together an agree on what we would like the API and the behavior to look like for a first version.

It could be a good idea, if I will be invited)

most definitely

@Remi-Gau Remi-Gau merged commit 63d6215 into bids-standard:dev Jun 9, 2021
@Remi-Gau Remi-Gau deleted the speed_up_dependencies_indexing branch June 15, 2021 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inefficient dependencies calculations
2 participants