Speed up dependencies indexing #243

Remi-Gau · 2021-05-31T19:31:46Z

fixes #209
supersedes #185

speed up indexing by copying dependencies from the previously indexed file when possible / appropriate
improve regexp on filename and prefix parsing
minor changes in copy_to_derivatives
- unzip files directly when possible
- make pipeline name an obligatory argument
- change default ouput path

nbeliy

It should work. The only problem that you don't clone the last file is it is with same base name.
So in this case you need to update metafile (just copying it from previous file?) and already calculated dependencies.

+bids/layout.m

Remi-Gau · 2021-06-02T07:22:00Z

@nbeliy

Thanks for all those pointers.

I think that I have implemented your suggestion.

Added some tests to make sure that we get the correct number of files listed in dependencies for a couple of datasets.

Can you cross check that:

it behaves as expected on your datasets
I have not messed up something that would kill the speed optimization

codecov · 2021-06-02T07:51:11Z

Codecov Report

Merging #243 (3c49dd6) into dev (bb8a80a) will not change coverage.
The diff coverage is 0.00%.

@@          Coverage Diff          @@
##             dev    #243   +/-   ##
=====================================
  Coverage   0.00%   0.00%           
=====================================
  Files         31      31           
  Lines       1446    1480   +34     
=====================================
- Misses      1446    1480   +34

Flag	Coverage Δ
unittests	`0.00% <0.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
+bids/+internal/append_to_layout.m	`0.00% <0.00%> (ø)`
+bids/+internal/parse_filename.m	`0.00% <0.00%> (ø)`
+bids/copy_to_derivative.m	`0.00% <0.00%> (ø)`
+bids/layout.m	`0.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb8a80a...3c49dd6. Read the comment docs.

nbeliy · 2021-06-02T09:12:32Z

For some reason github don't allow me to see changes if I login. I'll put review in comments.

append_to_layout.m:
l31: Should we put a warning here? To tell user that some files are skipped?
l74: Just in case bids will add more formats that are folders, it will be more generic to test if path is directory.

copy_to_derivatives.m:
l3: Misleading description, in implementation function returns nothing. Bu it will be good to return the path to created derivative folder (e.g. derivatives/bids-matlab)
l24: To my taste, the pipeline name argument must be mandatory.
l94: From my understanding of (bids)[https://bids-specification.readthedocs.io/en/stable/02-common-principles.html#storage-of-derived-datasets], by default derivatives should be within dataset folder, so out_path = fullfile(BIDS.pth, 'derivatives');. Also you provide default value to out_path -- it will be never empty so if will never trigger.
l258: This will be slow. You first copy data, then extract it, and then remove zipped files. You can extract files directly to destination. For ex. in copy_with_symlink add if ends_with('.gz') && unzip; unzip_data(...). Obviously in this case you need to drop of removal of original zipped files.

layout.m:
It takes ~4s per subject (each containing ~800 files), so performance is improved.
However it treats json files as data, the search patterns must be adapted, for ex. like in #185

Overwise, it looks good for me

Remi-Gau · 2021-06-02T10:09:42Z

layout.m:
It takes ~4s per subject (each containing ~800 files), so performance is improved.

😄

However it treats json files as data, the search patterns must be adapted, for ex. like in #185

Yup I am working on it on a different PR in large swath inspired by #185.

Remi-Gau · 2021-06-02T11:05:31Z

append_to_layout.m:
l31: Should we put a warning here? To tell user that some files are skipped?
l74: Just in case bids will add more formats that are folders, it will be more generic to test if path is directory.

Done

Changes API

nbeliy · 2021-06-02T11:17:58Z

Tell me when I can retest

Remi-Gau · 2021-06-02T11:31:28Z

l24: To my taste, the pipeline name argument must be mandatory.

Agreed. Done. Mind you this change the order of input arguments.

l94: From my understanding of (bids)[https://bids-specification.readthedocs.io/en/stable/02-common-principles.html#storage-of-derived-datasets], by default derivatives should be within dataset folder, so out_path = fullfile(BIDS.pth, 'derivatives');. Also you provide default value to out_path -- it will be never empty so if will never trigger.

Good point. Actually the whole thing about folder organizations of derivatives and raw are mostly suggestions, but I agree that we can make the default to nest the derivatives in the raw.

l258: This will be slow. You first copy data, then extract it, and then remove zipped files. You can extract files directly to destination. For ex. in copy_with_symlink add if ends_with('.gz') && unzip; unzip_data(...). Obviously in this case you need to drop of removal of original zipped files.

Yup I know but I am pretty sure (as far as I can recall) I had to do things that way because the behavior of Octave and Matlab is slightly different with gunzip. Need to double check.

nbeliy · 2021-06-02T11:38:47Z

Yup I know but I am pretty sure (as far as I can recall) I had to do things that way because the behavior of Octave and Matlab is slightly different with gunzip. Need to double check.

From quick look into docs, the only difference I could spot is that matlab automatically creates folder if it not exists. But in any way, this has no relations with dependencies calculations.

Remi-Gau · 2021-06-02T15:12:08Z

latest commits implements #185 but relaxes constraints to have more than one instance of the string sub- in the filename: we'll just assume that the prefix is what comes before the first one.

Remi-Gau · 2021-06-02T15:57:28Z

So the difference in gunzip behavior (and not unzip as the name of function implies) between Octave and Matlab is that Octave deletes the source file.

Here is MWE below.

Will try to add a work around with an if is_octave to keep the speed up with matlab.

Not sure how if gunzip plays nice with symbolic links when working with datalad datasets. Will try it out.

Script

src_folder = fullfile(pwd, 'input');
tgt_folder = fullfile(pwd, 'output');

% set up
copyfile('sub-01_T1w.nii.gz', src_folder);

%do
gunzip(fullfile(src_folder, 'sub-01_T1w.nii.gz'), tgt_folder)

Starting point

├── gunzip_cmp_matlab_octave.m
├── input
├── output
└── sub-01_T1w.nii.gz

Results

With Matlab

├── gunzip_cmp_matlab_octave.m
├── input
│   └── sub-01_T1w.nii.gz
├── output
│   └── sub-01_T1w.nii
└── sub-01_T1w.nii.gz

With octave

├── gunzip_cmp_matlab_octave.m
├── input
├── output
│   └── sub-01_T1w.nii
└── sub-01_T1w.nii.gz

giving up and silence the tests for octave

Remi-Gau · 2021-06-02T20:00:15Z

Tell me when I can retest

@nbeliy

It should be good for another cross check.

Added a test for copy derivatives and unzip to try to make sure Octave and Matlab behave the same way on Linux.

Will quickly run things on Windows.

nbeliy

Looks good for me. Performance on my tests is the same:

>> tic; BIDS = bids.layout('funct_3D', false); toc;
Elapsed time is 13.879293 seconds.
>> size(bids.query(BIDS, 'data')) 
ans =
        4793           1

It is still a waiting time but I'm impression we hit matlab itself.

I checked the Approuve, but for some reason ity didn't worked. For me it works and I "approve" the merge :-D

+bids/copy_to_derivative.m

nbeliy · 2021-06-03T08:58:56Z

P.S. How is the advances on regexp for file selection? I can't find the corresponding merge request

Remi-Gau · 2021-06-03T09:02:41Z

P.S. How is the advances on regexp for file selection? I can't find the corresponding merge request

Sorry.

Was done on local branch that I merged here:

e43f2e3

nbeliy · 2021-06-03T11:14:29Z

Was done on local branch that I merged here:

Ok thet's why I still have json -- you exclude them only for datasets following the schema.
Is there reason why shemaless datasets should have json files as data?

Remi-Gau · 2021-06-03T11:39:14Z

Is there reason why shemaless datasets should have json files as data?

It was suggested by @arnodelorme in here
#97 (comment)

nbeliy · 2021-06-03T13:12:28Z

It was suggested by @arnodelorme in here

Then will it be possible to get json files only if they are not metadada for data?
A json file will be created for each file in derivative, then to do query I will need to add 'extention', '.nii' each time.

Also, if he want to get metadata json, they can be retrived with query('metafiles')

Remi-Gau · 2021-06-04T08:48:04Z

It was suggested by @arnodelorme in here

Then will it be possible to get json files only if they are not metadada for data?
A json file will be created for each file in derivative, then to do query I will need to add 'extention', '.nii' each time.

Also, if he want to get metadata json, they can be retrived with query('metafiles')

Would you be ok if I opened a separate issue for this: I feel that otherwise we are ending up changing behavior on this PR that was supposed to be only about optimization and has already devolved into way more than this.

Tempted to have an issue + meeting where several of the interested people on this can come together an agree on what we would like the API and the behavior to look like for a first version.

nbeliy · 2021-06-04T09:14:11Z

Tempted to have an issue + meeting where several of the interested people on this can come together an agree on what we would like the API and the behavior to look like for a first version.

It could be a good idea, if I will be invited)

Remi-Gau · 2021-06-04T10:24:51Z

Tempted to have an issue + meeting where several of the interested people on this can come together an agree on what we would like the API and the behavior to look like for a first version.

It could be a good idea, if I will be invited)

most definitely

Remi-Gau added 2 commits May 29, 2021 17:11

attempt speed up layout

b48d0b9

fix

802613a

Remi-Gau linked an issue Jun 1, 2021 that may be closed by this pull request

Inefficient dependencies calculations #209

Closed

Remi-Gau requested a review from nbeliy June 1, 2021 04:53

Remi-Gau mentioned this pull request Jun 1, 2021

Inefficient dependencies calculations #209

Closed

nbeliy reviewed Jun 1, 2021

View reviewed changes

+bids/layout.m Outdated Show resolved Hide resolved

+bids/layout.m Outdated Show resolved Hide resolved

+bids/layout.m Outdated Show resolved Hide resolved

Remi-Gau added 8 commits June 1, 2021 22:14

remove unnecessary condition

ac8f466

transfer file duplication into append to layout

6c9949b

remove extra checks on indexing json files when using schema

19b4e41

fix group indexing and refactor

a8e4d37

update tests

8c5067f

pass verbose flag when copying

73c791c

update timing

674a726

Merge branch 'dev' into speed_up_dependencies_indexing

c47d2aa

Remi-Gau added 2 commits June 2, 2021 10:18

improve regex to list files

77c72c2

simplify prefix parsing

69dc2ac

add test

2752d6d

Remi-Gau added 4 commits June 2, 2021 12:54

add warning to append to layout and refactor

7d05d12

update doc

66a3032

update tests

229ddd6

more general way to deal with MEG folder "file format"

cf1bdea

make pipeline name obligatory when copying file

92371dc

Changes API

add comments

e533d67

change default output folder for copy to derivatives

0257a02

Remi-Gau added 2 commits June 2, 2021 16:47

Merge branch 'improve_regex' into speed_up_dependencies_indexing

e43f2e3

fix prefix parsing issues

09b2428

Remi-Gau added 2 commits June 2, 2021 19:10

when possible gunzip directly instead of copy then gunzip

852f5e2

failed attempt at fixing some octave warning

e5d9cc2

giving up and silence the tests for octave

Remi-Gau mentioned this pull request Jun 2, 2021

strjoin warning in Octave #248

Open

fix windows bug

08ad8ed

nbeliy approved these changes Jun 3, 2021

View reviewed changes

+bids/copy_to_derivative.m Outdated Show resolved Hide resolved

+bids/copy_to_derivative.m Show resolved Hide resolved

+bids/copy_to_derivative.m Show resolved Hide resolved

Remi-Gau mentioned this pull request Jun 9, 2021

Behavior question: do we want to index json as other files when runnning layout with use_schema=false #250

Closed

Remi-Gau added 3 commits June 9, 2021 11:31

remove attempt to copy with matlab if system copy fails

38f6718

cover OS edge cases and update comments

b997e49

update test bids query due to a change on the bids-example input

3c49dd6

Remi-Gau merged commit 63d6215 into bids-standard:dev Jun 9, 2021

Remi-Gau mentioned this pull request Jun 9, 2021

Dev - jsonwrite by default #249

Merged

Remi-Gau deleted the speed_up_dependencies_indexing branch June 15, 2021 12:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up dependencies indexing #243

Speed up dependencies indexing #243

Remi-Gau commented May 31, 2021 •

edited

Loading

nbeliy left a comment

Remi-Gau commented Jun 2, 2021

codecov bot commented Jun 2, 2021 •

edited

Loading

nbeliy commented Jun 2, 2021

Remi-Gau commented Jun 2, 2021

Remi-Gau commented Jun 2, 2021

nbeliy commented Jun 2, 2021

Remi-Gau commented Jun 2, 2021

nbeliy commented Jun 2, 2021

Remi-Gau commented Jun 2, 2021

Remi-Gau commented Jun 2, 2021

Remi-Gau commented Jun 2, 2021

nbeliy left a comment •

edited

Loading

nbeliy commented Jun 3, 2021

Remi-Gau commented Jun 3, 2021

nbeliy commented Jun 3, 2021

Remi-Gau commented Jun 3, 2021

nbeliy commented Jun 3, 2021

Remi-Gau commented Jun 4, 2021

nbeliy commented Jun 4, 2021

Remi-Gau commented Jun 4, 2021

Speed up dependencies indexing #243

Speed up dependencies indexing #243

Conversation

Remi-Gau commented May 31, 2021 • edited Loading

nbeliy left a comment

Choose a reason for hiding this comment

Remi-Gau commented Jun 2, 2021

codecov bot commented Jun 2, 2021 • edited Loading

Codecov Report

nbeliy commented Jun 2, 2021

Remi-Gau commented Jun 2, 2021

Remi-Gau commented Jun 2, 2021

nbeliy commented Jun 2, 2021

Remi-Gau commented Jun 2, 2021

nbeliy commented Jun 2, 2021

Remi-Gau commented Jun 2, 2021

Remi-Gau commented Jun 2, 2021

Script

Starting point

Results

With Matlab

With octave

Remi-Gau commented Jun 2, 2021

nbeliy left a comment • edited Loading

Choose a reason for hiding this comment

nbeliy commented Jun 3, 2021

Remi-Gau commented Jun 3, 2021

nbeliy commented Jun 3, 2021

Remi-Gau commented Jun 3, 2021

nbeliy commented Jun 3, 2021

Remi-Gau commented Jun 4, 2021

nbeliy commented Jun 4, 2021

Remi-Gau commented Jun 4, 2021

Remi-Gau commented May 31, 2021 •

edited

Loading

codecov bot commented Jun 2, 2021 •

edited

Loading

nbeliy left a comment •

edited

Loading