Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[.data.table needs to be more modular. #852

Open
arunsrinivasan opened this issue Oct 3, 2014 · 11 comments
Open

[.data.table needs to be more modular. #852

arunsrinivasan opened this issue Oct 3, 2014 · 11 comments

Comments

@arunsrinivasan
Copy link
Member

No description provided.

@rsaporta
Copy link
Contributor

rsaporta commented Oct 4, 2014

can you elaborate on this?

@jangorecki
Copy link
Member

My suggestion regarding this issue would be to isolate [.data.table and bmerge by adding separate layer-function for preparing i data (key manipulation, attributes, etc.) from i argument (and possibly some others).
At the current moment the auto.index feature can fits in that new intermediate function but there are multiple FR which can be implemented there to not increase complexity of [.

@jangorecki
Copy link
Member

Any ideas on this?

Currently [.data.table has a huge chunks of code like if(missing(i)) ..., if(missing(j)), # Grouping ... which has quite a lot nested conditions.
I think the separation of evaluation of control flow conditions from the computation on the dataset can bring more readability of the code. It also allows better isolation of scenarios to compute, so results in a more modular code.
High level view on the current [.data.table implementation base on few samples of code:

...
if (!missing(i)) {
        ...
        if (is.call(isub) && isub[[1L]] == as.name("order") && getOption("datatable.optimize") >= 1) {...}
        ...
        if (is.call(isub) && isub[[1L]] == quote(forder)) {...} else if (is.call(isub) && getOption("datatable.auto.index") &&
                 as.character(isub[[1L]]) %chin% c("==","%in%") &&
                 is.name(isub[[2L]]) &&
                 (isub2<-as.character(isub[[2L]])) %chin% names(x) && 
                 is.null(attr(x, '.data.table.locked'))) {...}
} else {...}
...
if (missing(j)) {...} else {
        ...
        if (is.call(jsub) && jsub[[1L]]==as.name(":=")) {...}
        ...
}
...

Above code in a proposed structure:

# control flow conditions evaluation
do_i = !missing(i)
do_j = !missing(j)
optimize_i = do_i && getOption("datatable.optimize") >= 1 && is.call(isub) && isub[[1L]] == as.name("order")
x_unlocked = is.null(attr(x, '.data.table.locked'))
create_index = do_i && getOption("datatable.auto.index") && x_unlocked && is.call(isub) && as.character(isub[[1L]]) %chin% c("==","%in%") && is.name(isub[[2L]]) && as.character(isub[[2L]]) %chin% names(x)
update_by_ref = do_j && is.call(jsub) && jsub[[1L]]==as.name(":=")

# computation on dataset
if(do_i){
        if(optimize_i) {...}
        if(create_index) {...}
        ...
} else {...}
...
if (!do_j) {...} else {
        ...
        if (update_by_ref) {...}
        ...
}
...

In above code calls kind of is.call(x) && x[[1L]]==... could be replaced by some kind of if_call_and_if_matching_to wrapper function.
Additionally besides of control flow and computation on dataset sections there could be a section to compute some variables like leftcols, rightcols, byval, ansvars, etc. or auto.index preparing (see upper post).
If we would be able to use only single return at the end of function then it can also simplify the visibility/invisibility of returned object.

If you would consider switching to such structure I can prepare PR after 1.9.6 release. The PR would be only a first step, just wrapping current structure into new shape, yet enough to cover all unit tests and be update to reduce huge chunks to more isolated code blocks.

@arunsrinivasan
Copy link
Member Author

@jangorecki thanks. Yes it should be more functional, agreed. I'll be working on this for v1.9.8. Have spent some time on this already.

@jangorecki
Copy link
Member

@arunsrinivasan you can push your dev version to new branch, maybe put a task list of features TODO here, so the goals will be defined. It could additionally bring some feedback from the potential contributors.

@jangorecki
Copy link
Member

Ideally would be to isolate processing to atomic functions, like process.i(subi), process.j(subj).
Not sure how big would be overhead in solution like this but maybe it can be easily avoided by simplifying code. This is skeleton how this isolation could be done, just run in console: https://gist.github.com/jangorecki/1ef34c0ccdc947ae7a29c158330e7e92
Example meta.data.table is like data.table but remember recent transformation (i, j, by) and source data, together with call information and statistics.

@ColeMiller1
Copy link
Contributor

@jangorecki I have been working on code to make the .SDcols more modular. Specifically, my aim is to allow it to reuse similar code between j, by and .SDcols by using a new function. Users would be able to use the same selection code in any of the above and have the same returns - it's largely that way currently except the same bits of code are in three separate places. The function is now working with the .SDcols - the other variables are more tangled and need to be checked more thoroughly.

I saw you assigned yourself some related items. Should I hold off?

@jangorecki
Copy link
Member

I assigned to myself because I am working on it too. Will try to push today, so we can work together.

@ColeMiller1
Copy link
Contributor

I have created modular_dt2 branch. This is a tangled web so this attempt was more of a light refactoring than full-fledged modularity. Most of the work was copying and pasting elsewhere. In one case, we have if (notjoin) i = !i with a note to update that to allow for which_. I have incorporated that in this branch.

Help / feedback welcome. Most of my comments include !!! to distinguish changes I made versus what comments already existed.

Single row subset - 1 row out of 1e5 rows

Branch expression minimum time (microseconds)
1.12.9 master CsubsetDT 102
1.12.9 modular_dt2 dt[1L] 130
1.12.9 with2 dt[1L, with = c(i = FALSE)] 173

column select - 1e5 rows; 3 columns

Branch expression minimum time (microseconds)
1.12.9 master dt[] 15
1.12.9 modular_dt2 dt[, .SD] 67 - note, shallow is used when is.null(irows) && jsub == quote(.SD) so this might not be the fairest. This is mainly a proof-of-concept of how we can exit j early
1.12.9 master dt[, .SD] 2190

looping and selecting single row:

##1.12.9 master
 ##  user  system elapsed 
 ## 18.60    4.53   24.18 

## 1.12.9 with2
##   user  system elapsed 
##  12.69    4.61   17.84
 
## 1.12.9 modular_dt2
##   user  system elapsed 
##  10.71    3.20   14.39 

##negating boolean index from parent.frame

library(data.table)
n = 1e7 ; grps = 1e2
set.seed(123L)
setDTthreads(1L)
dt = data.table(x = sample(grps, n, TRUE), y = runif(n), z = runif(n))
ind = sample(c(TRUE, FALSE), n, TRUE)
bench::mark(dt[!ind])
##1.12.9 master
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dt[!ind]      133ms    191ms      5.82     192MB     11.6

##1.12.9 modular_dt2
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dt[!ind]      115ms    138ms      7.16     154MB     12.5

@ColeMiller1
Copy link
Contributor

@jangorecki, how does this actually get implemented? I was trying to do a smaller PR for the dt[!logical_ind] but the git looks like I changed everything.

FWIW, recompiling removing OpenMP (for all tests) provides the following timings:

library(data.table)
##setDTthreads(1L) ## doesn't matter because this isn't compiled. 
DoSomething <- function(row) someCalculation <- row[["v1"]] + 1
allIterations <- data.table(v1 = runif(1e5), v2 = runif(1e5))

system.time(for (r in 1:nrow(allIterations)) DoSomething(allIterations[r, ]))
system.time(for (r in 1:nrow(allIterations)) DoSomething(allIterations[r, , with=c(i=FALSE)]))
version expr time in seconds
master dt[r] 17.8
with2 dt[r, with = c(i=FALSE)] 6.6
modular_dt2 dt[r] 6.4
modular_dt2 dt[1L] 5.9
data.frame df[r, ] 5.0
CsubsetDT fx(r) 2.3

So in summary, modular_dt2 is a 3x speed up without any new arguments for a simple row subset. I am not saying that this is what you were envisioning (I was hoping for something more modular / impressive, too), but I do not know the path to realize a modular dt without hurting the open PRs or, for reviewers, wondering what was change versus just re-arranged.

@jangorecki jangorecki removed the High label Jun 8, 2020
@MichaelChirico
Copy link
Member

Just for fun, I calculated the cyclomatic complexity of [.data.table:

cyclocomp::cyclocomp(data.table:::`[.data.table`)
# [1] 1033

For some context, lintr::cyclocomp_linter() has a default of 15, seeing a number higher than 30 is rare, and no other data.table function is higher than 200.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants