`[.data.table` needs to be more modular. #852

arunsrinivasan · 2014-10-03T07:53:18Z

No description provided.

rsaporta · 2014-10-04T03:14:14Z

can you elaborate on this?

jangorecki · 2015-01-10T07:56:51Z

My suggestion regarding this issue would be to isolate [.data.table and bmerge by adding separate layer-function for preparing i data (key manipulation, attributes, etc.) from i argument (and possibly some others).
At the current moment the auto.index feature can fits in that new intermediate function but there are multiple FR which can be implemented there to not increase complexity of [.

jangorecki · 2015-08-22T18:12:29Z

Any ideas on this?

Currently [.data.table has a huge chunks of code like if(missing(i)) ..., if(missing(j)), # Grouping ... which has quite a lot nested conditions.
I think the separation of evaluation of control flow conditions from the computation on the dataset can bring more readability of the code. It also allows better isolation of scenarios to compute, so results in a more modular code.
High level view on the current [.data.table implementation base on few samples of code:

...
if (!missing(i)) {
        ...
        if (is.call(isub) && isub[[1L]] == as.name("order") && getOption("datatable.optimize") >= 1) {...}
        ...
        if (is.call(isub) && isub[[1L]] == quote(forder)) {...} else if (is.call(isub) && getOption("datatable.auto.index") &&
                 as.character(isub[[1L]]) %chin% c("==","%in%") &&
                 is.name(isub[[2L]]) &&
                 (isub2<-as.character(isub[[2L]])) %chin% names(x) && 
                 is.null(attr(x, '.data.table.locked'))) {...}
} else {...}
...
if (missing(j)) {...} else {
        ...
        if (is.call(jsub) && jsub[[1L]]==as.name(":=")) {...}
        ...
}
...

Above code in a proposed structure:

# control flow conditions evaluation
do_i = !missing(i)
do_j = !missing(j)
optimize_i = do_i && getOption("datatable.optimize") >= 1 && is.call(isub) && isub[[1L]] == as.name("order")
x_unlocked = is.null(attr(x, '.data.table.locked'))
create_index = do_i && getOption("datatable.auto.index") && x_unlocked && is.call(isub) && as.character(isub[[1L]]) %chin% c("==","%in%") && is.name(isub[[2L]]) && as.character(isub[[2L]]) %chin% names(x)
update_by_ref = do_j && is.call(jsub) && jsub[[1L]]==as.name(":=")

# computation on dataset
if(do_i){
        if(optimize_i) {...}
        if(create_index) {...}
        ...
} else {...}
...
if (!do_j) {...} else {
        ...
        if (update_by_ref) {...}
        ...
}
...

In above code calls kind of is.call(x) && x[[1L]]==... could be replaced by some kind of if_call_and_if_matching_to wrapper function.
Additionally besides of control flow and computation on dataset sections there could be a section to compute some variables like leftcols, rightcols, byval, ansvars, etc. or auto.index preparing (see upper post).
If we would be able to use only single return at the end of function then it can also simplify the visibility/invisibility of returned object.

If you would consider switching to such structure I can prepare PR after 1.9.6 release. The PR would be only a first step, just wrapping current structure into new shape, yet enough to cover all unit tests and be update to reduce huge chunks to more isolated code blocks.

arunsrinivasan · 2015-08-23T16:44:49Z

@jangorecki thanks. Yes it should be more functional, agreed. I'll be working on this for v1.9.8. Have spent some time on this already.

jangorecki · 2015-08-23T21:18:07Z

@arunsrinivasan you can push your dev version to new branch, maybe put a task list of features TODO here, so the goals will be defined. It could additionally bring some feedback from the potential contributors.

jangorecki · 2018-03-17T18:05:44Z

Ideally would be to isolate processing to atomic functions, like process.i(subi), process.j(subj).
Not sure how big would be overhead in solution like this but maybe it can be easily avoided by simplifying code. This is skeleton how this isolation could be done, just run in console: https://gist.github.com/jangorecki/1ef34c0ccdc947ae7a29c158330e7e92
Example meta.data.table is like data.table but remember recent transformation (i, j, by) and source data, together with call information and statistics.

ColeMiller1 · 2020-01-13T12:50:41Z

@jangorecki I have been working on code to make the .SDcols more modular. Specifically, my aim is to allow it to reuse similar code between j, by and .SDcols by using a new function. Users would be able to use the same selection code in any of the above and have the same returns - it's largely that way currently except the same bits of code are in three separate places. The function is now working with the .SDcols - the other variables are more tangled and need to be checked more thoroughly.

I saw you assigned yourself some related items. Should I hold off?

jangorecki · 2020-01-13T13:16:31Z

I assigned to myself because I am working on it too. Will try to push today, so we can work together.

ColeMiller1 · 2020-06-04T04:14:02Z

I have created modular_dt2 branch. This is a tangled web so this attempt was more of a light refactoring than full-fledged modularity. Most of the work was copying and pasting elsewhere. In one case, we have if (notjoin) i = !i with a note to update that to allow for which_. I have incorporated that in this branch.

Help / feedback welcome. Most of my comments include !!! to distinguish changes I made versus what comments already existed.

Single row subset - 1 row out of 1e5 rows

Branch	expression	minimum time (microseconds)
1.12.9 master	CsubsetDT	102
1.12.9 modular_dt2	dt[1L]	130
1.12.9 with2	dt[1L, with = c(i = FALSE)]	173

column select - 1e5 rows; 3 columns

Branch	expression	minimum time (microseconds)
1.12.9 master	dt[]	15
1.12.9 modular_dt2	dt[, .SD]	67 - note, shallow is used when `is.null(irows) && jsub == quote(.SD)` so this might not be the fairest. This is mainly a proof-of-concept of how we can exit `j` early
1.12.9 master	dt[, .SD]	2190

looping and selecting single row:

##1.12.9 master
 ##  user  system elapsed 
 ## 18.60    4.53   24.18 

## 1.12.9 with2
##   user  system elapsed 
##  12.69    4.61   17.84
 
## 1.12.9 modular_dt2
##   user  system elapsed 
##  10.71    3.20   14.39

##negating boolean index from parent.frame

library(data.table)
n = 1e7 ; grps = 1e2
set.seed(123L)
setDTthreads(1L)
dt = data.table(x = sample(grps, n, TRUE), y = runif(n), z = runif(n))
ind = sample(c(TRUE, FALSE), n, TRUE)
bench::mark(dt[!ind])
##1.12.9 master
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dt[!ind]      133ms    191ms      5.82     192MB     11.6

##1.12.9 modular_dt2
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dt[!ind]      115ms    138ms      7.16     154MB     12.5

ColeMiller1 · 2020-06-06T00:11:10Z

@jangorecki, how does this actually get implemented? I was trying to do a smaller PR for the dt[!logical_ind] but the git looks like I changed everything.

FWIW, recompiling removing OpenMP (for all tests) provides the following timings:

library(data.table)
##setDTthreads(1L) ## doesn't matter because this isn't compiled. 
DoSomething <- function(row) someCalculation <- row[["v1"]] + 1
allIterations <- data.table(v1 = runif(1e5), v2 = runif(1e5))

system.time(for (r in 1:nrow(allIterations)) DoSomething(allIterations[r, ]))
system.time(for (r in 1:nrow(allIterations)) DoSomething(allIterations[r, , with=c(i=FALSE)]))

version	expr	time in seconds
master	dt[r]	17.8
with2	dt[r, with = c(i=FALSE)]	6.6
modular_dt2	dt[r]	6.4
modular_dt2	dt[1L]	5.9
data.frame	df[r, ]	5.0
CsubsetDT	fx(r)	2.3

So in summary, modular_dt2 is a 3x speed up without any new arguments for a simple row subset. I am not saying that this is what you were envisioning (I was hoping for something more modular / impressive, too), but I do not know the path to realize a modular dt without hurting the open PRs or, for reviewers, wondering what was change versus just re-arranged.

MichaelChirico · 2024-05-04T18:27:37Z

Just for fun, I calculated the cyclomatic complexity of [.data.table:

cyclocomp::cyclocomp(data.table:::`[.data.table`)
# [1] 1033

For some context, lintr::cyclocomp_linter() has a default of 15, seeing a number higher than 30 is rare, and no other data.table function is higher than 200.

arunsrinivasan added internals High labels Oct 3, 2014

arunsrinivasan modified the milestones: v1.9.6, v1.9.8 Oct 3, 2014

arunsrinivasan modified the milestones: v2.0.0, v1.9.8 Mar 7, 2016

MichaelChirico mentioned this issue Aug 2, 2016

[R-Forge #5260] basic operations like row-subsets on data.table could be faster #542

Closed

MichaelChirico mentioned this issue Oct 27, 2017

dot alias not understood as expected in by/keyby #2447

Open

franknarf1 mentioned this issue Mar 28, 2018

improvement on programmatically substituting columns in expressions #2655

Closed

mattdowle removed this from the Candidate milestone May 10, 2018

MichaelChirico mentioned this issue Sep 25, 2018

Does tryCatch of #2887 add to call overhead? #3037

Closed

jangorecki mentioned this issue Mar 17, 2019

Auto indexing on set() #1196

Closed

MichaelChirico mentioned this issue Mar 30, 2019

why data.table is faster with vectorized column subset than list column subset #3477

Closed

jangorecki mentioned this issue Jun 26, 2019

which_equal, which_in #3663

Open

jangorecki mentioned this issue Aug 8, 2019

make 'CsubsetDT' callable from C in other packages #3751

Closed

ColeMiller1 mentioned this issue Dec 27, 2019

Subset.i #4139

Closed

jangorecki mentioned this issue Jan 13, 2020

columns selection helper for .SDcols, j expr #4174

Closed

ColeMiller1 mentioned this issue Feb 18, 2020

Column Select Helper #4248

Closed

1 task

jangorecki removed the High label Jun 8, 2020

tlapak mentioned this issue Jun 14, 2020

cross join, #1717 #4544

Draft

ColeMiller1 mentioned this issue Jun 16, 2020

Windows 10 is faster with -fno-openmp than setDTthreads(1) on many repeated calls #4527

Closed

jangorecki mentioned this issue Nov 15, 2020

data.table spark/databases interface #1828

Closed

jangorecki mentioned this issue Oct 3, 2023

[.data.table is very slow with a single column #5650

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`[.data.table` needs to be more modular. #852

`[.data.table` needs to be more modular. #852

arunsrinivasan commented Oct 3, 2014

rsaporta commented Oct 4, 2014

jangorecki commented Jan 10, 2015

jangorecki commented Aug 22, 2015

arunsrinivasan commented Aug 23, 2015

jangorecki commented Aug 23, 2015

jangorecki commented Mar 17, 2018

ColeMiller1 commented Jan 13, 2020

jangorecki commented Jan 13, 2020

ColeMiller1 commented Jun 4, 2020

ColeMiller1 commented Jun 6, 2020

MichaelChirico commented May 4, 2024

[.data.table needs to be more modular. #852

[.data.table needs to be more modular. #852

Comments

arunsrinivasan commented Oct 3, 2014

rsaporta commented Oct 4, 2014

jangorecki commented Jan 10, 2015

jangorecki commented Aug 22, 2015

arunsrinivasan commented Aug 23, 2015

jangorecki commented Aug 23, 2015

jangorecki commented Mar 17, 2018

ColeMiller1 commented Jan 13, 2020

jangorecki commented Jan 13, 2020

ColeMiller1 commented Jun 4, 2020

Single row subset - 1 row out of 1e5 rows

column select - 1e5 rows; 3 columns

looping and selecting single row:

ColeMiller1 commented Jun 6, 2020

MichaelChirico commented May 4, 2024

`[.data.table` needs to be more modular. #852

`[.data.table` needs to be more modular. #852