Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster by=key(DT)[1] and keyby= index #3040

Merged
merged 10 commits into from
Sep 13, 2018
Merged

Faster by=key(DT)[1] and keyby= index #3040

merged 10 commits into from
Sep 13, 2018

Conversation

mattdowle
Copy link
Member

@mattdowle mattdowle commented Sep 8, 2018

N = 5e8
set.seed(1)
DT = data.table(
  id = sample(LETTERS[1:6], N, replace=TRUE),
  val = sample(10, N, replace=TRUE)
)
setkey(DT,id)

Before this PR :

> system.time(DT[,sum(val),by=id,verbose=TRUE]) 
Detected that j uses these columns: val 
Finding groups using uniqlist ... 6.924sec 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.002sec 
lapply optimization is on, j unchanged as 'sum(val)'
GForce optimized j to 'gsum(val)'
Making each group and running j (GForce TRUE) ... 3.156sec 
   user  system elapsed 
 10.567   0.601  11.168 

With this PR :

> system.time(DT[,sum(val),by=id,verbose=TRUE]) 
Detected that j uses these columns: val 
Finding groups using uniqlist ... 0.373sec 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.002sec 
lapply optimization is on, j unchanged as 'sum(val)'
GForce optimized j to 'gsum(val)'
Making each group and running j (GForce TRUE) ... 3.190sec 
   user  system elapsed 
  4.046   0.615   4.662 

The uniqlist part reduces from 6.9s to 0.4s.

ToDo:

  • add missing test of by=key(DT)[1] where that is integer64 wasn't missing.
  • numeric rounding to pass test 1196.2

Found as a result of investigating and thanks to #2962.

@mattdowle mattdowle changed the title Added simpler one-column case to uniqlist Faster by=key(DT)[1] Sep 8, 2018
@jangorecki
Copy link
Member

do we want this in patch release? I think there are potentially other improvements we could do here, thus might make sense to postpone to 1.12.0

@codecov
Copy link

codecov bot commented Sep 12, 2018

Codecov Report

Merging #3040 into master will decrease coverage by <.01%.
The diff coverage is 92.2%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3040      +/-   ##
==========================================
- Coverage   90.87%   90.87%   -0.01%     
==========================================
  Files          61       61              
  Lines       11740    11783      +43     
==========================================
+ Hits        10669    10708      +39     
- Misses       1071     1075       +4
Impacted Files Coverage Δ
R/setkey.R 83.33% <100%> (+0.21%) ⬆️
src/forder.c 97.92% <100%> (ø) ⬆️
R/data.table.R 91.71% <90.47%> (ø) ⬆️
src/uniqlist.c 95.1% <91.83%> (-1.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5049e56...294a16b. Read the comment docs.

@jangorecki
Copy link
Member

The other improvements I had in mind was #3042, and it will probably takes some time. Lets get this one merged for 1.11.6, they don't seems to be that much related.

Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mattdowle mattdowle added this to the 1.11.6 milestone Sep 12, 2018
@mattdowle mattdowle changed the title Faster by=key(DT)[1] Faster by=key(DT)[1] and keyby= index Sep 13, 2018
@mattdowle mattdowle merged commit 073c284 into master Sep 13, 2018
@mattdowle mattdowle deleted the faster_uniqlist branch September 13, 2018 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants