-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
uniqueN could be GForce optimised + GForce could be optimised for := too. #3725
Comments
There are a couple of things here. First point: When you use .N require(data.table)
foo <- function(n=3e8) {
card <- 3000
chars <- substr(openssl::sha2(as.character(1:card)), 1L, 5L)
dist <- runif(card)
DT <- data.table(
A=sample(chars, n, TRUE, dist),
B=sample(chars, n, TRUE, dist)
)
DT
}
set.seed(1L)
DT <- foo(5e7L)
DT[, .N, by=B, verbose=TRUE]
# Detected that j uses these columns: <none>
# Finding groups using forderv ... 0.687s elapsed (1.131s cpu)
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.001s elapsed (0.001s cpu)
# Getting back original order ... 0.001s elapsed (0.001s cpu)
# lapply optimization is on, j unchanged as '.N'
# GForce optimized j to '.N' ### <~~~~~~~~
# Making each group and running j (GForce TRUE) ... 1.301s elapsed (1.651s cpu)
# B N
# 1: 71ee4 18647
# 2: b1718 31722
# 3: 2c1f3 33496
# 4: 13b3f 31041
# 5: 12132 19033
# ---
# 2994: 46635 20
# 2995: 5787a 23
# 2996: 7611f 57
# 2997: c30c6 39
# 2998: a8a2c 23 You can see that the expression is optimised to use Similarly, we need to optimise Second point: Even then, When both these are done, things should speedup. Until then, the best way to go about this (not benchmarked) would be: unique(DT, by=c("A", "B"))[, .N, by=B] I think this does what you want to do, but of course this returns an aggregated result which'd mean you'll have to join+update back to your original |
@arunsrinivasan Second part of the title looks like a dupe of #1414 |
First part is dupe of #1120 |
I agree that both parts are dups. Closing this as it's clearly a dup. But would be nice to up the priority on this one since there seems to be some |
This issue follows a discussion during useR!2019 after the presentation of data.table by Arun @arunsrinivasan
Hello,
Thanks for the amazing job, I love data.table !
I am using uniqueN to verify l-diversity for anonymization purposes.
The data I am working with is around 30M rows, easily ingested by data.table.
Unfortunately uniqueN is not as fast as other functions.
I tried to parallelize the grouping using setDTthreads as I can go up to 16 on my rstudio server instance.
First I get a benchmark using a simple sum over numeric.
Then I do basically the same thing but apply uniqueN over character (factor would give the same results).
Here is the code for a repex https://github.com/phileas-condemine/repex_slow_uniqueN/blob/master/repex_slow_uniqueN.R
Additional info :
Here is my session_info()
also lscpu call
The text was updated successfully, but these errors were encountered: