-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
R 3.4.3 -> 3.5 performance drop #2962
Comments
Thank you for reporting. Confirmed on latest devel. library(data.table)
set.seed(1)
types <- c("A", "B", "C", "D", "E", "F")
obs <- 4e7
dt1 <- data.table(percent = round(runif(obs, min = 0, max = 1), digits = 2),
type = as.factor(sample(types, obs, replace = TRUE)))
system.time(dt1[, list(percent_total = sum(percent)), by = type]) 3.4.4 timings are same as in 3.4.3. |
I tested latest devel on 3.5.1 and 3.4.4 and there is no performance difference anymore. It is likely to be fixed by recent rework of many C functions. Please re-open if issue is still valid for you on latest devel. |
@jangorecki
And 1.12.0 gives me x2 time increase on both R versions. (I think this is due to #3395):
It looks like I don't have permissions to re-open this issue. Please re-open it if you can confirm my tests. |
@s-Nick-s I think it is related to number of threads (as in the link you mentioned). Below is R 3.5.2 vs R 3.4.4 using only single thread in both cases, data.table 1.12.0. library(data.table)
set.seed(1)
types <- c("A", "B", "C", "D", "E", "F")
setDTthreads(1L)
obs <- 4e6
dt1 <- data.table(percent = round(runif(obs, min = 0, max = 1), digits = 2),
type = as.factor(sample(types, obs, replace = TRUE)))
system.time(dt1[, list(percent_total = sum(percent)), by = type])
microbenchmark::microbenchmark(
test1 <- dt1[, list(percent_total = sum(percent)), by = type]
, times = 30) 3.5.2
3.4.4
|
@jangorecki |
Hi @jangorecki I worked with @s-Nick-s a while back when he discovered the performance drop. I did some tests these days, and although I find 1.12.+ to give same or better time on R3.5+ vs R.3.4.4, i do find that running R3.4.4 with data.table 1.11.8 is about 2x as quick as running this example on anything newer than that combination (any R version beginning R3.5 all the way to R4.0.0) with any version of data.table 1.12.x (I did use R3.4.4 + data.table 1.11.8 R3.6.3 + data.table 1.12.8
Please if you have the time, make a clean install of R3.4.4 + data.table 1.11.8 and verify this. Looking forward to your reply, |
@RaynorJim Thanks for letting us know. I run some benchmarks #docker run --rm -it r-base:3.4.4
#docker run --rm -it r-base:3.6.3
install.packages("microbenchmark", quiet=TRUE)
#install.packages("https://cloud.r-project.org/src/contrib/Archive/data.table/data.table_1.11.8.tar.gz", repos=NULL)
#install.packages("https://cloud.r-project.org/src/contrib/Archive/data.table/data.table_1.12.0.tar.gz", repos=NULL)
#install.packages("data.table", quiet=TRUE) # 1.12.8
options(width=999)
library(data.table)
set.seed(1)
types = c("A", "B", "C", "D", "E", "F")
#obs = 4e6
#obs = 4e7
dt1 = data.table(
percent = round(runif(obs, min = 0, max = 1), digits = 2),
type = as.factor(sample(types, obs, replace = TRUE))
)
setDTthreads(1L)
th1=microbenchmark::microbenchmark(th01 = dt1[, list(percent_total = sum(percent)), by = type], times = 10)
setDTthreads(40L)
th2=microbenchmark::microbenchmark(th40 = dt1[, list(percent_total = sum(percent)), by = type], times = 10)
rbind(th1,th2)
q("no") And got the following timings (milliseconds)
Will come back soon to make conclusion. I would appriciate if you could briefly check if those numbers match against yours, more or less. |
One observation that is quite clear from above timings is that single threaded code is slower starting from 1.12.0 comparing to 1.11.8, this is covered by #3330. Let's then focus on R 3.4.4 vs post-R 3.5, here R 3.6.3. As we can see here, there is a difference, but it was not really that big:
Now lets see if that is still valid, doesn't seem so:
If you are after minimizing speed, and you cannot use multiple threads, then going back to 1.11.8 is probably the best choice for your. You may also want to upvote and subscribe to #3330. Closing this issue for now. @RaynorJim If you believe your case is not covered by #3330 please let us know, we can always re-open. |
Appologies, I have been busy all day. I wasn't aware single threaded performance is covered by #3330, I will watch that issue. Indeed i cannot move away from 1.11.8 because we run all our code on a parallel cluster and all our CPUs are maxed out, so i am looking for as much speed from data.table as possible per one thread. Thanks again for checking this out, I appreciate it. |
I've noticed a significant drop in data.table performance after upgrading to R 3.5.
For example:
On 3.5 gives me:
But on 3.4.3 it's ~20% faster:
The text was updated successfully, but these errors were encountered: