-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[.data.table
is very slow with a single column
#5650
Comments
But you aren't even benchmarking the same queries on |
The first call to library(bench)
df <- data.frame(a = runif(10000), b= as.character(runif(10000)))
index <- runif(10000) <= 0.5
library(data.table)
dt <- as.data.table(df)
mark(
df[index, "a"],
df[["a"]][index],
dt[index, a],
dt[["a"]][index]) Result: # A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 "df[index… 78.3µs 81.1µs 11541. 98.3KB 21.7 5319 10 461ms
2 "df[[\"a\… 71.3µs 73.6µs 13177. 98.3KB 23.6 6132 11 465ms
3 "dt[index… 354.3µs 370.4µs 2597. 114.7KB 6.15 1267 3 488ms
4 "dt[[\"a\… 71.2µs 74µs 13165. 98.3KB 23.7 6115 11 464ms Does |
Does data.table have a benchmarking testsuite to track regressions? -> no but this is one of the goals in the next 2 years, we plan to hire someone to work on that full time. |
I don't think there was a regression here anyway. Looking at the time units, I doubt it will ever get good attention from the team as we would have to subset columns in a loop tens of thousands time for the problem to be really noticeable, and that would be rather uncommon use case. |
I'm not implying that the benchmarks are wrong, I'm just saying that |
OK, following your suggestion: library(bench)
df <- data.frame(a = runif(10000), b= as.character(runif(10000)))
index <- runif(10000) <= 0.5
library(data.table)
dt <- as.data.table(df)
mark(
df[index, "a"],
df[["a"]][index],
dt[index, a],
dt[["a"]][index],
as.list(dt[index, "a"])$a) The result is # A tibble: 5 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 "df[index… 78.3µs 80.8µs 11308. 97.7KB 19.1 5339 9 472ms
2 "df[[\"a\… 71.5µs 73.5µs 13192. 97.7KB 23.7 6113 11 463ms
3 "dt[index… 357µs 373µs 2629. 114.1KB 6.15 1283 3 488ms
4 "dt[[\"a\… 71.9µs 74.2µs 12873. 97.7KB 23.9 5928 11 461ms
5 "as.list(… 250.2µs 262µs 3729. 129.9KB 10.4 1795 5 481ms so slightly faster (which is surprising since this is not the recommended way in
Is it really an uncommon case to select just one column within a tight loop? The only reason that I noticed this is because I understand that the primary use of In any case, given the fact that there were already 1K issues reported before this one, I do not expect a fix to this one anytime soon, but other users may find the information useful or contribute workarounds. I didn't know that |
If speed is really important here then you need to avoid method dispatch anyway. If you're only interested in the underlying column then you should be able to use library(data.table)
df <- data.frame(a = runif(10000), b= as.character(runif(10000)))
index <- runif(10000) <= 0.5
dt <- as.data.table(df)
microbenchmark::microbenchmark(
df[index, "a"],
df[["a"]][index],
.subset2(df, "a")[index],
dt[index, a],
dt[["a"]][index],
as.list(dt[index, "a"])$a,
.subset2(dt, "a")[index]
)
#> Unit: microseconds
#> expr min lq mean median uq
#> df[index, "a"] 60.989 81.5590 94.47888 100.6775 105.0240
#> df[["a"]][index] 55.149 65.9050 85.58422 93.3225 96.5160
#> .subset2(df, "a")[index] 49.975 81.0015 80.79162 85.2605 89.0560
#> dt[index, a] 251.248 275.0495 311.04374 304.2420 312.0940
#> dt[["a"]][index] 55.672 68.5170 85.95876 94.0135 97.7035
#> as.list(dt[index, "a"])$a 172.928 217.2780 326.48526 226.8850 237.7520
#> .subset2(dt, "a")[index] 48.631 77.8495 79.47604 85.7895 88.7160
#> max neval
#> 120.672 100
#> 114.245 100
#> 103.070 100
#> 880.575 100
#> 112.374 100
#> 7082.201 100
#> 98.199 100 Created on 2023-06-16 with reprex v2.0.2
|
Maybe one could gain 10 microseconds per call but the confidence intervals appear to overlap and it is easier to write The issue remains that the "natural" approach of replacing |
There is PR of mine which already speeds up DT[index], possibly will work with column selection as well. You can try it out. |
This is rather a problem of optimizing tight loops. If there are many iterations then some care is needed. I documented that (amongst others things to be careful about in tight loops) in Lines 218 to 225 in 21da019
|
Does data.table have a benchmarking testsuite to track regressions? -> is this a regression? has this code ever worked faster in the past? |
AFAIK it is not a regression. We could call it regression surely if we compare to data.table v1.0.0. Over years interface for processing and optimizing users input was only growing, so it was adding up. Question "does it matter?" is a matter of balance between common usage patterns and absolute time added. And here use case like that get definitely not a high priority. It has been said multiple times in different places that it is better to vectorize your work on DT rather iteratively run many queries to DT, whenever possible. To avoid overhead of [. This is the cost of user friendliness that DT is packed with. There are some ideas, like |
#4687 is the FR for continuous benchmarking |
#
Minimal reproducible example
; please be sure to setverbose=TRUE
where possible!#
Output of sessionInfo()
The text was updated successfully, but these errors were encountered: