-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run 2E9 rows in-ram on EC2 #71
Comments
this issue Rdatatable/data.table#2956 can be also confirmed as resolved when doing 2E9 benchmark |
blocked by tidyverse/dplyr#4334 as of now |
tidyverse/dplyr#4334 has been recently resolved, once it will land on CRAN we should be good to proceed with this issue. |
We can wait for dplyr 1.0 to be released as it seems to be the next major version. Pandas got 1.0 version recently also. |
Need to post-pone that to dplyr 1.1.0. Performance polishing was shifted to 1.1.0 release, and dplyr 1.0 is expected to be slower. |
It is now blocked on tidyverse/dplyr#5291 |
Same machine as in 2014 was used, 244GB memory. Using recent stable versions as of today.
Minor changes to 2014's script:
Results:
|
data.table got the regression fixed in Rdatatable/data.table#4297 Results:
|
db-bench runs on a dedicated machine (provided by H2O) which has 125GB of RAM. So 2E9 won't fit in-ram (the data itself takes 100GB and there's too little working memory left). This machine has fast large disk though and it's much higher priority to test out-of-ram than it is to test bigger RAM; i.e. adding 500GB (1E10) test (#39) on the same 125GB RAM db-bench machine where spark and pydatatable will work but the other products will fail.
However, for completeness, it would still be nice to know if pandas works now on 2E9 on a node with 250GB RAM (it didn't 4 years ago but data.table did).
This issue was moved here from Rdatatable/data.table#823
The text was updated successfully, but these errors were encountered: