-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regression in big grouping #4818
Comments
I tweaked little bit the error message diff --git a/src/gsumm.c b/src/gsumm.c
index 372ae594..d1e006d4 100644
--- a/src/gsumm.c
+++ b/src/gsumm.c
@@ -112,8 +112,9 @@ SEXP gforce(SEXP env, SEXP jsub, SEXP o, SEXP f, SEXP l, SEXP irowsArg) {
int highSize = ((nrow-1)>>shift) + 1;
//Rprintf(_("When assigning grp[o] = g, highSize=%d nb=%d shift=%d nBatch=%d\n"), highSize, nb, shift, nBatch);
int *counts = calloc(nBatch*highSize, sizeof(int)); // TODO: cache-line align and make highSize a multiple of 64
+ if (!counts) error(_("Internal error: Failed to allocate counts when assigning g in gforce"));
int *TMP = malloc(nrow*2*sizeof(int));
- if (!counts || !TMP ) error(_("Internal error: Failed to allocate counts or TMP when assigning g in gforce"));
+ if (!TMP ) error(_("Internal error: Failed to allocate TMP when assigning g in gforce"));
#pragma omp parallel for num_threads(getDTthreads(nBatch, false)) // schedule(dynamic,1)
for (int b=0; b<nBatch; b++) {
const int howMany = b==nBatch-1 ? lastBatchSize : batchSize; and we now know that it is allocation of Line 115 in 8480b6a
which in case of 2e9 rows is trying to allocate 1.6e+10 bytes which is around 15GB.
Machine (another one) is 64 cores and 256GB mem machine, looking at
|
I tried scaling up memory to 488GB machine
R 4.0.3, peak memory was again 195GB despite having much more available memory.
|
Thanks, yes it seems to be duplicate of #4818. Not closing yet to ensure that fix or the other issue will resolve this. |
I re-run latest master to ensure issue is fix. Below are the timings using default 50% cores and 100% as well. Comparing to data.table 1.9.2 (Feb 2014), running on R 4.0.3, timings are from 2 up to 10 times smaller in recent master. Using 16 cores (50%)
Using 32 cores (100%)
|
While running grouping benchmark on 2e9 rows dataset (96GB csv) using recent stable data.table 1.13.2 I am getting following exception:
data.table/src/gsumm.c
Line 116 in 8480b6a
It is the same machine as the one used in 2014: 32 cores and 244GB memory.
I run data.table 1.9.2 as well to ensure that version which previously worked fine for this data size continue to work on a recent R version.
Timings are slower than they were in the past, but AFAIK this is what we observed in other issues: newer version of R was introducing an overhead that data.table was later addressing in newer versions. So if users upgrade R, then they should also upgrade data.table.
The text was updated successfully, but these errors were encountered: