Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gshift as gforce optimized shift #5205

Merged
merged 39 commits into from
Oct 20, 2021
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
5813501
init gshift
ben-schwen Aug 31, 2021
13fe55c
remodel R gshift
ben-schwen Sep 3, 2021
55f5775
fix lead error
ben-schwen Sep 4, 2021
6364212
add tests
ben-schwen Sep 4, 2021
ad63180
working version
ben-schwen Sep 4, 2021
4eada97
Merge remote-tracking branch 'origin/master' into gforce_shift
ben-schwen Oct 8, 2021
1b08e59
lag/lead working
ben-schwen Oct 9, 2021
3437ea0
update dispatch & tests
ben-schwen Oct 9, 2021
91696e5
undo newlines
ben-schwen Oct 9, 2021
9c88df1
reorder
ben-schwen Oct 9, 2021
5073709
fixed R dispatch
ben-schwen Oct 10, 2021
db16894
fix gshift
ben-schwen Oct 10, 2021
60a7509
coverage
ben-schwen Oct 11, 2021
0280a45
long index vector support
ben-schwen Oct 13, 2021
381eaea
update testnumber
ben-schwen Oct 13, 2021
3187a8c
merge master
ben-schwen Oct 13, 2021
c11d9ba
forgot two testnums
ben-schwen Oct 13, 2021
98848ce
works for n <= grpsize
ben-schwen Oct 14, 2021
e0dd981
add cyclic
ben-schwen Oct 14, 2021
b591b74
rm unnecessary code
ben-schwen Oct 14, 2021
d03d30e
add cov
ben-schwen Oct 14, 2021
373217f
more coverage
ben-schwen Oct 14, 2021
bdc7fa0
coverage escaped gshift
ben-schwen Oct 14, 2021
7d63674
add NEWS
ben-schwen Oct 14, 2021
9e339b6
removed 'when fill is not a call' from news item because fill being a…
mattdowle Oct 15, 2021
7a930ea
add 64bit
ben-schwen Oct 15, 2021
4a79037
Merge branch 'gforce_shift' of github.com:Rdatatable/data.table into …
ben-schwen Oct 15, 2021
9599dbe
merge current master
ben-schwen Oct 15, 2021
7f9e358
safer optimization dispatch
ben-schwen Oct 15, 2021
9a98673
add benchmark example to NEWS
ben-schwen Oct 15, 2021
48b1775
add example from SO
ben-schwen Oct 15, 2021
9b9f0bb
include column 'r' by avoiding repetition
mattdowle Oct 15, 2021
973f749
used an .Rdata to test against a fixed y, added RAW support
mattdowle Oct 15, 2021
f2e3361
added comment about integer64 case
mattdowle Oct 15, 2021
f5aad19
news item tweak
mattdowle Oct 15, 2021
1e5c2b3
added coerce test for shift by group
ben-schwen Oct 15, 2021
0f766c0
add unsupported type test
ben-schwen Oct 16, 2021
a6abac3
merge master
mattdowle Oct 20, 2021
5eec8f2
added/tweaked comments
mattdowle Oct 20, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,43 @@

29. `setkey()` now supports type `raw` as value columns (not as key columns), [#5100](https://github.com/Rdatatable/data.table/issues/5100). Thanks Hugh Parsonage for requesting, and Benjamin Schwendinger for the PR.

30. `shift()` is now GForce optimised, [#1534](https://github.com/Rdatatable/data.table/issues/1534). Thanks to Gerhard Nachtmann for requesting, and Benjamin Schwendinger for the PR.

```R
# Benchmark
N = 1e7
DT = data.table(x = sample(N), y = sample(1e6,N,TRUE))
basic_shift = shift
microbenchmark::microbenchmark(
DT[, shift(x, 1, type="lag"), y],
DT[, basic_shift(x, 1, type="lag"), y],
DT[, c(NA, head(x,-1)), y],
times = 10L, unit = "s")
# Unit: seconds
# expr min lq mean median uq max neval
# DT[, shift(x, 1, type = "lag"), y] 0.4865 0.5238 0.5463 0.5446 0.5725 0.5982 10
# DT[, basic_shift(x, 1, type = "lag"), y] 20.5500 20.9000 21.1600 21.3200 21.4400 21.5200 10
# DT[, c(NA, head(x, -1)), y] 8.7620 9.0240 9.1870 9.2800 9.3700 9.4110 10
```

Benchmarking example from [stackoverflow](https://stackoverflow.com/questions/35179911/shift-in-data-table-v1-9-6-is-slow-for-many-groups)
```R
library(microbenchmark)
set.seed(1)
basic_shift = shift
mg <- data.table(expand.grid(year = 2012:2016, id = 1:1000),
value = rnorm(5000))
microbenchmark(dt194 = mg[, c(value[-1], NA), by = id],
dt196 = mg[, basic_shift(value, n = 1, type = "lead"), by = id],
dtnow = mg[, shift(value, n = 1, type = "lead"), by = id],
unit = "ms")
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt194 3.6600 3.8250 4.4930 4.1720 4.9490 11.700 100
# dt196 18.5400 19.1800 21.5100 20.6900 23.4200 29.040 100
# dtnow 0.4826 0.5586 0.6586 0.6329 0.7348 1.318 100
```

## BUG FIXES

1. `by=.EACHI` when `i` is keyed but `on=` different columns than `i`'s key could create an invalidly keyed result, [#4603](https://github.com/Rdatatable/data.table/issues/4603) [#4911](https://github.com/Rdatatable/data.table/issues/4911). Thanks to @myoung3 and @adamaltmejd for reporting, and @ColeMiller1 for the PR. An invalid key is where a `data.table` is marked as sorted by the key columns but the data is not sorted by those columns, leading to incorrect results from subsequent queries.
Expand Down
24 changes: 23 additions & 1 deletion R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -1745,6 +1745,10 @@ replace_dot_alias = function(e) {
if (!(is.call(q) && is.symbol(q[[1L]]) && is.symbol(q[[2L]]) && (q1 <- q[[1L]]) %chin% gfuns)) return(FALSE)
if (!(q2 <- q[[2L]]) %chin% names(SDenv$.SDall) && q2 != ".I") return(FALSE) # 875
if ((length(q)==2L || (!is.null(names(q)) && startsWith(names(q)[3L], "na")))) return(TRUE)
if (length(q)>=2L && q[[1L]] == "shift") {
q_named = match.call(shift, q)
if (!is.call(q_named[["fill"]]) && is.null(q_named[["give.names"]])) return(TRUE)
} # add gshift support
# ^^ base::startWith errors on NULL unfortunately
# head-tail uses default value n=6 which as of now should not go gforce ... ^^
# otherwise there must be three arguments, and only in two cases:
Expand Down Expand Up @@ -1848,6 +1852,17 @@ replace_dot_alias = function(e) {
gi = if (length(o__)) o__[f__] else f__
g = lapply(grpcols, function(i) groups[[i]][gi])

# returns all rows instead of one per group
nrow_funs = c("gshift")
.is_nrows = function(q) {
if (!is.call(q)) return(FALSE)
if (q[[1L]] == "list") {
any(vapply(q, .is_nrows, FALSE))
} else {
q[[1L]] %chin% nrow_funs
}
}

# adding ghead/gtail(n) support for n > 1 #5060 #523
q3 = 0
if (!is.symbol(jsub)) {
Expand All @@ -1865,6 +1880,8 @@ replace_dot_alias = function(e) {
if (q3 > 0) {
grplens = pmin.int(q3, len__)
g = lapply(g, rep.int, times=grplens)
} else if (.is_nrows(jsub)) {
g = lapply(g, rep.int, times=len__)
}
ans = c(g, ans)
} else {
Expand Down Expand Up @@ -2970,7 +2987,7 @@ rleidv = function(x, cols=seq_along(x), prefix=NULL) {
# (2) edit .gforce_ok (defined within `[`) to catch which j will apply the new function
# (3) define the gfun = function() R wrapper
gfuns = c("[", "[[", "head", "tail", "first", "last", "sum", "mean", "prod",
"median", "min", "max", "var", "sd", ".N") # added .N for #334
"median", "min", "max", "var", "sd", ".N", "shift") # added .N for #334
`g[` = `g[[` = function(x, n) .Call(Cgnthvalue, x, as.integer(n)) # n is of length=1 here.
ghead = function(x, n) .Call(Cghead, x, as.integer(n)) # n is not used at the moment
gtail = function(x, n) .Call(Cgtail, x, as.integer(n)) # n is not used at the moment
Expand All @@ -2984,6 +3001,11 @@ gmin = function(x, na.rm=FALSE) .Call(Cgmin, x, na.rm)
gmax = function(x, na.rm=FALSE) .Call(Cgmax, x, na.rm)
gvar = function(x, na.rm=FALSE) .Call(Cgvar, x, na.rm)
gsd = function(x, na.rm=FALSE) .Call(Cgsd, x, na.rm)
gshift = function(x, n=1L, fill=NA, type=c("lag", "lead", "shift", "cyclic")) {
type = match.arg(type)
stopifnot(is.numeric(n))
.Call(Cgshift, x, as.integer(n), fill, type)
}
gforce = function(env, jsub, o, f, l, rows) .Call(Cgforce, env, jsub, o, f, l, rows)

.prepareFastSubset = function(isub, x, enclos, notjoin, verbose = FALSE){
Expand Down
40 changes: 40 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -18292,3 +18292,43 @@ DT = data.table(A=1:3, key="A")
test(2223.1, DT[.(4), nomatch=FALSE], data.table(A=integer(), key="A"))
test(2223.2, DT[.(4), nomatch=NA_character_], data.table(A=4L, key="A"))

# gshift
options(datatable.optimize = 2L)
esc = shift
DT = data.table(x = sample(letters[1:5], 20, TRUE),
y = rep.int(1:2, 10), # to test 2 grouping columns get rep'd properly
i = sample(c(-2L,0L,3L,NA), 20, TRUE),
d = sample(c(1.2,-3.4,5.6,NA), 20, TRUE),
s = sample(c("foo","bar",NA), 20, TRUE),
c = sample(c(0+3i,1,-1-1i,NA), 20, TRUE),
l = sample(c(TRUE, FALSE, NA), 20, TRUE),
r = as.raw(sample(1:5, 20, TRUE)),
if (test_bit64) i64 = as.integer64(sample(c(-2L,0L,2L,NA), 20, TRUE)))

nn = list(1, 5, -1, -5, c(1,2), c(-1,1))
cols = c("y", "i", "d", "s", "c", "l", if (test_bit64) "i64")
mattdowle marked this conversation as resolved.
Show resolved Hide resolved
testnum = 2224
for (n in nn) {
for (c in cols) {
mattdowle marked this conversation as resolved.
Show resolved Hide resolved
testnum = testnum + 0.001
test(testnum, EVAL(sprintf("DT[, shift(%s, %d, type='lag'), by=x]", c, n)), EVAL(sprintf("DT[, esc(%s, %d, type='lag'), by=x]", c, n)))
testnum = testnum + 0.001
test(testnum, EVAL(sprintf("DT[, shift(%s, %d, type='lead'), by=x]", c, n)), EVAL(sprintf("DT[, esc(%s, %d, type='lead'), by=x]", c, n)))
testnum = testnum + 0.001
test(testnum, EVAL(sprintf("DT[, shift(%s, %d, type='shift'), by=x]", c, n)), EVAL(sprintf("DT[, esc(%s, %d, type='shift'), by=x]", c, n)))
testnum = testnum + 0.001
test(testnum, EVAL(sprintf("DT[, shift(%s, %d, type='cyclic'), by=x]", c, n)), EVAL(sprintf("DT[, esc(%s, %d, type='cyclic'), by=x]", c, n)))
# check if shift with opposite type is same as shift with n*-1
testnum = testnum + 0.001
test(testnum, EVAL(sprintf("DT[, shift(%s, %d, type='lag'), by=x]", c, n)), EVAL(sprintf("DT[, esc(%s, %d, type='lead'), by=x]", c, -n)))
testnum = testnum + 0.001
test(testnum, EVAL(sprintf("DT[, shift(%s, %d, type='lead'), by=x]", c, n)), EVAL(sprintf("DT[, esc(%s, %d, type='lag'), by=x]", c, -n)))
}
}
mattdowle marked this conversation as resolved.
Show resolved Hide resolved

test(2224.51, DT[, shift(i, fill=1:10), by=x], error="fill must be a vector of length 1")
test(2224.52, DT[, shift(i, type="shift"), by=x], DT[, esc(i, type="shift"), by=x])
test(2224.53, DT[, shift(r), by=x], error="Type 'raw' is not supported by GForce gshift")
# use fill argument with length > 1 which is not a call
a=1:2
test(2224.54, DT[, shift(i, fill=a), by=x], error="fill must be a vector of length 1")
88 changes: 88 additions & 0 deletions src/gsumm.c
Original file line number Diff line number Diff line change
Expand Up @@ -1162,3 +1162,91 @@ SEXP gprod(SEXP x, SEXP narmArg) {
return(ans);
}

SEXP gshift(SEXP x, SEXP nArg, SEXP fillArg, SEXP typeArg) {
Copy link
Member

@MichaelChirico MichaelChirico Oct 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we prefer to re-use the code in shift.c here (to the extent possible)? otherwise it'll be a pain to always update code in both places going forward.

It might require some refactoring/modularizing of the original shift.c.

Copy link
Member Author

@ben-schwen ben-schwen Oct 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to recycle as much as possible but there are some different things to consider about both implementations.

Basic shift gshift
works on data.tables/vectors/lists operates on vectors
main workhouse is memmove used memory may not be blocked
no subset on i concurrent subsetting on irows

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe basic shift could be redirected to call gshift then? It would have to call gforce prep function first, perhaps in a dummy way to trick gshift into working. Then it would be common code and all existing tests of shift would cover the new gforce gshift too.
Perhaps not now. Happy to merge this PR as is, subject to integer64. A task could be created to redirect shift to gshift in future.

const bool nosubset = irowslen == -1;
const bool issorted = !isunsorted;
const int n = nosubset ? length(x) : irowslen;
if (nrow != n) error(_("Internal error: nrow [%d] != length(x) [%d] in %s"), nrow, n, "gshift");

int nprotect=0;
enum {LAG, LEAD/*, SHIFT*/,CYCLIC} stype = LAG;
if (!(length(fillArg) == 1))
error(_("fill must be a vector of length 1"));

if (!isString(typeArg) || length(typeArg) != 1)
error(_("Internal error: invalid type for gshift(), should have been caught before. please report to data.table issue tracker")); // # nocov
if (!strcmp(CHAR(STRING_ELT(typeArg, 0)), "lag")) stype = LAG;
else if (!strcmp(CHAR(STRING_ELT(typeArg, 0)), "lead")) stype = LEAD;
else if (!strcmp(CHAR(STRING_ELT(typeArg, 0)), "shift")) stype = LAG;
else if (!strcmp(CHAR(STRING_ELT(typeArg, 0)), "cyclic")) stype = CYCLIC;
else error(_("Internal error: invalid type for gshift(), should have been caught before. please report to data.table issue tracker")); // # nocov

bool lag;
const bool cycle = stype == CYCLIC;

R_xlen_t nx = xlength(x), nk = length(nArg);
if (!isInteger(nArg)) error(_("Internal error: n must be integer")); // # nocov
const int *kd = INTEGER(nArg);
for (int i=0; i<nk; i++) if (kd[i]==NA_INTEGER) error(_("Item %d of n is NA"), i+1);

SEXP ans = PROTECT(allocVector(VECSXP, nk)); nprotect++;
SEXP thisfill = PROTECT(coerceAs(fillArg, x, ScalarLogical(0))); nprotect++;
for (int g=0; g<nk; g++) {
lag = stype == LAG || stype == CYCLIC;
int m = kd[g];
// switch
if (m < 0) {
m = m * (-1);
lag = !lag;
}
R_xlen_t ansi = 0;
SEXP tmp;
SET_VECTOR_ELT(ans, g, tmp=allocVector(TYPEOF(x), nx));
#define SHIFT(CTYPE, RTYPE, ASSIGN) { \
const CTYPE *xd = (const CTYPE *)RTYPE(x); \
const CTYPE fill = (const CTYPE)RTYPE(thisfill)[0]; \
Copy link
Member

@mattdowle mattdowle Oct 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locally I'm seeing this warning (from gcc with -std=c99 I think but it might be due to gcc 10.3.0) :

gsumm.c: In function ‘gshift’:
gsumm.c:1207:26: warning: ISO C forbids casting nonscalar to the same type [-Wpedantic]
 1207 |       const CTYPE fill = (const CTYPE)RTYPE(thisfill)[0];                                                         \
      |                          ^
gsumm.c:1241:59: note: in expansion of macro ‘SHIFT’
 1241 |       case CPLXSXP: { Rcomplex *ansd=COMPLEX(tmp);        SHIFT(Rcomplex, COMPLEX,  ansd[ansi++]=val); } break;
      |                                                           ^~~~~

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIC it should also work without the cast.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the casting of fill between NA_REAL and NA_INTEGER64 though?

for (int i=0; i<ngrp; ++i) { \
const int grpn = grpsize[i]; \
const int mg = cycle ? (((m-1) % grpn) + 1) : m; \
const int thisn = MIN(mg, grpn); \
const int jstart = ff[i]-1+ (!lag)*(thisn); \
const int jend = jstart+ MAX(0, grpn-mg); /*if m > grpn -> jend = jstart */ \
if (lag) { \
const int o = ff[i]-1+(grpn-thisn); \
for (int j=0; j<thisn; ++j) { \
const int k = issorted ? (o+j) : oo[o+j]-1; \
const CTYPE val = cycle ? (nosubset ? xd[k] : (irows[k]==NA_INTEGER ? fill : xd[irows[k]-1])) : fill; \
ASSIGN; \
} \
} \
for (int j=jstart; j<jend; ++j) { \
const int k = issorted ? j : oo[j]-1; \
const CTYPE val = nosubset ? xd[k] : (irows[k]==NA_INTEGER ? fill : xd[irows[k]-1]); \
ASSIGN; \
} \
if (!lag) { \
const int o = ff[i]-1; \
for (int j=0; j<thisn; ++j) { \
const int k = issorted ? (o+j) : oo[o+j]-1; \
const CTYPE val = cycle ? (nosubset ? xd[k] : (irows[k]==NA_INTEGER ? fill : xd[irows[k]-1])) : fill; \
ASSIGN; \
} \
} \
} \
}
switch(TYPEOF(x)) {
case LGLSXP: { int *ansd=LOGICAL(tmp); SHIFT(int, LOGICAL, ansd[ansi++]=val); } break;
case INTSXP: { int *ansd=INTEGER(tmp); SHIFT(int, INTEGER, ansd[ansi++]=val); } break;
case REALSXP: { double *ansd=REAL(tmp); SHIFT(double, REAL, ansd[ansi++]=val); } break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was just about to merge (looks awesome!) and then just saw integer64 missed here (and test_bit64 added to tests). Bug fix 45 starts shift(xInt64, fill=0) so now that shift is optimized I wonder if the shift problems of int64 would return. Maybe existing tests of shift int64 don't test it by group then.
Can't think of anything else.

Copy link
Member Author

@ben-schwen ben-schwen Oct 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added integer64 support. Interestingly we don't do anything special about integer64 anymore in shift.c or gshift except using coerceAs for coercing and copyMostAttrib for copying class integer64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. The coerceAs is for fill but fill isn't currently tested by the new gforce test 2224. A loop of differently typed fill needs adding to the test then, and I suspect a specific case here for integer64 will then be needed to pass that? Can I leave that to you, and the C99 warning, hence WIP again.

case CPLXSXP: { Rcomplex *ansd=COMPLEX(tmp); SHIFT(Rcomplex, COMPLEX, ansd[ansi++]=val); } break;
case STRSXP: { SHIFT(SEXP, STRING_PTR, SET_STRING_ELT(tmp,ansi++,val)); } break;
//case VECSXP: { SHIFT(SEXP, SEXPPTR_RO, SET_VECTOR_ELT(tmp,ansi++,val)); } break;
default:
error(_("Type '%s' is not supported by GForce gshift. Either add the namespace prefix (e.g. data.table::shift(.)) or turn off GForce optimization using options(datatable.optimize=1)"), type2char(TYPEOF(x)));
}
copyMostAttrib(x, tmp); // needed for integer64 because without not the correct class of int64 is assigned
}
UNPROTECT(nprotect);
return(ans);
}

2 changes: 2 additions & 0 deletions src/init.c
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ SEXP dim();
SEXP gvar();
SEXP gsd();
SEXP gprod();
SEXP gshift();
SEXP nestedid();
SEXP setDTthreads();
SEXP getDTthreads_R();
Expand Down Expand Up @@ -197,6 +198,7 @@ R_CallMethodDef callMethods[] = {
{"Cgvar", (DL_FUNC) &gvar, -1},
{"Cgsd", (DL_FUNC) &gsd, -1},
{"Cgprod", (DL_FUNC) &gprod, -1},
{"Cgshift", (DL_FUNC) &gshift, -1},
{"Cnestedid", (DL_FUNC) &nestedid, -1},
{"CsetDTthreads", (DL_FUNC) &setDTthreads, -1},
{"CgetDTthreads", (DL_FUNC) &getDTthreads_R, -1},
Expand Down