Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shift fix type coercion for integer64 #5189

Merged
merged 20 commits into from
Oct 8, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -376,6 +376,27 @@

44. In v1.13.2 a version of an old bug was reintroduced where during a grouping operation list columns could retain a pointer to the last group. This affected only attributes of list elements and only if those were updated during the grouping operation, [#4963](https://github.com/Rdatatable/data.table/issues/4963). Thanks to @fujiaxiang for reporting and @avimallu and Václav Tlapák for investigating and the PR.

45. `shift(xInt64, fill=0)` and `shift(xInt64, fill=as.integer64(0))` (but not `shift(xInt64, fill=0L)`) would error with `INTEGER() can only be applied to a 'integer', not a 'double'` where `xInt64` conveys `bit64::integer64`, `0` is type `double` and `0L` is type integer, [#4865](https://github.com/Rdatatable/data.table/issues/4865). Thanks to @peterlittlejohn for reporting and Benjamin Schwendinger for the PR.

46. `DT[i, strCol:=classVal]` did not coerce using the `as.character` method for the class, resulting in either an unexpected string value or an error such as `To assign integer64 to a target of type character, please use as.character() for clarity`. Discovered during work on the previous issue, [#5189](https://github.com/Rdatatable/data.table/pull/5189).

```R
DT
# A
# <char>
# 1: a
# 2: b
# 3: c
DT[2, A:=as.IDate("2021-02-03")]
DT[3, A:=bit64::as.integer64("4611686018427387906")]
DT
# A
# <char>
# 1: a
# 2: 2021-02-03 # was 18661
# 3: 4611686018427387906 # was error 'please use as.character'
```

## NOTES

1. New feature 29 in v1.12.4 (Oct 2019) introduced zero-copy coercion. Our thinking is that requiring you to get the type right in the case of `0` (type double) vs `0L` (type integer) is too inconvenient for you the user. So such coercions happen in `data.table` automatically without warning. Thanks to zero-copy coercion there is no speed penalty, even when calling `set()` many times in a loop, so there's no speed penalty to warn you about either. However, we believe that assigning a character value such as `"2"` into an integer column is more likely to be a user mistake that you would like to be warned about. The type difference (character vs integer) may be the only clue that you have selected the wrong column, or typed the wrong variable to be assigned to that column. For this reason we view character to numeric-like coercion differently and will warn about it. If it is correct, then the warning is intended to nudge you to wrap the RHS with `as.<type>()` so that it is clear to readers of your code that a coercion from character to that type is intended. For example :
Expand Down
7 changes: 4 additions & 3 deletions inst/tests/nafill.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -281,7 +281,7 @@ if (test_bit64) {
x = as.integer64(1L)
test(10.81, coerceAs(x, 1), 1, output="double[integer64] into double[numeric]")
test(10.82, coerceAs(x, 1L), 1L, output="double[integer64] into integer[integer]")
test(10.83, coerceAs(x, "1"), error="please use as.character", output="double[integer64] into character[character]") # not yet implemented
test(10.83, coerceAs(x, "1"), "1", output="double[integer64] into character[character]")
test(10.84, coerceAs(1, x), x, output="double[numeric] into double[integer64]")
test(10.85, coerceAs(1L, x), x, output="integer[integer] into double[integer64]")
test(10.86, coerceAs("1", x), x, output="character[character] into double[integer64]", warning="Coercing.*character")
Expand All @@ -294,14 +294,15 @@ if (test_nanotime) {
x = nanotime(1L)
test(10.91, coerceAs(x, 1), 1, output="double[nanotime] into double[numeric]")
test(10.92, coerceAs(x, 1L), 1L, output="double[nanotime] into integer[integer]")
test(10.93, coerceAs(x, "1"), error="please use as.character", output="double[nanotime] into character[character]") # not yet implemented
test(10.93, substring(coerceAs(x, "1"),1,11) %in% c("1","1970-01-01T"), output="double[nanotime] into character[character]")
# ^ https://github.com/eddelbuettel/nanotime/issues/92; %in% so as not to break if nanotime adds as.character method
test(10.94, coerceAs(1, x), x, output="double[numeric] into double[nanotime]")
test(10.95, coerceAs(1L, x), x, output="integer[integer] into double[nanotime]")
test(10.96, coerceAs("1", x), x, output="character[character] into double[nanotime]", warning="Coercing.*character")
}
options(datatable.verbose=FALSE)
test(11.01, coerceAs(list(a=1), 1), error="is not atomic")
test(11.02, coerceAs(1, list(a=1)), error="is not atomic")
test(11.02, coerceAs(1, list(a=1)), list(1))
test(11.03, coerceAs(sum, 1), error="is not atomic")
test(11.04, coerceAs(quote(1+1), 1), error="is not atomic")
test(11.05, coerceAs(as.name("x"), 1), error="is not atomic")
Expand Down
29 changes: 28 additions & 1 deletion inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -14611,7 +14611,7 @@ if (test_bit64) {
warning="-1.*integer64.*position 1 taken as 0 when assigning.*raw.*column 3 named 'c'")
test(2005.66, DT[2:3, f:=as.integer64(c(NA,"2147483648"))]$f, as.complex(c(-42,NA,2147483648)))
DT[,h:=LETTERS[1:3]]
test(2005.67, DT[2:3, h:=as.integer64(1:2)], error="To assign integer64 to.*type character, please use as.character.")
test(2005.67, DT[2:3, h:=as.integer64(1:2)]$h, c("A","1","2")) # PR#5189
}

# rbindlist raw type, #2819
Expand Down Expand Up @@ -18222,3 +18222,30 @@ DT1 = data.table(id=1:3, grp=c('a', 'a', 'b'), value=4:6)
DT2 = data.table(grp = c('a', 'b'), agg = list(c('1' = 4, '2' = 5), c('3' = 6)))
test(2217, DT1[, by = grp, .(agg = list(setNames(as.numeric(value), id)))], DT2)

# shift integer64 when fill isn't integer32, #4865
testnum = 2218
funs = c(as.integer, as.double, as.complex, as.character, if (test_bit64) as.integer64)
# when test_bit64==FALSE these all passed before; now passes with test_bit64==TRUE too
for (f1 in funs) {
DT = data.table(x=f1(1:4))
for (f2 in funs) {
testnum = testnum + 0.01
test(testnum, DT[, shift(x)], f1(c(NA, 1:3)))
testnum = testnum + 0.01
w = if (identical(f2,as.character) && !identical(f1,as.character)) "Coercing.*character.*to match the type of target vector"
test(testnum, DT[, shift(x, fill=f2(NA))], f1(c(NA, 1:3)), warning=w)
testnum = testnum + 0.01
if (identical(f1,as.character) && identical(f2,as.complex)) {
# one special case due to as.complex(0)=="0+0i"!="0"
test(testnum, DT[, shift(x, fill="0")], f1(0:3))
} else {
test(testnum, DT[, shift(x, fill=f2(0))], f1(0:3), warning=w)
}
}
}

# subassign coerce a class to character, part of PR#5189
DT = data.table(A=letters[1:3])
test(2219.1, DT[2, A:=as.IDate("2021-02-03")], data.table(A=c("a","2021-02-03","c")))
if (test_bit64) test(2219.2, DT[3, A:=as.integer64("4611686018427387906")], data.table(A=c("a","2021-02-03","4611686018427387906")))

53 changes: 34 additions & 19 deletions src/assign.c
Original file line number Diff line number Diff line change
Expand Up @@ -830,7 +830,7 @@ const char *memrecycle(const SEXP target, const SEXP where, const int start, con
}
}
} else if (isString(source) && !isString(target) && !isNewList(target)) {
warning(_("Coercing 'character' RHS to '%s' to match the type of %s."), type2char(TYPEOF(target)), targetDesc);
warning(_("Coercing 'character' RHS to '%s' to match the type of %s."), targetIsI64?"integer64":type2char(TYPEOF(target)), targetDesc);
// this "Coercing ..." warning first to give context in case coerceVector warns 'NAs introduced by coercion'
// and also because 'character' to integer/double coercion is often a user mistake (e.g. wrong target column, or wrong
// variable on RHS) which they are more likely to appreciate than find inconvenient
Expand All @@ -856,7 +856,7 @@ const char *memrecycle(const SEXP target, const SEXP where, const int start, con
// inside BODY that cater for 'where' or not. Maybe there's a way to merge the two macros in future.
// The idea is to do these range checks without calling coerceVector() (which allocates)

#define CHECK_RANGE(STYPE, RFUN, COND, FMT, TO) {{ \
#define CHECK_RANGE(STYPE, RFUN, COND, FMT, TO, FMTVAL) {{ \
const STYPE *sd = (const STYPE *)RFUN(source); \
for (int i=0; i<slen; ++i) { \
const STYPE val = sd[i+soff]; \
Expand All @@ -865,7 +865,7 @@ const char *memrecycle(const SEXP target, const SEXP where, const int start, con
const char *tType = targetIsI64 ? "integer64" : type2char(TYPEOF(target)); \
snprintf(memrecycle_message, MSGSIZE, \
"%"FMT" (type '%s') at RHS position %d "TO" when assigning to type '%s' (%s)", \
val, sType, i+1, tType, targetDesc); \
FMTVAL, sType, i+1, tType, targetDesc); \
/* string returned so that rbindlist/dogroups can prefix it with which item of its list this refers to */ \
break; \
} \
Expand All @@ -875,28 +875,36 @@ const char *memrecycle(const SEXP target, const SEXP where, const int start, con
switch(TYPEOF(target)) {
case LGLSXP:
switch (TYPEOF(source)) {
case RAWSXP: CHECK_RANGE(Rbyte, RAW, val!=0 && val!=1, "d", "taken as TRUE")
case INTSXP: CHECK_RANGE(int, INTEGER, val!=0 && val!=1 && val!=NA_INTEGER, "d", "taken as TRUE")
case RAWSXP: CHECK_RANGE(Rbyte, RAW, val!=0 && val!=1, "d", "taken as TRUE", val)
case INTSXP: CHECK_RANGE(int, INTEGER, val!=0 && val!=1 && val!=NA_INTEGER, "d", "taken as TRUE", val)
case REALSXP: if (sourceIsI64)
CHECK_RANGE(int64_t, REAL, val!=0 && val!=1 && val!=NA_INTEGER64, PRId64, "taken as TRUE")
else CHECK_RANGE(double, REAL, !ISNAN(val) && val!=0.0 && val!=1.0, "f", "taken as TRUE")
CHECK_RANGE(int64_t, REAL, val!=0 && val!=1 && val!=NA_INTEGER64, PRId64, "taken as TRUE", val)
else CHECK_RANGE(double, REAL, !ISNAN(val) && val!=0.0 && val!=1.0, "f", "taken as TRUE", val)
} break;
case RAWSXP:
switch (TYPEOF(source)) {
case INTSXP: CHECK_RANGE(int, INTEGER, val<0 || val>255, "d", "taken as 0")
case INTSXP: CHECK_RANGE(int, INTEGER, val<0 || val>255, "d", "taken as 0", val)
case REALSXP: if (sourceIsI64)
CHECK_RANGE(int64_t, REAL, val<0 || val>255, PRId64, "taken as 0")
else CHECK_RANGE(double, REAL, !R_FINITE(val) || val<0.0 || val>256.0 || (int)val!=val, "f", "either truncated (precision lost) or taken as 0")
CHECK_RANGE(int64_t, REAL, val<0 || val>255, PRId64, "taken as 0", val)
else CHECK_RANGE(double, REAL, !R_FINITE(val) || val<0.0 || val>256.0 || (int)val!=val, "f", "either truncated (precision lost) or taken as 0", val)
} break;
case INTSXP:
if (TYPEOF(source)==REALSXP) {
if (sourceIsI64)
CHECK_RANGE(int64_t, REAL, val!=NA_INTEGER64 && (val<=NA_INTEGER || val>INT_MAX), PRId64, "out-of-range (NA)")
else CHECK_RANGE(double, REAL, !ISNAN(val) && (!R_FINITE(val) || (int)val!=val), "f", "truncated (precision lost)")
switch (TYPEOF(source)) {
case REALSXP: if (sourceIsI64)
CHECK_RANGE(int64_t, REAL, val!=NA_INTEGER64 && (val<=NA_INTEGER || val>INT_MAX), PRId64, "out-of-range (NA)", val)
else CHECK_RANGE(double, REAL, !ISNAN(val) && (!R_FINITE(val) || (int)val!=val), "f", "truncated (precision lost)", val)
case CPLXSXP: CHECK_RANGE(Rcomplex, COMPLEX, !((ISNAN(val.i) || (R_FINITE(val.i) && val.i==0.0)) &&
(ISNAN(val.r) || (R_FINITE(val.r) && (int)val.r==val.r))), "f", "either imaginary part discarded or real part truncated (precision lost)", val.r)
} break;
case REALSXP:
if (targetIsI64 && isReal(source) && !sourceIsI64) {
CHECK_RANGE(double, REAL, !ISNAN(val) && (!R_FINITE(val) || (int)val!=val), "f", "truncated (precision lost)")
switch (TYPEOF(source)) {
case REALSXP: if (targetIsI64 && !sourceIsI64)
CHECK_RANGE(double, REAL, !ISNAN(val) && (!R_FINITE(val) || (int)val!=val), "f", "truncated (precision lost)", val)
break;
case CPLXSXP: if (targetIsI64)
CHECK_RANGE(Rcomplex, COMPLEX, !((ISNAN(val.i) || (R_FINITE(val.i) && val.i==0.0)) &&
(ISNAN(val.r) || (R_FINITE(val.r) && (int64_t)val.r==val.r))), "f", "either imaginary part discarded or real part truncated (precision lost)", val.r)
else CHECK_RANGE(Rcomplex, COMPLEX, !(ISNAN(val.i) || (R_FINITE(val.i) && val.i==0.0)), "f", "imaginary part discarded", val.i)
}
}
}
Expand Down Expand Up @@ -992,6 +1000,7 @@ const char *memrecycle(const SEXP target, const SEXP where, const int start, con
if (sourceIsI64)
BODY(int64_t, REAL, int, (val==NA_INTEGER64||val>INT_MAX||val<=NA_INTEGER) ? NA_INTEGER : (int)val, td[i]=cval)
else BODY(double, REAL, int, ISNAN(val) ? NA_INTEGER : (int)val, td[i]=cval)
case CPLXSXP: BODY(Rcomplex, COMPLEX, int, ISNAN(val.r) ? NA_INTEGER : (int)val.r, td[i]=cval)
default: COERCE_ERROR("integer"); // test 2005.4
}
} break;
Expand All @@ -1008,6 +1017,7 @@ const char *memrecycle(const SEXP target, const SEXP where, const int start, con
memcpy(td, (int64_t *)REAL(source), slen*sizeof(int64_t)); break;
} else BODY(int64_t, REAL, int64_t, val, td[i]=cval)
} else BODY(double, REAL, int64_t, R_FINITE(val) ? val : NA_INTEGER64, td[i]=cval)
case CPLXSXP: BODY(Rcomplex, COMPLEX, int64_t, ISNAN(val.r) ? NA_INTEGER64 : (int64_t)val.r, td[i]=cval)
default: COERCE_ERROR("integer64");
}
} else {
Expand All @@ -1022,6 +1032,7 @@ const char *memrecycle(const SEXP target, const SEXP where, const int start, con
memcpy(td, (double *)REAL(source), slen*sizeof(double)); break;
} else BODY(double, REAL, double, val, td[i]=cval)
} else BODY(int64_t, REAL, double, val==NA_INTEGER64 ? NA_REAL : val, td[i]=cval)
case CPLXSXP: BODY(Rcomplex, COMPLEX, double, val.r, td[i]=cval)
default: COERCE_ERROR("double");
}
}
Expand Down Expand Up @@ -1060,9 +1071,13 @@ const char *memrecycle(const SEXP target, const SEXP where, const int start, con
}
break;
}
if (sourceIsI64)
error(_("To assign integer64 to a target of type character, please use as.character() for clarity.")); // TODO: handle that here as well
source = PROTECT(coerceVector(source, STRSXP)); protecti++;
if (OBJECT(source) && getAttrib(source, R_ClassSymbol)!=R_NilValue) {
// otherwise coerceVector doesn't call the as.character method for Date, IDate, integer64, nanotime, etc; PR#5189
// this if() is to save the overhead of the R call eval() when we know there can be no method
source = PROTECT(eval(PROTECT(lang2(sym_as_character, source)), R_GlobalEnv)); protecti+=2;
} else {
source = PROTECT(coerceVector(source, STRSXP)); protecti++;
}
}
BODY(SEXP, STRING_PTR, SEXP, val, SET_STRING_ELT(target, off+i, cval))
}
Expand Down
1 change: 1 addition & 0 deletions src/data.table.h
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ extern SEXP sym_datatable_locked;
extern SEXP sym_tzone;
extern SEXP sym_old_fread_datetime_character;
extern SEXP sym_variable_table;
extern SEXP sym_as_character;
extern double NA_INT64_D;
extern long long NA_INT64_LL;
extern Rcomplex NA_CPLX; // initialized in init.c; see there for comments
Expand Down
2 changes: 2 additions & 0 deletions src/init.c
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ SEXP sym_datatable_locked;
SEXP sym_tzone;
SEXP sym_old_fread_datetime_character;
SEXP sym_variable_table;
SEXP sym_as_character;
double NA_INT64_D;
long long NA_INT64_LL;
Rcomplex NA_CPLX;
Expand Down Expand Up @@ -366,6 +367,7 @@ void attribute_visible R_init_data_table(DllInfo *info)
sym_tzone = install("tzone");
sym_old_fread_datetime_character = install("datatable.old.fread.datetime.character");
sym_variable_table = install("variable_table");
sym_as_character = install("as.character");

initDTthreads();
avoid_openmp_hang_within_fork();
Expand Down
Loading