-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix melt(na.rm=TRUE) with list columns #5044
Conversation
Part of the problem was that dt_na in data.table frank.c did not handle columns of type VECSXP (R list), even though is.na(some_list) in R returns a logical vector with TRUE for list elements that contain scalar NA. So now in this PR dt_na handles VECSXP but anyNA (also in frank.c) still does NOT handle VECSXP. Is that OK to leave like this? Or would it be preferable to also update anyNA as well, for consistency? The C code I used comes from base R source code, file coerce.c -> function do_isna, |
Codecov Report
@@ Coverage Diff @@
## master #5044 +/- ##
=========================================
Coverage ? 99.47%
=========================================
Files ? 75
Lines ? 14841
Branches ? 0
=========================================
Hits ? 14763
Misses ? 78
Partials ? 0
Continue to review full report at Codecov.
|
After the PR we get the following results (no errors, no NA in value columns) > DT.list.missing = data.table(l1=list(1,NA), l2=list(NA,2), n34=c(3,4), NA5=c(NA,5))
> melt(DT.list.missing, measure.vars=c("n34","NA5"), na.rm=TRUE)
l1 l2 variable value
<list> <list> <fctr> <num>
1: 1 NA n34 3
2: NA 2 n34 4
3: NA 2 NA5 5
> melt(DT.list.missing, measure.vars=c("l1","l2"), na.rm=TRUE)
n34 NA5 variable value
<num> <num> <fctr> <list>
1: 3 NA l1 1
2: 4 5 l2 2
> melt(DT.list.missing, measure.vars=c("l1","n34"), na.rm=TRUE)
l2 NA5 variable value
<list> <num> <fctr> <list>
1: NA NA l1 1
2: NA NA n34 3
3: 2 5 n34 4
Warning message:
In melt.data.table(DT.list.missing, measure.vars = c("l1", "n34"), :
'measure.vars' [l1, n34] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'list'. All measure variables not of type 'list' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.
> melt(DT.list.missing, measure.vars=c("l1","NA5"), na.rm=TRUE)
l2 n34 variable value
<list> <num> <fctr> <list>
1: NA 3 l1 1
2: 2 4 NA5 5
Warning message:
In melt.data.table(DT.list.missing, measure.vars = c("l1", "NA5"), :
'measure.vars' [l1, NA5] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'list'. All measure variables not of type 'list' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.
> melt(DT.list.missing, measure.vars=list(l=c("l1","l2"), n=c("n34","NA5")), na.rm=TRUE)
variable l n
<fctr> <list> <num>
1: 1 1 3
2: 2 2 5 |
So something funny happens when you call base R > is.na(NA_integer64_)
[1] TRUE
> is.na(list(NA_integer64_, NA_real_, NA_integer_))
[1] FALSE TRUE TRUE
> sapply(list(NA_integer64_, NA_real_, NA_integer_), is.na)
[1] TRUE TRUE TRUE The code in this PR for > DT.wide <- data.table(l1=list(NA, c(NA,NA)), l2=list(NA_complex_, NA_integer64_))
> (DT.long.na.rm <- melt(DT.wide, measure=c("l1","l2"), na.rm=TRUE))
variable value
<fctr> <list>
1: l1 NA,NA
> DT.long.na.keep <- melt(DT.wide, measure=c("l1","l2"), na.rm=FALSE)
> DT.long.na.keep[!is.na(value)]
variable value
<fctr> <list>
1: l1 NA,NA
2: l2 NA So either we can
|
The simple loops checking for NA in list columns in this PR are pretty trivial, and you could equally have copied this R-API-usage from other parts of data.table, or just used R API documentation to write it. But the fact you wrote that you did copy, even simple loops, from base R's source code, may be problematic. R's source code is GPL with headers LGPL, but data.table is MPL. I believe the intention of the R core team (as stated in doc/COPYRIGHTS) would be better represented by the use of MPL, and therefore I can't imagine they would have any problem with this PR being accepted, but I think any change on the licensing of R's C source is very unlikely. Perhaps you can research further on this topic and propose a way forward. |
My intention was to look at the base R source code so that the data.table implementation would be consistent with the base R implementation of |
I think we're fine to merge this one, this time. It was very useful you pointed to the parts of R's source you looked at. First of all it's only 25 lines. These 25 lines include single line It's looping through a
That's all there is to it: those particular 7 lines. We have code like this throughout data.table. It's API usage documented in the manual. There is no other logic in the lines of code you've referred to. There's no algorithm here. More interesting perhaps is whether I'll leave it a few days before merging to give time for further comments. |
I get the impression that REAL_ELT etc are related to ALTREP (but I dont see them mentioned in the docs either for some reason). Last month there was an ALTREP-related R-devel post mentioning REAL_ELT and INTEGER_ELT as "fairly new" https://www.mail-archive.com/[email protected]/msg43455.html so we should change back to REAL/etc to support old R. (we don't have a CI test on R-3.1.0?) |
3.1.0 is tested on gitlab after a PR is merged :
https://gitlab.com/Rdatatable/data.table/-/pipelines
|
by the way I asked for *_ELT docs on R-devel, and I got this response from Luke Tierney
|
…the NULL is not removed. Thanks to GLCI which has two instances running without bit64 to catch this
Linking to @tdhocks's reply: 3a5de96#r52510904 |
This is a follow-up to #4737 which proposed, for consistency, to allow melt(na.rm=TRUE) on list columns. For now I have just added some tests which fail given the current melt code:
There are two problems (1) error in SET_VECTOR_ELT, (2) there are NA in value columns even though we asked for na.rm=TRUE.