Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix melt(na.rm=TRUE) with list columns #5044

Merged
merged 14 commits into from
Jun 21, 2021
Merged

Conversation

tdhock
Copy link
Member

@tdhock tdhock commented Jun 12, 2021

This is a follow-up to #4737 which proposed, for consistency, to allow melt(na.rm=TRUE) on list columns. For now I have just added some tests which fail given the current melt code:

> DT.list.missing = data.table(l1=list(1,NA), l2=list(NA,2), n34=c(3,4), NA5=c(NA,5))
> melt(DT.list.missing, measure.vars=c("n34","NA5"), na.rm=TRUE)
Error in melt.data.table(DT.list.missing, measure.vars = c("n34", "NA5"),  : 
  attempt to set index 3/3 in SET_VECTOR_ELT
> melt(DT.list.missing, measure.vars=c("l1","l2"), na.rm=TRUE)
     n34   NA5 variable  value
   <num> <num>   <fctr> <list>
1:     3    NA       l1      1
2:     4     5       l1     NA
3:     3    NA       l2     NA
4:     4     5       l2      2
> melt(DT.list.missing, measure.vars=c("l1","n34"), na.rm=TRUE)
       l2   NA5 variable  value
   <list> <num>   <fctr> <list>
1:     NA    NA       l1      1
2:      2     5       l1     NA
3:     NA    NA      n34      3
4:      2     5      n34      4
Warning message:
In melt.data.table(DT.list.missing, measure.vars = c("l1", "n34"),  :
  'measure.vars' [l1, n34] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'list'. All measure variables not of type 'list' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.
> melt(DT.list.missing, measure.vars=c("l1","NA5"), na.rm=TRUE)
Error in melt.data.table(DT.list.missing, measure.vars = c("l1", "NA5"),  : 
  attempt to set index 3/3 in SET_VECTOR_ELT
In addition: Warning message:
In melt.data.table(DT.list.missing, measure.vars = c("l1", "NA5"),  :
  'measure.vars' [l1, NA5] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'list'. All measure variables not of type 'list' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.
> melt(DT.list.missing, measure.vars=list(l=c("l1","l2"), n=c("n34","NA5")), na.rm=TRUE)
   variable      l     n
     <fctr> <list> <num>
1:        1      1     3
2:        1     NA     4
3:        2      2     5

There are two problems (1) error in SET_VECTOR_ELT, (2) there are NA in value columns even though we asked for na.rm=TRUE.

@tdhock
Copy link
Member Author

tdhock commented Jun 14, 2021

Part of the problem was that dt_na in data.table frank.c did not handle columns of type VECSXP (R list), even though is.na(some_list) in R returns a logical vector with TRUE for list elements that contain scalar NA. So now in this PR dt_na handles VECSXP but anyNA (also in frank.c) still does NOT handle VECSXP. Is that OK to leave like this? Or would it be preferable to also update anyNA as well, for consistency?

The C code I used comes from base R source code, file coerce.c -> function do_isna,
https://github.com/wch/r-source/blob/b560647e74459fa2f40262dcaf1abf171c197efc/src/main/coerce.c#L2247-L2271
This is the C function that is used to implement R function is.na.

@codecov
Copy link

codecov bot commented Jun 14, 2021

Codecov Report

❗ No coverage uploaded for pull request base (master@4c600a2). Click here to learn what that means.
The diff coverage is 100.00%.

❗ Current head ac466a5 differs from pull request most recent head 8b81d43. Consider uploading reports for the commit 8b81d43 to get more accurate results
Impacted file tree graph

@@            Coverage Diff            @@
##             master    #5044   +/-   ##
=========================================
  Coverage          ?   99.47%           
=========================================
  Files             ?       75           
  Lines             ?    14841           
  Branches          ?        0           
=========================================
  Hits              ?    14763           
  Misses            ?       78           
  Partials          ?        0           
Impacted Files Coverage Δ
R/data.table.R 99.94% <ø> (ø)
src/assign.c 99.71% <ø> (ø)
src/between.c 99.21% <ø> (ø)
src/bmerge.c 100.00% <ø> (ø)
src/chmatch.c 100.00% <ø> (ø)
src/cj.c 100.00% <ø> (ø)
src/coalesce.c 100.00% <ø> (ø)
src/dogroups.c 99.67% <ø> (ø)
src/fastmean.c 96.82% <ø> (ø)
src/fcast.c 100.00% <ø> (ø)
... and 64 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4c600a2...8b81d43. Read the comment docs.

@tdhock tdhock marked this pull request as ready for review June 14, 2021 02:00
@tdhock tdhock changed the title WIP: fix melt(na.rm=TRUE) with list columns fix melt(na.rm=TRUE) with list columns Jun 14, 2021
@tdhock
Copy link
Member Author

tdhock commented Jun 14, 2021

After the PR we get the following results (no errors, no NA in value columns)

> DT.list.missing = data.table(l1=list(1,NA), l2=list(NA,2), n34=c(3,4), NA5=c(NA,5))
> melt(DT.list.missing, measure.vars=c("n34","NA5"), na.rm=TRUE)
       l1     l2 variable value
   <list> <list>   <fctr> <num>
1:      1     NA      n34     3
2:     NA      2      n34     4
3:     NA      2      NA5     5
> melt(DT.list.missing, measure.vars=c("l1","l2"), na.rm=TRUE)
     n34   NA5 variable  value
   <num> <num>   <fctr> <list>
1:     3    NA       l1      1
2:     4     5       l2      2
> melt(DT.list.missing, measure.vars=c("l1","n34"), na.rm=TRUE)
       l2   NA5 variable  value
   <list> <num>   <fctr> <list>
1:     NA    NA       l1      1
2:     NA    NA      n34      3
3:      2     5      n34      4
Warning message:
In melt.data.table(DT.list.missing, measure.vars = c("l1", "n34"),  :
  'measure.vars' [l1, n34] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'list'. All measure variables not of type 'list' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.
> melt(DT.list.missing, measure.vars=c("l1","NA5"), na.rm=TRUE)
       l2   n34 variable  value
   <list> <num>   <fctr> <list>
1:     NA     3       l1      1
2:      2     4      NA5      5
Warning message:
In melt.data.table(DT.list.missing, measure.vars = c("l1", "NA5"),  :
  'measure.vars' [l1, NA5] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'list'. All measure variables not of type 'list' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.
> melt(DT.list.missing, measure.vars=list(l=c("l1","l2"), n=c("n34","NA5")), na.rm=TRUE)
   variable      l     n
     <fctr> <list> <num>
1:        1      1     3
2:        2      2     5

@tdhock tdhock requested a review from mattdowle June 14, 2021 02:04
@tdhock
Copy link
Member Author

tdhock commented Jun 14, 2021

So something funny happens when you call base R is.na on a list.
As mentioned above, usually it will return a logical vector with TRUE if the list element is a scalar NA, and FALSE otherwise.
However it does not know about bit64::NA_integer64_ so:

> is.na(NA_integer64_)
[1] TRUE
> is.na(list(NA_integer64_, NA_real_, NA_integer_))
[1] FALSE  TRUE  TRUE
> sapply(list(NA_integer64_, NA_real_, NA_integer_), is.na)
[1] TRUE TRUE TRUE

The code in this PR for melt(na.rm=TRUE) will actually remove NA_integer64_, but using is.na after na.rm=FALSE will NOT:

> DT.wide <- data.table(l1=list(NA, c(NA,NA)), l2=list(NA_complex_, NA_integer64_))
> (DT.long.na.rm <- melt(DT.wide, measure=c("l1","l2"), na.rm=TRUE))
   variable  value
     <fctr> <list>
1:       l1  NA,NA
> DT.long.na.keep <- melt(DT.wide, measure=c("l1","l2"), na.rm=FALSE)
> DT.long.na.keep[!is.na(value)]
   variable  value
     <fctr> <list>
1:       l1  NA,NA
2:       l2     NA

So either we can

  • a. keep as is, and get results that are inconsistent between na.rm=TRUE and FALSE. (but our dt_na would no longer be consistent with base R is.na)
  • b. modify our dt_na C function so that it DOES NOT consider a list element with scalar NA_integer64_ as missing (consistent with base R is.na). Then melt would yield consistent results between na.rm=TRUE and FALSE but NA_integer64_ would be an odd exception to the rule about when list elements should be considered missing values (as in base R which is not aware of that type).
  • c. modify base R is.na so that everything is consistent? Seems like it would be reasonable to ask base R is.na to call a S3 method if it exists, as it does in this case, is.na.integer64. I will post an issue with bit64 about this.

@mattdowle
Copy link
Member

The C code I used comes from base R source code

The simple loops checking for NA in list columns in this PR are pretty trivial, and you could equally have copied this R-API-usage from other parts of data.table, or just used R API documentation to write it.

But the fact you wrote that you did copy, even simple loops, from base R's source code, may be problematic. R's source code is GPL with headers LGPL, but data.table is MPL. I believe the intention of the R core team (as stated in doc/COPYRIGHTS) would be better represented by the use of MPL, and therefore I can't imagine they would have any problem with this PR being accepted, but I think any change on the licensing of R's C source is very unlikely.

Perhaps you can research further on this topic and propose a way forward.

@tdhock
Copy link
Member Author

tdhock commented Jun 15, 2021

My intention was to look at the base R source code so that the data.table implementation would be consistent with the base R implementation of is.na. The only way I could be sure about that was to either use is.na directly (which seemed like a bad idea due to efficiency) or to use the same logic as in its source code.
Sorry for the trouble. I did not think about the licensing issue, and I do not know how to fix that. (I'm not a lawyer...)

@mattdowle
Copy link
Member

mattdowle commented Jun 16, 2021

I think we're fine to merge this one, this time.

It was very useful you pointed to the parts of R's source you looked at. First of all it's only 25 lines. These 25 lines include single line {, }, case:, and break; statements. Essentially it's just one switch and one if.

It's looping through a list and returning whether each item is a length-1 NA value. The switch is for each of the types. Subtracting the {, }, case: and break lines leaves us with the following meat in this case:

if (!isVector(s) || length(s) != 1)
TYPEOF(s)
(INTEGER_ELT(s, 0) == NA_INTEGER)
ISNAN(REAL_ELT(s, 0))
(STRING_ELT(s, 0) == NA_STRING)
Rcomplex v = COMPLEX_ELT(s, 0)
(ISNAN(v.r) || ISNAN(v.i))

That's all there is to it: those particular 7 lines. We have code like this throughout data.table. It's API usage documented in the manual. There is no other logic in the lines of code you've referred to. There's no algorithm here.

More interesting perhaps is whether INTEGER_ELT is actually part of R's API. I don't see it in R-exts (so that function is not actually documented in the manual). R-exts is BDR's definition of the R-API, iiuc, which differs from other R-core member's definition. Anyway, we haven't used INTEGER_ELT before in data.table (I grep'd our source) so I wonder how long it has been available and whether this PR would pass R 3.1.0 which is our stated dependency. Unless there is a good reason to use it, I would just use INTEGER(list_element)[0] which we've done forever and is exactly the same as INTEGER_ELT(list_element, 0)?

I'll leave it a few days before merging to give time for further comments.

@tdhock
Copy link
Member Author

tdhock commented Jun 16, 2021

I get the impression that REAL_ELT etc are related to ALTREP (but I dont see them mentioned in the docs either for some reason). Last month there was an ALTREP-related R-devel post mentioning REAL_ELT and INTEGER_ELT as "fairly new" https://www.mail-archive.com/[email protected]/msg43455.html so we should change back to REAL/etc to support old R. (we don't have a CI test on R-3.1.0?)

@MichaelChirico
Copy link
Member

MichaelChirico commented Jun 16, 2021 via email

@tdhock
Copy link
Member Author

tdhock commented Jun 17, 2021

by the way I asked for *_ELT docs on R-devel, and I got this response from Luke Tierney

> One more question: are there any circumstances in which one should use
> REAL_ELT(x,i) rather than REAL(x)[i] or vice versa? Or can they be used
> interchangeably?

For a single call it is better to use REAL_ELT(x, i) since it doesn't
force allocating a possibly large object in order to get a pointer to
its data with REAL(x).  If you are iterating over a whole object you
may want to get data in chunks. There are iteration macros that
help. Some examples are in src/main/summary.c.

@mattdowle mattdowle merged commit 793cec0 into master Jun 21, 2021
@mattdowle mattdowle deleted the fix-melt-narm-list-example branch June 21, 2021 20:58
mattdowle added a commit that referenced this pull request Jun 22, 2021
…the NULL is not removed. Thanks to GLCI which has two instances running without bit64 to catch this
@mattdowle
Copy link
Member

@tdhock maybe list items with the value NULL should be removed too? Up to you. Just letting you know in case you weren't aware. See 3a5de96 just now which I just did quickly to pass GLCI until you can look. Feel free to change this test back if you decide to remove NULL.

@mattdowle
Copy link
Member

Linking to @tdhocks's reply: 3a5de96#r52510904

@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants