fix melt(na.rm=TRUE) with list columns #5044

tdhock · 2021-06-12T13:26:39Z

This is a follow-up to #4737 which proposed, for consistency, to allow melt(na.rm=TRUE) on list columns. For now I have just added some tests which fail given the current melt code:

> DT.list.missing = data.table(l1=list(1,NA), l2=list(NA,2), n34=c(3,4), NA5=c(NA,5))
> melt(DT.list.missing, measure.vars=c("n34","NA5"), na.rm=TRUE)
Error in melt.data.table(DT.list.missing, measure.vars = c("n34", "NA5"),  : 
  attempt to set index 3/3 in SET_VECTOR_ELT
> melt(DT.list.missing, measure.vars=c("l1","l2"), na.rm=TRUE)
     n34   NA5 variable  value
   <num> <num>   <fctr> <list>
1:     3    NA       l1      1
2:     4     5       l1     NA
3:     3    NA       l2     NA
4:     4     5       l2      2
> melt(DT.list.missing, measure.vars=c("l1","n34"), na.rm=TRUE)
       l2   NA5 variable  value
   <list> <num>   <fctr> <list>
1:     NA    NA       l1      1
2:      2     5       l1     NA
3:     NA    NA      n34      3
4:      2     5      n34      4
Warning message:
In melt.data.table(DT.list.missing, measure.vars = c("l1", "n34"),  :
  'measure.vars' [l1, n34] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'list'. All measure variables not of type 'list' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.
> melt(DT.list.missing, measure.vars=c("l1","NA5"), na.rm=TRUE)
Error in melt.data.table(DT.list.missing, measure.vars = c("l1", "NA5"),  : 
  attempt to set index 3/3 in SET_VECTOR_ELT
In addition: Warning message:
In melt.data.table(DT.list.missing, measure.vars = c("l1", "NA5"),  :
  'measure.vars' [l1, NA5] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'list'. All measure variables not of type 'list' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.
> melt(DT.list.missing, measure.vars=list(l=c("l1","l2"), n=c("n34","NA5")), na.rm=TRUE)
   variable      l     n
     <fctr> <list> <num>
1:        1      1     3
2:        1     NA     4
3:        2      2     5

There are two problems (1) error in SET_VECTOR_ELT, (2) there are NA in value columns even though we asked for na.rm=TRUE.

tdhock · 2021-06-14T01:05:33Z

Part of the problem was that dt_na in data.table frank.c did not handle columns of type VECSXP (R list), even though is.na(some_list) in R returns a logical vector with TRUE for list elements that contain scalar NA. So now in this PR dt_na handles VECSXP but anyNA (also in frank.c) still does NOT handle VECSXP. Is that OK to leave like this? Or would it be preferable to also update anyNA as well, for consistency?

The C code I used comes from base R source code, file coerce.c -> function do_isna,
https://github.com/wch/r-source/blob/b560647e74459fa2f40262dcaf1abf171c197efc/src/main/coerce.c#L2247-L2271
This is the C function that is used to implement R function is.na.

codecov · 2021-06-14T01:23:21Z

Codecov Report

❗ No coverage uploaded for pull request base (master@4c600a2). Click here to learn what that means.
The diff coverage is 100.00%.

❗ Current head ac466a5 differs from pull request most recent head 8b81d43. Consider uploading reports for the commit 8b81d43 to get more accurate results

@@            Coverage Diff            @@
##             master    #5044   +/-   ##
=========================================
  Coverage          ?   99.47%           
=========================================
  Files             ?       75           
  Lines             ?    14841           
  Branches          ?        0           
=========================================
  Hits              ?    14763           
  Misses            ?       78           
  Partials          ?        0

Impacted Files	Coverage Δ
R/data.table.R	`99.94% <ø> (ø)`
src/assign.c	`99.71% <ø> (ø)`
src/between.c	`99.21% <ø> (ø)`
src/bmerge.c	`100.00% <ø> (ø)`
src/chmatch.c	`100.00% <ø> (ø)`
src/cj.c	`100.00% <ø> (ø)`
src/coalesce.c	`100.00% <ø> (ø)`
src/dogroups.c	`99.67% <ø> (ø)`
src/fastmean.c	`96.82% <ø> (ø)`
src/fcast.c	`100.00% <ø> (ø)`
... and 64 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4c600a2...8b81d43. Read the comment docs.

tdhock · 2021-06-14T02:04:25Z

After the PR we get the following results (no errors, no NA in value columns)

> DT.list.missing = data.table(l1=list(1,NA), l2=list(NA,2), n34=c(3,4), NA5=c(NA,5))
> melt(DT.list.missing, measure.vars=c("n34","NA5"), na.rm=TRUE)
       l1     l2 variable value
   <list> <list>   <fctr> <num>
1:      1     NA      n34     3
2:     NA      2      n34     4
3:     NA      2      NA5     5
> melt(DT.list.missing, measure.vars=c("l1","l2"), na.rm=TRUE)
     n34   NA5 variable  value
   <num> <num>   <fctr> <list>
1:     3    NA       l1      1
2:     4     5       l2      2
> melt(DT.list.missing, measure.vars=c("l1","n34"), na.rm=TRUE)
       l2   NA5 variable  value
   <list> <num>   <fctr> <list>
1:     NA    NA       l1      1
2:     NA    NA      n34      3
3:      2     5      n34      4
Warning message:
In melt.data.table(DT.list.missing, measure.vars = c("l1", "n34"),  :
  'measure.vars' [l1, n34] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'list'. All measure variables not of type 'list' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.
> melt(DT.list.missing, measure.vars=c("l1","NA5"), na.rm=TRUE)
       l2   n34 variable  value
   <list> <num>   <fctr> <list>
1:     NA     3       l1      1
2:      2     4      NA5      5
Warning message:
In melt.data.table(DT.list.missing, measure.vars = c("l1", "NA5"),  :
  'measure.vars' [l1, NA5] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'list'. All measure variables not of type 'list' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.
> melt(DT.list.missing, measure.vars=list(l=c("l1","l2"), n=c("n34","NA5")), na.rm=TRUE)
   variable      l     n
     <fctr> <list> <num>
1:        1      1     3
2:        2      2     5

tdhock · 2021-06-14T17:12:18Z

So something funny happens when you call base R is.na on a list.
As mentioned above, usually it will return a logical vector with TRUE if the list element is a scalar NA, and FALSE otherwise.
However it does not know about bit64::NA_integer64_ so:

> is.na(NA_integer64_)
[1] TRUE
> is.na(list(NA_integer64_, NA_real_, NA_integer_))
[1] FALSE  TRUE  TRUE
> sapply(list(NA_integer64_, NA_real_, NA_integer_), is.na)
[1] TRUE TRUE TRUE

The code in this PR for melt(na.rm=TRUE) will actually remove NA_integer64_, but using is.na after na.rm=FALSE will NOT:

> DT.wide <- data.table(l1=list(NA, c(NA,NA)), l2=list(NA_complex_, NA_integer64_))
> (DT.long.na.rm <- melt(DT.wide, measure=c("l1","l2"), na.rm=TRUE))
   variable  value
     <fctr> <list>
1:       l1  NA,NA
> DT.long.na.keep <- melt(DT.wide, measure=c("l1","l2"), na.rm=FALSE)
> DT.long.na.keep[!is.na(value)]
   variable  value
     <fctr> <list>
1:       l1  NA,NA
2:       l2     NA

So either we can

a. keep as is, and get results that are inconsistent between na.rm=TRUE and FALSE. (but our dt_na would no longer be consistent with base R is.na)
b. modify our dt_na C function so that it DOES NOT consider a list element with scalar NA_integer64_ as missing (consistent with base R is.na). Then melt would yield consistent results between na.rm=TRUE and FALSE but NA_integer64_ would be an odd exception to the rule about when list elements should be considered missing values (as in base R which is not aware of that type).
c. modify base R is.na so that everything is consistent? Seems like it would be reasonable to ask base R is.na to call a S3 method if it exists, as it does in this case, is.na.integer64. I will post an issue with bit64 about this.

mattdowle · 2021-06-14T23:02:51Z

The C code I used comes from base R source code

The simple loops checking for NA in list columns in this PR are pretty trivial, and you could equally have copied this R-API-usage from other parts of data.table, or just used R API documentation to write it.

But the fact you wrote that you did copy, even simple loops, from base R's source code, may be problematic. R's source code is GPL with headers LGPL, but data.table is MPL. I believe the intention of the R core team (as stated in doc/COPYRIGHTS) would be better represented by the use of MPL, and therefore I can't imagine they would have any problem with this PR being accepted, but I think any change on the licensing of R's C source is very unlikely.

Perhaps you can research further on this topic and propose a way forward.

tdhock · 2021-06-15T00:36:08Z

My intention was to look at the base R source code so that the data.table implementation would be consistent with the base R implementation of is.na. The only way I could be sure about that was to either use is.na directly (which seemed like a bad idea due to efficiency) or to use the same logic as in its source code.
Sorry for the trouble. I did not think about the licensing issue, and I do not know how to fix that. (I'm not a lawyer...)

mattdowle · 2021-06-16T06:51:17Z

I think we're fine to merge this one, this time.

It was very useful you pointed to the parts of R's source you looked at. First of all it's only 25 lines. These 25 lines include single line {, }, case:, and break; statements. Essentially it's just one switch and one if.

It's looping through a list and returning whether each item is a length-1 NA value. The switch is for each of the types. Subtracting the {, }, case: and break lines leaves us with the following meat in this case:

if (!isVector(s) || length(s) != 1)
TYPEOF(s)
(INTEGER_ELT(s, 0) == NA_INTEGER)
ISNAN(REAL_ELT(s, 0))
(STRING_ELT(s, 0) == NA_STRING)
Rcomplex v = COMPLEX_ELT(s, 0)
(ISNAN(v.r) || ISNAN(v.i))

That's all there is to it: those particular 7 lines. We have code like this throughout data.table. It's API usage documented in the manual. There is no other logic in the lines of code you've referred to. There's no algorithm here.

More interesting perhaps is whether INTEGER_ELT is actually part of R's API. I don't see it in R-exts (so that function is not actually documented in the manual). R-exts is BDR's definition of the R-API, iiuc, which differs from other R-core member's definition. Anyway, we haven't used INTEGER_ELT before in data.table (I grep'd our source) so I wonder how long it has been available and whether this PR would pass R 3.1.0 which is our stated dependency. Unless there is a good reason to use it, I would just use INTEGER(list_element)[0] which we've done forever and is exactly the same as INTEGER_ELT(list_element, 0)?

I'll leave it a few days before merging to give time for further comments.

tdhock · 2021-06-16T14:14:30Z

I get the impression that REAL_ELT etc are related to ALTREP (but I dont see them mentioned in the docs either for some reason). Last month there was an ALTREP-related R-devel post mentioning REAL_ELT and INTEGER_ELT as "fairly new" https://www.mail-archive.com/[email protected]/msg43455.html so we should change back to REAL/etc to support old R. (we don't have a CI test on R-3.1.0?)

MichaelChirico · 2021-06-16T15:57:40Z

3.1.0 is tested on gitlab after a PR is merged : https://gitlab.com/Rdatatable/data.table/-/pipelines

tdhock · 2021-06-17T19:43:05Z

by the way I asked for *_ELT docs on R-devel, and I got this response from Luke Tierney

> One more question: are there any circumstances in which one should use
> REAL_ELT(x,i) rather than REAL(x)[i] or vice versa? Or can they be used
> interchangeably?

For a single call it is better to use REAL_ELT(x, i) since it doesn't
force allocating a possibly large object in order to get a pointer to
its data with REAL(x).  If you are iterating over a whole object you
may want to get data in chunks. There are iteration macros that
help. Some examples are in src/main/summary.c.

…s R 3.1

…the NULL is not removed. Thanks to GLCI which has two instances running without bit64 to catch this

mattdowle · 2021-06-22T02:42:09Z

@tdhock maybe list items with the value NULL should be removed too? Up to you. Just letting you know in case you weren't aware. See 3a5de96 just now which I just did quickly to pass GLCI until you can look. Feel free to change this test back if you decide to remove NULL.

mattdowle · 2021-06-22T16:26:47Z

Linking to @tdhocks's reply: 3a5de96#r52510904

tdhock added 8 commits June 12, 2021 09:18

tests that fail for melt(na.rm=TRUE) with list columns

dd76128

change na.rm=TRUE example for list column

60162ff

fix for list column in id.vars with na.rm=TRUE

e55ac23

TODO complex

2a0c17a

dt_na case VECSXP

edc5507

fix na.omit test result

a5637ea

test fails for complex NA

bf653d3

dt_na case VECSXP element new case CPLXSXP

304d0ec

melt on a data table with list columns

97b9a06

test removing NA_integer64_

ac466a5

tdhock marked this pull request as ready for review June 14, 2021 02:00

tdhock changed the title ~~WIP: fix melt(na.rm=TRUE) with list columns~~ fix melt(na.rm=TRUE) with list columns Jun 14, 2021

tdhock requested a review from mattdowle June 14, 2021 02:04

tdhock mentioned this pull request Jun 14, 2021

is.na returns FALSE for list element with scalar NA_integer64_ r-lib/bit64#18

Closed

mattdowle added this to the 1.14.1 milestone Jun 14, 2021

mattdowle added 4 commits June 21, 2021 14:11

Merge branch 'master' into fix-melt-narm-list-example

c117a82

INTEGER_ELT(x,i) => INTEGER(x)[i] (and similarly REAL_ELT etc) to pas…

c8667af

…s R 3.1

avoid DtoLL

43dbaaa

more on int64

8b81d43

mattdowle merged commit 793cec0 into master Jun 21, 2021

mattdowle deleted the fix-melt-narm-list-example branch June 21, 2021 20:58

mattdowle added a commit that referenced this pull request Jun 22, 2021

#5044 follow-up: when bit64 is not available test_bit64 is FALSE but …

3a5de96

…the NULL is not removed. Thanks to GLCI which has two instances running without bit64 to catch this

jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix melt(na.rm=TRUE) with list columns #5044

fix melt(na.rm=TRUE) with list columns #5044

tdhock commented Jun 12, 2021 •

edited

Loading

tdhock commented Jun 14, 2021

codecov bot commented Jun 14, 2021 •

edited

Loading

tdhock commented Jun 14, 2021 •

edited

Loading

tdhock commented Jun 14, 2021

mattdowle commented Jun 14, 2021

tdhock commented Jun 15, 2021

mattdowle commented Jun 16, 2021 •

edited

Loading

tdhock commented Jun 16, 2021

MichaelChirico commented Jun 16, 2021 via email •

edited by mattdowle

Loading

tdhock commented Jun 17, 2021

mattdowle commented Jun 22, 2021

mattdowle commented Jun 22, 2021

fix melt(na.rm=TRUE) with list columns #5044

fix melt(na.rm=TRUE) with list columns #5044

Conversation

tdhock commented Jun 12, 2021 • edited Loading

tdhock commented Jun 14, 2021

codecov bot commented Jun 14, 2021 • edited Loading

Codecov Report

tdhock commented Jun 14, 2021 • edited Loading

tdhock commented Jun 14, 2021

mattdowle commented Jun 14, 2021

tdhock commented Jun 15, 2021

mattdowle commented Jun 16, 2021 • edited Loading

tdhock commented Jun 16, 2021

MichaelChirico commented Jun 16, 2021 via email • edited by mattdowle Loading

tdhock commented Jun 17, 2021

mattdowle commented Jun 22, 2021

mattdowle commented Jun 22, 2021

tdhock commented Jun 12, 2021 •

edited

Loading

codecov bot commented Jun 14, 2021 •

edited

Loading

tdhock commented Jun 14, 2021 •

edited

Loading

mattdowle commented Jun 16, 2021 •

edited

Loading

MichaelChirico commented Jun 16, 2021 via email •

edited by mattdowle

Loading