List column support in data.table #4290

hope-data-science · 2020-03-09T12:55:00Z

Recently, I tried to get rid of tidyverse in every aspect and see if data.table could do the same more efficiently. I know that now data.table supports functions like nest and unnest in tidyr. However, could I find an example for all the data.table way to run examples in tidyr::nest and tidyr::chop? Any hints? Thanks.

The text was updated successfully, but these errors were encountered:

shrektan · 2020-03-09T13:36:36Z

library(tidyr)
library(data.table)

df <- tibble(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1)
df %>% nest(data = c(y, z))
#> # A tibble: 3 x 2
#>       x data            
#>   <dbl> <list>          
#> 1     1 <tibble [3 × 2]>
#> 2     2 <tibble [2 × 2]>
#> 3     3 <tibble [1 × 2]>
df %>% chop(c(y, z))
#> # A tibble: 3 x 3
#>       x y         z        
#>   <dbl> <list>    <list>   
#> 1     1 <int [3]> <int [3]>
#> 2     2 <int [2]> <int [2]>
#> 3     3 <int [1]> <int [1]>


dt <- as.data.table(df)
dt[, .(data = list(.SD)), keyby = x]
#>        x              data
#>    <num>            <list>
#> 1:     1 <data.table[3x2]>
#> 2:     2 <data.table[2x2]>
#> 3:     3 <data.table[1x2]>
dt[, lapply(.SD, list), keyby = x, .SDcols = c('y', 'z')]
#>        x      y      z
#>    <num> <list> <list>
#> 1:     1  1,2,3  6,5,4
#> 2:     2    4,5    3,2
#> 3:     3      6      1

^{Created on 2020-03-09 by the reprex package (v0.3.0)}

MichaelChirico · 2020-03-09T13:44:43Z

You might also check out @TysonStanley 's tidyfast package and his recent talk about list columns in data.table:

https://resources.rstudio.com/rstudio-conf-2020/list-columns-in-data-table-tyson-s-barrett

hope-data-science · 2020-03-09T13:47:48Z

Thank you for the prompt feedback. I tried some of these, but meet some trouble.
By example, I mean every cases in the "Examples" in tidyr::nest and tidyr::chop. I find that data.table treat integer and double quite differently, which making the unnest in data.table fails. And I don't know how to write consistent codes to unnest or nest multiple columns, for list of data.tables as well as any vectors or even lists. It is just complicated. Let me give an example:

library(tidyr)

df <- tibble(
  x = 1:3,
  y = list(
    NULL,
    tibble(a = 1, b = 2),
    tibble(a = 1:3, b = 3:1)
  )
)
df %>% unnest(y)
df %>% unnest(y, keep_empty = TRUE)

####################################

library(data.table)
dt <- data.table(
  x = 1:3,
  y = list(
    NULL,
    data.table(a = 1, b = 2),
    data.table(a = 1:3, b = 3:1)
  )
)

dt[,unlist(y),by = x]

# Error in `[.data.table`(dt, , unlist(y), by = x) : 
#   Column 1 of result for group 3 is type 'integer' but expecting type 'double'. Column types must be consistent for each group.

MichaelChirico · 2020-03-09T13:51:46Z

For that, you'll have to wait for this:

#4156

In turn I've been waiting for the work on rbindlist to get merged first to see if I can't leverage some of that code to make cases like this easier rather than writing some more custom code.

For now, you'll have to know your input types -- dt[ , as.numeric(unlist(y)), by = x]

hope-data-science · 2020-03-09T13:51:48Z

You might also check out @TysonStanley 's tidyfast package and his recent talk about list columns in data.table:

https://resources.rstudio.com/rstudio-conf-2020/list-columns-in-data-table-tyson-s-barrett

Thanks, I know his work, there is an article too. Check https://osf.io/f6pxw/download.
Actually I'm trying to write some myself, but it seems there are many problems ahead, which make me very hard to do the same as in tidyr. If data.table could not beat in speed for similar tasks, I might have to use tidyr, which seems to work on data.table too. But I prefer the whole workflow in data.table if possible.

shrektan · 2020-03-09T14:00:57Z

A verbose but work solution should be :

library(data.table)
dt <- data.table(
  x = 11:13,
  y = list(
    NULL,
    data.table(a = 1, b = 2),
    data.table(a = 1:3, b = 3:1)
  )
)
dt[, ID := seq_len(.N)]
rbindlist(dt$y, idcol = "ID")[dt[, .(x, ID)], on = "ID", nomatch = 0L][, ID := NULL][]
#>        a     b     x
#>    <num> <num> <int>
#> 1:     1     2    12
#> 2:     1     3    13
#> 3:     2     2    13
#> 4:     3     1    13
rbindlist(dt$y, idcol = "ID")[dt[, .(x, ID)], on = "ID"][, ID := NULL][]
#>        a     b     x
#>    <num> <num> <int>
#> 1:    NA    NA    11
#> 2:     1     2    12
#> 3:     1     3    13
#> 4:     2     2    13
#> 5:     3     1    13

^{Created on 2020-03-09 by the reprex package (v0.3.0)}

hope-data-science · 2020-03-09T14:32:01Z

Still, there's no uniform way to handle it. Some examples:

dt <- data.table(
  x = c(2,3,1),
  y = list(
    1:3,
    4:5,
    7:9
  )
)

# not working
dt[, ID := seq_len(.N)]
rbindlist(dt$y, idcol = "ID")[dt[, .(x, ID)], on = "ID", nomatch = 0L][, ID := NULL][]
rbindlist(dt$y, idcol = "ID")[dt[, .(x, ID)], on = "ID"][, ID := NULL][]

# still works
tidyr::unnest(dt,y)

I am considering what could be in each cell of the table. A data.table, a vector, a list. How could we handle all of them in a consistent way (which tidyr seems to tackle already).

shrektan · 2020-03-09T15:02:34Z

Just add a as.data.table() in advance.

library(data.table)
dt <- data.table(
  x = c(2,3,1),
  y = list(
    1:3,
    4:5,
    7:9
  )
)

# tidyr
tidyr::unnest(dt,y)
#> # A tibble: 8 x 2
#>       x     y
#>   <dbl> <int>
#> 1     2     1
#> 2     2     2
#> 3     2     3
#> 4     3     4
#> 5     3     5
#> 6     1     7
#> 7     1     8
#> 8     1     9

# data.table
dt[, ID := seq_len(.N)]
rbindlist(lapply(dt$y, as.data.table), idcol = "ID")[dt[, .(x, ID)], on = "ID", nomatch = 0L][, ID := NULL][]
#>    V1 x
#> 1:  1 2
#> 2:  2 2
#> 3:  3 2
#> 4:  4 3
#> 5:  5 3
#> 6:  7 1
#> 7:  8 1
#> 8:  9 1
rbindlist(lapply(dt$y, as.data.table), idcol = "ID")[dt[, .(x, ID)], on = "ID"][, ID := NULL][]
#>    V1 x
#> 1:  1 2
#> 2:  2 2
#> 3:  3 2
#> 4:  4 3
#> 5:  5 3
#> 6:  7 1
#> 7:  8 1
#> 8:  9 1

^{Created on 2020-03-09 by the reprex package (v0.3.0)}

hope-data-science · 2020-03-09T15:10:07Z

Nice, turn everything into data.table and then use rbindlist. I'll try with more examples, this is great!

hope-data-science · 2020-03-09T15:24:30Z

Any hints for the tidyr::unchop in the data.table?

shrektan · 2020-03-09T15:29:05Z

I need an example. I don't know the differences between tidyr::unchop() and tidyr::unnest().

hope-data-science · 2020-03-09T15:36:42Z

Example:

library(tidyr)
df <- tibble(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1)
df %>% chop(c(y, z))
#> # A tibble: 3 x 3
#>       x y         z        
#>   <dbl> <list>    <list>   
#> 1     1 <int [3]> <int [3]>
#> 2     2 <int [2]> <int [2]>
#> 3     3 <int [1]> <int [1]>
df %>% unchop(c(y,z))
#> # A tibble: 6 x 3
#>       x     y     z
#>   <dbl> <int> <int>
#> 1     1     1     6
#> 2     1     2     5
#> 3     1     3     4
#> 4     2     4     3
#> 5     2     5     2
#> 6     3     6     1
df %>% unnest(c(y,z))
#> # A tibble: 6 x 3
#>       x     y     z
#>   <dbl> <int> <int>
#> 1     1     1     6
#> 2     1     2     5
#> 3     1     3     4
#> 4     2     4     3
#> 5     2     5     2
#> 6     3     6     1

While this is no difference in this example, I don't know how to make it back in data.table.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List column support in data.table #4290

List column support in data.table #4290

hope-data-science commented Mar 9, 2020

shrektan commented Mar 9, 2020

MichaelChirico commented Mar 9, 2020

hope-data-science commented Mar 9, 2020

MichaelChirico commented Mar 9, 2020

hope-data-science commented Mar 9, 2020

This comment has been minimized.

This comment has been minimized.

shrektan commented Mar 9, 2020 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

hope-data-science commented Mar 9, 2020

shrektan commented Mar 9, 2020 •

edited

Loading

hope-data-science commented Mar 9, 2020

hope-data-science commented Mar 9, 2020

shrektan commented Mar 9, 2020 •

edited

Loading

hope-data-science commented Mar 9, 2020

List column support in data.table #4290

List column support in data.table #4290

Comments

hope-data-science commented Mar 9, 2020

shrektan commented Mar 9, 2020

MichaelChirico commented Mar 9, 2020

hope-data-science commented Mar 9, 2020

MichaelChirico commented Mar 9, 2020

hope-data-science commented Mar 9, 2020

This comment has been minimized.

This comment has been minimized.

shrektan commented Mar 9, 2020 • edited Loading

This comment has been minimized.

This comment has been minimized.

hope-data-science commented Mar 9, 2020

shrektan commented Mar 9, 2020 • edited Loading

hope-data-science commented Mar 9, 2020

hope-data-science commented Mar 9, 2020

shrektan commented Mar 9, 2020 • edited Loading

hope-data-science commented Mar 9, 2020

shrektan commented Mar 9, 2020 •

edited

Loading

shrektan commented Mar 9, 2020 •

edited

Loading

shrektan commented Mar 9, 2020 •

edited

Loading