Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List column support in data.table #4290

Closed
hope-data-science opened this issue Mar 9, 2020 · 16 comments
Closed

List column support in data.table #4290

hope-data-science opened this issue Mar 9, 2020 · 16 comments
Labels

Comments

@hope-data-science
Copy link

Recently, I tried to get rid of tidyverse in every aspect and see if data.table could do the same more efficiently. I know that now data.table supports functions like nest and unnest in tidyr. However, could I find an example for all the data.table way to run examples in tidyr::nest and tidyr::chop? Any hints? Thanks.

@shrektan
Copy link
Member

shrektan commented Mar 9, 2020

library(tidyr)
library(data.table)

df <- tibble(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1)
df %>% nest(data = c(y, z))
#> # A tibble: 3 x 2
#>       x data            
#>   <dbl> <list>          
#> 1     1 <tibble [3 × 2]>
#> 2     2 <tibble [2 × 2]>
#> 3     3 <tibble [1 × 2]>
df %>% chop(c(y, z))
#> # A tibble: 3 x 3
#>       x y         z        
#>   <dbl> <list>    <list>   
#> 1     1 <int [3]> <int [3]>
#> 2     2 <int [2]> <int [2]>
#> 3     3 <int [1]> <int [1]>


dt <- as.data.table(df)
dt[, .(data = list(.SD)), keyby = x]
#>        x              data
#>    <num>            <list>
#> 1:     1 <data.table[3x2]>
#> 2:     2 <data.table[2x2]>
#> 3:     3 <data.table[1x2]>
dt[, lapply(.SD, list), keyby = x, .SDcols = c('y', 'z')]
#>        x      y      z
#>    <num> <list> <list>
#> 1:     1  1,2,3  6,5,4
#> 2:     2    4,5    3,2
#> 3:     3      6      1

Created on 2020-03-09 by the reprex package (v0.3.0)

@MichaelChirico
Copy link
Member

You might also check out @TysonStanley 's tidyfast package and his recent talk about list columns in data.table:

https://resources.rstudio.com/rstudio-conf-2020/list-columns-in-data-table-tyson-s-barrett

@hope-data-science
Copy link
Author

Thank you for the prompt feedback. I tried some of these, but meet some trouble.
By example, I mean every cases in the "Examples" in tidyr::nest and tidyr::chop. I find that data.table treat integer and double quite differently, which making the unnest in data.table fails. And I don't know how to write consistent codes to unnest or nest multiple columns, for list of data.tables as well as any vectors or even lists. It is just complicated. Let me give an example:

library(tidyr)

df <- tibble(
  x = 1:3,
  y = list(
    NULL,
    tibble(a = 1, b = 2),
    tibble(a = 1:3, b = 3:1)
  )
)
df %>% unnest(y)
df %>% unnest(y, keep_empty = TRUE)

####################################

library(data.table)
dt <- data.table(
  x = 1:3,
  y = list(
    NULL,
    data.table(a = 1, b = 2),
    data.table(a = 1:3, b = 3:1)
  )
)

dt[,unlist(y),by = x]

# Error in `[.data.table`(dt, , unlist(y), by = x) : 
#   Column 1 of result for group 3 is type 'integer' but expecting type 'double'. Column types must be consistent for each group.

@MichaelChirico
Copy link
Member

For that, you'll have to wait for this:

#4156

In turn I've been waiting for the work on rbindlist to get merged first to see if I can't leverage some of that code to make cases like this easier rather than writing some more custom code.

For now, you'll have to know your input types -- dt[ , as.numeric(unlist(y)), by = x]

@hope-data-science
Copy link
Author

You might also check out @TysonStanley 's tidyfast package and his recent talk about list columns in data.table:

https://resources.rstudio.com/rstudio-conf-2020/list-columns-in-data-table-tyson-s-barrett

Thanks, I know his work, there is an article too. Check https://osf.io/f6pxw/download.
Actually I'm trying to write some myself, but it seems there are many problems ahead, which make me very hard to do the same as in tidyr. If data.table could not beat in speed for similar tasks, I might have to use tidyr, which seems to work on data.table too. But I prefer the whole workflow in data.table if possible.

@shrektan

This comment has been minimized.

@shrektan

This comment has been minimized.

@shrektan
Copy link
Member

shrektan commented Mar 9, 2020

A verbose but work solution should be :

library(data.table)
dt <- data.table(
  x = 11:13,
  y = list(
    NULL,
    data.table(a = 1, b = 2),
    data.table(a = 1:3, b = 3:1)
  )
)
dt[, ID := seq_len(.N)]
rbindlist(dt$y, idcol = "ID")[dt[, .(x, ID)], on = "ID", nomatch = 0L][, ID := NULL][]
#>        a     b     x
#>    <num> <num> <int>
#> 1:     1     2    12
#> 2:     1     3    13
#> 3:     2     2    13
#> 4:     3     1    13
rbindlist(dt$y, idcol = "ID")[dt[, .(x, ID)], on = "ID"][, ID := NULL][]
#>        a     b     x
#>    <num> <num> <int>
#> 1:    NA    NA    11
#> 2:     1     2    12
#> 3:     1     3    13
#> 4:     2     2    13
#> 5:     3     1    13

Created on 2020-03-09 by the reprex package (v0.3.0)

@hope-data-science

This comment has been minimized.

@shrektan

This comment has been minimized.

@hope-data-science
Copy link
Author

Still, there's no uniform way to handle it. Some examples:

dt <- data.table(
  x = c(2,3,1),
  y = list(
    1:3,
    4:5,
    7:9
  )
)

# not working
dt[, ID := seq_len(.N)]
rbindlist(dt$y, idcol = "ID")[dt[, .(x, ID)], on = "ID", nomatch = 0L][, ID := NULL][]
rbindlist(dt$y, idcol = "ID")[dt[, .(x, ID)], on = "ID"][, ID := NULL][]

# still works
tidyr::unnest(dt,y)

I am considering what could be in each cell of the table. A data.table, a vector, a list. How could we handle all of them in a consistent way (which tidyr seems to tackle already).

@shrektan
Copy link
Member

shrektan commented Mar 9, 2020

Just add a as.data.table() in advance.

library(data.table)
dt <- data.table(
  x = c(2,3,1),
  y = list(
    1:3,
    4:5,
    7:9
  )
)

# tidyr
tidyr::unnest(dt,y)
#> # A tibble: 8 x 2
#>       x     y
#>   <dbl> <int>
#> 1     2     1
#> 2     2     2
#> 3     2     3
#> 4     3     4
#> 5     3     5
#> 6     1     7
#> 7     1     8
#> 8     1     9

# data.table
dt[, ID := seq_len(.N)]
rbindlist(lapply(dt$y, as.data.table), idcol = "ID")[dt[, .(x, ID)], on = "ID", nomatch = 0L][, ID := NULL][]
#>    V1 x
#> 1:  1 2
#> 2:  2 2
#> 3:  3 2
#> 4:  4 3
#> 5:  5 3
#> 6:  7 1
#> 7:  8 1
#> 8:  9 1
rbindlist(lapply(dt$y, as.data.table), idcol = "ID")[dt[, .(x, ID)], on = "ID"][, ID := NULL][]
#>    V1 x
#> 1:  1 2
#> 2:  2 2
#> 3:  3 2
#> 4:  4 3
#> 5:  5 3
#> 6:  7 1
#> 7:  8 1
#> 8:  9 1

Created on 2020-03-09 by the reprex package (v0.3.0)

@hope-data-science
Copy link
Author

Nice, turn everything into data.table and then use rbindlist. I'll try with more examples, this is great!

@hope-data-science
Copy link
Author

Any hints for the tidyr::unchop in the data.table?

@shrektan
Copy link
Member

shrektan commented Mar 9, 2020

I need an example. I don't know the differences between tidyr::unchop() and tidyr::unnest().

@hope-data-science
Copy link
Author

Example:

library(tidyr)
df <- tibble(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1)
df %>% chop(c(y, z))
#> # A tibble: 3 x 3
#>       x y         z        
#>   <dbl> <list>    <list>   
#> 1     1 <int [3]> <int [3]>
#> 2     2 <int [2]> <int [2]>
#> 3     3 <int [1]> <int [1]>
df %>% unchop(c(y,z))
#> # A tibble: 6 x 3
#>       x     y     z
#>   <dbl> <int> <int>
#> 1     1     1     6
#> 2     1     2     5
#> 3     1     3     4
#> 4     2     4     3
#> 5     2     5     2
#> 6     3     6     1
df %>% unnest(c(y,z))
#> # A tibble: 6 x 3
#>       x     y     z
#>   <dbl> <int> <int>
#> 1     1     1     6
#> 2     1     2     5
#> 3     1     3     4
#> 4     2     4     3
#> 5     2     5     2
#> 6     3     6     1

While this is no difference in this example, I don't know how to make it back in data.table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants