Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: auto-infer LHS of := when absent #1543

Open
MichaelChirico opened this issue Feb 23, 2016 · 13 comments
Open

FR: auto-infer LHS of := when absent #1543

MichaelChirico opened this issue Feb 23, 2016 · 13 comments
Labels
top request One of our most-requested issues

Comments

@MichaelChirico
Copy link
Member

It's something of a nuisance to assign ragged columns with tstrsplit, if we don't necessarily know how many columns will result ex ante. Is it possible to add an assign argument which would use := or set to add the result as extra columns to the table?

DT <- data.table(x = c("a/b/c/d", "a/b/c", "a/b"),
                 y = 1:3, z = 4:6)
DT[ , tstrsplit(x, "/")]
#    V1 V2 V3 V4 #`x`, `y`, and `z` are irrecoverable
#1:  a  b  c NA
#2:  a  b NA NA
#3:  a  b  c  d

Workaround (ugly)

DT[ , paste0("V", 1:max(sapply(spl <- strsplit(x, "/"), length))) := transpose(spl)][]
#          x y z V1 V2 V3 V4
#1:   a/b/c 1 5  a  b  c NA
#2:     a/b 2 6  a  b NA NA
#3: a/b/c/d 3 7  a  b  c  d

Inspired by this SO question

@mrdwab
Copy link

mrdwab commented Feb 24, 2016

Seconding @MichaelChirico here. But I imagine that it's somewhat cumbersome to pull off since tstrsplit works both within and outside of a data.table.

In reworking some of the functions for "splitstackshape", I've written the following wrapper functions (both of which will probably use copy in the end instead of modifying the original data.table):

## vectorized equivalent of `listCol_w`
flatten <- function(indt, cols, drop = FALSE) {
  require(data.table)
  if (!is.data.table(indt)) indt <- as.data.table(indt)
  x <- unlist(indt[, lapply(.SD, function(x) max(lengths(x))), .SDcols = cols])
  nams <- paste(rep(cols, x), sequence(x), sep = "_")
  indt[, (nams) := unlist(lapply(.SD, transpose), recursive = FALSE), .SDcols = (cols)]
  if (isTRUE(drop)) indt[, (cols) := NULL]
  indt[]
}

and

## vectorized equivalent of `listCol_l`
flattenLong <- function(indt, cols) {
  ob <- setdiff(names(indt), cols)
  x <- flatten(indt, cols, TRUE)
  mv <- lapply(cols, function(y) grep(sprintf("^%s_", y), names(x)))
  setorderv(melt(x, measure.vars = mv, value.name = cols), ob)[]
}

With @MichaelChirico's example, the approach would be:

flatten(copy(DT)[, x := strsplit(x, "/", TRUE)], "x", drop = TRUE)
#    y z x_1 x_2 x_3 x_4
# 1: 1 4   a   b   c   d
# 2: 2 5   a   b   c  NA
# 3: 3 6   a   b  NA  NA

(Or, of course just using cSplit.)

@MichaelChirico
Copy link
Member Author

@mrdwab thanks for the feedback! Indeed I had thought about how to implement this a bit but was stymied -- the only other function I know of that accepts can assign things is :=, and that doesn't work outside j. But I do occasionally use tstrsplit on a non-data.table object -- it's a nice feature.

Of course if this is possible this is the way to go:

if (assign){
  # behave like `:=`
}else{
  # behave like now
}

If not, an alternative would be to add a keep.n argument to transpose (I see in the C code this is just a matter of returning maxlen), which works like:

if (keep.n){
  # return length-2 list -- 
  # [[1]]: "list" = as before
  # [[2]]: "v.name" = paste0("V", 1:maxlen)
}else{
  # as before
}

And used like

DT[ , (x <- tstrsplit(x, "/", keep.n = TRUE))$vname := x$list]

@DavidArenburg
Copy link
Member

DavidArenburg commented Mar 8, 2016

I think this should be a general feature of `:=`(). If no names were provided, then it should automatically generate them, e.g. for the data provided by @MichaelChirico, this should look like

DT[, `:=`(tstrsplit(x, "/", fixed = TRUE))]
DT
#          x y z x_1 x_2 x_3 x_4
# 1: a/b/c/d 1 4   a   b   c   d
# 2:   a/b/c 2 5   a   b   c  NA
# 3:     a/b 3 6   a   b  NA  NA

This could be a truly awesome feature that will make our lives much easier when the number of columns is unknown. It will also make the code much more concise.

This could be generalized to other data.table (or any) functions too, for instance

DT[, `:=`(shift(y, 1:.N))]
DT
#          x y z x_1 x_2 x_3 x_4 y_1 y_2 y_3
# 1: a/b/c/d 1 4   a   b   c   d  NA  NA  NA
# 2:   a/b/c 2 5   a   b   c  NA   1  NA  NA
# 3:     a/b 3 6   a   b  NA  NA   2   1  NA

The only caveat I see here, that it could override existing columns. In that case it should return an error, or generate a bit different column names- that part needs to be figured out.

@MichaelChirico
Copy link
Member Author

@DavidArenburg good point. do you happen to know if there's an outstanding FR for that? seems like an obvious one, after all... if not, I'll change the title of this one.

@MichaelChirico MichaelChirico changed the title FR: assign argument for tstrsplit FR: auto-infer LHS of := when absent Mar 8, 2016
@franknarf1
Copy link
Contributor

Maybe update this SO question if this is implemented.

@franknarf1
Copy link
Contributor

franknarf1 commented Aug 15, 2018

Another SO post, asking to do it by group (so the workaround "answer" for the last one I linked doesn't work): https://stackoverflow.com/q/51861038

# reproducible example (since the link doesn't have one)
library(data.table)
dt <- data.table(x = 1 : 4, id = c(1,1,2,2))
ff = function(x) list(a = x + 1, b = 2)

# desired syntax    
dt[, `:=`(res <- names(ff(x)), res), by=id]
# or 
dt[, `:=`(ff(x)), by=id]

I know there's a warning about the inefficiency of returning named lists in j for grouped operations, but maybe there's some good way to handle their use-case.

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Apr 15, 2019

Following up on @DavidArenburg 's suggested approach, the more natural R way to accomplish this would I think be through do.call, but do.call(`:=`, list_of_column_assignments) won't work with the current API for := which is NSE-based. A way around this would be to define a proxy function like list_set or list_assign which does the same thing as := as a function, including auto-naming un-named components: do.call(list_set, tstrsplit(some_string, some_sep)).

This is a bit redundant -- there's a reason we can't use := as a function:

print(data.table:::`:=`)
function(...) stop('Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").')

And list_set is redundant to that...

@myoung3
Copy link
Contributor

myoung3 commented Apr 2, 2021

Another SO post here:
https://stackoverflow.com/questions/66917673/set-multiple-columns-in-r-data-table-with-a-named-list-and

@MichaelChirico Is it really not possible to hack in a way for do.call(":=", list()) to work? Using the base R syntax is a more appealing solution to me than adding another function or changing the default behavior of :=

@myoung3
Copy link
Contributor

myoung3 commented Apr 2, 2021

Since j already gets substituted, we could just add a switch for when do.call(":=", or do.call(`:=`, is detected

@MichaelChirico
Copy link
Member Author

I think it could be doable but I'm not a big fan... (1) added NSE maintenance overhead and (2) i usually avoid do.call like the plague myself. I would lean towards a more elegant solution if possible...

@myoung3
Copy link
Contributor

myoung3 commented Apr 6, 2021

Yeah as someone familiarizing myself with the [.data.table code I'll agree on that point--the amount of NSE is pretty overwhelming already.

@shapenaji
Copy link

shapenaji commented Apr 6, 2021

Throwing in my 2c here, could the do.call and other := tricks be avoided if there was a new keyword?

dt[, .NEW := tstrsplit(col, ...)]

(or something better named than .NEW)
This would assign using the names of the output without modifying the normal flow.

@jangorecki
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests

8 participants