Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow a single column to be used as rownames in as.matrix #2702

Merged
merged 22 commits into from
Apr 7, 2018
Merged

Allow a single column to be used as rownames in as.matrix #2702

merged 22 commits into from
Apr 7, 2018

Conversation

sritchie73
Copy link
Contributor

Implements #2692

Added a rownames argument to as.matrix.data.table and as.data.frame.data.table with the following behaviour:

  • rownames = TRUE takes key(x) as the rownames of the new matrix / data.frame if it is a single column, or the first column if !haskey(x) or length(key(x)) > 1.
  • rownames = "column" takes the named column as the rownames of the new matrix / data.frame.
  • rownames = 3 looks up the column by index and uses that column as the rownames of the new matrix / data.frame.

Use cases include:

  • Converting a long data.table to a matrix via dcast()
  • Casting to a data.frame for other package's functions that expect data.frame arguments to have rownames for subsetting / row matching in their internal workings.

I've added documentation pages for as.matrix.data.table and as.data.frame.data.table, describing the use and behaviour of the rownames() argument, and highlight examples where it might be useful. The as.data.frame.data.table documentation also highlights the setDF function for cases where rownames nor a copy of the data.table are not required.

I'd appreciate feedback on the amount of error checking implemented (I tend to be overzealous here), and based on that additional unit tests can be implemented.

@MichaelChirico
Copy link
Member

MichaelChirico commented Mar 23, 2018

I'd say no need for as.data.frame.data.table to be extended -- setDF is preferred and has a rownames argument already, unless I'm missing something.

There is #1719...

@sritchie73
Copy link
Contributor Author

I did see that argument in setDF. Does it allow you to use a column in dt as the rownames as suggested in #1719 ? Its documentation still suggests you must manually provide the vector yourself.

My understanding was also that as.data.frame and setDF have two different use cases:

  • setDF for when you want to coerce a data.table to a data.frame by reference, and
  • as.data.frame when you want to make a copy that is a data.frame.

Otherwise why not just alias as.data.frame.data.table to setDF? i.e.:

as.data.frame.data.table <- function(x, ...) { setDF(x) }

@MichaelChirico
Copy link
Member

MichaelChirico commented Mar 23, 2018

In fact I'd be fine with setDF(copy(x)), which is essentially what's done currently:

data.table:::as.data.frame.data.table
function (x, ...) 
{
    ans = copy(x)
    setattr(ans, "row.names", .set_row_names(nrow(x)))
    setattr(ans, "class", "data.frame")
    setattr(ans, "sorted", NULL)
    setattr(ans, ".internal.selfref", NULL)
    ans
}


# vs the is.data.table(x) branch of setDF:
if (is.null(rownames)) {
  rn <- .set_row_names(nrow(x))
}
else {
  if (length(rownames) != nrow(x))stop("rownames incorrect length; expected ", nrow(x), " names, got ", length(rownames))
  rn <- rownames
}
setattr(x, "row.names", rn)
setattr(x, "class", "data.frame")
setattr(x, "sorted", NULL)
setattr(x, ".internal.selfref", NULL)

More that I'm advocating for dev time to improve setDF over as.data.frame.data.table.

Thanks for the PR btw!

@sritchie73
Copy link
Contributor Author

Ah I see, that makes sense.

The tricky part will be trying to remove the column and adding it as the rownames attributes all by reference. I don't see why it couldn't theoretically be done, but might require some C code to implement.

@sritchie73 sritchie73 changed the title Allow a single column to be used as rownames in as.matrix and as.data.frame Allow a single column to be used as rownames in as.matrix Mar 23, 2018
@sritchie73
Copy link
Contributor Author

I've rolled this back so that rownames is only implemented for as.matrix.

data.frames require more thought and work. I agree the development work should go into setDF, and then as.data.frame should just call setDF. Complicating this however, the as.data.frame generic has a row.names argument that has the same functionality as rownames in setDF currently (i.e. a character vector of rownames can be supplied). We'd need to think about how to handle this conflict in arguments.

@codecov-io
Copy link

codecov-io commented Mar 23, 2018

Codecov Report

Merging #2702 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2702      +/-   ##
==========================================
+ Coverage    93.4%   93.42%   +0.02%     
==========================================
  Files          61       61              
  Lines       12236    12276      +40     
==========================================
+ Hits        11429    11469      +40     
  Misses        807      807
Impacted Files Coverage Δ
R/data.table.R 97.85% <100%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1ab4baa...cde74a2. Read the comment docs.

@sritchie73
Copy link
Contributor Author

A couple of queries:

-rownames could also accept a vector to use as the rownames of the matrix - Should I add this?

Merge branch 'master' into as_matrix_rownames

# Conflicts:
#	NEWS.md
#	inst/tests/tests.Rraw
R/data.table.R Outdated
# E.g. because rownames is some sort of object that cant be converted to a column index
stop("rownames must be TRUE, a column index, or a column name in x")
} else {
if (is.logical(rownames) && isTRUE(rownames)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isTRUE(rownames) is sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I've changed this statement to identical(rownames, TRUE) which I think is clearer (and is used elsewhere in data.table.R).

R/data.table.R Outdated
rn <- x[[rnc]]
dm <- dim(x) - c(0, 1)
cn <- names(x)[-rnc]
X <- x[, -rnc, with = FALSE]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think x[, .SD, .SDcols = c(cn)] or x[, (rn) := NULL] could work -- not 100% on what the variables are, or the best way to do this within data.table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HughParsonage - I like the .SDcols approach. I had also thought about using the new .. syntax, but then that raises the problem with R CMD check --as-cran complaining that ..rnc is a global variable with no visible binding.

rn <- x[[rnc]]
dm <- dim(x) - c(0, 1)
cn <- names(x)[-rnc]
X <- x[, .SD, .SDcols = cn]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x <- x[, -..rnc] now works here in dev instead of these 2 lines. A copy is needed in this case at this point, iiuc, otherwise, x[, (rnc):=NULL] to remove that column by reference, currently. And maybe x[, ..rnc := NULL] in future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes I discovered -.. was implemented as I was working on this. Ultimately I decided not to use -..rnc as you then have to define a dummy ..rnc at the top of the function to avoid R CMD check --as-cran complaining ..rnc is a global variable with no visible binding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I see. That is a shame about the no visible binding note.
In general we've tried to use idiomatic data.table internally, and then deal with the no-visible-binding note explicitly by adding a dummy NULL as you say. This way, when folk look at the internals they see how we use data.table ourselves. If and when there is ever a solution for the CRAN note, we can fix it one distinct place by taking away the NULL definitions.

dm <- dim(x)
cn <- names(x)
as.matrix.data.table <- function(x, rownames, ...) {
rn <- NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we have a style guide on this, but I note that the corresponding CRAN cheat for [.data.table symbols are defined in the package environment rather than the function body:

https://github.com/Rdatatable/data.table/blob/master/R/data.table.R#L11

  • side is cutting that tiny bit of overhead for use cases that might repeatedly call this method; - side is increasing potential for unintentional collisions (so if moving outside the body, perhaps use some more obscure name)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are defined there because they are exported. rn won't be used by user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap - rn is internal here, it will contain the vector of rownames to put in the matrix (after all the processing in if (!missing(rownames)) {}. rnc will contain the index of the column in x to be dropped.

R/data.table.R Outdated
stop("rownames must be a single column in x or a vector of row names of length nrow(x)")
} else if (is.na(rownames)) {
warning("rownames is NA, ignoring rownames")
} else if (identical(rownames, FALSE)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i find it a bit weird that rownames = TRUE is accepted, but rownames = FALSE is incorrect usage and results in this warning. Setting rownames dynamically by some condition evaluating to TRUE or FALSE seems like a natural use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop the warning

@sritchie73
Copy link
Contributor Author

@MichaelChirico can you tell me how to pull your changes to my branch? They're not appearing for me locally so I'm not sure how GitHub works here.

@MichaelChirico
Copy link
Member

MichaelChirico commented Mar 28, 2018 via email

R/data.table.R Outdated
warning("rownames is TRUE but multiple keys found in key(x), using first column instead")
rownames <- 1
if (length(rownames) > 1L) {
warning(sprintf("rownames is TRUE but multiple keys [%s] found for x; defaulting to first key column [%s]",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelChirico not quite - I've defaulted to just using the first column (rather than the first key column) if there are multiple keys. We can certainly take the first key column if you think its the right approach? (Or perhaps the last key since that is how the rows will be ordered?)

@sritchie73
Copy link
Contributor Author

Thanks @MichaelChirico it turned out git was lying to me that my branch was up to date with origin.

I've tried to add comments to your changes but I don't quite understand the GitHub review feature so i'll copy in the comments here:

  • I will drop the warning where rownames = FALSE
  • Your change to the warning message where multiple keys are detected is not quite correct - I've defaulted to use x[,1] as the rownames in that case rather than x[,key(x)[1]]. I can certainly change this – but would think maybe it could be useful to take the last key (since the rows of x will be ordered by that column)?

@MichaelChirico
Copy link
Member

@sritchie73 x[ , 1] sounds fine. just a problem with my editing in-browser.

@sritchie73
Copy link
Contributor Author

@MichaelChirico i've updated the code and tests so that rownames = FALSE no longer generates a warning. I've also removed the similar warnings generated when rownames = NULL or rownames = NA.

@MichaelChirico
Copy link
Member

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants