Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Allow setnames() to skip column names that don't exist #3030

Closed
M-YD opened this issue Sep 4, 2018 · 13 comments
Closed

[feature request] Allow setnames() to skip column names that don't exist #3030

M-YD opened this issue Sep 4, 2018 · 13 comments
Milestone

Comments

@M-YD
Copy link
Contributor

M-YD commented Sep 4, 2018

There have been a number of similar posts to this, but nothing that seems to directly address the issue that I am proposing and which I think would be a very useful update.

The issue I am having is that I have implemented setnames() as part of a large and complicated function (500 lines) that I have built and to be presented with an error after having spent quite some time waiting for it to execute is unhelpful and also annoying because it means that the time I had spent waiting for it to complete was for nothing.

What would have been really helpful in this scenario is an if/else-type condition when checking old columns and skipping them if they didn't exist.

Something along the lines of:

if(!(old %in% colnames)){ # Checks if old columns names do not exist in existing column names
    next # Skip to the next item in the list if TRUE
}

The reason I would check it with a ! (NOT) is because the very nature of setnames() checks to see if the names do exist; if so, it continues as normal because that is literally what it is designed for. Therefore, checking for existing names within setnames() is futile because that is what setnames() is doing in the first instance.

The trouble with the current setup of setnames() is that it works on the assumption that I know precisely which columns exist at any given point.

I don't.

I have a good idea, but it isn't always correct because there are literally dozens of potential scenarios that could occur based on numerous variations of how the data I am extracting are gathered.

Approximately 90.00%-95.00% of the time, the columns that I am working with are typically the same ones, but there are those odd occasions (the other 5.00%-10.00%) where this is not the case and all of a sudden, something breaks unexpectedly and I am left trying to figure out a new way to hack another version of this function together to handle this single use case just one time, which is neither a good thing to have to do nor good practice.

All of this could be avoided with a simple check inside setnames() to see if the columns exist or not.

Referring to columns by number might work in some cases, but I think it's neither here nor there because different people will need to refer to columns in different ways; names work best in some scenarios and numbers in others.

In my case, names work best because the columns aren't always in the same order and as such using column numbers could break my data frame and it's possible that I wouldn't know until the end.

Also, in my case if the column doesn't exist then it isn't a problem for me and I am happy to proceed onto the next name in the list. I'm sure that some people would like a warning notification if this happens, and I agree that it is a helpful feature to have, which is why I'm not advocating for it to be removed - not at all. Perhaps a more condensed version of the warning can appear on the fly, or even a summary at the end would be good to have.

An if/else-type condition like this would most certainly make things easier for me, at least, and I'm sure several others would agree.


Update:

@HughParsonage suggested an additional argument and this is precisely what I was thinking of.

Something along the lines of:

setnames(DT, old, new, skip_absent = FALSE)

It would include an additional argument (skip_absent) which is set to FALSE by default and which won't affect that the way that setnames() functions for anybody unless the skip_absent flag is specifically set to TRUE at the time of calling.

@jangorecki
Copy link
Member

jangorecki commented Sep 4, 2018

It would be much easier to follow your request if you would include minimal reproducible example and expected output.
also...

in my case if the column doesn't exist then it isn't a problem for me and I am happy to proceed onto the next name in the list.

I think your case is not very common one.

@M-YD
Copy link
Contributor Author

M-YD commented Sep 4, 2018

@jangorecki I can't give a reproducible example because it isn't fitting in this case; the function I have built gathers data from several external backend systems and then several other operations are carried out from there and the parameters used in the function/queries can affect the outcome in several ways, most of which can't be predicted accurately.

The problem I have is how setnames() handles columns that don't exist; I know for certain that I am not the only person to experience this due to the other issues and questions that have been posted online over the last few years.

In terms of an expected outcome, what I want isn't an outcome, per se; my issue revolves more around the behaviour/execution of setnames() when dealing with column names that don't exist.

In that case, my expected outcome is for setnames() to ignore names that don't exist and to proceed onto the next.

In terms of how common my case is, I think it's more common than you realise because people have been posting issues about very similar things for a number of years, the flavour of which is typically focused on wanting setnames() to skip columns that don't exist rather than throwing an error.

@M-YD M-YD changed the title Allow setnames() to skip column names that don't exist and move onto the next. Allow setnames() to skip column names that don't exist and move onto the next Sep 4, 2018
@franknarf1
Copy link
Contributor

If your workflow is complicated enough that its behavior cannot be predicted, you can afford a one-liner to define your own setnames, I guess:

setfoundnames = function(x, old, new){ 
  w <- match(old, names(x), nomatch=0); setnames(x, old[w], new[w])
}

This is a reproducible example (illustrating what I assume you're after):

library(data.table)
DT = data.table(a = 1, b = 2, d = 3)
setfoundnames(DT, c("a", "A"), c("z", "Z"))

It is fitting to make an example to be clear about the change in behavior that you're requesting. Within your wall of text, it may be a chore to parse out which parts are describing the request and which are arguing for it.

@HughParsonage
Copy link
Member

Probably an extra argument to setnames could be sufficient.

setnames(DT, old, new, halt_if_absent = TRUE)
library(data.table)
DT = data.table(a = 1, b = 2, d = 3)
setnames(DT, "A", "B") 
#> Error ...
setnames(DT, "A", "B", halt_if_not_found = FALSE)
#> (Returns DT)

Note that the current behaviour of setnames has saved my bacon a few times. And it has probably made debugging much easier than I appreciate. I've reviewed code that used plyr::rename on non-existent names. The warnings were simply ignored; but when the names were corrected, the conclusion was the opposite of that originally calculated.

I appreciate that it sucks when a long-running script is halted by a seemingly inconsequential error, but it's hard to see that skipping in the absence of a direction to change names is a useful default.

In a similar vein, one could imagine someone writing

DT[, x := y / 1000]

when y does not exist in DT. In the philosophy of 'no errors', we could just leave DT unchanged. It's not difficult to construct an example where this would be disastrous.

@M-YD
Copy link
Contributor Author

M-YD commented Sep 6, 2018

@HughParsonage I concur; this is precisely the sort of functionality I had in mind.

setnames(DT, old, new, skip_absent = FALSE)

An additional argument (skip_absent) which is set to FALSE by default would be ideal here because it would mean that the way that setnames() functions won't change for anybody by default and people in my scenario who need this extra functionality can simply set the skip_absent flag to TRUE at the time of calling the function.

@MichaelChirico
Copy link
Member

I'm really not sold on the utility of such an argument.

IMHO it's an error to try and supply non-existent names.

The workaround only takes one line of code for those, so the convenience factor is next to 0 as well.

@M-YD
Copy link
Contributor Author

M-YD commented Sep 6, 2018

@MichaelChirico

IMHO it's an error to try and supply non-existent names.

I disagree because people won't be trying to supply non-existent names deliberately.

In my specific case, all of the names in old could exist, and I would expect them to almost every time, but there are some edge cases where they won't always exist, and it is those times that I need to be able to define within the function that it is okay to automatically skip and move onto the next name in the list.

I appreciate that not everybody will want this, which is why the argument should be set to FALSE by default; as such, it won't affect the majority of users in the way that it currently operates in that it won't skip missing values by default.

Also, I anticipate that there will be lots of people using setnames() for the first time over the coming years and some of those people will encounter this error without knowing what is wrong. When they look at ?setnames they will see the skip_absent argument and also see that it can be set to TRUE. From there, they will be able to make a decision based on their specific needs as to whether or not they want to continue stopping on missing old values or if they want to skip those instances and move onto the next.

Ultimately, I expect that this would provide a vehicle for a much more self-sufficient resolution because the whole investigative process for newcomers can all happen in a shorter space of time compared to having to trawl through Google/StackOverflow/GitHub in search of a solution.

@MichaelChirico
Copy link
Member

people won't be trying to supply non-existent names deliberately.

R is not really a language like Python where you might try to make errors deliberately...

Adding an argument that will benefit a tiny % of users is not really "for free" as it adds another part of the API for us to potentially maintain.

It might be another thing if the workaround was particularly data.table-flavored; it's not. It's extremely standard base R to use match like in Frank's suggestion.

@slin30
Copy link

slin30 commented Sep 6, 2018

IMO, the implication of setnames is that the target columns are the reference, and therefore I expect an error if I attempt to set a name on a non-existent column. The possibility that this expectation could be violated, even when requiring a bespoke additional arg to be changed from a hypothetical default, means I would no longer be able to take the current explicit behavior for granted.

I'd much rather the base function remain explicit, which allows me to write a custom version that is less explicit if needed. Especially when, as mentioned by others, doing so is quite trivial in this case.

@M-YD
Copy link
Contributor Author

M-YD commented Sep 6, 2018

...means I would no longer be able to take the current explicit behavior for granted.

Of course you could because the default behaviour will remain exactly the same unless the additional argument is specifically changed to TRUE (from a default of FALSE).

In which scenario(s) do you imagine the default FALSE setting being violated?

@mattdowle
Copy link
Member

mattdowle commented Sep 7, 2018

It may help that I write down that on this project we've always tried to lean on the side of 'yes'. If a user wants something, especially this strongly, then why not? If 5% or even 1% of users would find it useful, that's good enough for me. Unless it could harm the other 95%, or it's particularly difficult or time-consuming to implement. In this case though, adding skip_absent with a default of FALSE would cause no harm and is easy and quick to implement. The extra match() call Frank proposed, while one extra call, is not quite as nice as a built-in skip_absent. In this case, Mus is claiming others raise the same issue, so that should tilt us to say 'yes' even more.
But Mus - the original post was way too long and set everyone off on the wrong foot, right?
@MusTheDataGuy Are you able to submit a PR? The changes would be to data.table.R, man/setattr.Rd, tests.Rraw and NEWS. More info here: https://github.com/Rdatatable/data.table/wiki/Contributing

@M-YD
Copy link
Contributor Author

M-YD commented Sep 7, 2018

@mattdowle I'm glad that you agree and that you see value in this.

Yes, admittedly the original post was too long; I wanted to make a good case is all.

I would be delighted to submit a PR; before doing so I will first take a look at the current codebase to familiarise myself with it.

@M-YD M-YD changed the title Allow setnames() to skip column names that don't exist and move onto the next [feature request] Allow setnames() to skip column names that don't exist and move onto the next Sep 7, 2018
@mattdowle mattdowle changed the title [feature request] Allow setnames() to skip column names that don't exist and move onto the next [feature request] Allow setnames() to skip column names that don't exist Sep 7, 2018
@mattdowle mattdowle added this to the 1.12.0 milestone Oct 17, 2018
@UweBlock
Copy link
Contributor

For the sake of completeness, there is a question on SO, Rename variables based on values in another dataframe, which was answered using this new feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants