Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

:= could warn/error when RHS is ambiguous #3077

Closed
MichaelChirico opened this issue Sep 27, 2018 · 7 comments
Closed

:= could warn/error when RHS is ambiguous #3077

MichaelChirico opened this issue Sep 27, 2018 · 7 comments

Comments

@MichaelChirico
Copy link
Member

Was scraping the table here:

https://en.wikipedia.org/wiki/List_of_nominations_to_the_Supreme_Court_of_the_United_States

And didn't realize the output had duplicated column names:

SC = data.table(
  Nominee = "Name", 
  Nominee = "Age", 
  `Nominated by` = "President", 
  `Nominated by` = "Party", 
  Succession = "(Seat)   Justice", 
  Nomination = "Majorityparty", 
  Nomination = "Datesubmitted", 
  Nomination = "Outcome (vote)and date", 
  Nomination = "ND", 
)

SC[ , nominated := as.IDate(Nomination, '%b. %d, %Y')]

nominated will attempt to compute the RHS and will select the first Nomination column.

This is easily avoided by being careful about check.names up front, but if it's easy to check/warn on the fly, we should.

@franknarf1
Copy link
Contributor

(Looking at the title...) Seems like a useful precaution in any j call, not just :=...?

@MichaelChirico
Copy link
Member Author

Thinking again, maybe this can be solved up front by adding gateways in setDT, as.data.table.*, fread... if check.names is `FALSE, check the names anyway and warn by default?

Not sure it makes sense to run this code repeatedly in every j call

@franknarf1
Copy link
Contributor

Yeah, a warning upstream like you suggest would be more useful to me, though there are so many points at which data.tables are created, it may be too much of a headache that way, too. (Eg, besides the ones you mention, there's even rbindlist(list(list(x = 1, x = 2, x = 3))))

For me, := isn't special relative to other DT[...] contexts where I might make such a mistake: SC[, table(Nomination)]; SC[, .N, by=Nomination]. I don't feel strongly about it, though and this problem hardly affects me (since I work with predictable data structures these days).

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Sep 27, 2018 via email

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Aug 24, 2019

Have a fix for this issue like this:

-        allcols = c(names(x), xdotprefix, names(i), idotprefix)
-        ansvars = setdiff(intersect(av,allcols), bynames)
+        if (anyDuplicated(used_from_x <- dupintersect(names_x, av))) {
+          dupcols = unique(used_from_x[duplicated(used_from_x)])
+          stop("Unable to disambiguate reference to duplicated columns in x: ", brackify(dupcols))
+        }
+        allcols = c(names_x, xdotprefix, names_i, idotprefix)
+        ansvars = sdvars = setdiff(intersect(allcols, av), bynames)

But it breaks tests 1290 where there are some rules laid out for how data.table is supposed to act with duplicates, namely:

########################################
# Extensve testing for "duplicate" names
########################################
# Rules: Basically, if index is directly given in 'j', just those columns are touched/operated on. But if 'column' names are given and there are more than one
# occurrence of that column, then it's hard to decide which to keep and which to remove. So, to remove, all are removed, to keep, always the first is kept.
# 1) when i,j,by are all absent (or) just 'i' is present then ALL duplicate columns are returned.
# 2) When 'with=FALSE' and 'j' is a character and 'notj' is TRUE, all instances of the column to be removed will be removed.
# 3) When 'with=FALSE' and 'j' is a character and 'notj' is FALSE, only the first column will be recognised in presence of duplicate columns.
# 4) When 'with=FALSE' and 'j' is numeric and 'notj' is TRUE, just those indices will be removed.
# 5) When 'with=FALSE' and 'j' is numeric and 'notj' is FALSE, all columns for indices given, if valid, are returned. (FIXES #5688)
# 6) When .SD is in 'j', but '.SDcols'  is not present, ALL columns are subset'd - FIXES BUG #5008.
# 7) When .SD and .SDcols are present and .SDcols is numeric, columns corresponding to the given indices are returned.
# 8) When .SD and .SDcols are present and .SDcols is character, duplicate column names will only return the first column, each time.
# 9) When .SD and .SDcols are present and .SDcols is numeric, and it's -SDcols, then just those columns are removed.
# 10) When .SD and .SDcols are present and .SDcols is character and -SDcols, then all occurrences of that object is removed.
# 11) When no .SD and no .SDcols and no with=FALSE, only duplicate column names will return only the first column each time.
# 12) With 'get("col")', it's the same as with all character types.
# 13) A logical expression in 'j'.
# 14) Finally, no tests but.. using 'by' with duplicate columns and aggregating may not return the intended result, as it may operate on column names in some cases.

Not sure how married we are to the behavior laid out here (a comment in tests hardly seems like documentation)... thoughts @jangorecki / @mattdowle ?

@jangorecki
Copy link
Member

jangorecki commented Aug 24, 2019

As for me I would move towards not supporting duplicated column names wherever possible. The only use case I can think about is to produce table for reporting/formatting.

@MichaelChirico
Copy link
Member Author

Hmm with hindsight I'd rather not pursue this. Duplicate names can be a pain but they are a fact of life -- I suspect it would cause more pain than help to impose this on user workflows.

E.g. I've recently dealt with this working with {dbplyr} -- I want to run a query like (distilled):

SELECT A.*, B.*
FROM A JOIN B USING (x,y,z)

But A and B might have columns besides x,y,z in common, causing this to fail because {dbplyr} (or maybe tibble/dplyr?) refuse to allow duplicates. The workaround is the much more painful approach of writing out all the column names.

Another example: I received DF and I want to convert to data.table to do some processing. If there are duplicate names in the object I receive and it won't affect me, I'll be annoyed I have to further process DF. If there are duplicates and it will affect me, this change amounts to shuffling the code to deal with duplicates around (from after to before setDT()/as.data.table()) -- seems like busywork.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants